You are here

JD3 FastRelax over MPI - crashes on relax completion/before writing output

1 post / 0 new
JD3 FastRelax over MPI - crashes on relax completion/before writing output

Bit of a complicated issue here so apologies in advance if it's not entirely clear - happy to provide clarification.

I'm running a simple RosettaScripts file (relax_monomer_foldtree.rs_.xml_.txt) on one or more structures and am using JD3 to distribute the jobs (relax_jd3_S.xml_.txt). This works fine for small-to-moderately sized proteins and complexes (generally <1800 total amino acids in three chains of <600 each). However, for large proteins (i.e., full spike proteins; 3500+ total a.a.), this only works properly if all tasks are on the one node If tasks are split across 2+ nodes, the job will fail when the first task is completed but before the relaxed PDB is written with the following message:

[][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1338: Got FATAL event local access violation work queue error on QP 0xea

[][handle_cqe] Send desc error in msg to 12, wc_opcode=0
[][handle_cqe] Msg from 12: wc.status=9 (remote invalid request error), wc.wr_id=0x15863a0, wc.opcode=0, vbuf->phead->type=24 = MPIDI_CH3_PKT_ADDRESS
[][mv2_print_wc_status_error] IBV_WC_REM_INV_REQ_ERR: This event is generated when the responder detects an invalid message on the channel. Possible causes include a) the receive buffer is smaller than the incoming send, b)  operation is not supported by this receive queue (qp_access_flags on the remote QP was not configured to support this operation), or c) the length specified in a RDMA request is greater than 2^31 bytes. It is generated on the sender side of the connection. Relevant to: RC or DC QPs.
[][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:497: [] Got completion with error 9, vendor code=0x8a, dest rank=12
: No such file or directory (2)
srun: error: atl1-1-02-020-36-2: task 12: Exited with exit code 255
srun: error: atl1-1-02-020-35-1: task 2: Exited with exit code 252
slurmstepd: error:  mpi/pmix_v3: _errhandler: atl1-1-02-020-36-2 [2]: pmixp_client_v2.c:212: Error handler invoked: status = -25, source = [slurm.pmix.4660380.0:12]


The relevant part appears to be the IBV_WC_REM_INV_REQ_ERR. Has anyone run into this before on their own MPI runs, and is it more likely to be a Rosetta problem or more of an MPI implementation problem on our end? Increasing RAM (initially at 8GB/cpu, all the way up to 32GB/cpu) does not appear to help.

Run command (slurm)

srun -JRelax $ROSETTA_PATH/bin/rosetta_scripts_jd3.cxx11threadmpiserialization.linuxgccrelease \
  -job_definition_file "$jd3" \
  -use_truncated_termini \
  -missing_density_to_jump \
  -jd3:n_archive_nodes 2 \
  -mpi_tracer_to_file logs/mpi \
  -linmem_ig 10 \

Some specs:

RHEL 7.9, 8GB/cpu, 1 cpu/task (nstructs*pdbs + n_archive_nodes  + 1 tasks), rosetta.source.release-340 compiled with mvapich2/2.3.6 & gcc/10.3.0 with mpi, serialization, and cxx11thread extras.

RosettaScripts file532 bytes
JD3 jobfile629 bytes
Post Situation: 
Tue, 2024-03-12 12:51