You are here

Relax multiple PDB files with MPI, jd2, and a pdblist on TACC Stampede2

2 posts / 0 new
Last post
Relax multiple PDB files with MPI, jd2, and a pdblist on TACC Stampede2
#1

Hello,

I'm having trouble relaxing multiple PDB files via MPI and a pdblist on the TACC Stampede2 cluster. Apparently, only the first file in the pdblist is relaxed, regardless of which PBD file is first in the list. My pdblist has Unix EOL breaks, so that should be good. Do I need to add some flag to relax multiple files in batch using MPI? What am I missing?

Here's the SLURM file I'm using:

-----------------------------------------------------

#!/bin/bash

#SBATCH -n 5                  # total MPI tasks in this job
#SBATCH -N 1                  # total number of nodes (processors), 68 cores each
#SBATCH -p normal       # queue
#SBATCH -t 00:10:00      # run time (hh:mm:ss)
#SBATCH -o RelaxLog     # Name of scoring log file
#SBATCH -e ErrorLog     # Name of error log file

module load intel/17.0.4
module load impi/17.0.3
module load rosetta

ibrun relax.cxx11mpi.linuxiccrelease -in:file:l pdblist -in:file:fullatom -relax:quick -nstruct 2 -out:suffix _relaxed -out:path:pdb Output_PBDs -out:path:score Output_Scores -database=$HOME/rosetta_database -mpi_tracer_to_file Output

-----------------------------------------------------

I ran the comparable script on my personal computer (not MPI), and the "-in:file:l pdblist" flag works fine (all PDB files in the list are relaxed). Here is that comparable script:

-------------------------------------------------

Rosetta/main/source/bin/relax.linuxgccrelease -in:file:l pdblist -in:file:fullatom -relax:quick -nstruct 2 -out:suffix _relaxed -out:path:pdb Output_PBDs -out:path:score Output_Scores -database Rosetta/main/database/

-------------------------------------------------

So, I'm thinking that there's something wrong with the way the JobDistributor reads the pdblist. Here's the output of the master node when I try to run the SLURM file on TACC Stampede2 (and only the first PDB file in the pdblist is relaxed):

-------------------------------------------------

core.init: (0) Rosetta version unknown:exported  from http://www.rosettacommons.org
core.init: (0) command: /home1/apps/intel17/impi17_0/rosetta/3.8/bin/relax.cxx11mpi.linuxiccrelease -in:file:l pdblist -in:file:fullatom -relax:quick -nstruct 2 -out:suffix _relaxed -out:path:pdb Output_PBDs -out:path:score Output_Scores -database=/home1/01748/pl6218/rosetta_database -mpi_tracer_to_file Output
core.init: (0) 'RNG device' seed mode, using '/dev/urandom', seed=1855461808 seed_offset=0 real_seed=1855461808
core.init.random: (0) RandomGenerator:init: Normal mode, seed=1855461808 RG_type=mt19937
core.scoring.ScoreFunctionFactory: (0) SCOREFUNCTION: talaris2014
core.scoring.etable: (0) Starting energy table calculation
core.scoring.etable: (0) smooth_etable: changing atr/rep split to bottom of energy well
core.scoring.etable: (0) smooth_etable: spline smoothing lj etables (maxdis = 6)
core.scoring.etable: (0) smooth_etable: spline smoothing solvation etables (max_dis = 6)
core.scoring.etable: (0) Finished calculating energy tables.
basic.io.database: (0) Database file opened: scoring/score_functions/hbonds/sp2_elec_params/HBPoly1D.csv
basic.io.database: (0) Database file opened: scoring/score_functions/hbonds/sp2_elec_params/HBFadeIntervals.csv
basic.io.database: (0) Database file opened: scoring/score_functions/hbonds/sp2_elec_params/HBEval.csv
basic.io.database: (0) Database file opened: scoring/score_functions/rama/Rama_smooth_dyn.dat_ss_6.4
basic.io.database: (0) Database file opened: scoring/score_functions/P_AA_pp/P_AA
basic.io.database: (0) Database file opened: scoring/score_functions/P_AA_pp/P_AA_n
basic.io.database: (0) Database file opened: scoring/score_functions/P_AA_pp/P_AA_pp
protocols.relax.FastRelax: (0) ================== Using default script ==================
protocols.jd2.PDBJobInputter: (0) Instantiate PDBJobInputter
protocols.jd2.PDBJobInputter: (0) PDBJobInputter::fill_jobs
protocols.jd2.PDBJobInputter: (0) pushed 5hqi_ignorechain.pdb nstruct indices 1 - 2
protocols.evaluation.ChiWellRmsdEvaluatorCreator: (0) Evaluation Creator active ... 
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Getting next job to assign from list id 1 of 2
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for job requests...
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from 6 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending new job id 1 to node 6 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Getting next job to assign from list id 2 of 2
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for job requests...
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from 7 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending new job id 2 to node 7 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: No more jobs to assign, setting next job id to zero
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Finished handing out jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for 9 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from  8 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending spin down signal to node 8
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for 8 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from  9 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending spin down signal to node 9
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for 7 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from  1 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending spin down signal to node 1
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for 6 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from  2 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending spin down signal to node 2
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for 5 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from  3 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending spin down signal to node 3
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for 4 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from  4 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending spin down signal to node 4
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for 3 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from  5 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending spin down signal to node 5
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for 2 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from  7 with tag 30
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received job success message for job id 2 from node 7 blocking till output is done 
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received job output finish message for job id 2 from node 7
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master set job 2 as completed/deletable.
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for 2 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from  7 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending spin down signal to node 7
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for 1 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from  6 with tag 30
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received job success message for job id 1 from node 6 blocking till output is done 
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received job output finish message for job id 1 from node 6
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master set job 1 as completed/deletable.
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Waiting for 1 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from  6 with tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending spin down signal to node 6
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Finished sending spin down signals to slaves
protocols::checkpoint: (0) Deleting checkpoints of FastRelax

-------------------------------------------------

There were two files in that pdblist, but the JobDistributor only pushed the first one. There were no error messages, and the error log was blank. Everything says "successful", but only the first file in the pdblist was relaxed.

Any ideas???

As a side note, I can successfully run the relax protocol on TACC Stampede2 if I just change my SLURM file from " -in:file:l pdblist" to "-in:file:s 5hqi_ignorechain.pdb," so I know it's working fine with a single PDB file.

Thanks very much for your help!

Sincerely,

AJ

Post Situation: 
Sat, 2018-07-14 13:14
AJVincelli

In case anybody is wondering, here's the solution to this problem: Add a blank line after the last file in the pdblist.

So the pdblist file should look like this:

------------------------

Structure1.pdb

Structure2.pdb

Structure3.pdb

 

-------------------------------

Note the blank line after Structure3.pdb. Apparently the MPI distribution stops AT the last line, not AFTER the last line.

Thanks!

Sun, 2018-11-25 11:29
AJVincelli