You are here

How to perform abinitiorelax.mpi.linuxgccrelease in parallel mode with MPI

4 posts / 0 new
Last post
How to perform abinitiorelax.mpi.linuxgccrelease in parallel mode with MPI
#1

Hi,

I have compiled Rosetta3.8 successfully with the command " $./scons.py bin mode=release extras=mpi cxx=icc". But when I performed the command "mpirun -np 64 AbinitioRelax.mpi.linuxiccrelease @options", the error will be displayed in the end like this:

=========================================================================

===

=    BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=     PID XXXXX RUNING AT LOCALHOST.LOCALDOMAIN

=     EXIT CODE:11

=      CLEANING UP REMAINING PROCESSES

=      YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===========================================================================

 

Would you help me to solve this problem, many thanks!

Kindest regards

 

Jiyuan

Post Situation: 
Sun, 2017-09-24 18:37
kingljy

That message is from the MPI system, and is basically reporting that whatever program you're running under MPI has encountered a problem and so the MPI system is shutting the whole thing down. The `EXIT CODE:11` in the message indicates that this is likely a SegFault in the Rosetta code.

To track this down, it will help to take a closer look at the Rosetta output. What is being printed to the Tracers? Do you get any results, or does the program terminate with this issue immediately?

SegFaults are a bit tricky to track down. It's often necessary to recompile Rosetta in debug mode (with `mode=debug` on the scons commandline), and then re-run things. Debug mode has extra checks which help with debugging, but slow things down. If possible, try running the same command on a single processor without MPI, and see if you can prompt an error message that way. (Without MPI the results will be more interpretable.) If you need MPI in order to provoke the behavior, use the `-mpi_tracer_to_file` option (e.g. `-mpi_tracer_to_file tracer.log`) to make the output more interpretable.

Hopefully you'll get some sort of error message or traceback from the debug build that will help us track down the issue. (If you're still getting a SegFault and no message with the debug build, you may need to run it under a debugger to get a backtrace.)

 

Another thing you probably want to do before going through the hassle of recompiling is to double-check all your input files. SegFaults normally pop up when input files don't quite conform to the format which Rosetta expects.

Mon, 2017-09-25 09:56
rmoretti

Dear rmoretti,

Thank you so much for your reply. I have recompiled rosetta3.8 as your instructions. After I performed the command "mpirun -np 64 AbinitioRelax.mpi.linuxiccrelease -mpi_tracer_to_file tracer.log @options", and I can't get any results in the input_files.The error was still displayed:

=========================================================================

===

=    BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=     PID XXXXX RUNING AT LOCALHOST.LOCALDOMAIN

=     EXIT CODE:11

=      CLEANING UP REMAINING PROCESSES

=      YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===========================================================================

and the content of the file tracer.log is :

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

core.init: (0) Rosetta version unknown:exported  from http://www.rosettacommons.org
core.init: (0) command: AbinitioRelax.mpi.linuxiccrelease -mpi_tracer_to_file tracer.log @options
core.init: (0) 'RNG device' seed mode, using '/dev/urandom', seed=-1822093000 seed_offset=0 real_seed=-1822093000
core.init.random: (0) RandomGenerator:init: Normal mode, seed=-1822093000 RG_type=mt19937
core.init: (0) Resolved executable path: /home/Rosetta/rosetta_bin_linux_2017.08.59291_bundle/main/source/build/src/release/linux/3.10/64/x86/icc/16.0/mpi/AbinitioRelax.mpi.linuxiccrelease
core.init: (0) Looking for database based on location of executable: /home/Rosetta/rosetta_bin_linux_2017.08.59291_bundle/main/database/
protocols.abinitio.AbrelaxApplication: (0) read fasta sequence: 225 residues
RPVFEREIYTAGIYETDTSNRELLTVHATHTEGLDITYTMDLDTMVVDPSLEGVRESAFT    
LHPSSGVLSLNMNPLDTMVGMFEFDVVATDTRGAEARTDVKIYLITHLNRVYFLFNNTL
DVVDSNRAFIADTFSSVFSLTCNIDAVLRAPDSSGAARDDRTEVRAHFIRNHVPATTDEI
EQLRSNTILLRAIQETLLTRELHLEDFVGGSSPELGVDNSLT
protocols.evaluation.ChiWellRmsdEvaluatorCreator: (0) Evaluation Creator active ...
core.chemical.GlobalResidueTypeSet: (0) For ResidueTypeSet centroid there is no shadow_list.txt file to list known PDB ids.
core.chemical.GlobalResidueTypeSet: (0)     This will turn off PDB component loading for ResidueTypeSet centroid
core.chemical.GlobalResidueTypeSet: (0)     Expected file: /home/Rosetta/rosetta_bin_linux_2017.08.59291_bundle/main/database/chemical/residue_type_sets/centroid/shadow_list.txt
core.chemical.GlobalResidueTypeSet: (0) Finished initializing centroid residue type set.  Created 62 residue types
core.chemical.GlobalResidueTypeSet: (0) Total time to initialize 0.05 seconds.
core.io.fragments: (0) reading fragments from file: aat000_09_05.200_v1_3 ...
core.io.fragments: (0) rosetta++ fileformat detected! Calling legacy reader...
core.fragments.ConstantLengthFragSet: (0) finished reading top 25 9mer fragments from file aat000_09_05.200_v1_3
core.io.fragments: (0) reading fragments from file: aat000_03_05.200_v1_3 ...
core.io.fragments: (0) rosetta++ fileformat detected! Calling legacy reader...
core.fragments.ConstantLengthFragSet: (0) finished reading top 200 3mer fragments from file aat000_03_05.200_v1_3
core.fragment: (0) compute strand/loop fractions for 221 residues...
protocols.abinitio.AbrelaxApplication: (0) run ClassicAbinitio.....
basic.io.database: (0) Database file opened: scoring/score_functions/EnvPairPotential/env_log.txt
basic.io.database: (0) Database file opened: scoring/score_functions/EnvPairPotential/cbeta_den.txt
basic.io.database: (0) Database file opened: scoring/score_functions/EnvPairPotential/pair_log.txt
basic.io.database: (0) Database file opened: scoring/score_functions/EnvPairPotential/cenpack_log.txt
basic.io.database: (0) Database file opened: scoring/score_functions/SecondaryStructurePotential/phi.theta.36.HS.resmooth
basic.io.database: (0) Database file opened: scoring/score_functions/SecondaryStructurePotential/phi.theta.36.SS.resmooth
core.scoring: (0) ATOM_VDW set to CENTROID_ROT_MIN
basic.io.database: (0) Database file opened: scoring/score_functions/hbonds/sp2_elec_params/HBPoly1D.csv
basic.io.database: (0) Database file opened: scoring/score_functions/hbonds/sp2_elec_params/HBFadeIntervals.csv
basic.io.database: (0) Database file opened: scoring/score_functions/hbonds/sp2_elec_params/HBEval.csv
basic.io.database: (0) Database file opened: scoring/score_functions/rama/Rama_smooth_dyn.dat_ss_6.4
basic.io.database: (0) Database file opened: scoring/score_functions/centroid_smooth/cen_rot_pair_params.txt
basic.io.database: (0) Database file opened: scoring/score_functions/centroid_smooth/cen_rot_env_params.txt
basic.io.database: (0) Database file opened: scoring/score_functions/centroid_smooth/cen_rot_cbeta_params.txt
basic.io.database: (0) Database file opened: scoring/score_functions/centroid_smooth/cen_rot_pair_ang_params.txt
core.scoring.AtomVDW: (0) Openning alternative vdw file: /home/Rosetta/rosetta_bin_linux_2017.08.59291_bundle/main/database/chemical/atom_type_sets/centroid_rot//min.txt
core.scoring: (0) ATOM_VDW set to CENTROID_ROT_MIN
core.scoring.ScoreFunctionFactory: (0) SCOREFUNCTION: talaris2014
core.scoring.etable: (0) Starting energy table calculation
core.scoring.etable: (0) smooth_etable: changing atr/rep split to bottom of energy well
core.scoring.etable: (0) smooth_etable: spline smoothing lj etables (maxdis = 6)
core.scoring.etable: (0) smooth_etable: spline smoothing solvation etables (max_dis = 6)
core.scoring.etable: (0) Finished calculating energy tables.
basic.io.database: (0) Database file opened: scoring/score_functions/P_AA_pp/P_AA
basic.io.database: (0) Database file opened: scoring/score_functions/P_AA_pp/P_AA_n
basic.io.database: (0) Database file opened: scoring/score_functions/P_AA_pp/P_AA_pp
protocols.jobdist.JobDistributors: (0) Node: 0 next_job()
protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
protocols.jobdist.JobDistributors: (0) Looking for an available job: 1 1  1
protocols.jobdist.JobDistributors: (0) Master Node --available job? 1
protocols.jobdist.JobDistributors: (0) Master Node -- Assigning job 1 1 to node 2
protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
protocols.jobdist.JobDistributors: (0) Looking for an available job: 2 1  2
protocols.jobdist.JobDistributors: (0) Master Node --available job? 1
protocols.jobdist.JobDistributors: (0) Master Node -- Assigning job 1 2 to node 4
protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
protocols.jobdist.JobDistributors: (0) Looking for an available job: 3 1  3
protocols.jobdist.JobDistributors: (0) Master Node --available job? 1
protocols.jobdist.JobDistributors: (0) Master Node -- Assigning job 1 3 to node 5
protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
protocols.jobdist.JobDistributors: (0) Looking for an available job: 4 1  4
protocols.jobdist.JobDistributors: (0) Master Node --available job? 1
protocols.jobdist.JobDistributors: (0) Master Node -- Assigning job 1 4 to node 6
protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
protocols.jobdist.JobDistributors: (0) Looking for an available job: 5 1  5
protocols.jobdist.JobDistributors: (0) Master Node --available job? 1
protocols.jobdist.JobDistributors: (0) Master Node -- Assigning job 1 5 to node 7
protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
protocols.jobdist.JobDistributors: (0) Looking for an available job: 6 1  6
protocols.jobdist.JobDistributors: (0) Master Node --available job? 1
protocols.jobdist.JobDistributors: (0) Master Node -- Assigning job 1 6 to node 12
protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
protocols.jobdist.JobDistributors: (0) Looking for an available job: 7 1  7
protocols.jobdist.JobDistributors: (0) Master Node --available job? 1
protocols.jobdist.JobDistributors: (0) Master Node -- Assigning job 1 7 to node 17
protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
protocols.jobdist.JobDistributors: (0) Looking for an available job: 8 1  8
protocols.jobdist.JobDistributors: (0) Master Node --available job? 1
protocols.jobdist.JobDistributors: (0) Master Node -- Assigning job 1 8 to node 18
protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
protocols.jobdist.JobDistributors: (0) Looking for an available job: 9 1  9
protocols.jobdist.JobDistributors: (0) Master Node --available job? 1
protocols.jobdist.JobDistributors: (0) Master Node -- Assigning job 1 9 to node 19
protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
protocols.jobdist.JobDistributors: (0) Looking for an available job: 10 1  10
protocols.jobdist.JobDistributors: (0) Master Node --available job? 1
protocols.jobdist.JobDistributors: (0) Master Node -- Assigning job 1 10 to node 26
protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
protocols.jobdist.JobDistributors: (0) Looking for an available job: 11 1  11

----------------------------------------------------------------------------------------------------------------------------------------------

Would you find any solution for this error.

 

Jiyuan

 

 

Mon, 2017-09-25 21:09
kingljy

It looks like you're still running the release-mode application (AbinitioRelax.mpi.linuxiccrelease). You need to change to the debug-mode application that you compiled (AbinitioRelax.mpi.linuxiccdebug) in order to get the extra debug-mode tests.

Also, when using  -mpi_tracer_to_file, be sure to examine all the various log files produced (there should be one for each CPU), not just the main one.

Tue, 2017-10-10 09:56
rmoretti