You are here

Submitting job to local cluster

9 posts / 0 new
Last post
Submitting job to local cluster
#1

Hi All,

I have installed rosetta on our institute cluster with mpi support (extras=mpi). I want to run rosetta script on the cluster (favor_native_residue.xml). Can anyone kindly let me know how to do it. Attached is a script i used to submit gromacs job in the same cluster (originally it was gromacs_job.sh but I changed it to gromacs_job.txt to attach here) and my favor_native_residue.xml (changed to favor_native_residue.txt for attachment). 

 

Thanks a lot.

 

AttachmentSize
gromacs_job.txt1.25 KB
favor_native_residue.txt583 bytes
Category: 
Post Situation: 
Fri, 2016-04-15 01:38
tusharranjanmoharana

1) Don't run on MPI first, run test jobs not in MPI to make sure you have all your files set up right.  Debugging MPI is a nightmare and the problem is nearly never actually in the MPI layers.  You've compiled both the normal non-MPI (for testing) and the MPI builds?

2) You have a rosetta script, which runs through the rosetta_scripts application, but you haven't mentioned your command line flags.  The source you got favor_native_residue.xml should have had a set of flags/options to accompany it.  You'll need those, too, probably with minor modifications.  For MPI you'll want "-mpi_tracer_to_file proc" to write each processor's stdout to a separate file.  Depending on the number of processors, you might want "-mpi_work_partition_job_distributor" to avoid wasting one on a head node.

3) Your script will run gromacs or rosetta just fine, just change that last line to call your Rosetta path instead of gromacs.  instead of passing all your flags there, I'd use the @ syntax, so it would look something like "$(which mpirun) -np $NSLOTS /path/to/rosetta_scripts.mpi.linuxgccrelease @flags".  flags is a file with each command line option on a separate line.

Fri, 2016-04-15 07:28
smlewis

Hi smlewis,

Thanks for your suggestion.  I prepared the job submission script and flag file as per your suggestions. Here I am attaching my .sh script that I prepared for submitting job (rosetta_job_submit.txt) flag file (flag.txt) and error log (rosetta.txt) that I got after submitting to server. I will be grateful if you can sugest me something to try out. pdb and res files are in there respected place and fine (they work fine in my local computer). If you need any other information please ask me.

Sat, 2016-04-16 04:12
tusharranjanmoharana

Unfortunately I don't know what the problem is because I've never used your cluster architecture (I've always used LSF).  

I guess here are some things to try:

1) is there a space after the @ for @flags.txt?  Don't put a space.

2) the *only* Rosetta output I am seeing is the line "ERROR: ERROR:  Multiple values specified for option -out:nstruct" - which doesn't even make sense because you clearly only specified it once.  Do the proc files never get generated (proc_0, proc_1, etc)?

3) Can you try running the MPI version on the head node?  Don't let it run long, just like 30 seconds to verify that it spins up correctly, then kill it.

4) Most of those error lines in rosetta.txt are the same line over and over indicating something is missing.  Ask around other users of your cluster if that message means something in a general sense (it's not Rosetta specific, that's for sure).  Maybe they'll just know "oh, it means widget B is missing" and that will fix it.

Wed, 2016-04-20 09:27
smlewis

1) Spaces should be fine between the @ and the options file name, at least for recent-ish weekly releases. (It's a feature which has been added to allow tab-completion of option file names.)

2) Normally, if you specify something like -nstruct twice (e.g. once on the commandline and once in an options file) you'll get "WARNING: Override of option -out:nstruct sets a different value", rather than that error message.

Instead, it looks like you have multiple "values" being passed to the -out:nstruct value. This doesn't happen on the commandline, but in the option file, if you have a line like "-out:nstruct 20 5" you'll get exactly the error seen.

In your case, it's not an actual number, but probably a non-printable character at the end of the line. I'm guessing it's a Unix/Windows line ending issue. Windows uses an extra character in its line ending, which on Unix systems causes a host of problems. Try running the dos2unix application against your flags.txt file. Either that, or re-type in a text editor directly on a Unix system. (DON'T copy/paste).

4) Agreed that those errors are not due to Rosetta. But I'm guessing it's also related to the line ending issue. If rosetta_job_submit.txt has the same Windows line ending issues, then Unix may be interpreting the non-printable character as a "command", and giving that error when it can't run it.

Thu, 2016-04-28 15:16
rmoretti

Hi smlewis,

Hi rmoretti,

Thanks a lot for your valuable advice and sorry for late reply. I tried them but unfortunately nothing worked. When I run on the head node as suggested by smlewis I got follwing error:

Error obtaining unique transport key from ORTE (orte_precondition_transports not present in
the environment).

  Local host: epihed1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)


When I run the test as mentioned in rosetta manual ./integration.py --mpi-tests I got the errors  listed in error_cluster.txt. I got similar errors in my local computer as mentioned in error_local.txt but didn't encounter any problem when running actual job.

It will be a gteat help if you can suggest anything.

Thanks a lot

Tushar

 

Mon, 2016-05-09 02:06
tusharranjanmoharana

Hi smlewis,

Hi rmoretti,

After changing one line in the flag file I am getting completely new error. Files are attached with required name. Please go through it.

Mon, 2016-05-09 04:13
tusharranjanmoharana

Continue from previous message.

File attachments: 
Mon, 2016-05-09 04:14
tusharranjanmoharana

I can't say for certain, as I've never used your cluster system, but I'm interpreting the "Error obtaining unique transport key from ORTE" (and the associated message) as indicating that your cluster is set up to only run MPI jobs - that is, you can't actually run a serial job on the head node -- Talk to your cluster administrators to be sure, but there certainly are clusters where everything has to go through MPI (and other where they prohibit running anything on the head node.)

Regarding the new_error.txt, it looks like you're getting some sort of carriage return right after "flags.txt" in your job submission script. (Note how the closing single quote is on its own line.) I'm guessing it's the same non-printable character issue as before. Try running your submission script (as distinct from the flags.txt file) through dos2unix.

I'd also *highly* recommend that you avoid doing things on Windows machines when working with the cluster system. Try to only editing files directly on a Unix system (or a MacOS X machine). If you edit files in your Windows machine, you're going to keep running into the line ending issues, and will result in errors unless you remember to pass everything through dos2unix.

Mon, 2016-05-09 09:23
rmoretti