Hi guys,
I recently downloaded and compiled Rosetta with MPI capabilities to take advantage of the 32 core processor we have on our workstation. Compilation went well, and I can call protocols - but they all seem to hang.
To help narrow things down, I am working out of the DeNovo Structure Prediction tutorial demo directory - I can call the protocol and it seems to start running as normal:
mpirun -n 32 $ROSETTA_MPI/main/source/bin/AbinitioRelax.mpi.linuxgccrelease @input_files/options
Everything starts up like normal, but it always ends of hanging on this output:
~$: protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1
I dug around these forums - seems that the code is still trying to run off of only one core - not sure why. Is there a way to specify I want to run on many cores? I thought this was the purpose of running compiled binaries with extras=mpi.
I looked into the code where it gets stuck, seems like it is forever waiting on a return from the MPI_Recv( ) function.. I could be wrong - I cant read C++ all that well:
(From protocols.jobdist.JobDistributors)
418 while ( true ) {
419 int node_requesting_job( 0 );
420
421 JobDistributorTracer << "Master Node -- Waiting for job request; tag_ = " << tag_ << std::endl;
422 MPI_Recv( & node_requesting_job, 1, MPI_INT, MPI_ANY_SOURCE, tag_, MPI_COMM_WORLD, & stat_ );
423 bool const available_job_found = find_available_job();
424
425 JobDistributorTracer << "Master Node --available job? " << available_job_found << std::endl;
426
427 Size job_index = ( available_job_found ? current_job_ : 0 );
428 int struct_n = ( available_job_found ? current_nstruct_ : 0 );
429 if ( ! available_job_found ) {
430 JobDistributorTracer << "Master Node -- Spinning down node " << node_requesting_job << std::endl;
431 MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );
432 break;
433 } else {
434 JobDistributorTracer << "Master Node -- Assigning job " << job_index << " " << struct_n << " to node " << node_requesting_job << std::endl;
435 MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );
436 MPI_Send( & struct_n, 1, MPI_INT, node_requesting_job, tag_, MPI_COMM_WORLD );
437 // ++current_nstruct_; handled now by find_available_job
438 }
439 }
440
441 // we've just told one node to spin down, and
442 // we don't have to spin ourselves down.
443 Size nodes_left_to_spin_down( mpi_nprocs() - 1 - 1);
444
445 while ( nodes_left_to_spin_down > 0 ) {
446 int node_requesting_job( 0 );
447 int recieve_from_any( MPI_ANY_SOURCE );
448 MPI_Recv( & node_requesting_job, 1, MPI_INT, recieve_from_any, tag_, MPI_COMM_WORLD, & stat_ );
449 Size job_index( 0 ); // No job left.
450 MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );
451 JobDistributorTracer << "Master Node -- Spinning down node " << node_requesting_job << " with " << nodes_left_to_spin_down << " remaining nodes." << std::endl;
452 --nodes_left_to_spin_down;
453 }
454
455 }
Any help is appreaicted!
Thanks!
Nathan
In your output, are you getting any '(1)' or other such (non-zero) labels?
The other thing I would double check is that the MPI libraries you compiled with are the proper "flavor" and version to go with the mpirun command you're using. If you have a "flavor" mismatch (e.g. running a Rosetta compiled with OpenMPI with a MPICH2 mpirun), you might have issues getting Rosetta to recognize that it's running under MPI.
I just ran it again, and it apepars that all outputs have '(0)' as a label - no non-zero labels.
I need to double check the MPI libraries. Do you have a suggestion as to how I can check that? I am attempting to run the protocols using mpirun. I have OpenMPI installed, and when I compiled Rosetta, it was calling mpicc to compile the source. I also had to comment out all the header file environment variables in the site.settings file to get the code to compile with extras=mpi - I am not sure if this is necessary information, but it seems that both the INCLUDE and LD_LIBRARY_PATH environment variables were empty when I compiled - and it was able to compile after I told it to ignore those.
I am not sure if this is sufficient information! Let me know... Thank you!