You are here

cluster.mpi.linuxgccrelease failed

4 posts / 0 new
Last post
cluster.mpi.linuxgccrelease failed
#1

Hi there,
I was clustering a silent files with my 10% lowest energy decoys and the cluster.mpi.linuxgccrelease just stopped and issued the following on screen:

--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 19956 on node compute-1-5 exited on signal 9 (Killed).
--------------------------------------------------------------------------

Well, it seems some problem with MPIRUN rather than with the cluster.mpi.linuxgccrelease binary.
I'm running cluster.mpi.linuxgccrelease with the following command line:
mpirun -x LD_LIBRARY_PATH=$LIB --mca btl_tcp_if_include eth0 -np 20 --host compute-1-11,compute-1-12,compute-1-13,compute-1-14,compute-1-15,compute-1-16,compute-1-17,compute-1-18,compute-1-19,compute-1-20 $BIN/cluster.mpi.linuxgccrelease -in:file:fullatom -in:file:silent_struct_type binary -in:file:silent ecut_10.out -cluster:radius -1

Did I miss some special MPIRUN option?
Thanks in advance.

Post Situation: 
Mon, 2014-02-24 12:11
fred

The clustering code was never multi-processor-ized to my knowledge. I don't think it should actually fail in MPI, but it certainly won't work better than the non-MPI.

Mon, 2014-02-24 12:27
smlewis

Hi smlewis,
Thanks for your replay. Judging by the output on screen, the mpi version seems to work reasonable well, but it doesn't writes the expected clusters before die. So, I thought I had missed some MPIRUN option. Well, if you don't use the mpicluster, who am I to use it? Thanks for sharing.
Best.

EDIT: the information bellow might be useful to another user and/or author.
Feb 25 14:59:59 compute-1-20 kernel: Out of memory: Kill process 25129 (cluster.mpi.lin) score 445 or sacrifice child
Feb 25 14:59:59 compute-1-20 kernel: Killed process 25129, UID 1006, (cluster.mpi.lin) total-vm:7986656kB, anon-rss:7777940kB, file-rss:2620kB
For some reason the process has been killed with the status "Out of memory". The same jobs was completed with the non-mpi version of cluster.default.linuxgccrelease.

Tue, 2014-02-25 11:04
fred

This problem has been solved decreasing the number of process per worknode.
See https://www.rosettacommons.org/node/3619
Hope it helps.

Thu, 2014-03-27 11:29
fred