You are here

Rosetta (antibody_designer) performance

8 posts / 0 new
Last post
Rosetta (antibody_designer) performance
#1

 

Hi All,

 

I am building a Rosetta(rosetta_bin_linux_2021.16.61629_bundle) and running the simulation that my customer offerd.

One thing I really want to know and ask you guys is that the performance is okay and how to imporve more.

Here are my environments.

1. Rosetta version: rosetta_bin_linux_2021.16.61629_bundle (cannot be changed)
2. Binary: antibody_designer.mpi.linuxiccrelease
3. Input dataset: Please refer to the attached picture file, especially nstruct is 25,000.
4. Compute resource: mpirun -n 12160 $rosettaexe @common_flags.txt @experimental_flags.txt

What I am curious about is that it seems that I used about 12,160 mpi ranks to generate 25,000 pdb files. How is the amount of computation allocated?

#1: One mpi rank is put into one pdb operation, so it is used to compute two pdbs(maximum 3 pdbs).
#2: Multiple mpi ranks are put into one pdb operation. Therefore, 12,160 ranks are performed one by one

 

Please let me know if any suggestions or any questions about the parts I need to explain.

 

Thanks,

Kihang

AttachmentSize
Picture1.gif203.98 KB
Category: 
Post Situation: 
Wed, 2022-03-02 00:48
Kihang _Youn

Hello.

Rosetta is parallel per nstruct with one node serving as the head node. So one processor per output PDB - your option #1. 

When one MPI job gets done, if there are other jobs left to run, it will run them.  The thing to note about MPI and Rosetta is that it will wait till all the PDBs are complete before releasing the processor.  So it's best to have the number of processors equal a division of the number of nstruct +1 in order to save compute costs. So if you have 1k processors needing 2k nstruct, you would have mpiexec use 1001 processors to make this more even.  Individual nstruct may take different amounts of time, so you still may have processors waiting, but not as much.  

To alleviate this, you could also run it in batch mode instead of MPI, but would need to set the constant_seed option manually in order to change it per process generally. 

Wed, 2022-03-02 06:22
jadolfbr

 

Hi jadolfbr,

Thank you for your quick reply.

I understand how to Rosetta's parallel operation works, Let me explain the way to create PDB files.

#1. 1 PDB with 2 processes takses 25 minutes.
#2. 25,000 PDBs with 12,000 processes takes 14 hours. After 30 minutes of operation (logs are actively generated), PDB files are created one by one, and the creation rate is 33 pieces per minute over 13 hours.

Of course, I don't expect the same amount of time to take when one PDB is done by one process and to take when 25,000 PDBs are done by 25,000 processes. But I'm just trying to figure out what's causing the problem to be too slow. What I think is that there is a bottleneck in the write stage for each operation and write, or there is a designation of the number of processes in charge of writing, so I guess that only a small number of IOs are done.

Wed, 2022-03-02 21:39
Kihang _Youn

I see.  Yes.  Head processor controls IO.  So that the score file is not overwritten.  I would imagine that it's the appending of the scorefiles and concurrent writing that is taking that long. However, I have written hundreds of thousands of nstructs and didn't have much of a problem.  I will check with some people to see if they can weigh in as well as to what they think may be the issue.  If the PDBs came out faster, ie once a minute, I could see more of why this would be the case.  
 

the only issue with antibody designer specifically is the concurrent reads of the SQLite database.   This may slow down the overall concurrency.  I've never had the opportunity to run it on so many processors at once, so I never ran into an issue like that if it exists.  The database is fairly large, so that may be the main problem.  Especially with a shared file system.  It only reads 200 at a time to save on memory, but that is still quite a bit of IO.  

Wed, 2022-03-02 22:00
jadolfbr

I will see what if can be done about it. A different database engine could speed it up. Not sure what else since each process is pulling from the database independently, and it is fairly large to store in ram per process. 

Wed, 2022-03-02 22:07
jadolfbr

Please listen if I understand correctly.

The file with access lock when writing is score.sc. (Bottleneck is suspected, but you have never experienced symptoms even if you perform more than 100k)
Each write is done by a process, but it is managed by the master node.

Just one question. If 100 nstructs were executed by 101 processes(#1) and 1 nstruct were executed by 2 processes(#2), are you assuming that the times are the similar?

I will tell you the result after I run it with more various nstruct and process combinations.

Thu, 2022-03-03 00:19
Kihang _Youn

 

Please listen if I understand correctly.

The file with access lock when writing is score.sc. (Bottleneck is suspected, but you have never experienced symptoms even if you perform more than 100k)
Each write is done by a process, but it is managed by the master node.

Just one question. If 100 nstructs were executed by 101 processes(#1) and 1 nstruct were executed by 2 processes(#2), are you assuming that the times are the similar?

I will tell you the result after I run it with more various nstruct and process combinations.

Thu, 2022-03-03 00:19
Kihang _Youn

 

Hello all,

I have a new analysis result
These are runnig times with nstructs and number of cores.

#cores #structs Etime(sec)
76 75            2,901
152 150            3,156
304 300            4,771
608 600            6,449
1216 1200            6,828
2432 2400            6,435
3648 3600            6,529
4864 4800            9,676
6004 6000          12,494
7524 7500          16,948
15048 15000          35,098
25004 25000          59,027

 

Computational resources were used in a 1:1 ratio with nstructs.
So, I expected that it would be roughly the same level if there was no performance degradation.
However, the result seems to be a steady drop in performance starting with the 3600 cores, and I'd like to hear your opinion on how to fix this.

The improvements I can think of are as follows, is it possible to implement in rosetta now?
#1. By communicating the result to master rank, only one rank is responsible for I/O.
#2. Release the lock restricted by score.sc.
#3. Writes the file to local, not shared storage.

Best Regards,
Kihang

 

 

File attachments: 
Thu, 2022-03-10 16:30
Kihang _Youn