Page 1 of 1

Problems running multiple Rosetta tasks

Posted: Mon Jun 29, 2020 7:07 pm
by UBT - Timbo
Hi all

A fellow UBT member on Rosetta (who isn't a forum member and contacted me via the Rosetta website) has got an issue with his hardware as follows:
I have a Twin E5 CPU PC running Windows 10. This machine is fully committed to Rosetta.

36 Work units each are distributed to the two CPU's but one or other CPU is always only partly used, with an average of 25/30 work units. This swaps and changes between the CPU's seemingly at random.

I can temporarily resolve this by using the "set affinity" feature in Task Manager to reassign a work unit to the more idle node, but its not long before one or other of the CPU's is short of work again.

Aside from the usual specific things that might be limiting the hardware (ie BM or Rosetta Preferences incorrectly set, app_config setting limits in the Rosetta folder, lack of available free RAM), can anyone think of any other reason why Rosetta might not be firing on all cylinders?

regards
Tim

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 9:26 am
by chriscambridge
My first question would be what does he mean by "node", as in "to reassign a work unit to the more idle node"?

Nodes in HPC mean when you have say 4 nodes within a single server, and each node has its own CPU, RAM, HD, etc. But power, fans, etc are all provided as a whole via the server. It would then be called a 4-node server, e.g.:

Image

This is obviously not what he means so its a bit unclear exactly what "I can temporarily resolve this by using the "set affinity" feature in Task Manager to reassign a work unit to the more idle node, but its not long before one or other of the CPU's is short of work again".

Also the only thing I could find about BOINC Affinity was related to NUMA, and I don't think this is of any relevance either:

https://boinc.berkeley.edu/dev/forum_th ... p?id=10124

Forgetting both of these, It does sound pretty strange though. About the only time that I have seen BM act pretty erratic is when it comes down to tasks not being able to finish in time, but that case the last thing you would expect is for BM to *not fully utilise all cores/threads.

Perhaps BM is not recognising all cores/threads (due to a bug or faulty CPU/etc)?

I would restart BM and ask him to post/give you this from Event manager, at least then we can see what BM is seeing in terms of CPU, etc:
30/06/2020 09:13:33 | | Windows processor group 0: 8 processors
30/06/2020 09:13:33 | | Host name: MYHOSTNAME
30/06/2020 09:13:33 | | Processor: 8 GenuineIntel Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz [Family 6 Model 60 Stepping 3]
30/06/2020 09:13:33 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx smx tm2 pbe fsgsbase bmi1 smep bmi2
30/06/2020 09:13:33 | | OS: Microsoft Windows 10: Professional x64 Edition, (10.00.18363.00)
30/06/2020 09:13:33 | | Memory: 15.89 GB physical, 23.13 GB virtual

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 9:27 am
by Woodles
I'm guessing by 'node' they means CPU in a normal dual CPU setup.

CPU percentage?
Disc space?
RAM allowance?

Sounds like RAM usage, when I tried Rosetta a while back, each task was taking over a gig of RAM. If they have 64 gig installed and don't allow all of it to be used for BOINC then they might be limited to 61 - 66 tasks.

Is the host visible on the Rosetta website?

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 9:29 am
by chriscambridge
If you have lack of RAM then it normally suspends tasks with a message of "not enough memory" (or something similar).

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 9:33 am
by Woodles
That was my previous experience as well but it's the usual way of tasks limiting by themselves.

I have had a badly configured host that was set to use only 10 gig of disc space (rather than leave 10 gig free) and more tasks just didn't get started, the error log stated "application needs xxx of disc space, only yyy available" but no other indication.

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 9:35 am
by chriscambridge
Yeah that is very true, I hadn't thought of that. I think if he restarted BM he would get that "lack of disc space" notification pretty quick if that was the issue?
I'm guessing by 'node' they means CPU in a normal dual CPU setup.
Can you use the affinity menu to tell BM what CPU to run a given task on? That sounds pretty complex for BM..

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 9:37 am
by Woodles
Is this "Bill"?

His 72 thread, dual E5 host has 64 gig of RAM and the task output states "Peak working set size 2,229.42 MB" for RAM usage." so that's a 2+ gig peak for each task and 'only' 64 gig available.

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 9:42 am
by Woodles
The disc error message only appears when Boinc tries to start another task, it wouldn't show in the start up messages.

BM knows nothing about affinity, it just starts the application. You have to assign a specific process to a specific core using task manager or similar. Then it loses it as soon as the task finishes and you have to do it again with the new process (even though it's the same application)

Open task manager, right click on a process and there's a "set affinity" option, normally set to "ALL".

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 9:46 am
by chriscambridge
I only had the lack of disc space once, incorrectly I might add, so was not sure exactly when it got triggered.

Oh he's talking about affinity via Windows/OS; I thought he meant via BM. Hmm not sure if he should start messing with that, seems like a great way to confuse BM even further.

I guess the simple solution/analysis for him is just to check task manager when BM is fully loaded with R@H, and check how much Disk/RAM is used/left!

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 9:54 am
by Woodles
It's easy to confuse BM :)

Can he add options to cc_config.xml and report the results?

Code: Select all

<cc_config>
  log_flags>
    <mem_usage_debug>1</mem_usage_debug>
    <sched_op_debug>1</sched_op_debug>
    <slot_debug>1</slot_debug>
    <state_debug>1</state_debug>
    <task_debug>1</task_debug>
  </log_flags>
</cc_config>
I'm assuming he does have enough tasks to run anyway and not just ~60 downloaded?

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 1:07 pm
by UBT - Timbo
Woodles wrote: Tue Jun 30, 2020 9:37 am Is this "Bill"?

His 72 thread, dual E5 host has 64 gig of RAM and the task output states "Peak working set size 2,229.42 MB" for RAM usage." so that's a 2+ gig peak for each task and 'only' 64 gig available.
HI Mark

Yup - it is "Bill".

I've already replied to him with a number of things to check, and I'm waiting to hear back from him. However, in my reply to him, I linked to my first post here, so he can read any comments made (as this section is "public") and maybe any other suggestions can be checked.

I, too, am assuming he is talking about this specific host (link below). And getting this host to run (say) 70+ concurrent Rosetta tasks might be tricky with only 64 Gb of RAM. Likewise, there could be an awful lot of "disk-paging".

https://boinc.bakerlab.org/rosetta/show ... id=3147531

Another possible issue (which I have found limited info about) is that BOINC "only supports a maximum of 32 cores per physical CPU and only 64 threads".

https://boinc.berkeley.edu/forum_thread.php?id=12877

So, given a 72-thread host, it's possible that the maximum number of tasks that could run concurrently is 64. Does anyone else have a single host with more than 64 vCores? Have you found any limitations regarding maximum number of concurrent tasks (on a Windows machine)?

If Bill finds he has 32 threads running on one CPU, then it's quite possible that he might only get 25-30 running on the 2nd CPU, depending on memory limits, and OS requirements, plus maybe running his GPU (NVIDIA Quadro K620).

regards
Tim

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 1:10 pm
by UBT - Timbo
chriscambridge wrote: Tue Jun 30, 2020 9:46 am I only had the lack of disc space once, incorrectly I might add, so was not sure exactly when it got triggered.

Oh he's talking about affinity via Windows/OS; I thought he meant via BM. Hmm not sure if he should start messing with that, seems like a great way to confuse BM even further.

I guess the simple solution/analysis for him is just to check task manager when BM is fully loaded with R@H, and check how much Disk/RAM is used/left!
Hi Chris

Thanks for the input on this.

Hopefully, I'll hear back from Bill and somehow get to the bottom of this.

regards
Tim

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 1:55 pm
by Woodles
Hi Tim,
UBT - Timbo wrote: Tue Jun 30, 2020 1:07 pmI've already replied to him with a number of things to check, and I'm waiting to hear back from him. However, in my reply to him, I linked to my first post here, so he can read any comments made (as this section is "public") and maybe any other suggestions can be checked.
So we have to stop insulting him now and clean up the thread? :lol:
UBT - Timbo wrote: Tue Jun 30, 2020 1:07 pmAnd getting this host to run (say) 70+ concurrent Rosetta tasks might be tricky with only 64 Gb of RAM. Likewise, there could be an awful lot of "disk-paging".
Could also be a limit on how much disc/swap file is allowed to be used?
UBT - Timbo wrote: Tue Jun 30, 2020 1:07 pmAnother possible issue (which I have found limited info about) is that BOINC "only supports a maximum of 32 cores per physical CPU and only 64 threads".

https://boinc.berkeley.edu/forum_thread.php?id=12877

So, given a 72-thread host, it's possible that the maximum number of tasks that could run concurrently is 64. Does anyone else have a single host with more than 64 vCores? Have you found any limitations regarding maximum number of concurrent tasks (on a Windows machine)?
The top Rosetta host apparently has 192 threads - https://boinc.bakerlab.org/rosetta/show ... id=3959217

I also believe Damien has a 3990X threadripper (64 cores / 128 threads) and he's not mentioned any problems with running more than 32 threads.

Mark

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 2:17 pm
by UBT - Timbo
Hi Mark

Some more info just received from Bill:
The host in question has no purpose other than running BOINC and Rosetta.

OS: Windows 10 Pro 10.0.18362 Build 18362
Mainboard: Supermicro X10DAi V1.02
CPU's: 2X Xeon E5-2697 v4 18 cores 36 logical processors each. Total 72.
GPU: not assigned to run any projects or tasks.
BIOS: AM 2.0a, 09/11/2016 Mode UEFI
RAM: 64GB. All permanently available to BOINC.
C Drive: NVMe INTEL SSDPEDMW40 368GB. Total used space 108GB All available to BOINC.
Page/swap file use: 100%. Space available 30000MB
BOINC CPU Computing/Usage Limits: None.
Min Work Buffer: 2 days. (Currently 418 tasks).

If Windows Task Manager/details/work unit/affinity is adjusted to provide a balanced work load between the two nodes or CPU's, the RAM used is usually around 30GB, but doesn't change much even without any adjustment.
The page/swap file, using the SSD, is 30Gb, so I doubt that is too much of a limitation...but it could be made larger, just in case?
The top Rosetta host apparently has 192 threads - https://boinc.bakerlab.org/rosetta/show ... id=3959217
Nice find - but that host is running Linux, not Windows, so maybe the OS is a limitation?

Damiens, hosts are hidden so I can't check which OS he is running?

So, perhaps the 2 main factors are "memory limitations" and "Windows limitations" ?

regards
Tim

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 4:50 pm
by Woodles
Hi Tim,

So
Plenty of disc space, Rosetta mod recommended 1 gig per task + 10 so ~82 needed.
Plenty of swap space (might run slower with > 64 gig memory needed but it'll still run)
Plenty of work waiting.
Sounds like the RAM is sufficient for 72 tasks at once if 60 - 65 tasks is only 30 gig.

I'm still tending towards memory limits as Rosetta takes a lot of RAM and other projects not as much. Can he try a different project and see if 72 tasks will run? ODLK/Latinsquares uses very little RAM if I remember correctly?

Thread limitations:
128 threads on Windows server - https://boinc.bakerlab.org/rosetta/show ... id=4404016
128 threads on Windows 10 - https://boinc.bakerlab.org/rosetta/show ... id=4451936

Looks like Damiens 3990x is on QuChem not Rosetta.

Mark

Re: Problems running multiple Rosetta tasks

Posted: Tue Jun 30, 2020 6:29 pm
by UBT - Timbo
Hi Mark

Thanks for your further input.

I was also thinking of suggesting that he try another project, just as a test...and to choose one that was low on memory requirements.

At least that will help to confirm, if it is a "hardware" issue (such as RAM), or a software/configuration type issue.

regards and thanks
Tim