GPUGrid bad batch of jobs

A range of Biological/Medical Science projects that need our help, such as: Rosetta, Ralph, POEM, RNA, SIMAP, The Lattice Project, Xansons.
Post Reply
chriscambridge
Active UBT Contributor 1+ yr
Posts: 2083
Joined: Mon Aug 08, 2016 1:56 pm
Location: UK

GPUGrid bad batch of jobs

Post by chriscambridge » Wed Oct 07, 2020 1:56 pm

In the last half hour every host is getting computational errors so I would say that a bad batch of jobs has been added to the GPUGrid queue.

If you are running GPUGrid at this time you may want to check your hosts.
Last edited by chriscambridge on Wed Oct 07, 2020 4:38 pm, edited 1 time in total.
Image

chriscambridge
Active UBT Contributor 1+ yr
Posts: 2083
Joined: Mon Aug 08, 2016 1:56 pm
Location: UK

Re: GPUGrid bad batch of jobs

Post by chriscambridge » Wed Oct 07, 2020 2:21 pm

Other people are also now getting these failed tasks:

https://www.gpugrid.net/forum_thread.php?id=5179
Image

Woodles
Active UBT Contributor 20+ yrs
Posts: 11561
Joined: Thu Dec 20, 2007 12:00 am
Location: Cambridgeshire

Re: GPUGrid bad batch of jobs

Post by Woodles » Wed Oct 07, 2020 3:58 pm

Thanks Chris. I only had one GPU on it but failing tasks here as well :(

Oh well, it's been a while since I did any Collatz :)

chriscambridge
Active UBT Contributor 1+ yr
Posts: 2083
Joined: Mon Aug 08, 2016 1:56 pm
Location: UK

Re: GPUGrid bad batch of jobs

Post by chriscambridge » Thu Oct 08, 2020 4:43 pm

GPUGrid has now been temporarily suspended.

The project will be offline until the app can be rebuilt, however uploads are still working for outstanding tasks that were completed before the issue.

https://www.gpugrid.net/forum_thread.php?id=5180#55462

From what I can gather ACEMD license wasnt renewed and this is causing all tasks to fail.

https://www.gpugrid.net/forum_thread.php?id=4970
https://www.gpugrid.net/forum_thread.php?id=5179
Image

chriscambridge
Active UBT Contributor 1+ yr
Posts: 2083
Joined: Mon Aug 08, 2016 1:56 pm
Location: UK

Re: GPUGrid bad batch of jobs

Post by chriscambridge » Fri Oct 09, 2020 9:31 pm

Dears, thanks for your patience.

I updated the acemd3 apps.

Also, I verified that there were very few results by apps from the old CUDA version, so I won't be re-deploying them. In other words, apps now require CUDA 10 (Linux) and CUDA 10.1 (Windows). In terms of drivers versions:

CUDA 10.1 (10.1.105) >= 418.39 for Windows
CUDA 10.0 (10.0.130) >= 410.48 for Linux

https://docs.nvidia.com/deploy/cuda-com ... index.html
https://www.gpugrid.net/forum_thread.php?id=5183#55495
Image

Woodles
Active UBT Contributor 20+ yrs
Posts: 11561
Joined: Thu Dec 20, 2007 12:00 am
Location: Cambridgeshire

Re: GPUGrid bad batch of jobs

Post by Woodles » Sat Oct 10, 2020 11:30 am

Thanks Chris, I'll check which drivers I'm using, they seemed to be alright before.

chriscambridge
Active UBT Contributor 1+ yr
Posts: 2083
Joined: Mon Aug 08, 2016 1:56 pm
Location: UK

Re: GPUGrid bad batch of jobs

Post by chriscambridge » Sun Oct 11, 2020 12:20 pm

I can't remember if your running Windows or Linux (or perhaps both), but the latest Nvidia Linux (Mint) driver is 450.

I'm running both 435 and 450 and they both work fine on GPUGrid.

Something I did notice in the latest Mint update was there are now 2 different nvidia drivers (per version):

eg: nvidia-driver-450 and nvidia-driver-450-server

I did a quick Google but I couldn't really find anything that explained the difference between the two.

The server version was "recommended" in Mint, but knowing how Nvidia is with GTX/RTX drivers in servers/data centers, I avoided it and just stuck with the normal non-server version.
Image

Woodles
Active UBT Contributor 20+ yrs
Posts: 11561
Joined: Thu Dec 20, 2007 12:00 am
Location: Cambridgeshire

Re: GPUGrid bad batch of jobs

Post by Woodles » Sun Oct 11, 2020 1:22 pm

Mainly Linux with 418.56 up to 455.23 plus a few Windows on 430.86 to 445.75

Looks like I should be alright but I'll update them anyway once the sprint is over.

Post Reply