Massive SETI server failure - Now fixed

SETI@home and SETI BETA - project now closed, maybe for good :-(
Post Reply
UBT - Halifax-lad
Posts: 3790
Joined: Mon Mar 13, 2006 12:00 am

Massive SETI server failure - Now fixed

Post by UBT - Halifax-lad »

Happy May Day. Unfortunately for us it's been "Mayday! Mayday!" At 4:43 (PDT) this morning, our science database machine, thumper, became hasenfeffer. It currently refuses to acknowledge that it has any disk drives. Since the controllers are attached to the motherboard, major repairs will probably be required. No work can be created until this machine gets fixed. We are on the phone with Sun now in hopes of securing repair or a replacement.
So 319,395 WU's to go
Last edited by UBT - Halifax-lad on Mon May 14, 2007 5:56 pm, edited 3 times in total.
UBT - Halifax-lad
Posts: 3790
Joined: Mon Mar 13, 2006 12:00 am

Post by UBT - Halifax-lad »

More news, and the work is going fast, note the developers say we could be in for a long dry period
This was one of those days. Sometime in the early morning MySQL on sidious crashed and rebooted itself. It had minor indigestion and restarted on its own just fine. Eric had to restart the BOINC projects to clean the pipes.

But when I came in I found Eric dissecting our master database server, thumper. That's never a good sign. He and Jeff informed me that it lost the ability to see any of its internal drives. Tests throughout the day confirmed that diagnosis - there's something dead between the power supply and the disk controllers so the drives don't even spin up. Booting from a DVD and an "fdisk" shows nothing. This system has a "preliminary" motherboard, which is one of the reasons we got it for free, but it has no hardware support.

Meanwhile I went ahead with the usual database backup/compression while we figured out what the heck we're gonna do. We're pretty confident the data is intact and as long as some server somewhere can mount the 24 SATA drives the make up the database the SETI@home science data will be perfectly intact. Failing that, we can recover from tape but unfortunately we're at a bad point in the backup cycle so the most recent tape is a week old.

Since data loss is most likely not an issue, the upshot of thumper being down is that we can't run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it's already down to about 281,000. Brace yourselves for a long outage. We placed many phone calls and asked for favors but so far no sure immediate solution presented itself. We have some leads, but does anybody have a 64-bit multiprocessor 8+GB system with 24 SATA drive bays they can loan us within the next 24-48 hours?

- Matt
Darren
Posts: 752
Joined: Mon Mar 13, 2006 12:00 am

Post by Darren »

Well I say we should all grab what we can so there are fewer left for the teams that specialise on Seti! ;)
UBT - Halifax-lad
Posts: 3790
Joined: Mon Mar 13, 2006 12:00 am

Post by UBT - Halifax-lad »

Darren wrote:Well I say we should all grab what we can so there are fewer left for the teams that specialise on Seti! ;)
If you can get past the queue of everyone else trying to get work too, the project has just come out of a scheduled outage so lots of backlog
Darren
Posts: 752
Joined: Mon Mar 13, 2006 12:00 am

Post by Darren »

Just bagged 41 of the little beauties.
UBT - The Prof....
Posts: 1630
Joined: Mon Nov 06, 2006 12:00 am

Post by UBT - The Prof.... »

Attempting smash and grab but no luck as yet :(
UBT - Halifax-lad
Posts: 3790
Joined: Mon Mar 13, 2006 12:00 am

Post by UBT - Halifax-lad »

UBT - The Prof.... wrote:Attempting smash and grab but no luck as yet :(
I gave in just going to carry on with other projects, looks like Temujin is gonna ave to find a new temp project
UBT - The Prof....
Posts: 1630
Joined: Mon Nov 06, 2006 12:00 am

Post by UBT - The Prof.... »

Grab ok, but only got one wu  :(
Still, beggers can't be choosers. Its better than nothing 8)
Temujin
Posts: 2259
Joined: Mon Mar 13, 2006 12:00 am

Post by Temujin »

UBT - Halifax--lad wrote:
UBT - The Prof.... wrote:Attempting smash and grab but no luck as yet :(
I gave in just going to carry on with other projects, looks like Temujin is gonna ave to find a new temp project
I run a 5 day cache, so I'm ok for a while yet :D
UBT - Halifax-lad
Posts: 3790
Joined: Mon Mar 13, 2006 12:00 am

Post by UBT - Halifax-lad »

Astropulse is kicking out work too, its just swamped with requests too seen as though its on the same servers.

Word of warning Astropulse runs slow on Windows, it may miss deadlines too but Eric has said manual credits would be granted for any that happen to fail the deadline but then complete afterwards 8)

Still plenty of enhanced work on this project too, which is just the same as on normal SETI project
Darren
Posts: 752
Joined: Mon Mar 13, 2006 12:00 am

Post by Darren »

And the well has run dry....
UBT - Timbo
UBT Forum Admin
Posts: 9680
Joined: Mon Mar 13, 2006 12:00 am
Location: NW Midlands
Contact:

Post by UBT - Timbo »

Darren wrote:And the well has run dry....
Hi all,

Clearly, the SETI server problem is still on-going and doesn't look like getting any better quickly, unless you have a Sun X4500 server lying around !!

These things normally cost $48,000 each (but are on special offer now for just $24,000 - see here: http://www.sun.com/servers/x64/x4500/).

In the meantime, may I be so bold as to suggest you consider crunching for one of the other projects?

see this link for the "Attach to project" URL's of various other worthy projects - http://forum.ukboincteam.com/viewtopic.php?t=1700

There's plenty to choose from and most have work available...

regards

Tim
UBT - Halifax-lad
Posts: 3790
Joined: Mon Mar 13, 2006 12:00 am

Post by UBT - Halifax-lad »

Great news! Sun Microsystems is coming to the rescue and will be replacing our inoperative science data base server. They are preparing the machine now and will be rushing it to us on Monday. Once we have the machine up and the database recovered, we can start sending work out again. Details on the server crash and our recovery from it can be found in Technical News
UBT - Timbo
UBT Forum Admin
Posts: 9680
Joined: Mon Mar 13, 2006 12:00 am
Location: NW Midlands
Contact:

Post by UBT - Timbo »

UBT - Halifax--lad wrote:
Great news! Sun Microsystems is coming to the rescue and will be replacing our inoperative science data base server. They are preparing the machine now and will be rushing it to us on Monday. Once we have the machine up and the database recovered, we can start sending work out again. Details on the server crash and our recovery from it can be found in Technical News
Congrats to "Sun" then for becoming the "hero" in this.

Think that's gotta be worth a link to them:

http://www.sun.com  :wink:

regards,

Tim
Temujin
Posts: 2259
Joined: Mon Mar 13, 2006 12:00 am

Post by Temujin »

Yep, well done Sun Image

Most of my machines have now run out of Seti WUs and have been moved over to either Einstein or WEP with a smattering of TMRL & Riesel.

As they reckon they won't get the new server till late monday, I can't see we'll be getting any new work till tuesday at the earlest and maybe even wednesday.
Oh well, at least it gives the other projects a look in :D
UBT - Halifax-lad
Posts: 3790
Joined: Mon Mar 13, 2006 12:00 am

Post by UBT - Halifax-lad »

Update: Our science database server died on Tuesday, May 1st. We haven't been able to create new workunits since then (though we are still accepting completed results). Sun is graciously replacing this server. The bad news is, despite earlier claims, it won't be here until Friday the 11th, which means the earliest we'll be creating new work is Monday the 14th. Thank you for your continued patience! Updates, discussion, and more information about this and other server-related topics can be found in Technical News
UBT - Halifax-lad
Posts: 3790
Joined: Mon Mar 13, 2006 12:00 am

Post by UBT - Halifax-lad »

more or less fixed now
Update: We got the new server yesterday, inserted our old disks and booted it up. It came right up, but verifying the file systems took overnight. The work is being created, the splitters and assimilators are working. It will be a while before we catch up. Thank you for your continued patience and support.
UBT - Halifax-lad
Posts: 3790
Joined: Mon Mar 13, 2006 12:00 am

Post by UBT - Halifax-lad »

Plenty work now for the alien hunters
Temujin
Posts: 2259
Joined: Mon Mar 13, 2006 12:00 am

Post by Temujin »

what with previous reports of seti server recovery being greatly exaggerated, it looks like opti apps can also download WUs now.
Things "might" be getting back to normal, touch wood, fingers crossed.
Post Reply