MJ-12 News

Various OTHER distributed computing projects, such as Folding@home, Dimes, Majestic-12, etc
Post Reply
UBT - Halifax-lad
Posts: 3790
Joined: Mon Mar 13, 2006 12:00 am

MJ-12 News

Post by UBT - Halifax-lad »

From the forum over at MJ-12
Hi all,

First of all I want to say big thank you for patience all this time and specifically while I was doing contract work this September. It was very tough to put any time towards project because I worked 6 days a week, but now this is over and it was well worth it: the money that were hard earned will now enable us go much further than I though was possible, well, I tried hard not even to think how to index more than 4 bln pages, but now I can confidently set high, but fairly easily achieveable target of 10 bln page index. :)

The actual plan is broken into short and medium term actions that I want to take, in approximately this order:

1) Get 2nd line working and push out new release with full boolean logic, fix index server #2 plus normal things that need to be done

2) Implement some overdue fixes to node and add support for priority buckets

3) Implement selective index recrawl - most highly ranked urls will be recrawled hourly/daily/weekly to give fresh taste in the index - #2 is required for that

Real deal

4) Implement new indexing format

For some time I resented indexing more data for the sake of getting high number, this was because I wanted to learn limitations of the index structure that is currently present, particularly when it comes to multi-word searching: exactly the problem area that we have now. I now know what needs to be done to make search engine faster and more relevant, the cost is increased disk space usage, but thankfully there is money for nice big 750 GB disks, plus more servers.

5) Make merging process support new indexing format, more importantly I need to make distributed processing library to work on multiple machines to get more speed out of it. Those who were here in Feb-Mar will remember the pain that we were going through while merging mere 600 mln pages: it was hell on Earth due to bugs, and generally it was just too slow - parallelising merging to multiple machines is the complete solution to the problem.

6) Add simple support for new index structures in the search engine - nothing too powerful, just to have in place checks that it all works, this will lay the ground work for SE improvement after mass indexing is done

7) Buying lots of hardware for mass indexing (billions of pages) and merging here :)

Time wise I think we will start mass indexing in November - it will certainly be painful initially, but this time there will be a lot more hardware available so things will be much easier in this respect - I remember last year it was hell to merge mere 45 mln pages, but this time we will be merging at least 1 bln chunks, probably 1.5 bln 8)
Post Reply