Introduction of Nutch & Hadoop
After Lucene, the author created another powerful tool. Its name is Nutch. Nutch is a powerful crawler built on top of the Lucene. With Nutch, you can launch a multi-threaded crawler to obtain information from the Net. At this point of writing, Nutch is in its 0.9 version. Nutch comes with a list of cool features, including whole Web crawling, local file crawling for the
intranet, indexing all the while.
Hadoop was designed to handle the petabytes of data that Nutch could potentially store and process. In fact, Hadoop has its own file system: the Hadoop Distributed File System (HDFS), which can run on any old run-of-the-mill, low-cost hardware.
Hadoop works by storing part of the file system’s data across all the servers in the cluster. As new queries come in, HDFS follows the "moving computation is cheaper than moving data" rule — meaning that moving the processing of the query to as
close as possible to the data will be faster than placing the query at random within the cluster and moving data long distances across the network.
I have searched around to see if anyone can give me some tips on this tool. Surprisingly, I don’t see much. But don’t worry, I have found some that can at least get you start playing with it.
Set up Nutch
Here is the guideline written by Peter Wang that I followed to bring my Nutch up. Follow it and bring your Nutch before go further. By the way, if you want to run Nutch with Solr, this is a good tutorial.
Nutch Architectural Review






































(4.75 out of 5)
No Comment Received
Sorry the comment area are closed for non registered users