Architecture
Nutch divides naturally into two pieces: the crawler and the searcher. The crawler fetches pages and turns them into an inverted index, which the searcher uses to answer users' search queries. The interface between the two pieces is the index, so apart from an agreement about the fields in the index, the two are highly decoupled. (Actually, it is a little more complicated than this, since the page content is not stored in the index, so the searcher needs access to the segments described below in order to produce page summaries and to provide access to cached pages.)
The main practical spin-off from this design is that the crawler and searcher systems can be scaled independently on separate hardware platforms. For instance, a highly trafficked search page that provides searching for a relatively modest set of sites may only need a correspondingly modest investment in the crawler infrastructure, while requiring more substantial resources for supporting the searcher.
We will look at the Nutch crawler here, and leave discussion of the searcher to part two.
The Crawler
The crawler system is driven by the Nutch crawl tool, and a family of related tools to build and maintain several types of data structures, including the web database, a set of segments, and the index. We describe all of these in more detail next.
The web database, or WebDB, is a specialized persistent data structure for mirroring the structure and properties of the web graph being crawled. It persists as long as the web graph that is being crawled (and re-crawled) exists, which may be months or years. The WebDB is used only by the crawler and does not play any role during searching. The WebDB stores two types of entities: pages and links. A page represents a page on the Web, and is indexed by its URL and the MD5 hash of its contents. Other pertinent information is stored, too, including the number of links in the page (also called outlinks); fetch information (such as when the page is due to be refetched); and the page's score, which is a measure of how important the page is (for example, one measure of importance awards high scores to pages that are linked to from many other pages). A link represents a link from one web page (the source) to another (the target). In the WebDB web graph, the nodes are pages and the edges are links.
A segment is a collection of pages fetched and indexed by the crawler in a single run. The fetchlist for a segment is a list of URLs for the crawler to fetch, and is generated from the WebDB. The fetcher output is the data retrieved from the pages in the fetchlist. The fetcher output for the segment is indexed and the index is stored in the segment. Any given segment has a limited lifespan, since it is obsolete as soon as all of its pages have been re-crawled. The default re-fetch interval is 30 days, so it is usually a good idea to delete segments older than this, particularly as they take up so much disk space. Segments are named by the date and time they were created, so it's easy to tell how old they are.
The index is the inverted index of all of the pages the system has retrieved, and is created by merging all of the individual segment indexes. Nutch uses Lucene for its indexing, so all of the Lucene tools and APIs are available to interact with the generated index. Since this has the potential to cause confusion, it is worth mentioning that the Lucene index format has a concept of segments, too, and these are different from Nutch segments. A Lucene segment is a portion of a Lucene index, whereas a Nutch segment is a fetched and indexed portion of the WebDB.
The crawl tool
Now that we have some terminology, it is worth trying to understand the crawl tool, since it does a lot behind the scenes. Crawling is a cyclical process: the crawler generates a set of fetchlists from the WebDB, a set of fetchers downloads the content from the Web, the crawler updates the WebDB with new links that were found, and then the crawler generates a new set of fetchlists (for links that haven't been fetched for a given period, including the new links found in the previous cycle) and the cycle repeats. This cycle is often referred to as the generate/fetch/update cycle, and runs periodically as long as you want to keep your search index up to date.
URLs with the same host are always assigned to the same fetchlist. This is done for reasons of politeness, so that a web site is not overloaded with requests from multiple fetchers in rapid succession. Nutch observes the Robots Exclusion Protocol, which allows site owners to control which parts of their site may be crawled.
The crawl tool is actually a front end to other, lower-level tools, so it is possible to get the same results by running the lower-level tools in a particular sequence. Here is a breakdown of what crawl does, with the lower-level tool names in parentheses:
1. Create a new WebDB (admin db -create).
2. Inject root URLs into the WebDB (inject).
3. Generate a fetchlist from the WebDB in a new segment (generate).
4. Fetch content from URLs in the fetchlist (fetch).
5. Update the WebDB with links from fetched pages (updatedb).
6. Repeat steps 3-5 until the required depth is reached.
7. Update segments with scores and links from the WebDB (updatesegs).
8. Index the fetched pages (index).
9. Eliminate duplicate content (and duplicate URLs) from the indexes (dedup).
10. Merge the indexes into a single index for searching (merge).
After creating a new WebDB (step 1), the generate/fetch/update cycle (steps 3-6) is bootstrapped by populating the WebDB with some seed URLs (step 2). When this cycle has finished, the crawler goes on to create an index from all of the segments (steps 7-10). Each segment is indexed independently (step 8), before duplicate pages (that is, pages at different URLs with the same content) are removed (step 9). Finally, the individual indexes are combined into a single index (step 10).
The dedup tool can remove duplicate URLs from the segment indexes. This is not to remove multiple fetches of the same URL because the URL has been duplicated in the WebDB--this cannot happen, since the WebDB does not allow duplicate URL entries. Instead, duplicates can arise if a URL is re-fetched and the old segment for the previous fetch still exists (because it hasn't been deleted). This situation can't arise during a single run of the crawl tool, but it can during re-crawls, so this is why dedup also removes duplicate URLs.
While the crawl tool is a great way to get started with crawling websites, you will need to use the lower-level tools to perform re-crawls and other maintenance on the data structures built during the initial crawl. We shall see how to do this in the real-world example later, in part two of this series. Also, crawl is really aimed at intranet-scale crawling. To do a whole web crawl, you should start with the lower-level tools. (See the "Resources" section for more information.)
Configuration and Customization
All of Nutch's configuration files are found in the conf subdirectory of the Nutch distribution. The main configuration file is conf/nutch-default.xml. As the name suggests, it contains the default settings, and should not be modified. To change a setting you create conf/nutch-site.xml, and add your site-specific overrides.
Nutch defines various extension points, which allow developers to customize Nutch's behavior by writing plugins, found in the plugins subdirectory. Nutch's parsing and indexing functionality is implemented almost entirely by plugins--it is not in the core code. For instance, the code for parsing HTML is provided by the HTML document parsing plugin, parse-html. You can control which plugins are available to Nutch with the plugin.includes and plugin.excludes properties in the main configuration file.
Useful Links:
Ninand
ReplyDeleteWhat do you mean by:
The index is the inverted index of all of the pages the system has retrieved, and is created by merging all of the individual segment indexes.
Can you give us an example of how the original segment index looks like and how it is inverted? Thanx