Saturday, January 31, 2009

Nutch Folder Structure















Nutch data is composed of:

1.The crawl database, or crawldb. This contains information about every url known to Nutch, including whether it was fetched, and, if so, when.
2.The link database, or linkdb. This contains the list of known links to each url, including both the source url and anchor text of the link.
3.A set of segments. Each segment is a set of urls that are fetched as a unit. Segments are directories with the following subdirectories:
a crawl_generate names a set of urls to be fetched
a crawl_fetch contains the status of fetching each url
a content contains the content of each url
a parse_text contains the parsed text of each url
a parse_data contains outlinks and metadata parsed from each url
a crawl_parse contains the outlink urls, used to update the crawldb

4.The indexes are Lucene-format indexes.

Useful Links:

http://wiki.apache.org/nutch/NutchHadoopTutorial

http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

No comments:

Post a Comment