Here are the steps required:
1) Download Nutch 0.9 from SVN Repository
2) Create a Java Project in MyEclipse
a) choose "Create Project from existing source" ( Eclipse will scan all the folders which contain java files and make them source folders)
b) Go to the third Tab "Libraries" . CLick on the "Add Class Folder" button and check the conf folder .
c) Go to the fourth Tab "Order and Export" move the "conf" folder up in order to te first position
3) Configure Nutch
a) change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch- default.xml
b) in the cong folder do the following steps:
* Rename crawl-urlfilter.txt.template to crawl-urlfilter.txt
* Rename automaton-urlfilter.txt.template to automaton-urlfilter.txt
* In crawl-urlfilter.txt replace
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ with
+^http://([a-z0-9]*\.)*org.apache.com/
* Create a urls folder. Add a file urls.txt with seed urls to crawl
c) Edit nutch-site.xml and add the following
<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
4) Create Eclipse Launcher
Menu Run > "Run..."
create "New" for "Java Application"
set in Main class
org.apache.nutch.crawl.Crawl
on tab Arguments, Program Arguments
urls -dir crawl -depth 3 -topN 50
in VM arguments
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
click on "Run"
if all works, you should see Nutch getting busy at crawling
I have a doubt...How are the final results of Nutch stored?I mean, in which format is stored the information contained in the links analyzed?
ReplyDeleteI understood that Nutch need the information be plan text to parse it...but in which format is stored finally?