Here are the steps required:
1) Download Nutch 0.9 from SVN Repository
2) Create a Java Project in MyEclipse
a) choose "Create Project from existing source" ( Eclipse will scan all the folders which contain java files and make them source folders)
b) Go to the third Tab "Libraries" . CLick on the "Add Class Folder" button and check the conf folder .
c) Go to the fourth Tab "Order and Export" move the "conf" folder up in order to te first position
3) Configure Nutch
a) change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch- default.xml
b) in the cong folder do the following steps:
* Rename crawl-urlfilter.txt.template to crawl-urlfilter.txt
* Rename automaton-urlfilter.txt.template to automaton-urlfilter.txt
* In crawl-urlfilter.txt replace
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ with
+^http://([a-z0-9]*\.)*org.apache.com/
* Create a urls folder. Add a file urls.txt with seed urls to crawl
c) Edit nutch-site.xml and add the following
<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
4) Create Eclipse Launcher
Menu Run > "Run..."
create "New" for "Java Application"
set in Main class
org.apache.nutch.crawl.Crawl
on tab Arguments, Program Arguments
urls -dir crawl -depth 3 -topN 50
in VM arguments
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
click on "Run"
if all works, you should see Nutch getting busy at crawling