Sunday, February 1, 2009

Setting Up Nutch 0.9 on MyEclipse on Windows

I have setup Nutch 0.9 on MyEclipse (Windows Platform ) Successfully.

Here are the steps required:

1) Download Nutch 0.9 from SVN Repository 

2) Create a Java Project in MyEclipse
a) choose "Create Project from existing source" ( Eclipse will scan all the folders which contain java files and make them source folders) 
b) Go to the third Tab "Libraries"  . CLick on the "Add Class Folder" button and check the conf folder . 
c) Go to the fourth Tab "Order and Export" move the "conf" folder up in order to te first position

3) Configure Nutch 
a) change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch- default.xml
b) in the cong folder do the following steps:
* Rename crawl-urlfilter.txt.template to  crawl-urlfilter.txt
* Rename automaton-urlfilter.txt.template to automaton-urlfilter.txt
* In crawl-urlfilter.txt replace 
  +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ with 
+^http://([a-z0-9]*\.)*org.apache.com/
* Create a urls folder. Add a file urls.txt with seed urls to crawl
c) Edit nutch-site.xml and add the following

  <property>
  <name>http.agent.name</name>
  <value></value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>


4) Create Eclipse Launcher

  • Menu Run > "Run..."

  • create "New" for "Java Application"

  • set in Main class

org.apache.nutch.crawl.Crawl 
  • on tab Arguments, Program Arguments

urls -dir crawl -depth 3 -topN 50 
  • in VM arguments

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log 
  • click on "Run"

  • if all works, you should see Nutch getting busy at crawling


1 comment:

  1. I have a doubt...How are the final results of Nutch stored?I mean, in which format is stored the information contained in the links analyzed?

    I understood that Nutch need the information be plan text to parse it...but in which format is stored finally?

    ReplyDelete