J2ee & Web Search: February 2009

I have setup Nutch 0.9 on MyEclipse (Windows Platform ) Successfully.

Here are the steps required:

1) Download Nutch 0.9 from SVN Repository

2) Create a Java Project in MyEclipse

a) choose "Create Project from existing source" ( Eclipse will scan all the folders which contain java files and make them source folders)

b) Go to the third Tab "Libraries" . CLick on the "Add Class Folder" button and check the conf folder .

c) Go to the fourth Tab "Order and Export" move the "conf" folder up in order to te first position

3) Configure Nutch

a) change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch- default.xml

b) in the cong folder do the following steps:

* Rename crawl-urlfilter.txt.template to crawl-urlfilter.txt

* Rename automaton-urlfilter.txt.template to automaton-urlfilter.txt

* In crawl-urlfilter.txt replace

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ with

+^http://([a-z0-9]*\.)*org.apache.com/

* Create a urls folder. Add a file urls.txt with seed urls to crawl

c) Edit nutch-site.xml and add the following

<name>http.agent.name</name>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -

please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents

http.agent.description

http.agent.url

http.agent.email

http.agent.version

and set their values appropriately.

</description>

</property>

<name>http.agent.description</name>

<description>Further description of our bot- this text is used in

the User-Agent header. It appears in parenthesis after the agent name.

</description>

</property>

<name>http.agent.url</name>

<description>A URL to advertise in the User-Agent header. This will

appear in parenthesis after the agent name. Custom dictates that this

should be a URL of a page explaining the purpose and behavior of this

crawler.

</description>

</property>

<name>http.agent.email</name>

<description>An email address to advertise in the HTTP 'From' request

header and User-Agent header. A good practice is to mangle this

address (e.g. 'info at example dot com') to avoid spamming.

</description>

</property>

4) Create Eclipse Launcher

Menu Run > "Run..."
create "New" for "Java Application"
set in Main class

org.apache.nutch.crawl.Crawl

on tab Arguments, Program Arguments

urls -dir crawl -depth 3 -topN 50

in VM arguments

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

click on "Run"
if all works, you should see Nutch getting busy at crawling

J2ee & Web Search

Sunday, February 1, 2009

Setting Up Nutch 0.9 on MyEclipse on Windows

Followers

Blog Archive

About Me