Sunday, June 8, 2014

How to configure Nutch in Eclipse for SOLR

Checkout and Build Nutch:
1.    Get the latest source code from SVN using terminal.
For Nutch 1.x (ie.trunk) run this:
svn co https://svn.apache.org/repos/asf/nutch/trunk


2.    Add “http.agent.name” and “http.robots.agents” with appropiate values in “conf/nutch-site.xml”.
Here you have to rename the nutch-site.xml.template file to nutch-site.xml and make the changes accordingly.
See conf/nutch-default.xml for the description of these properties.

3.    Also, add “plugin.folders” and set it to {PATH_TO_NUTCH_CHECKOUT}/build/plugins. eg. If Nutch is present at "/home/Desktop/2.x",
set the property to:


<property>
   <name>plugin.folders</name>
   <value>/home/Desktop/2.x/build/plugins</value>
</property>

There is no /build/plugins folder currently present. But when you run the "ant eclipse" command you will get the "/build/plugins" in your {PATH_TO_NUTCH_CHECKOUT}.
Thats why it is written as set the absolute path as {PATH_TO_NUTCH_CHECKOUT}/build/plugins.
Do not give relative path here as it wont.

4.    Run this command:
ant eclipse



5.    Load project in Eclipse
5.1.    In Eclipse, click on “File” -> “Import...”

5.2.    Select “Existing Projects into Workspace”

5.3.    In the next window, set the root directory to the location where you took the checkout of nutch 2.x (or trunk). Click “Finish”.

5.4.    You will now see a new project named 2.x (or trunk) being added in the workspace. Wait for a moment until Eclipse refreshes its SVN cache and builds its workspace. You can see the status at the bottom right corner of Eclipse.

5.5.    In Package Explorer, right click on the project “2.x” (or trunk), select “Build Path” -> “Configure Build Path”

5.6.    In the “Order and Export” tab, scroll down and select “2.x/conf” (or trunk/conf). Click on “Top” button. Sadly, Eclipse will again build the workspace but this time it won’t take take much.


6.    Need to Download following jar files :
http://mvnrepository.com/artifact/org.elasticsearch/elasticsearch/0.90.1
Configure the above jar file in eclipse.

7.    One error you will get for “ElasticsearchException”. Change it to “ElasticSearchException” (S Capital)


8.    Now you are ready to run the nutch code in eclipse:
8.1.    Lets start off with the inject operation.

8.2.    Right click on the project in “Package Explorer” -> select “Run As” -> select “Run Configurations”.

8.3.    Create a new configuration. Name it as "inject".
For 1.x ie trunk : Set the main class as: org.apache.nutch.crawl.Injector
For 2.x : Set the main class as: org.apache.nutch.crawl.InjectorJob

8.4.    In the arguments tab, for program arguments, provide the path of the input directory which has seed urls.

8.5.    Set VM Arguments to “-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log”

8.6.    Click "Apply" and then click "Run".

8.7.    If everything was set perfectly, then you should see inject operation progressing on console.



Class in Nutch 1.x (i.e.trunk)
inject :- org.apache.nutch.crawl.Injector
generate :- org.apache.nutch.crawl.Generator
fetch :- org.apache.nutch.fetcher.Fetcher
parse :- org.apache.nutch.parse.ParseSegment
updatedb :- org.apache.nutch.crawl.CrawlDb


Class in Nutch 2.x
inject :- org.apache.nutch.crawl.InjectorJob
generate :- org.apache.nutch.crawl.GeneratorJob
fetch :- org.apache.nutch.fetcher.FetcherJob
parse :- org.apache.nutch.parse.ParserJob
updatedb :- org.apache.nutch.crawl.DbUpdaterJob


HOPE THIS HELPS!!!!

6 comments:

  1. How I can use crawl script in eclipse to crawl.

    ReplyDelete
    Replies
    1. What do you mean by Crawl Script. If you configure your project in Eclipse then you will get Run option.
      HTH

      Delete
  2. Slaaam I do the same but have this error

    Injector: java.io.IOException: Failed to set permissions of path: \path\to\large\hadoop\tmp\mapred\staging\SemsEM1495190232\.staging to 0700
    at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:691)
    at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:664)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
    at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193)
    at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Unknown Source)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
    at org.apache.nutch.crawl.Injector.run(Injector.java:379)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:369)

    ReplyDelete
    Replies
    1. Hi,
      Were you trying this on Windows Server?
      I faced same issue earlier in 2012 when I was running it in windows server.

      Delete
  3. when change some conf

    I get this error (any support)

    Injector: starting at 2015-11-10 05:54:14
    Injector: crawlDb: -dir
    Injector: urlDir: urls
    Injector: Converting injected urls to crawl db entries.
    java.io.IOException: Job failed!
    Injector: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:330)
    at org.apache.nutch.crawl.Injector.run(Injector.java:388)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:377)

    ReplyDelete
    Replies
    1. Not sure what exact error you are getting. You should try to debug it more.
      With no inputs I cannot identify the error

      Delete

Password Protected Solr Admin Page

As we all know Solr Admin Page is not password protected and anyone can get into Solr Admin Page. However this article will ...