Technical Fundas: Crawl your website using Nutch Crawler without Indexing the HTML content into SOLR

Sunday, July 27, 2014

Crawl your website using Nutch Crawler without Indexing the HTML content into SOLR

This article will help you in resolving following issue:

1) If you want to crawl the website using Nutch Crawler without indexing the HTML content into SOLR here are the changes that you need to perform the crawl script of nutch package.

You need to remove the following piece of code:

SOLRURL="$3"

if [ "$SOLRURL" = "" ]; then echo "Missing SOLRURL : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>" exit -1;fi

echo "Indexing $SEGMENT on SOLR index -> $SOLRURL"

  $bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT

  if [ $? -ne 0 ]

   then exit $?

fi

echo "Cleanup on SOLR index -> $SOLRURL" $bin/nutch clean -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb

  if [ $? -ne 0 ]

   then exit $?

fi

Hope This Helps!!!

Technical Fundas

Sunday, July 27, 2014

Crawl your website using Nutch Crawler without Indexing the HTML content into SOLR

No comments:

Post a Comment

Password Protected Solr Admin Page

Pages

Search This Blog