Sunday, July 27, 2014

Crawl your website using Nutch Crawler without Indexing the HTML content into SOLR

This article will help you in resolving following issue:

1) If you want to crawl the website using Nutch Crawler without indexing the HTML content into SOLR here are the changes that you need to perform the crawl script of nutch package.

You need to remove the following piece of code:

SOLRURL="$3"

if [ "$SOLRURL" = "" ]; then echo "Missing SOLRURL : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>" exit -1;fi


echo "Indexing $SEGMENT on SOLR index -> $SOLRURL" $bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT if [ $? -ne 0 ]
then exit $?
fi

echo "Cleanup on SOLR index -> $SOLRURL" $bin/nutch clean -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb if [ $? -ne 0 ]
then exit $?
fi

Hope This Helps!!!

No comments:

Post a Comment

Password Protected Solr Admin Page

As we all know Solr Admin Page is not password protected and anyone can get into Solr Admin Page. However this article will ...