Share What You Know, Learn What You Don't: Deleting Dead URLS & Files That No Longer Exist Nutch

Nutch allows you to crawl the web or a filesystem in order to build up index store of all its content. If your objective is to simply crawl the content once, it is fairly easy. But if you want to continuously monitor a site and crawl updates, it can be harder. Harder because the Nutch documentation does not have many details about that.

When you recrawl your source there can be a previously active URLs or Files that now no longer exist and you would like Nutch to remove them from its Indexes. Nutch would update any changes made to documents it had indexed in the past but any files that had been deleted still remained in the indexes.

If you wish to skip straight to the solution then just go to the end of the post, but if you wish to understand what is happening then read on.

Nutch stores a record of all the files/urls it has encountered whilst doing its crawl and is called the crawldb. Initially this is built from the list of urls/files provided by the user using the inject command which will be normally be taken from your seed.txt file.

Nutch uses a generate/fetch/update process:
generate: This command looks at the crawldb for all the urls/files that are due for fetching and regroups them in a segment. A url/file is due for fetch if if it is new or the time has expired for that url/file and is now due for recrawling (default is 30 days).

fetch: This command will go an fetch all the urls/files specified in the segment.

update: This command will add the results of the crawling, which have been stored in the segment, into the crawldb and each url/file will be updated to indicate the time it wad fetched and when its next scheduled fetch is. If any urls/files have been discovered, they will be added and marked as not fetched.

How does Nutch can detect if a page has changed or not? Each time a page is fetched, Nutch computes a signature for the page. At the next fetch, if the signature is the same (or if a 304 is returned by the web server because of the If-Modified-Since header), Nutch can tell if the page was modified or not (It is not just the content, if the http headers or metatags have changed it will be marked as modified). If a document no longer exists it returns a 404 and will be marked DB_GONE. During the update cycle Nutch has the ability to purge all the those urls/files that have been marked DB_GONE.

The linkdb stores the finalised indexes that Nutch has generated from the crawl and this is the data that Nutch passes to the Solr Server during the solrindex process.

To tell Nutch that you would like all the urls/files that have been deleted you need to add the following code to your nutch-site.xml:

<property>
  <name>db.update.purge.404</name>
  <value>true</value>
  <description>If true, updatedb will add purge records with status DB_GONE
  from the CrawlDB.
  </description>
</property>

I hope this is of some help to you.

Share What You Know, Learn What You Don't

Pages

Friday 9 August 2013

Deleting Dead URLS & Files That No Longer Exist Nutch

1 comment: