Pages

Friday 9 August 2013

Solr 4 Deduplication On Windows

Setting Up Deduplication

I followed the Solr wiki to try and set-up deduplication and all it did was cause my Solr server to break, these settings worked for me and should also work for you too.  

Any changes made to the schema.xml file for Solr should also be reflected in the schema.xml file in Nutch.

schema.xml

You need a separate field to store the signature:


 <field name="signature" type="string" stored="true" indexed="true" multiValued="false" />

solrconfig.xml

The SignatureUpdateProcessorFactory has to be registered in the solrconfig.xml as part of the UpdateRequest Chain:
<updateRequestProcessorChain name="dedupe">
    <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">true</bool>
      <str name="signatureField">signature</str>
      <str name="fields">id</str>
      <str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>
Also be sure to change your update handlers to use the defined chain:


 <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" >
    <lst name="defaults">
      <str name="update.chain">dedupe</str>
    </lst>
  </requestHandler>
The update processor can also be specified per request with a parameter of update.chain=dedupe.
Note that for pre-Solr3.2 you need to use update.processor instead

No comments:

Post a Comment