Pages

Tuesday 30 July 2013

MySQL Server (5.6), Installer, Connector or WorkBench Wont't Install (Windows 7 64-bit)

If you’re having trouble installing MySQL on Windows 7 (Any other platform will probably be the same) then I’d recommend the following process.  Don’t skip the first steps that describe the uninstall and removal of the previous install as this seems to clear up most issues.   

Caution: deleting the data files will delete any data previously entered into MySQL.

1.  Remove previous MySQL installs (and remnants)
a.  Stop the MySQL service ( Start | Control Panel | System & Security | Administrative Tools | Services)

b.  Remove All programs with MySQL in the name from Windows Add or Remove Programs (Start | Control Panel | Uninstall a Program)

c.  Delete the MySQL data directory (It is a hidden folder) from  (You will lose any stored data so you may wish to make a copy) :
       c:\Users\{Your Username}\AppData\Local\MySQL
       c:\Users\{Your Username}\AppData\LocalLow\MySQL
       c:\Users\{Your Username}\AppData\Roaming\MySQL


d.  Delete the MySQL folders from
        c:\program files\MySQL\
        c:\program files(x86)\MySQL\
        c:\program data\MySQL\                --(this is also a hidden folder)


————–
That’s it for the removal and clean up.  Now, we’ll begin the install.

2.a  Go to the MySQL Website and install the latest MySQL Installer.  It should come in the form of an MSI installer and comes for a 32 bit machine(x86) but will run on a 64-bit machine.

b.  Launch the installer which can be found in your Downloads folder.  Follow the Wizard through to the end and make sure you make note of the password you set for your new installation of the server in case you forget it 

TROUBLESHOOTING
What if it didn’t work?

1.  Check the service from the control panel.  Is it running?  If so, try to log into MySQL.  I’ve seen instances where an error was received, but it completed successfully.  If it’s not running, try to start it and please post the complete error message.

2.  Check the error log.  It’s in C:\Users\AppData\MySQL\MySQL Server 5.6\data   It has an .err extension and you should be able to open it with notepad.  Ignore any errors about the innodb plugin not loading.  It’s a symptom, not the problem.  Look for an error about missing data files or a mismatch in file sizes.

3.  1045 error?  It should indicate Password: No or Password:yes.  If “yes”, it’s not a port issue, not a firewall issue — it’s a password issue.  It’s usually caused by not deleting out the data files from an old install.  The password is kept in the data directory and it’s likely that the password you entered during the previous install doesn’t match what you’re entering now.  You can either re-install after deleting all the files or try to reset your root password.  (resetting root password: http://dev.mysql.com/doc/refman/5.1/en/resetting-permissions.html )

4.  Still stuck?   Email Oracle's Support Team

Monday 29 July 2013

Error parsing: xxx failed(2,0): Unable to read 512 bytes from 65536 in stream of length 65421

Error parsing: xxx failed(2,0): Unable to read 512 bytes from 65536 in stream of length 65421.  

The error that you are receiving is because of length of the file so for each error message you are likely to have different error numbers.

By default there is a limit to how much data that Nutch will parse and should your file be larger than that limit then Nutch will either truncate your index file or just give up. 


SOLUTION:

To take that limit away you need to set the content limit to -1 - This lets Nutch know that you have no limit on the length of file and it will continually attempt to parse it until it is completed.  This can mean that there are performance issues when it comes to large files.

If you are using a web crawler:

Add this code to the nutch-site.xml
<property> 
<name>http.content.limit</name> 
<value>-1</value> 
<description>The length limit for downloaded content, in bytes. 
             If this value is nonnegative (>=0), content longer than it 
will be 
             truncated;otherwise, no truncation at all. 
</description> 
</property>
If you are using a filesystem crawler then:

Add this code to the nutch-site.xml
<property> 
<name>file.content.limit</name> 
<value>-1</value> 
<description>The length limit for downloaded content, in bytes. 
             If this value is nonnegative (>=0), content longer than it 
will be 
             truncated;otherwise, no truncation at all. 
</description> 
</property>

Friday 26 July 2013

Java Multi-Threaded Recursive File & Folder Crawler

This is a multi-threaded java program that takes a filepath and recursively goes through each folder and displays the contents of each 

package FileName;
import java.util.*;
import java.io.*;
 
public class fileCrawler {
 
  private WorkQueue workQ;
  static int i = 0;
 
 private class Worker implements Runnable {
 
  private WorkQueue queue;
 
  public Worker(WorkQueue q) {
   queue = q;
  }
 
//  since main thread has placed all directories into the workQ, we
//  know that all of them are legal directories; therefore, do not need
//  to try ... catch in the while loop below
 
  public void run() {
   String name;
   while ((name = queue.remove()) != null) {
    File file = new File(name);
    String entries[] = file.list();
    if (entries == null)
     continue;
    for (String entry : entries) {
     if (entry.compareTo(".") == 0)
      continue;
     if (entry.compareTo("..") == 0)
      continue;
     String fn = name + "\\" + entry;
     System.out.println(fn);
    }
   }
  }
 }
 
 public fileCrawler() {
  workQ = new WorkQueue();
 }
 
 public Worker createWorker() {
  return new Worker(workQ);
 }
 
 
// need try ... catch below in case the directory is not legal
 
 public void processDirectory(String dir) {
   try{
   File file = new File(dir);
   if (file.isDirectory()) {
    String entries[] = file.list();
    if (entries != null)
     workQ.add(dir);
 
    for (String entry : entries) {
     String subdir;
     if (entry.compareTo(".") == 0)
      continue;
     if (entry.compareTo("..") == 0)
      continue;
     if (dir.endsWith("\\"))
      subdir = dir+entry;
     else
      subdir = dir+"\\"+entry;
     processDirectory(subdir);
    }
   }}catch(Exception e){}
 }
 
 public static void main(String Args[]) {
 
  fileCrawler fc = new fileCrawler();
 
//  now start all of the worker threads
 
  int N = 5;
  ArrayList<Thread> thread = new ArrayList<Thread>(N);
  for (int i = 0; i < N; i++) {
   Thread t = new Thread(fc.createWorker());
   thread.add(t);
   t.start();
  }
 
//  now place each directory into the workQ
 
  fc.processDirectory(Args[0]);
 
//  indicate that there are no more directories to add
 
  fc.workQ.finish();
 
  for (int i = 0; i < N; i++){
   try {
    thread.get(i).join();
   } catch (Exception e) {};
  }
 }
}



package FileName;
import java.util.*;
 
public class WorkQueue {
 
//
// since we are providing the concurrency control, can use non-thread-safe
// linked list
//
  private LinkedList<String> workQ;
 private boolean done;  // no more directories to be added
 private int size;  // number of directories in the queue
 
 public WorkQueue() {
  workQ = new LinkedList<String>();
  done = false;
  size = 0;
 }
 
 public synchronized void add(String s) {
  workQ.add(s);
  size++;
  notifyAll();
 }
 
 public synchronized String remove() {
  String s;
  while (!done && size == 0) {
   try {
    wait();
   } catch (Exception e) {};
  }
  if (size > 0) {
   s = workQ.remove();
   size--;
   notifyAll();
  } else
   s = null;
  return s;
 }
 
 public synchronized void finish() {
  done = true;
  notifyAll();
 }
}

Thursday 25 July 2013

Solr XSLT alternative with live links

The XSLT files that Solr has standard are pretty good and cover a wide variety of different transforms however they didnt give me exactly what I wanted.  I use Nutch as a filesystem crawler and therefore wanted the XSLT transform to make all the ur'ls(uri's) to the crawled files to be hyperlinks, so that when I clicked on them it would either download that file or allow me to view it in my browser.  I therefore decided that I would make my own XSLT file even though I had no previous experience using XSLT before so I have adapted the example.xsl file that comes standard with Solr and this is the final version:

<?xml version='1.0' encoding='UTF-8'?>
 
<xsl:stylesheet version='1.0'
    xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
>
 
  <xsl:output media-type="text/html" encoding="UTF-8"/> 
 
  <xsl:variable name="title" select="concat('Solr search results (',response/result/@numFound,' documents)')"/>
 
  <xsl:template match='/'>
    <html>
      <head>
        <title><xsl:value-of select="$title"/></title>
        <xsl:call-template name="css"/>
      </head>
      <body>
        <h1><xsl:value-of select="$title"/></h1>
        <div class="note">
          ******* Author: Allan Macmillan ******** Blog:amac4.blogspot.com ********* allan2.xls
        </div>
        <xsl:apply-templates select="response/result/doc"/>
      </body>
    </html>
  </xsl:template>
 
  <xsl:template match="doc">
    <xsl:variable name="pos" select="position()"/>
    <div class="doc">
      <table width="100%">
        <xsl:apply-templates>
          <xsl:with-param name="pos"><xsl:value-of select="$pos"/></xsl:with-param>
        </xsl:apply-templates>
      </table>
    </div>
  </xsl:template>
 
  <xsl:template match="doc/*[@name='score']" priority="100">
    <xsl:param name="pos"></xsl:param>
    <tr>
      <td class="name">
        <xsl:value-of select="@name"/>
      </td>
      <td class="value">
        <xsl:value-of select="."/>
 
        <xsl:if test="boolean(//lst[@name='explain'])">
          <xsl:element name="a">
            <!-- can't allow whitespace here -->
            <xsl:attribute name="href">javascript:toggle("<xsl:value-of select="concat('exp-',$pos)" />");</xsl:attribute>?</xsl:element>
          <br/>
          <xsl:element name="div">
            <xsl:attribute name="class">exp</xsl:attribute>
            <xsl:attribute name="id">
              <xsl:value-of select="concat('exp-',$pos)" />
            </xsl:attribute>
            <xsl:value-of select="//lst[@name='explain']/str[position()=$pos]"/>
          </xsl:element>
        </xsl:if>
      </td>
    </tr>
  </xsl:template>
 
  <xsl:template match="doc/arr" priority="100">
    <tr>
      <td class="name">
        <xsl:value-of select="@name"/>
      </td>
      <td class="value">
        <ul>
        <xsl:for-each select="*">
          <li><xsl:value-of select="."/></li>
        </xsl:for-each>
        </ul>
      </td>
    </tr>
  </xsl:template>
 
 
  <xsl:template match="doc/*">
    <tr>
      <td class="name">
        <xsl:value-of select="@name"/>
      </td>
      <td class="value">
      <xsl:variable name="var" select="."/>
      <xsl:choose>
         <xsl:when test="starts-with($var,'file:////')">
      <a>
            <xsl:attribute name="href">
               <xsl:value-of select="$var"/>
            </xsl:attribute>
     <xsl:value-of select="$var"/></a>
         </xsl:when>
  <xsl:otherwise>
             <xsl:value-of select="$var"/>
         </xsl:otherwise>
      </xsl:choose>
      </td>
    </tr>
  </xsl:template>
 
  <xsl:template match="*"/>
 
  <xsl:template name="css">
    <script>
      function toggle(id) {
        var obj = document.getElementById(id);
        obj.style.display = (obj.style.display != 'block') ? 'block' : 'none';
      }
    </script>
    <style type="text/css">
      body { font-family: "Lucida Grande", sans-serif }
      td.name { font-style: italic; font-size:80%; }
      td { vertical-align: top; }
      ul { margin: 0px; margin-left: 1em; padding: 0px; }
      .note { font-size:80%; }
      .doc { margin-top: 1em; border-top: solid grey 1px; }
      .exp { display: none; font-family: monospace; white-space: pre; }
    </style>
  </xsl:template>
 
</xsl:stylesheet>


XSLT Error For Solr - HTTP Status 500 - {msg=getTransformer fails in getContentType...

It is common to apply an XSLT stylesheet to an xml file to give it a cleaner look and Solr is no different coming with 4 pre-written xslt files for you to choose from.  

The error message XSLT Error For Solr - HTTP Status 500 - {msg=getTransformer fails in getContentType... is a common one and in most cases is easily fixed.  There are a few reasons that the error message appears:

  1. You have not specified a stylesheet to use
           e.g http://localhost:8080/solr/select?q=*&wt=xslt
    Make sure you have specified a stylesheet and that the stylesheet you specified actually exists
           e.g http://localhost:8080/solr/select?q=*&wt=xslt&tr=example.xsl
  2. A common mistake is to think that a stylesheet has a .xslt file extension - THEY DONT! The file extension is .xsl so be sure when you save your file and when you specify it in your query that you use .xsl
  3. There are errors in you XSL file - Be careful if you are working in an environment that does not do syntax checking (notepad) - It is easy to forget to close a tag or make a spelling mistake.

Stack Trace:
 {msg=getTransformer fails in getContentType,trace=java.lang.RuntimeException: getTransformer fails in getContentType at org.apache.solr.response.XSLTResponseWriter.getContentType(XSLTResponseWriter.java:74) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:623) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:372) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1008) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1852) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.io.IOException: 'tr' request parameter is required to use the XSLTResponseWriter at org.apache.solr.response.XSLTResponseWriter.getTransformer(XSLTResponseWriter.java:1

Tuesday 23 July 2013

Setting Up Tika & Extracting Request Handler

Setting Up Tika's Extracting Request Handler

Some of this is covered in the set-up of Solr
Sometimes indexing prepared text files (such as XML, CSV, JSON, etc) is not enough. There are numerous situations where you need to extract data from binary files. For example, indexing PDF files – actually their contents. To do that we can use Apache Tika which comes built in with Apache Solr by using its ExtractingRequestHandler.


Preparation

You should have worked through the set-up for Solr prior to this point and can be found at:

If you wish to have a fully functioning file or web crawler using Nutch that Indexes to Solr then follow the next steps of the guide at:   

Set-Up Guide

  • In the $SOLR_HOME/collection1/conf/solrconfig.xml file there will be a section with heading - Solr Cell Update Request Handler. The code there should be updated or replaced to say:
<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    <str name="fmap.content">text</str>
   <str name="lowernames">true</str>
   <str name="uprefix">attr_</str>
   <str name="captureAttr">true</str>
 </lst>
</requestHandler>
  • Create an "extract" folder anywhere in the system, one option would be putting it in the solr_home folder. Then place the solr-cell-4.3.0.jar file in it from the $SOLR/dist. Then copy the contents of the $SOLR/contrib/extraction/lib/ folder into your extract folder.
  • In the solrconfig.xml file add code for the directory you have chosen:
<lib dir="$SOLR_HOME/extract" regex=".*\.jar" />
  • In the schema.xml file the <field name="text"…..> line needs edited to say
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
  • To test that it works open command prompt and navigate to any directory containing a pdf file and execute the following code replacing the filename with the file to be used:
curl "http://localhost:8080/solr/update/extract?literal.id=1&commit=true" -F "myfile=@FILENAME.pdf"
  • If all has worked correctly then the following output should be displayed
<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">578</int>
  </lst>
</response>

Next Steps

You now have Solr configured properly and ready to use Tika to extract the data that you need.  The next step is configure Nutch, an open source web crawler that will crawl the web to find pages to index:  

How It Works

Binary file parsing is implemented using the Apache Tika framework. Tika is a toolkit for detecting and extracting metadata and structured text from various types of documents, not only binary files but also HTML and XML files. To add a handler that uses Apache Tika, we need to add a handler based on the solr.extraction.ExtractingRequestHandler class to our solrconfig.xml file as shown in the example. In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract directory that we created. The dir attribute of the lib tag should be pointing to the path of the created directory. The regex attribute is the regular expression telling Solr which files to load. Let's now discuss the default configuration parameters. The fmap.content parameter tells Solr what field content of the parsed document should be extracted. In our case, the parsed content will go to the field named text. The next parameter lowernames is set to true; this tells Solr to lower all names that come from Tika and have them lowercased. The next parameter, uprefix, is very important. It tells Solr how to handle fields that are not defined in the schema.xml file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returned a field named creator, and we don't have such a field in our index, then Solr would try to index it under a field named attrcreator which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after those elements. Next we have a command that sends a PDF file to Solr. We are sending a file to the /update/ extract handler with two parameters. First we define a unique identifier. It's useful to be able to do that during document sending because most of the binary document won't have an identifier in its contents. To pass the identifier we use the literal.id parameter. The second parameter we send to Solr is the information to perform the commit right after document processing.

Source Code

If you are unsure of anything then pop me an email and i can send you sample schema.xml and solrconfig.xml for you to use.