File Types

From DBSight Full-Text Search Engine/Platform Wiki

Default Supported File Types

  • html
  • pdf
  • xls
  • rtf
  • csv
  • txt
  • zip

You can also extend/customize it for your own file type.

How file types are recognized?

For any column of BLOB type, the contained file's type need to be determined by another column set to field type "Blob File Name".

For example, if one row with BLOB column stored a file, usually the file name is also stored. The file name's extension is used to determine the file type. If the file name is something like "abc.pdf", or "c:\abc\def\abc.pdf", or anything else ending with ".pdf", the binary data in the BLOB will be parsed according to pdf format.

If all the BLOB content is the same type, pdf for example, you can use select a column with simply value as "aFileName.pdf", and use it as "Blob File Name".

How to extend your own file types

This is kind of advanced. To simply put it, just need to implement an interface with an inputstream as input parameter and an String as output, and add 4 lines to register the file types and your new java class in an xml file.

The interface source code is here. It's inside dbsight.jar.

package net.javacoding.xsearch.indexer.filter;

import java.io.InputStream;

public interface Filter {
    abstract String getString(InputStream is);
}

Here is an example code for RTFFilter

   public String getString(InputStream is) {
       Reader reader = new InputStreamReader(is);
       RTFParserDelegateImpl delegate = new RTFParserDelegateImpl();
       RTFParser rtfParser = null;
       rtfParser = RTFParser.createParser(reader);
       rtfParser.setNewLine("\n");
       rtfParser.setDelegate(delegate);

       try {
         rtfParser.parse();
       } catch (com.etranslate.tm.processing.rtf.ParseException e) {
           return null;
       }

       return delegate.getText();
     }

   public static void main(String[] args) throws Exception {
       RTFFilter filter = new RTFFilter();
       System.out.println(filter.getString(new FileInputStream(new File(args[0]))));
   }

To register the filter for the file type, please locate file WEB-INF/conf/xsearch-web-config.xml, and find tag <filters>, copy and modify lines similar to this:

   <filter name="Excel">
     <filter-class>net.javacoding.xsearch.indexer.filter.POIExcelFilter</filter-class>
     <filename-pattern><![CDATA[.*\.xls]]></filename-pattern>
   </filter>