Search outside of database

From DBSight Full-Text Search Engine/Platform Wiki

Table of contents


DBSight has always been used for Database search. But it is not limited to database any more. You can use a customized fetcher to crawl data and feed to DBSight now. And it's quite simple actually.

Download Sample Fetcher

There is a sample fetcher, which is pretty dumb. But you can use the sample project to quickly get started.

http://www.dbsight.net/download/customized_fetcher.rar

Directory Structure

The files usually are

WEB-INF/lib/ext/fetcher/<fetcher_dir_name>/fetcher.xml
WEB-INF/lib/ext/fetcher/<fetcher_dir_name>/*.jar
WEB-INF/lib/ext/fetcher/<fetcher_dir_name>/*.zip
WEB-INF/lib/ext/fetcher/<fetcher_dir_name>/*.properties

The <fetcher_dir_name> should be unique.

During indexing, the *.jar, *.zip, *.properties files are loaded into classpath first, following by the system classpath, and all usual *.jar and *.zip files under WEB-INF/lib, and WEB-INF/lib/ext/indexing.

Customized Fetcher

When creating an index, you would notice a new option to select from "Customized Fetcher". The fetcher needs to be written in Java, inheriting AbstractFetcher class. There are two functions need to be defined.

Overwrite List<FieldType> getFieldTypes(Properties p)

The properties object contains the name and value pair of strings that can be specified when configuring the fetcher.

In here, you just need to list out the fields, and specifying the primary key field and modified date field. Here is an simple example:

   public List<FieldType> getFieldTypes(Properties p) {
       List<FieldType> ret = new ArrayList<FieldType>();
       ret.add(new NumberFieldType("id").setPrimaryKey(true));
       ret.add(new StringFieldType("title"));
       ret.add(new TimestampFieldType("modified_time").setModifiedTime(true));
       return ret;
   }

Overwrite void execute(Properties p, long lastRunTime)

The properties object contains the name and value pair of strings that can be specified when configuring the fetcher.

In this function, you need to call scheduleDocument(document) to pass the document down the pipeline. This way, the data are not hogged in the memory and you can free the memory for previous documents to process the next one.

In this function, if lastRunTime == 0, it'll be a re-creating index. If lastRuntime>0, the lastRunTime would be the latest modified time in existing indexes. Here is a dumb example:

   public void execute(Properties p, long lastRunTime, List<Column> columns) {
       if(lastRunTime>0) {
           // an incremental indexing
           for(int i=50;i<150;i++) {
               TextDocument td = new TextDocument();
               td.add("id", Integer.toString(i));
               td.add("title", "incrementally updated document title "+i);
               td.add("modified_time", new Date(System.currentTimeMillis()));
               scheduleDocument(td);
           }
           //delete some old document
           TextDocument td = new TextDocument();
           td.add("id", Integer.toString(i));
           scheduleDeleteDocument(td);
       }else {
           // a re-create indexing
           for(int i=0;i<100;i++) {
               TextDocument td = new TextDocument();
               td.add("id", Integer.toString(i));
               td.add("title", "document title "+i);
               td.add("modified_time", new Date(System.currentTimeMillis()));
               scheduleDocument(td);
           }
       }
   }

Get Started

The sample fetcher has all the information you will need. You can start with the build.xml, to adjust 3 key properties:

dir : the directory name. Should be unique among all fetchers.
DBSIGHT_HOME : the directory you installed DBSIGHT
jarName : Give you jar file a name.

Then, just use the sample as a starting point, rename the DumbFetcher and TestFetcher to your own class name/package.

After these, you can start writing your own code. When it can run with your TestFetcher, do an "ant", and it'll be deployed to your DBSight instance.

Here is the source code for dummy fetcher:

http://www.dbsight.net/download/customized_fetcher.rar

Here is the source code for sales force fetcher:

http://www.dbsight.net/download/sforce_fetcher.rar

FAQ

What's the List<Column> columns argument to execute method?

List of columns are for selected columns. The configuration UI allow you to choose some columns instead of the full list.

How to let DBSight handle different file types?

If you have some common file types, DBSight should be able to handle them. You can use this API:

 TextDocument td = new TextDocument();
 File f = new File("/home/dbsight/f1.doc");
 td.add("file_name",f.getName()); //this is needed because DBSight does not automatically save the file name field.
 td.add("file_content", f);  // add the actual file
 // adding a remote file
 td.add("file_content2, "some remote word file.doc", new URL("http://www.abcdef.com/asdf/hijk.doc").openStream());

How to handle deleted document?

Use this API:

 TextDocument td = new TextDocument();
 td.add("your_primary_key_field", "primary_key_value");
 scheduleDeleteDocument(td);