Search outside of database
From DBSight Full-Text Search Engine/Platform Wiki
| Table of contents |
DBSight has always been used for Database search. But it is not limited to database any more. You can use a customized fetcher to crawl data and feed to DBSight now. And it's quite simple actually.
Download Sample Fetcher
There is a sample fetcher, which is pretty dumb. But you can use the sample project to quickly get started.
http://www.dbsight.net/download/customized_fetcher.zip
Directory Structure
The files usually are
WEB-INF/lib/ext/fetcher/<fetcher_dir_name>/fetcher.xml WEB-INF/lib/ext/fetcher/<fetcher_dir_name>/*.jar WEB-INF/lib/ext/fetcher/<fetcher_dir_name>/*.zip WEB-INF/lib/ext/fetcher/<fetcher_dir_name>/*.properties
The <fetcher_dir_name> should be unique.
During indexing, the *.jar, *.zip, *.properties files are loaded into classpath first, following by the system classpath, and all usual *.jar and *.zip files under WEB-INF/lib, and WEB-INF/lib/ext/indexing.
Customized Fetcher
When creating an index, you would notice a new option to select from "Customized Fetcher". The fetcher needs to be written in Java, inheriting AbstractFetcher class. There are two functions need to be defined.
Overwrite List<FieldType> getFieldTypes(Properties p)
The properties object contains the name and value pair of strings that can be specified when configuring the fetcher.
In here, you just need to list out the fields, and specifying the primary key field and modified date field. Here is an simple example:
public List<FieldType> getFieldTypes(Properties p) {
List<FieldType> ret = new ArrayList<FieldType>();
ret.add(new NumberFieldType("id").setPrimaryKey(true));
ret.add(new StringFieldType("title"));
ret.add(new TimestampFieldType("modified_time").setModifiedTime(true));
return ret;
}
Overwrite void execute(Properties p, long lastRunTime)
The properties object contains the name and value pair of strings that can be specified when configuring the fetcher.
In this function, you need to call scheduleDocument(document) to pass the document down the pipeline. This way, the data are not hogged in the memory and you can free the memory for previous documents to process the next one.
In this function, if lastRunTime == 0, it'll be a re-creating index. If lastRuntime>0, the lastRunTime would be the latest modified time in existing indexes. Here is a dumb example:
public void execute(Properties p, long lastRunTime, List<Column> columns) {
if(lastRunTime>0) {
// an incremental indexing
for(int i=50;i<150;i++) {
TextDocument td = new TextDocument();
td.add("id", Integer.toString(i));
td.add("title", "incrementally updated document title "+i);
td.add("modified_time", new Date(System.currentTimeMillis()));
scheduleDocument(td);
}
}else {
// a re-create indexing
for(int i=0;i<100;i++) {
TextDocument td = new TextDocument();
td.add("id", Integer.toString(i));
td.add("title", "document title "+i);
td.add("modified_time", new Date(System.currentTimeMillis()));
scheduleDocument(td);
}
}
}
Get Started
The sample fetcher has all the information you will need. You can start with the build.xml, to adjust 3 key properties:
dir : the directory name. Should be unique among all fetchers. DBSIGHT_HOME : the directory you installed DBSIGHT jarName : Give you jar file a name.
Then, just use the sample as a starting point, rename the DumbFetcher and TestFetcher to your own class name/package.
After these, you can start writing your own code. When it can run with your TestFetcher, do an "ant", and it'll be deployed to your DBSight instance.
Here is the source code for dummy fetcher:
http://www.dbsight.net/download/customized_fetcher.rar
Here is the source code for sales force fetcher:
http://www.dbsight.net/download/sforce_fetcher.rar
