API

From DBSight Full-Text Search Engine/Platform Wiki

Table of contents

Search API

Background

Some users would like to directly use Java, or other languages to search on DBSight. Using XML/JSON/HTML could be a little fragile, since they can be modified. And passing text between searching client program and DBSight server is not so efficient, especially XML.

So we implemented a binary protocol, based on Google's Protocol Buffer. It turns out very efficient.

This opens the door for Java, C++, Python, and other languages, to directly interact with DBSight.

Protocol Definition

This protocol is between any search client, and DBSight searchProtocolBuffer.do . For Java, DBSight provides SearchConnection class to wrap the communication. For other languages, you can just send an http request to searchProtocolBuffer.do with message SearchRequest, and get SearchResponse back.

The definition of the protocol is actually pretty small:

 package net.javacoding.xsearch.api.protocol;
 
 option optimize_for = SPEED;
 
 option java_package = "net.javacoding.xsearch.api.protocol";
 option java_outer_classname = "SearchProtocol";
 
 message SearchRequest {
   repeated string index = 1;
   optional string query = 2;
   optional string lucene_query = 3;
   optional int32 start = 4;
   optional int32 result_per_page = 5;
   repeated Sort sort = 6;
   optional bool debug = 7;
   repeated StringColumnFormat string_column_format = 8;
   optional string begin_highlight_tag = 9;
   optional string end_highlight_tag = 10;
   optional int32 boolean_operator = 11; //0 for default, 1 for AND, 2 for OR
   optional int32 facet_count_limit = 12 [default = 17];   //avoid too many facet_counts 
   optional string source_location = 13;
   optional string user_input = 14;
   optional string searchable = 15;
   optional bool enalbe_facet_search = 16 [default = true]; // whether to do facet search or not
   optional int32 random_query_seed = 17 [default = 0];   // random seed if random query is needed
   optional string defaultQuery = 18;  //use this query if no search results found
   optional string lucene_query_parser_name = 19;
   optional string filterable = 20;
   optional int32 max_document = 21;  //21 and 22 are used for sharded search
   repeated SearchTermFreqency search_term_frequency = 22;
   optional bool lucene_and_operator = 23; //empty or false for default or, true for AND
   optional string roles = 24;  //comma separated roles
   optional string queryString = 25;  //directly from request.getQueryString()
 }
 message Sort {
   optional string column = 1;
   optional bool descending = 2 [default = true];
 }
 message StringColumnFormat {
   required string column = 1;
   enum StringFormat {
     DIRECT = 0;
     HTML = 1;
     HIGHLIGHTED = 2;
     HIGHLIGHTED_HTML = 3;
     SUMMARIZED = 4;
     SUMMARIZED_HTML = 5;
   }
   optional StringFormat string_format = 2 [default = DIRECT];
 }
 
 message SearchResponse {
   required int64 search_time = 1;
   required int32 total = 2;
   repeated Document doc = 3;
   repeated FacetChoice facet_choice = 4;
   optional string index_name = 5;
 }
 message Document {
   required float score = 1;
   required float boost = 2;
   repeated Field field = 3;
 }
 message Field {
   required string name = 1;
   required string value = 2;
   enum Type {
     STRING = 0;
     DATETIME = 1;
     NUMBER = 2;
   }
   optional Type type = 3 [default = STRING];
 }
 message FacetChoice {
   required string column = 1;
   repeated FacetCount facet_count = 2;
   optional int32 facet_count_total = 3;
   optional int32 min_integer_value = 4;
   optional int32 max_integer_value = 5;
   optional int64 usage_counter = 6;
 }
 message FacetCount {
   required string value = 1;
   optional string end_value = 2;
   required int32 count = 3;
 }

Java Search

Requirement

  1. Java 1.5 or later
  2. copy WEB-INF/lib/dbsight.jar to your classpath
  3. copy WEB-INF/lib/dbsight-search-api.jar to your classpath
  4. copy WEB-INF/lib/protobuf.jar to your classpath

Sample Usage

       import net.javacoding.xsearch.api.Document;
       import net.javacoding.xsearch.api.FacetChoice;
       import net.javacoding.xsearch.api.FacetCount;
       import net.javacoding.xsearch.api.Field;
       import net.javacoding.xsearch.api.Result;
       import net.javacoding.xsearch.api.SearchConnection;
       import net.javacoding.xsearch.api.SearchQuery;
       ...
       SearchConnection s = new SearchConnection("http://localhost:8080/dbsight/").setIndex("my_index");
       SearchQuery q = new SearchQuery("love").setDebug(true)
                           .highlight("title").summarize("description")
                           .setHighlightTag("<span style=\"color:#666\">", "</span>")
                           .setFacetCountLimit(20);
       Result sr = s.search(q);
       System.out.println("total:"+sr.getTotal());
       System.out.println("doc count:"+sr.getDocList().size());
       System.out.println("Search time:"+sr.getSearchTime());
       for(Document d : sr.getDocList()) {
           System.out.println("---------------------------");
           for(Field f : d.getFieldList()) {
               System.out.println(f.getName()+"("+f.getType()+")"+f.getValue());
           }
       }
       for(FacetChoice fChoice : sr.getFacetChoiceList()) {
           System.out.println("Narrow By " + fChoice.getColumn());
           for(FacetCount fc : fChoice.getFacetCountList()) {
               System.out.println("  " + fc.getValue() + (fc.getEndValue().length()==0? "" : ","+fc.getEndValue()) + " ~ " + fc.getCount());
           }
       }

Check JavaDoc

The Java Doc shipped with DBSight should have more accurate information. Be sure to check java doc for net.javacoding.xsearch.api.SearchConnection

Submit API

Background

A very common requirement is to search content the user just submitted. Previously DBSight always rely on database scheduled crawler to get new content. So it was impossible to meet this requirement.

Now with this API, when users submit any content, the application can choose to also submit the content to DBSight. It's a simple http call. And the content will be kept in memory for fast search and update. We called this in-memory index as "Buffer Index".

Buffer Index Usage

Send a HTTP Post to DBSight, with a special field called "indexName". For example, if you have an index "books", with fields like "title", "author", "isbn", "price", "publish_date". You can submit it to:

http://localhost:8080/dbsight/submit.do

with parameter and values like

indexName=books
title=Good Book
author=me
isbn=123456789
price=23.99
publish_date=1221968258421                        //this is a value of a date as a number of milliseconds

All unknown columns are skipped.

Data Types

Date,Timestamp

The value should be the number of milliseconds since January 1, 1970, 00:00:00 GMT represented by this date. Usually in java, you can just use

new Date().getTime()

Numbers, Float, Integer, Long

Nothing special here. Just print the numbers to string, for example, 1234567.89

String

Nothing special here.

Buffer Index Life Cycle

Buffer Index hold the data in memory.

When submitting documents, the Primary Key Column is required. When the content with the same primary key value is submitted, the content will be updated. The other submitted fields will be updated, and the rest fields will be kept the same.

When a scheduled incremental or re-create indexing is finished, the buffer index will be discarded since it has all the duplicated documents in the newly created index. It has several advantages:

  1. Developers don't need to solely rely on the flimsy HTTP call. Just submit it and forget about it. Retrying logic is not really needed, and no need to possibly slowing down the whole application.
  2. If DBSight needs to shut down for maintenance. There will be minimal impact. Quite likely only the user himself/herself will try to search the document just submitted.

DBSight can always use the robust database crawling to retrieve the right content.

If you want to use API to submit content and want to keep buffer index content without another indexing on database, you can schedule an "Index API Submissions" job. This would merge the buffer index data into the index on disk. Here isis how it works:

  1. When submitting content via API, the content is not only indexed into buffer index in memory, but also stored on disk.
  2. When the "Index API Submissions" job runs, the on-disk content is indexed again and merged into the on-disk index.

Current Buffern Index Limitation

The content submitted via this API currently is not included in narrowBy(facet search) results.