Select from Database

From DBSight Full-Text Search Engine/Platform Wiki

Table of contents

Main Query: Order conetent by modified date

It's strongly recommended you order the query results by the modified date, in descending order, like

 select article_id, modified_date, category_id, sub_category_id, content
 from   articles
 order by 2 desc

Only in this order, DBSight can know quickly which documents have been modified since last indexing. And Incremental Indexing will stop when it found an already-indexed modified date.

It's strongly recommended you select both Primary Key Column and Modified Date Column from the tables, although they are not required.

Primary Key column

If you specify Primary Key Column, contents with the same primary key value will be found and older content will be updated.

Primary Key column consists of several actual columns?

If you want to use combination of columns as the primary key, you can simply select those columns as the primary key just by SQL:

 select concat(category, article_id) as the_primary_key, modified_date, ...
 from   articles
 order by 2 desc

You don't need to change your database tables to add any functional index at all for this "Primary Key". This "Primary Key" is only used in DBSight to uniquely identify each documents, not really used to query database.

Modified Date column

If you specify Modified Date Column, Incremental Indexing will quickly find out which documents have been modified and only retrieve these documents.

Modified Date Column can be an Date/Timestamp, or an incremental integer. So if your content will not be modified, you can even use this, if article_id is incremental:

 select article_id, content
 from   articles
 order by 1 desc

Main Query Tricks

Problem: If you have a lot of data, like several millions of rows, sorting on modified_date will take a long time. Some database, especially MySQL, not only sorting takes a long time, but also it will sort every time when DBSight paginates through the result set.

Solution: Learned the trick from one customer. Here is it.

  1. Create a MySql MyIsam table
  2. Insert ordered data into it
  3. Just "select * from the_table" during indexing.

For example, if you have a table A with a timestamp column "TS", before indexing, you can


Then simply use this as Main Query

select * from A

This table A should be a MyIsam one, which means that it does not supports mysql foreign keys, but it is very fast to build.

Subsequent Queries

If you have some one-many associations, you can use Subsequent Queries.

In the above example, if one article has many comments, you can get the comments in Subsequent Queries:

 select comment, comment_date
 from comments
 where article_id = ?
 order by 2 asc

And choose the parameter to be article_id from the main query.

Vertically Concatenation

And usually you need to save all comments related to the article, it will be considered as one big document. To do so, make sure you select checkbox for Concatenate, so that comments in all rows are combined together.

Cache List of Values

If you have a list of values to retrieve, you don't need to do complicated joins or outter joins. In the example above, to select the category name, you can use sql:

 select category_name
 from categories
 where category_id = ?

And choose the parameter to be category_id from the main query.

Make sure the checkbox for Cache is selected.

Batch Subsequent Queries (since 2.0beta)

The performance problem to solve

When not batching subsequent queries, suppose there are 1 Main Query and 1 Subsequent Query, to create the index from scratch, the total number of jdbc select execution would be 1+N, where N is total number of documents.

The solution

To resolve this, we need to batch subsequent queries. Suppose each batch size is b, the total number of jdbc calls would be 1+N/b. For example, if there are 1 million records, and batch size b is 200, the number of saved jdbc calls is 1+1000000/200=5001, saving 1000000*(1-1/200)= 995000 jdbc calls.

How to write Batched Subsequent Queries

1. First, to batch subsequent queries, the SQL should be like

select field1, field2, ..., fieldx, ...
from a_table
where fieldx IN ( ? )

During execution, the fieldx IN (?) will be expanded to fieldx IN (?,?,?,?,?,...), which means the values in the "IN (...)" would be expanded.

2. Second, the fieldx should be included in the selected field list. This is because when DBSight get the results, it needs to allocate each row back to the original records. So fieldx needs to be in the results,

Also, filedx must be a single-valued column instead of multi-valued column.

Synchronize Deleted Documents

Soft-Deleted Documents Queries

If you soft-delete a document, which is flagging a document as deleted in the database, you can use this to more efficiently find out which documents should not be included in the search.

In the above example, if you have a flag column "is_deleted", you can get the primary key column of the deleted documents by:

 select article_id
 from   comments
 where is_deleted = 1

This query can only have one column selected.

Queries for hard deleted documents

The SQL is the exact opposite to "Soft-Deleted Documents Queries". You select all the primary keys that should stay in the search.

In the above example, you can get all the primary key values of the existing documents by:

 select article_id
 from   comments

This query can only have one column selected.

This could take some time, including incremental indexing, if total number of document is huge. It's because DBSight needs to collect all the existing primary keys to find out which are already deleted from your database. It's not really efficient, but it's quick enough for most situations.

How incremental indexing works?

Normal Main-Query-Based Incremental Indexing

Usually or incremental indexing, DBSight will apply the same Main Query, ordered by the descending modified date, and stop processing when it finds a record that's already in the index.

Without any special hint, it may retrieving all the records for processing and it will take a long time. So you can set fetch size(in advanced options page) to give jdbc a hint to return rows as soon as ask the jdbc to return rows as soon as it finds enough rows for the fetch size. Some jdbc, like (mysql) doesn't follow this fetch size hint well, so we have mysql specific code for that.

The "Incremental Indexing" will stop as soon as it sees an already included "modified time".

The "Full Indexing" will skip modified time that's already included and run till the end. If any records have a duplicated primary key, the record will be updated.

Alternative Incremental Indexing

In "Advanced SQL" section, you can create alternative incremental index SQL. It can have one or several place holders for input parameters. The parameter is filled in as a timestamp during execution, with the value equals the latest modified_date in the existing index.

The Incremental SQL has a big advantage over the Main-Query-Based incremental indexing, because it does not require the results ordered by the last modified_date. This could be a big performance difference if the results are too many and sorting takes time on the database side. Of course, in most cases, since we usually query with

last_modified_date > ?

it's better to have an index on last_modified_date anyway. So it's possible sorting may not matter much. But removing a requirement surely will reduce the load on the database, and it's much more performant in most cases.

Finding Updated Rows

Here is a common question,

If there is a Main Query and 3 Subsequent Queries, which means:
  Main table 
  child table 1
  child table 2
  child table 3
Say for E.g, even if one of the child table gets modified the index needs to be 
refreshed, so in that case how does the incremental indexing work?

The most efficient way would be to use the advanced SQLs, with this SQL:

select * from main_table where primary_key in (
 select primary_key from main_table where updated_at > ?
 union all
 select primary_key from child_table1 where updated_at > ?
 union all
 select primary_key from child_table2 where updated_at > ?
 union all
 select primary_key from child_table3 where updated_at > ?

This should work very well when incremental indexing are executed periodically.

However, this would have trouble if need to detect rows deleted from the child tables.

To really resolve the issue, it's best to update the modified_date column in the main table when child tables are updated/deleted/inserted. It is the most efficient way to find updated rows.

Performance Tuning Query

Avoid Order By in Main Query with fresh index

The "order by modified_date desc" in Main Query is mostly an requirement for incremental indexing. So if you are creating an index from scratch, or always use "Re-create whole index", you can avoid the order-by clause to speed up the Main Query.

You can use "Advanced Query" to have an incremental-specific Main Query, or manually add back the order-by clause once you are done with the fresh index creation.

SQL Server Locking

By default, SQL Server uses shared locks on any rows that are queried by a query or stored procedure. If the Main Query is long-running, it can prevent other users from accessing the rows they need to UPDATE or DELETE rows on a timely basis. The easiest way to turn off locking is to use the NOLOCK optimizer hint as part of the query. For example:

select * from tablex with NOLOCK

Of course, "select *" is not a recommended approach for real production level code. :)

Performance Tuning Examples

Forum Messages

If you are thinking to select data as this:

select subject,body,p.mem,login,forumtitle
from pgd_messages p inner join pgd_members m on m.mem=p.mem
inner join pgd_forums f on f.forumid=p.forumid

You could possibly speed things up by this approach, depending on the data volume:

Main Query: Get All Messages,
 select threadid,subject,body,p.mem, p.forumid  from pgd_messages p
Subsequent query 1: Set p.mem as the input variable, enable batch join
 select m.login from pgd_members m where m.mem in (?)
Subsequent query 2: Set p.forumid as the input variable, enable caching since there are not many forums.
 select f.forumtitle from pgd_forums f where f.forumid= ?

By using batch join and caching, the Main Query are getting much simpler, and should be processed much faster. The subsequent joins are well cached and faster too.