User dictionary

From DBSight Full-Text Search Engine/Platform Wiki

Table of contents

Spell Checking

There are two ways to do spell checking.

Spell checking based on words from the index

  • If "Enable Index-Specific Spell Checking" is enabled, the words inside the index itself would be used.
  • Currently you would need to do a "Re-build the spell check index" from dashboard to re-create the dictionary. The theory is, if the index has grown to a fairly large size, the words inside the index may not vary much between each incremental indexing. So it's OK to just do it now and then.

Spell checking based on User Dictionary

  • Currently DBSight will be shipped with default English dictionary for spell check. If you want to use other languages, you can simply delete everything in spell_check.txt and replace it with your language's words. The spell check dictionary file is:
WEB-INF/data/dictionary/spell_check.txt

It is a text file, each line represent a valid word. When user submit a query into DBSight, for each word, the spell checker will go through the dictionary, if there is a match, the spell checker will treat it as a valid word, otherwise the closest match will be returned as spell suggestion.


  • You can update or replace the dictionary by editing the spell_check.txt, especially if you have some keywords not listed in the dictionary.
  • The user dictionary index will be created or refreshed after one incremental or full indexing, which will check the timestamp of spell_check.txt and re-create the index if needed.

Stop Words and Synonyms

Stop Words

There are the words that should be ignored during searching, and usually are some common words, like "an", "the". For example, if a user enter query

search the database

results containing

search a database
search database
search the database

should be matched. What's more,

search the database

will be ranked higher than other results. (available since 1.5.4)

The file is:

WEB-INF/dictionary/stopwords.txt

Each word takes one line, and case-insensitive.

Synonyms

These are the words that are equivalent to each other. The file is:

WEB-INF/dictionary/synonyms.txt

Each line has several words separated by spaces, and case-insensitive.

How to use it?

Stop words and synonyms are tightly related to Analyzers. Since each field can have a different Analyzer, each field can also choose to have Stopwords and synonyms applied or not, by selecting the check box along side the Analyzer selection.

Reserved Words

These are the words that should not be "analyzed" by analyzers(available since 1.5.5 beta). The file is:

WEB-INF/dictionary/reserved.txt

For example, "C#" or "C++" should not be simply analyzed into "C" by most analyzers.

Any fields enabling "Synonyms and Stopwords" will also get this reserved words.

But to be able to search these reserved words, you either can use lucene query directly, like

 lq=fieldName:c#

or use "phrase search" to avoid DBSight query parser applying the analyzer on the query:

 q="c#"