Step by step

From DBSight Full-Text Search Engine/Platform Wiki

Table of contents

Background

What is Freedb

For those who don't already know, freedb is "a database to look up CD information using the internet. This is done by a client that calculates a (nearly) unique disc ID and then queries the database. As a result, the client displays the artist, CD-title, tracklist and some additional infos."

The freedb data can be downloaded in a tar file, which includes about 1.7 million files, each containing the information of an individual CD.

The Hardware

CPU P3 450MHz
Memory 256M
Hard disk 20G

Preparing Data

Get Data

I downloaded the tar file from freedb, ran a preprocessing Perl script to merge all included files to 2 big flat files, then bulk loaded the data files into each of the 2 MySQL tables respectively.

Database Schema

   CREATE TABLE albums (
     dir CHAR(10) NOT NULL,
     disc_id CHAR(8) NOT NULL,
     artist_name VARCHAR(255) DEFAULT ,
     tracks TINYINT UNSIGNED DEFAULT 0,
     album_title VARCHAR(255) DEFAULT ,
     year CHAR(4),
     genre VARCHAR(255),
     duration MEDIUMINT UNSIGNED DEFAULT 0,
     description TEXT,
     last_modify_time INT UNSIGNED DEFAULT 0
   );
   CREATE TABLE tracks (
     dir CHAR(10) NOT NULL,
     disc_id CHAR(8) NOT NULL,
     track_no TINYINT UNSIGNED NOT NULL DEFAULT 0,
     track_title VARCHAR(255) DEFAULT ,
     duration MEDIUMINT UNSIGNED DEFAULT 0,
     description TEXT
   );
  • Indexes (including primary key) were not created for the sake of loading performance.
  • In order to support multi-language, please run this command "SET NAMES 'utf8';" in MySQL.
  • It took 12 minutes to import 1.24G data.

Selecting Data

Upgrade JDBC driver

The JDBC driver for MySQL shipped by DBSight is MySQL JDBC 2.0.14. It works on MySQL 4.0 or lower versions. The recent license prevents DBSight from shipping the latest version. So I did the following:

Create Index

The index name is "freedb", which should be unique and can not be changed later.

Then click on "Configure JDBC Connection".

Configure JDBC Connection

Choose "Save", then click on "Configure SQL Queries to get documents".

Configure Main SQL Query - Step 1

Now comes a little challenge. A SQL query is needed to retrieve the documents (one document per album). It's recommended to have the primary key and modified date column, and order results by the modified date column in descending order. So I constructed this query:

   SELECT CONCAT(dir, disc_id) AS id,
          dir, disc_id,
          artist_name, tracks, album_title, year, genre, duration, description,
          last_modify_time
   FROM   albums
   ORDER  BY last_modify_time DESC

An index on the last_modify_time column could be created in order to improve the SELECT performance.

   ALTER TABLE albums ADD INDEX(last_modify_time);
   #1707313 rows affected (1 min 24.99 sec)

Since "albums" is a large table, building index took a while.

Then click on "Automatically Generate Result Columns".

Configure Main SQL Query - Step 2

Scrolled down to step 2, configuring index field types for each column.

column name original field type chose field type
id Text Keyword
dir Text UnIndexed
disc_id Text UnIndexed
year Text Keyword
genre Text Keyword

Click on "Save". Choose "Create a new Subsequent SQL Query" in the message box.

Configure Subsequent SQL Query

Read the instructions in the SQL textbox and replace them with your query. This step is similar to that of Configure Main SQL Query, except that:

  • Subsequent query only retrieves information of one document.
  • Subsequent query can reference parameters from the main query and previous subseqent queries.
  • Subsequent query results can be cached, which is useful for list-of-value (LOV) fields.

In this case, the subsequent query is to retrieve all tracks data of the album identified by "disc_id".

  • I first concatenate track's title and description columns into a single "content" column, then "vertically concatenate" the "content" column of all tracks in each album.
  • For the Index Field Type, I use the default value "Text".

This time, I follow the "Setup Wizard" on the upper right corner, to "Select a display template".

Displaying Data

Select a Scaffold

I chose the "Search Engine" scaffold.

Configure Template

Remember you can always change the values later. Changes I have made:

  1. Included in Front Page: Uncheck it because I don't have other indexes to search.
  2. Page Title: Typed in "Search freedb".
  3. URL Prefix: http://www.freedb.org/freedb_search_fmt.php?cat=rock&id= (I left the category "rock" there to change it later.)
  4. ID Column: Selected "disc_id".
  5. Title Column: Selected "album_title", chose to escape HTML, and need highlighting if there are words matching the query.
  6. Summary Column: Selected "content", chose to escape HTML, and need summarizing the related context matching the query.
  7. Narrow Search Results: This is a very useful feature. Make sure it's selected.
  8. Recent Searches: Search history stored in cookies. Select it.
  9. Spell Checker: a useful feature. Select it.
  10. Syndicate: RSS/Atom feed for your search. Not really useful for freedb. I selected it for demo purpose only.
  11. Results Per Page: a nice feature. Select it.

I named the template as "free".

Modify the Template

Remember the URL Prefix that I left to make changes later? Now is the time.

First click the "free" template in the template listing page.

There is a list of files generated from the example template "Search Engine".

  • "main.ftl" is, as its name suggests, the main template that includes other component templates.
  • "documents.ftl" is the most important component template, which displays the search results.

Then click on "document.ftl". Now I see the Freemarker code to generate the HTML for all the matching documents. The link for each document is also here.

There is a "Column Wizard" to help you generate some Velocity code. I selected "dir" and got this code:

   ${doc.get('dir')}

Each document is represented by the object "doc", and "dir" is one of its attributes.

Then I found "rock" in the URL, and replaced it with ${doc.get('dir')}

Indexing Data

Build Index

Now everything is ready. Let's schedule to get the data from database.

But wait! The data is 1.2G, but the default maximal index size is only 500M. So I went to "Data Source"->"Advanced Settings", and increased the maximal index size to 20G. Please keep in mind that the disk space should be at least 3~4 times bigger than the index size.

"Indexing Memory Limit" is useful if you have a small memory machine like mine. I set it to 64M.

The data won't change often. So I just need a one-time job. I went to the Dashboard, chose "Full Indexing", and hit "Go".

Cool! It's running!

Let's wait for the computer to do the hard work.

Fine Tuning

Do more tweaking with the template

The DBSight's scaffolding power is that it can quickly create a template, yet you can freely to customize the template. Or you can choose some partial scaffolding to apply a different UI for some components.

See Results

Go to http://search.dbsight.com/