Step by step
From DBSight Full-Text Search Engine/Platform Wiki
| Table of contents |
Background
What is Freedb
For those who don't already know, freedb is "a database to look up CD information using the internet. This is done by a client that calculates a (nearly) unique disc ID and then queries the database. As a result, the client displays the artist, CD-title, tracklist and some additional infos."
The freedb data can be downloaded in a tar file, which includes about 1.7 million files, each containing the information of an individual CD.
The Hardware
| CPU | P3 450MHz |
| Memory | 256M |
| Hard disk | 20G |
Preparing Data
Get Data
I downloaded the tar file from freedb, ran a preprocessing Perl script to merge all included files to 2 big flat files, then bulk loaded the data files into each of the 2 MySQL tables respectively.
Database Schema
CREATE TABLE albums (
dir CHAR(10) NOT NULL,
disc_id CHAR(8) NOT NULL,
artist_name VARCHAR(255) DEFAULT ,
tracks TINYINT UNSIGNED DEFAULT 0,
album_title VARCHAR(255) DEFAULT ,
year CHAR(4),
genre VARCHAR(255),
duration MEDIUMINT UNSIGNED DEFAULT 0,
description TEXT,
last_modify_time INT UNSIGNED DEFAULT 0
);
CREATE TABLE tracks (
dir CHAR(10) NOT NULL,
disc_id CHAR(8) NOT NULL,
track_no TINYINT UNSIGNED NOT NULL DEFAULT 0,
track_title VARCHAR(255) DEFAULT ,
duration MEDIUMINT UNSIGNED DEFAULT 0,
description TEXT
);
- Indexes (including primary key) were not created for the sake of loading performance.
- In order to support multi-language, please run this command "SET NAMES 'utf8';" in MySQL.
- It took 12 minutes to import 1.24G data.
Selecting Data
Upgrade JDBC driver
The JDBC driver for MySQL shipped by DBSight is MySQL JDBC 2.0.14. It works on MySQL 4.0 or lower versions. The recent license prevents DBSight from shipping the latest version. So I did the following:
- Download one from http://www.mysql.com/downloads/api-jdbc.html
- Unzip the file, put mysql-connector-java-3.1.8-bin.jar under WEB-INF/lib/ext/jdbc/mysql/
- Restart the Java application server
Create Index
The index name is "freedb", which should be unique and can not be changed later.
Then click on "Configure JDBC Connection".
Configure JDBC Connection
Choose "Save", then click on "Configure SQL Queries to get documents".
Configure Main SQL Query - Step 1
Now comes a little challenge. A SQL query is needed to retrieve the documents (one document per album). It's recommended to have the primary key and modified date column, and order results by the modified date column in descending order. So I constructed this query:
SELECT CONCAT(dir, disc_id) AS id,
dir, disc_id,
artist_name, tracks, album_title, year, genre, duration, description,
last_modify_time
FROM albums
ORDER BY last_modify_time DESC
An index on the last_modify_time column could be created in order to improve the SELECT performance.
ALTER TABLE albums ADD INDEX(last_modify_time); #1707313 rows affected (1 min 24.99 sec)
Since "albums" is a large table, building index took a while.
Then click on "Automatically Generate Result Columns".
Configure Main SQL Query - Step 2
Scrolled down to step 2, configuring index field types for each column.
| column name | original field type | chose field type |
| id | Text | Keyword |
| dir | Text | UnIndexed |
| disc_id | Text | UnIndexed |
| year | Text | Keyword |
| genre | Text | Keyword |
Click on "Save". Choose "Create a new Subsequent SQL Query" in the message box.
Configure Subsequent SQL Query
Read the instructions in the SQL textbox and replace them with your query. This step is similar to that of Configure Main SQL Query, except that:
- Subsequent query only retrieves information of one document.
- Subsequent query can reference parameters from the main query and previous subseqent queries.
- Subsequent query results can be cached, which is useful for list-of-value (LOV) fields.
In this case, the subsequent query is to retrieve all tracks data of the album identified by "disc_id".
- I first concatenate track's title and description columns into a single "content" column, then "vertically concatenate" the "content" column of all tracks in each album.
- For the Index Field Type, I use the default value "Text".
This time, I follow the "Setup Wizard" on the upper right corner, to "Select a display template".
Displaying Data
Select a Scaffold
I chose the "Search Engine" scaffold.
Configure Template
Remember you can always change the values later. Changes I have made:
- Included in Front Page: Uncheck it because I don't have other indexes to search.
- Page Title: Typed in "Search freedb".
- URL Prefix: http://www.freedb.org/freedb_search_fmt.php?cat=rock&id= (I left the category "rock" there to change it later.)
- ID Column: Selected "disc_id".
- Title Column: Selected "album_title", chose to escape HTML, and need highlighting if there are words matching the query.
- Summary Column: Selected "content", chose to escape HTML, and need summarizing the related context matching the query.
- Narrow Search Results: This is a very useful feature. Make sure it's selected.
- Recent Searches: Search history stored in cookies. Select it.
- Spell Checker: a useful feature. Select it.
- Syndicate: RSS/Atom feed for your search. Not really useful for freedb. I selected it for demo purpose only.
- Results Per Page: a nice feature. Select it.
I named the template as "free".
Modify the Template
Remember the URL Prefix that I left to make changes later? Now is the time.
First click the "free" template in the template listing page.
There is a list of files generated from the example template "Search Engine".
- "main.ftl" is, as its name suggests, the main template that includes other component templates.
- "documents.ftl" is the most important component template, which displays the search results.
Then click on "document.ftl". Now I see the Freemarker code to generate the HTML for all the matching documents. The link for each document is also here.
There is a "Column Wizard" to help you generate some Velocity code. I selected "dir" and got this code:
${doc.get('dir')}
Each document is represented by the object "doc", and "dir" is one of its attributes.
Then I found "rock" in the URL, and replaced it with ${doc.get('dir')}
Indexing Data
Build Index
Now everything is ready. Let's schedule to get the data from database.
But wait! The data is 1.2G, but the default maximal index size is only 500M. So I went to "Data Source"->"Advanced Settings", and increased the maximal index size to 20G. Please keep in mind that the disk space should be at least 3~4 times bigger than the index size.
"Indexing Memory Limit" is useful if you have a small memory machine like mine. I set it to 64M.
The data won't change often. So I just need a one-time job. I went to the Dashboard, chose "Full Indexing", and hit "Go".
Cool! It's running!
Let's wait for the computer to do the hard work.
Fine Tuning
Do more tweaking with the template
The DBSight's scaffolding power is that it can quickly create a template, yet you can freely to customize the template. Or you can choose some partial scaffolding to apply a different UI for some components.
