Remote Index Replication

From DBSight Full-Text Search Engine/Platform Wiki

Table of contents


How To Setup

1. Assumption: Make sure the index definition are the same on the 2 different computers, one is the client, the other is the server

You can download/upload the index definition to make sure they are the same

2. Go to the client's "Schedule" tab, save the URL for

Subscribe: Subscribe to index generated by another server.

Note: You don't need choose to "Enable Schedule for Indexing Action" or select "Subscribe: Subscribe to index generated by another server". You can schedule them later.

3. No step 3. You are already done. You can go to dashboard to select "Fetch Subscribed Index"

How does it work?

Index Replication is a DBSight invented feature. Basically it copies index over the network. It doesn't interrupt the searching and it will warm up the new index, switch on the new index, switch off the old index, and remove the old index.

Benchmark

What we benchmarked is 1.2G index transmission taking less than 2 minutes, on a 100Mbit/s local network. Some simple math tells us index transmission is at 85Mbit/second. So the bottleneck is actually the network.

The distribution is triggered by pull or push?

By default, when a new index is built, the indexing server just waits for other nodes to retrieve the new indexes.

But you can customize it by notifying other nodes to retrieve the new index right away. There are actually 2 ways:

Ping a URL

In "Data Source"->"Advanced Settings"->"Fetching Data"->"URL To Ping", set to

http://your_other_node_url:8080/dbsight/scheduleAJob.do?indexName=your_index_name&cmd=inMemory%20stopIndexing%20unlockStoppedIndex%20retrieveSubscription%updateFromSubscribedIndex%20updateFromSubscribedSpellCheckIndex&text=Fetch%20Subscribed%20Index

Note: On the searching servers, you will also open up the permission from the indexing server's IP. In "Data Source"->"Advanced Settings"->"Security"->"Allowed IP or Host Name List", set to the indexing server's IP.

Use Ruby code

Ping a URL only works for one additional searching server. You will need to use Ruby/JRuby to notify additional searching servers. Just need to go through them and send notification one by one. For example,

require "java"
include_class "net.javacoding.xsearch.utility.HttpPost"
index_name = "your_index_name"
['192.168.1.100','192.168.1.101'].each{|client_server|
 HttpPost.new("http://#{client_server}:8080/dbsight/scheduleAJob.do?indexName=#{index_name}&cmd=inMemory%20stopIndexing%20unlockStoppedIndex%20retrieveSubscription%20updateIndexupdateFromSubscribedIndex%20updateFromSubscribedSpellCheckIndex&text=Fetch%20Subscribed%20Index").send
}

Then you can save this file as WEB-INF/scripts/notify_clients.rb And add "notify_clients.rb" to WEB-INF/conf/xsearch-web-config.xml

 <commands>
  <command name="Incremental Indexing">... buildDictionaryIfNeeded ping-a-url notify_clients.rb</command>

FAQ

1. If this is the case that the index data reaches GB, transferring index data to searching nodes via HTTP will reduce the performance greatly. Do you have some performance data on this? Because searching nodes should be available and consistent to users, the distribution should be performed at the same time to achieve this.

A: the transferring is not the bottleneck. It's done in the background, and it won't affect current searching, because they are operating on two different indexes. Only when transferring is finished will the switching happen. And the old index will be deleted.

2. In additon, before the transfer process, seaching nodes should determine that whether the transfer is really needed to avoid the unnecessary transfer. Do you have this kind of mechanism?

A: There is a threadhold that can be set on the indexing node. If the threshhold is not met, the indexing node won't publish the new index.