LSHKIT: tools/mplsh-tune.cpp File Reference

#include <boost/program_options.hpp>
#include <boost/format.hpp>
#include <lshkit.h>
#include <lshkit/tune.h>


Functions
double	recall (const tune::Input &x)
double	cost (const tune::Input &x)
bool	constraint (const tune::Input &x)
int	main (int argc, char *argv[])
Variables
tune::Interval	intervals []
double	target_recall
MultiProbeLshDataModel *	model

Detailed Description

Assume you have a sample datafile sample.data. You need to take the following two steps for parameter tuning:

1. Create a model of data distribution.

 fitdata -D sample.data

 Allowed options:
    -h [ --help ]            produce help message.
    -N [ -- ] arg (=0)     # of points to use.  if N = 0, then the whole sample dataset is used.
    -P [ -- ] arg (=50000) # pairs of points to sample
    -Q [ -- ] arg (=1000)  # queries to sample
    -K [ -- ] arg (=100)   Top K
    -F [ -- ] arg (=10)      
    -D [ --data ] arg      data file

For example, we use the following command to generate statistics from audio.data:

 fitdata -D audio.data

  0%   10   20   30   40   50   60   70   80   90   100%
  |----|----|----|----|----|----|----|----|----|----|
  ***************************************************
  3.28031 2.90649
  0.831193        -0.160539       0.150609
  0.825921        -0.186676       0.177095

2. Use mplsh-tune to tune parameters.

 mplsh-tune -P audio.param -N 54387 -L 8 -T 20 -R 0.8 -K 50
  
  -P is the parameter file we generated in the 1st step;
  -N is the whole dataset size.  Note that in the first step we only use a subset of the data to generate the model.
  -L # hash tables to use.
  -T # of bins probed in each hash table
  -R required recall.
  -K K-NNs to find.

mplsh-tune will output best W and M that will meet the required recall. For the audio data, we get the following output:

 L = 8   T = 20  M = 20  W = 4.32        recall = 0.805276       cost = 0.0968405

We can then run the benchmark and see how close is the prediction to the real number:

 mplsh-run -D audio.data -B audio.query -L 8 -T 20 -W 4.32 -M 20 -Q 100 -K 50

 Loading data...done.
 Loading benchmark...done.
 Initializing index...
 done.
 Populating index...
 
 0%   10   20   30   40   50   60   70   80   90   100%
 |----|----|----|----|----|----|----|----|----|----|
 ***************************************************
 CREATE TIME: 3.19s
 Running queries...
 
 0%   10   20   30   40   50   60   70   80   90   100%
 |----|----|----|----|----|----|----|----|----|----|
 ***************************************************
 QUERY TIME: 0.57s
 [RECALL] 0.8192 +/- 0.234635
 [COST] 0.115853 +/- 0.0980032

How to choose L and T?

T is the number of buckets to probe in each hash table. A number from 10 ~ 100 would be fine. Larger T results in lower cost. Yes, more buckets are probed in each hash table, but that would allow us to make each hash bucket smaller, and the overall effect is less points are scanned to reach the required recall. However, our model doesn't consider the cost of generating the probe sequence and when T is very large, that cost can be significant. So in practice T should not really be much larger than 100.

Variable Documentation

tune::Interval intervals[]

Initial value:

 {{MIN_L, MAX_L + 1},
                            {MIN_T, MAX_T + 1},
                            {0, MAX_M - MIN_M + 1},
                            {0, NUM_W + 1}}

tools/mplsh-tune.cpp File Reference

Functions

Variables

Detailed Description

1. Create a model of data distribution.

2. Use mplsh-tune to tune parameters.

How to choose L and T?

Variable Documentation