tools/mplsh-tune.cpp File Reference

Automatic parameter tuning for MPLSH. More...

#include <boost/program_options.hpp>
#include <boost/format.hpp>
#include <lshkit.h>
#include <lshkit/tune.h>


double recall (const tune::Input &x)
double cost (const tune::Input &x)
bool constraint (const tune::Input &x)
int main (int argc, char *argv[])


tune::Interval intervals []
double target_recall
MultiProbeLshDataModel * model

Detailed Description

Automatic parameter tuning for MPLSH.

Assume you have a sample datafile You need to take the following two steps for parameter tuning:

1. Create a model of data distribution.

Use the following command

 fitdata -D

The following options are acceptable:

 Allowed options:
    -h [ --help ]            produce help message.
    -N [ -- ] arg (=0)     # of points to use.  if N = 0, then the whole sample dataset is used.
    -P [ -- ] arg (=50000) # pairs of points to sample
    -Q [ -- ] arg (=1000)  # queries to sample
    -K [ -- ] arg (=100)   Top K
    -F [ -- ] arg (=10)      
    -D [ --data ] arg      data file

For example, we use the following command to generate statistics from

 fitdata -D

We get the following output

  0%   10   20   30   40   50   60   70   80   90   100%
  3.28031 2.90649
  0.831193        -0.160539       0.150609
  0.825921        -0.186676       0.177095

Cut and paste the last three lines into a file named audio.param.

2. Use mplsh-tune to tune parameters.

There are four parameters in MPLSH: L, T, M, W. You'll have to choose L and T and let mplsh-tune to find the optimal M and W. For example, we use the following command to tune M and W for the audio dataset.

 mplsh-tune -P audio.param -N 54387 -L 8 -T 20 -R 0.8 -K 50
  -P is the parameter file we generated in the 1st step;
  -N is the whole dataset size.  Note that in the first step we only use a subset of the data to generate the model.
  -L # hash tables to use.
  -T # of bins probed in each hash table
  -R required recall.
  -K K-NNs to find.

mplsh-tune will output best W and M that will meet the required recall. For the audio data, we get the following output:

 L = 8   T = 20  M = 20  W = 4.32        recall = 0.805276       cost = 0.0968405

We can then run the benchmark and see how close is the prediction to the real number:

 mplsh-run -D -B audio.query -L 8 -T 20 -W 4.32 -M 20 -Q 100 -K 50

 Loading data...done.
 Loading benchmark...done.
 Initializing index...
 Populating index...
 0%   10   20   30   40   50   60   70   80   90   100%
 Running queries...
 0%   10   20   30   40   50   60   70   80   90   100%
 QUERY TIME: 0.57s
 [RECALL] 0.8192 +/- 0.234635
 [COST] 0.115853 +/- 0.0980032

(Note that the audio dataset is a pretty hard one.)

How to choose L and T?

L is the number of hash tables maintained in the memory and generally larger L results in better performance (smaller cost to reach certain recall). Hash tables only stores pointers to the feature vectors. So on a 64-bit machine, N points will take 8N bytes plus some overhead. For our audio dataset, where there are about 55K points, one hash table takes about 500KB of memory and 8 hash tables will take 4MB.

T is the number of buckets to probe in each hash table. A number from 10 ~ 100 would be fine. Larger T results in lower cost. Yes, more buckets are probed in each hash table, but that would allow us to make each hash bucket smaller, and the overall effect is less points are scanned to reach the required recall. However, our model doesn't consider the cost of generating the probe sequence and when T is very large, that cost can be significant. So in practice T should not really be much larger than 100.

Variable Documentation

tune::Interval intervals[]

Initial value:

 {{MIN_L, MAX_L + 1},
                            {MIN_T, MAX_T + 1},
                            {0, MAX_M - MIN_M + 1},
                            {0, NUM_W + 1}}

Get LSHKIT at Fast, secure and Free Open Source software downloads doxygen