#include <boost/program_options.hpp>
#include <boost/format.hpp>
#include <lshkit.h>
#include <lshkit/tune.h>
Assume you have a sample datafile sample.data. You need to take the following two steps for parameter tuning:
fitdata -D sample.data
The following options are acceptable:
Allowed options: -h [ --help ] produce help message. -N [ -- ] arg (=0) # of points to use. if N = 0, then the whole sample dataset is used. -P [ -- ] arg (=50000) # pairs of points to sample -Q [ -- ] arg (=1000) # queries to sample -K [ -- ] arg (=100) Top K -F [ -- ] arg (=10) -D [ --data ] arg data file
For example, we use the following command to generate statistics from audio.data:
fitdata -D audio.data
We get the following output
0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** 3.28031 2.90649 0.831193 -0.160539 0.150609 0.825921 -0.186676 0.177095
Cut and paste the last three lines into a file named audio.param.
mplsh-tune -P audio.param -N 54387 -L 8 -T 20 -R 0.8 -K 50 -P is the parameter file we generated in the 1st step; -N is the whole dataset size. Note that in the first step we only use a subset of the data to generate the model. -L # hash tables to use. -T # of bins probed in each hash table -R required recall. -K K-NNs to find.
mplsh-tune will output best W and M that will meet the required recall. For the audio data, we get the following output:
L = 8 T = 20 M = 20 W = 4.32 recall = 0.805276 cost = 0.0968405
We can then run the benchmark and see how close is the prediction to the real number:
mplsh-run -D audio.data -B audio.query -L 8 -T 20 -W 4.32 -M 20 -Q 100 -K 50
Loading data...done. Loading benchmark...done. Initializing index... done. Populating index... 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** CREATE TIME: 3.19s Running queries... 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** QUERY TIME: 0.57s [RECALL] 0.8192 +/- 0.234635 [COST] 0.115853 +/- 0.0980032
(Note that the audio dataset is a pretty hard one.)
T is the number of buckets to probe in each hash table. A number from 10 ~ 100 would be fine. Larger T results in lower cost. Yes, more buckets are probed in each hash table, but that would allow us to make each hash bucket smaller, and the overall effect is less points are scanned to reach the required recall. However, our model doesn't consider the cost of generating the probe sequence and when T is very large, that cost can be significant. So in practice T should not really be much larger than 100.
tune::Interval intervals[] |
Initial value:
{{MIN_L, MAX_L + 1}, {MIN_T, MAX_T + 1}, {0, MAX_M - MIN_M + 1}, {0, NUM_W + 1}}