include/lshkit/mplsh.h File Reference

Multi-Probe LSH indexing. More...

#include <lshkit/common.h>
#include <lshkit/lsh.h>
#include <lshkit/composite.h>
#include <lshkit/metric.h>
#include <lshkit/lsh-index.h>
#include <lshkit/mplsh-model.h>
#include <lshkit/topk.h>

Go to the source code of this file.

Namespaces

namespace  lshkit

Classes

struct  lshkit::Probe
 Probe vector. More...
class  lshkit::ProbeSequenceTemplates
 Probe sequence template. More...
class  lshkit::MultiProbeLsh
 Multi-Probe LSH class. More...
struct  lshkit::MultiProbeLsh::Parameter
class  lshkit::MultiProbeLshIndex< KEY >
 Multi-Probe LSH index. More...

Typedefs

typedef std::vector< Probe > lshkit::ProbeSequence
 Probe sequence.

Functions

void lshkit::GenProbeSequenceTemplate (ProbeSequence &seq, unsigned M, unsigned T)
 Generate a template probe sequence.

Variables

ProbeSequenceTemplates lshkit::__probeSequenceTemplates


Detailed Description

Multi-Probe LSH indexing.

Multi-Probe LSH (MPLSH) uses the same data structure as LshIndex, except that it probes more than one buckets in each hash table to generate more accurate results. Equivalently, less hash tables are needed to achieve the same accuracy. The limitation is that the current implementation only works for L2 distance.

Follow the following 4 steps to use the MPLSH API.

1. Implement a scanner class which scan the candidate keys.

The MPLSH data structure doesn't manage the feature vectors, but only keeps the keys to retrieve them. You need to provide a scanner class to MPLSH, and for each query, MPLSH will pass the candidate keys to the scanner. The scanner usually keeps an K-NN data structure internally and updates it when receives candidate keys.

MPLSH uses the scanner as a unary function taking a key as argument.

The default scanner implementation is TopkScanner.

 class Scanner
 {
      ...
      void operator () (unsigned key);
 };

2. Construct the MPLSH data structure.

Assume we use key type KEY.

 typedef MultiProbeLshIndex<KEY> Index;

 Index index;

The index can not be used yet.

3. Populate the index / Load the index from a previously saved file.

When the index is initially built, use the following to populate the index:
 Index::Parameter param;

 //Setup the parameters.  Note that L is not provided here.
 param.W = W;
 param.H = H; // See H in the program parameters.  You can just use the default value.
 param.M = M;
 param.dim = DIMENSION_OF_THE_DATA
 DefaultRng rng; // random number generator.
 
 index.init(param, rng, L);
 
 for (each possible key, value pair) {
     index.insert(key, value);
 }
 
 // You can now save the index for future use.
 ofstream os(index_file.c_str(), std::ios::binary);
 index.save(os);

Or you can load from a previously saved file

 ifstream is(index_file.c_str(), std::ios::binary);
 index.load(is);

4. Query the MPLSH.

 float *query;
 ...
 index.query(query, T, scanner);   cnt is the number of points actually scanned.

See the source file lshkit/tools/mplsh-run.cpp for a full example of using MPLSH.

For adaptive probing, I hard coded the sensitive range of KNN distance to [0.0001W, 100W] and logarithmically quantized the range to 200 levels. If you find that your KNN distances fall outside this range, or want more refined quantization, you'll have to modify the code in lshkit::MultiProbeLshIndex::init().

Reference

Wei Dong, Zhe Wang, William Josephson, Moses Charikar, Kai Li. Modeling LSH for Performance Tuning.. To appear in In Proceedings of ACM 17th Conference on Information and Knowledge Management (CIKM). Napa Valley, CA, USA. October 2008.

Qin Lv, William Josephson, Zhe Wang, Moses Charikar, Kai Li. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB). Vienna, Austria. September 2007.


Get LSHKIT at SourceForge.net. Fast, secure and Free Open Source software downloads doxygen