include/lshkit/sketch.h File Reference

Implementation of LSH based sketches. More...

#include <lshkit/matrix.h>

Go to the source code of this file.

Namespaces

namespace  lshkit

Classes

class  lshkit::Sketch< LSH, CHUNK >
 LSH-based sketcher. More...
class  lshkit::WeightedHammingHelper< CHUNK >
 Weighted hamming distance calculator. More...


Detailed Description

Implementation of LSH based sketches.

Sketches are compact representations (as bit-vectors) of large objects. They are simply constructed by concatenating a bunch of 1-bit LSH hash values. The distance between the original objects can be approximated by the hamming distance between sketches. The generated bit-vectors are stored in arrays of type CHUNK (unsigned char by default, so each chunk has 8 bits).

Following is an example of distance estimation using sketches:

      #include <lshkit.h>
      using namespace lshkit;

      typedef Sketch<ThresholdingLsh> MySketch; // approximate L1 distance
      TresholdingLsh::Parameter param;

      param.dim = 128;
      param.min = 0.0;
      param.max = 255.0; // Parameters are set for, maybe, SIFT features.

      DefaultRng rng;

      const static SKETCH_BITS = 256;
      const static SKETCH_BYTES = 256 / 8;

      MySketch sketch(SKETCH_BYTES, param, rng); // from 128 floats to 256 bits.
      // Note that in practice you are probably not going to simply construct a sketcher like this.
      // The sketcher used to sketch the query point should be the exact one used to
      // sketch the data points.  So by the time a database has been constructed,
      // the sketcher is most probably already constructed and saved in an archive.
      // You should use the serialize method to load the sketcher from the archive (first default construct
      // a sketcher, than apply sketch.serialize(some_input_stream, 0).)


      
      // Allocate space for sketches.  Using "new" is probably not a good idea.
      char *query_sketch = new char[SKETCH_BYTES];
      float *asym_info = new float[SKETCH_BITS]; // to hold asymmetric information.

      float *query = ...;

      sketch.apply(query, query_sketch, asym_info);

      WeightedHammingHelper asym_helper(SKETCH_BYTES);
      asym_helper.update(query_sketch, asym_info);

      metric::hamming<char> hamming(SKETCH_BYTES);

      // Scan the database.
      // DATABASE is a container of previously constructed sketches.
      BOOST_FOREACH (const char* data_sketch, DATABASE) {

          // evaluate the symmetric sketch distance
          float sym_dist = hamming(query_sketch, data_sketch);

          // evaluate the asymmetric sketch distance
          float asym_dist = asym_helper.distTo(data_sketch);

          // asym_dist should be more reliable than sym_dist for ranking.
      }

See src/sketch-run.cpp for a full example.

For more information on sketches and asymmetric distance estimators, see

Wei Dong, Moses Charikar, Kai Li. Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research & Development on Information Retrieval. Singapore. July 2008.


Get LSHKIT at SourceForge.net. Fast, secure and Free Open Source software downloads doxygen