#include <lshkit/matrix.h>
Go to the source code of this file.
Namespaces | |
namespace | lshkit |
Classes | |
class | lshkit::Sketch< LSH, CHUNK > |
LSH-based sketcher. More... | |
class | lshkit::WeightedHammingHelper< CHUNK > |
Weighted hamming distance calculator. More... |
Sketches are compact representations (as bit-vectors) of large objects. They are simply constructed by concatenating a bunch of 1-bit LSH hash values. The distance between the original objects can be approximated by the hamming distance between sketches. The generated bit-vectors are stored in arrays of type CHUNK (unsigned char by default, so each chunk has 8 bits).
Following is an example of distance estimation using sketches:
#include <lshkit.h> using namespace lshkit; typedef Sketch<ThresholdingLsh> MySketch; // approximate L1 distance TresholdingLsh::Parameter param; param.dim = 128; param.min = 0.0; param.max = 255.0; // Parameters are set for, maybe, SIFT features. DefaultRng rng; const static SKETCH_BITS = 256; const static SKETCH_BYTES = 256 / 8; MySketch sketch(SKETCH_BYTES, param, rng); // from 128 floats to 256 bits. // Note that in practice you are probably not going to simply construct a sketcher like this. // The sketcher used to sketch the query point should be the exact one used to // sketch the data points. So by the time a database has been constructed, // the sketcher is most probably already constructed and saved in an archive. // You should use the serialize method to load the sketcher from the archive (first default construct // a sketcher, than apply sketch.serialize(some_input_stream, 0).) // Allocate space for sketches. Using "new" is probably not a good idea. char *query_sketch = new char[SKETCH_BYTES]; float *asym_info = new float[SKETCH_BITS]; // to hold asymmetric information. float *query = ...; sketch.apply(query, query_sketch, asym_info); WeightedHammingHelper asym_helper(SKETCH_BYTES); asym_helper.update(query_sketch, asym_info); metric::hamming<char> hamming(SKETCH_BYTES); // Scan the database. // DATABASE is a container of previously constructed sketches. BOOST_FOREACH (const char* data_sketch, DATABASE) { // evaluate the symmetric sketch distance float sym_dist = hamming(query_sketch, data_sketch); // evaluate the asymmetric sketch distance float asym_dist = asym_helper.distTo(data_sketch); // asym_dist should be more reliable than sym_dist for ranking. }
See src/sketch-run.cpp for a full example.
For more information on sketches and asymmetric distance estimators, see
Wei Dong, Moses Charikar, Kai Li. Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research & Development on Information Retrieval. Singapore. July 2008.