mhash-c

about

Given a database of size N and a query vector of dimensionality d, the problem of identifying exact or nearly exact matches to the query in the database takes time Ω(Nd). When given another database of M such queries, this problem requires Ω(NMd) time. For large databases, this problem quickly becomes intractable.

To remedy this, we can employ approximation methods. Locality-Sensitive Hashing (LSH) is one such method. MinHashing is an LSH scheme which defines vector similarity using the Jaccard similarity coefficient. This project provides a multithreaded implementation of MinHashing in C using the pthreads library. A great explanation both of LSH and of MinHashing (as well as numerous other fantastic topics) can be found in the amazing text Mining of Massive Datasets^[1].

usage

The project is built using the make utility. Navigate into the project directory and type

make

to build the project. The following command will identify all pairs of similar vectors in bench/test.csv using a single thread.

cat bench/test.csv | ./mhash -b 2 -r 10 -t 1 -E

input

The program takes one or two databases of vectors as input. Consider the following format:

N,D,d
m1,f₁,f₂,...,f_m1
m2,f_m1+1,f_m2+2,...,f_m1+m2
...
mN,f_D-mN+1,f_D-mN+2,...,f_D

where N is the number of vectors in the database, D is the total number of features in all N vectors, and d is the dimensionality, i.e. # of possible features, of each vector. A single such database can be provided and compared to itself using the -E option. If another, distinct database is to be provided, it should be provided via standard input immediately after the first.

arguments

-b    <int>     number of bands per signature (n=r*b)         (required)
-r    <int>     size of each band (# rows)                    (required)
-s    <int>     seed for reproducibility                      (default: 0)
-t    <int>     number of threads to use                      (default: 1)
-E              enables exclusive comparison mode, prevents
                observations with the same index being paired
--verbose       enables verbosity, i.e. a bunch of debugging
                garbage will be thrown into standard output
--no-print      pairing results will not be sent to stdout

benchmarks

The expected probability of any two vectors with similarity s being paired when using b bands and r rows is 1-(1-s^r)^b^[1]. A successful implementation must provide this theoretical guarantee. We verify this is the case with our implementation as follows:

Generate a dataset D by sampling from the space of all possible vectors uniformly at random.
Run mhash on D with various settings of b and r.
Compute the proportion of called pairs at each similarity and place pairs into buckets (this is done because the distribution of similarities is discrete).
Graph both the theoretical curve and this empirical curve. The latter curve should approximate the former.

One would expect a larger dataset with higher dimensional vectors to yield a better approximation to the theoretical curve for all settings of b and r. The graphs below seem to support this intuition. (T) denotes a theoretical curve computed using the above formula.


*F = 96, N = 1000*	*F = 480, N = 4000*

We have also investigated timing requirements and peak memory usage of mhash over various settings and inputs. These are provided below. All memory results are provided in kilobytes (kB) and all timing results are provided in seconds. All benchmarking experiments were run on an Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz with 1.5 TB of RAM. Only one thread was used per mhash instance. Every data point is the average of 100 runs.

We used the --no-print flag when running mhash to suppress outputting identified pairs. The running time will be heavily influenced by the number of pairs mhash identifies and subsequently outputs, which can be quadratic in the size of the dataset in the worse case. The ouput quickly becomes the bottleneck when more than a number of pairs linear in the size of the input are identified.


*F = 96*	*F = 480*


*F = 96*	*F = 480*

We have provided the necessary tools to repeat this analysis here. Raw data for the above graphs can be found here.

resources

[1] Mining of Massive Datasets, J. Leskovec et al.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
bench		bench
src		src
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mhash-c

about

usage

input

arguments

benchmarks

resources

About

Releases

Packages

Languages

License

coderthetyler/mhash-c

Folders and files

Latest commit

History

Repository files navigation

mhash-c

about

usage

input

arguments

benchmarks

resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages