In 1999 Gionis et. al
Let H = {h: S → U} be a family of hash functions
- if
$v \in B \left ( q, r_1 \right )$ then$Pr_H \left [ h\left ( q \right ) = h\left ( v \right ) \right ] \geq p_1$ - if
$v \notin B \left ( q, r_2 \right )$ then$Pr_H \left [ h\left ( q \right ) = h\left ( v \right ) \right ] \leq p_2$
With
This repository contains a C/C++ implementation of the orthogonal LSH heuristic. The heuristic input is an integer-encoded sequence of words, an integer-encoded text file, and an inverse dictionary. The sequence size is nine words maximum. Zero passing is added if the query size is less than nine words. The repository also includes several bash and python scripts that facilitate data preprocessing phases.
The repository also include a series of bash and python scripts that facilitate the data pre-processing process. The workflow starts by transforming PDF files to UTF-8 text files via the pdf2txt script. You might need to install pdftotext utilities first. In the normalization phase, multiples filters can be selectively applied. The corpora extraction phase retrieve unique strings and sorts them in lexicographical order. Each string is assigned a unique integer label in the Z_inverted index phase. The normalized text files are later relabeled using the inverted index in the relabeling phase. Finally, the vectored text files can be used by LSH to perform a query. Note, the query must also be relabeled using the inverted index.