Data Preprocessing — Deduplication with MinHash and LSH

4 min readNov 2, 2020

When dealing with text preprocessing, one headache a data scientist has to deal with is the duplicated or similar documents.

Regarding this problem, the most straightforward way is to use a double loop to traverse all documents for a complete pairwise comparison.

If the corpus is short, that is fine. Levenshtein Distance can provide quite a good match. But with the increase of corpus size and document complexity. Screening becomes really time-consuming and space-consuming.

There is another way to handle it, with minHash and LSH. If you are not interested in the mechanism, just stop reading and jump to their tutorial (uri below). It saved my work tons of hours for pre-processing

http://ekzhu.com/datasketch/lsh.html

Now some Math

Jaccard Similarity

Understanding Hashing with the LSH involves two other concepts: 1) Jaccard Similarity: way to compute the similarity of two sets

Matrix Representation and Min-Hashing

Consider sets S1 = {1, 2, 5} S2 = {3} S3 = {2, 3, 4, 6} S4 = {1, 4, 6}, then we can construct a matrix representation as below (example borrowed from the Minhash-classnotes of utah.edu)

Then we can create a matrix representation M like below, with element features for each row and sets S1, S2…Sn as the column. 1 represents the presence of the letter while 0 the opposite.

Minhashing means, if randomly permute the matrix representation, then the first row with 1 in that column is the hash value. for above one

m(S1) = 1, m(S2) = 3, m(S3) = 2, m(S4) = 1

m(S1) = 2, m(S2) = 3, m(S3) = 2, m(S4) = 6

Hash-Based Signature Matrix

If we do the permutation many times, we will have a series of hash values for each set S. They compose the Hash-Based Signature Matrix.

For example, if collecting the above two hash permutation, we will have a hash-based signature matrix: [[1, 3, 2, 1], [2, 3, 2, 6]]. Each column represents the hashing signature for a set S. Then Jaccard Similarity can be applied to compute the similarity for two sets.

Fast Min Hashing

A permutation is not the only way of hashing. We can also use hashing functions to mimic the permutation effect, for large datasets for the purpose of quick hashing, which is the Fast Min Hashing Algorithm. In Utah’s doc (link), they have the pseudo-code listed there.

Translating that pseudo-code into the above example. i is the row number, j is the hash-function index.

If we have k = 2 random hash functions {h1, h2} so hi : E → [N] at random. Then the initial hash-based signature for every single element will be Cik = ∞. Cik will update its value according to the above algorithm. Take S1 as an example. If we set h1 = (i+1)%5 and h2=(3i+1)%5, then the column representation of S1 and the h1, h2 hash values are computed below

The initial signature for S1 would be infinity. When iterating along i, the value of each hash function will be compared to its current signature and always take the minimum.

Once done the iteration, we will have the updated Hash-Based Signature series for every set S. Then we can proceed to compute the Jaccard Similarity.

Locality-Sensitive Hashing

With a large scale corpus of size N containing thousands of words, traversing all pairs to compute the Jaccard Similarity will be O(N²). But if we compare only documents with high similarity and ignore the low similarity, the time complexity will be decreased to O(N). I think this is the key idea for LSH.

The LSH works quite like the concept of batching in data science. Instead of train the dataset as a whole, training samples will be firstly cut into batches. Then the algorithm iterate to update the results.

LSH looks the same, except it is batching on the features instead of the samples. So sets with similar batch features will be sorted into similar buckets as a Jaccard Similarity pair. As the sorting going on to all unpaired documents, the entire corpus will be sorted.