LSH is a really neat algorithm, but to my understanding (at least what I’ve seen in literature), it also tends to be rather inefficient. For it to have good precision, you need longer hashes, but that reduces recall. It also does not really tend to produce a well balanced distribution of entries over buckets. More current research has therefore focused on more elaborate hashing functions that are capable of producing shorter, and better balanced hash maps.
The article is well put together and nicely illustrated, though :)
1) i think it would be helpful to motivate by explaining that lsh queries are an approximation of doing pairwise jaccard similarity searches from the start. i think ullman does this.
2) it's a different approach to minhashing from the way i think about it. one hot encoding is not strictly necessary. all you need to do is for each round, randomly assign every shingle in the vocabulary an integer and take the lowest that is present in the document.
here's some concise python that will do it:
def _chunk_text(m, chunksize=5):
len_m = len(m)
end = len_m - chunksize + 1 if len_m > chunksize else len_m
return set([m[i:i+chunksize] for i in range(end)])
def _minhash(s, siglen=20):
def minhash_k(chunks, k):
return min([mmh3.hash(x, k) for x in chunks])
return [minhash_k(_chunk_text(s), k) for k in range(siglen)]
although i suppose either approach is valid. the big idea is that there are hopefully enough shingles such that given an ordering of shingle ids, two documents will often share the leftmost (minimum) shingle for each word in the signature. where signatures are just a list of these min value shingles from the same permuting of the shingle ids for each word in the signature applied to each document.
I read the article but I could find nothing in the article that gives credit to Piotr Indyk, the inventor of Locality Sensitive Hashing and the math behind it. Give credit where it is due!
The article is well put together and nicely illustrated, though :)