Using the MinHasher #609

Gaglia88 · 2016-12-22T16:02:55Z

Hi,
I'm trying to use MinHasher to compute LSH using differents profiles, each profiles has a list of tokens (a bag of words).
If I understand right the process to do is:

For each profile create an hash for each token using the minHasher.init method , this step produces a list of (profileID, [tokens hashes])
For each profile using the method minHasher.sum, sum all generated hashes, now I have a list of (profileID, hash)
For each profile's hash apply the minHasher.buckets method to obtain the buckets, so I have a list of (profile, [buckets])
Through some mapping process produce a list of (bucket, [profiles]), so if two profiles finish in the same bucket they are canditates to matching.

Is this process correct?

Because I tried it, but it never produces colliding buckets, I mean each bucket contains always a single profile.

This is my testing code

val numHashes = 1024
val targetThreshold = 0.1
val numBands = MinHasher.pickBands(targetThreshold, numHashes)
implicit lazy val minHasher = new MinHasher32(numHashes, numBands)

val p1 = Profile(1)
p1.addAttribute(KeyValue("Key", "a b c d"))

val p2 = Profile(2)
p2.addAttribute(KeyValue("Key", "a c e f"))

val profiles = p1 :: p2 :: Nil

val profileWithTokens = profiles.flatMap {
  profile =>
    profile.attributes.flatMap {
      kv =>
        kv.value.split("\\W+").filter(_.trim.size > 0).map {
          token =>
            (profile.id, token)
        }
    }
}.distinct

profileWithTokens.foreach(println)
println()

val profileWithHash = profileWithTokens.map{
  x =>
    val profileID = x._1
    val tokenHash = minHasher.init(x._2)
    (profileID, tokenHash)
}

profileWithHash.foreach(println)
println()

val profileWithHashes = profileWithHash.groupBy(x => x._1)

profileWithHashes.foreach(println)
println()

val profileWithSignature = profileWithHashes.map{
  x =>
    (x._1, minHasher.sum(x._2.map(_._2)))
}

profileWithSignature.foreach(println)
println()

println("Jaccard Similarity = "+minHasher.similarity(profileWithSignature(1), profileWithSignature(2)))
println()

val profileWithBuckets = profileWithSignature.map(x => (x._1, minHasher.buckets(x._2)))

val bucketsWithProfiles = profileWithBuckets.flatMap(x => x._2.map(y => (y, x._1))).groupBy(x => x._1).map(x => (x._1, x._2.map(_._2)))

println("Number of buckets "+bucketsWithProfiles.size)

println("Number of buckets with more than 1 element "+bucketsWithProfiles.filter(_._2.size > 1).size)

And this is the output

(1,a)
(1,b)
(1,c)
(1,d)
(2,a)
(2,c)
(2,e)
(2,f)

(1,MinHashSignature([B@319b92f3))
(1,MinHashSignature([B@fcd6521))
(1,MinHashSignature([B@27d415d9))
(1,MinHashSignature([B@5c18298f))
(2,MinHashSignature([B@31f924f5))
(2,MinHashSignature([B@5579bb86))
(2,MinHashSignature([B@5204062d))
(2,MinHashSignature([B@4fcd19b3))

(2,List((2,MinHashSignature([B@31f924f5)), (2,MinHashSignature([B@5579bb86)), (2,MinHashSignature([B@5204062d)), (2,MinHashSignature([B@4fcd19b3))))
(1,List((1,MinHashSignature([B@319b92f3)), (1,MinHashSignature([B@fcd6521)), (1,MinHashSignature([B@27d415d9)), (1,MinHashSignature([B@5c18298f))))

(2,MinHashSignature([B@6483f5ae))
(1,MinHashSignature([B@b9afc07))

Jaccard Similarity = 0.31640625

Number of buckets 965
Number of buckets with more than 1 element 0

The text was updated successfully, but these errors were encountered:

avi-stripe · 2016-12-22T17:02:02Z

The process is right. I haven't read your code in detail but on a quick scan it looks right. The similarity here is fairly low (I'm used to trying to match jaccard similarities closer to the 0.8 or 0.9 range) but you do have the threshold set quite low as well, so it should work. Have you:

tested two identical profiles to see if at least those match up (the buckets should be identical)
tested any other random profiles? maybe this one is just a fluke (though that's unlikely)

Gaglia88 · 2016-12-22T18:07:45Z

Hi, thanks for your reply.
I was not really sure that the process that I have implemented was right.

I tested the algorithm with two identical profiles, and they map in the same buckets! Thanks for the hint.

So I think I have made an error in the process of mapping (profile, [buckets]) to (bucket, [profile])
I will check it.

EDIT:
The problem was the last flatmap, I don't know why but it didn't work properly.
I solved in this way

val bucketsWithProfiles = profileWithBuckets.map { x => x._2.map((_, x._1)) }.flatten.groupBy(x => x._1).map(x => (x._1, x._2.map(_._2)))

Gaglia88 closed this as completed Dec 22, 2016

Gaglia88 reopened this Dec 22, 2016

Gaglia88 mentioned this issue Jun 6, 2017

The MinHasher execution depends by the number of nodes #634

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the MinHasher #609

Using the MinHasher #609

Gaglia88 commented Dec 22, 2016

avi-stripe commented Dec 22, 2016

Gaglia88 commented Dec 22, 2016 •

edited

Using the MinHasher #609

Using the MinHasher #609

Comments

Gaglia88 commented Dec 22, 2016

avi-stripe commented Dec 22, 2016

Gaglia88 commented Dec 22, 2016 • edited

Gaglia88 commented Dec 22, 2016 •

edited