Skip to content

itspawanbhardwaj/spark-fuzzy-matching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Maven Central

For Scala 2.10

<dependency>
  <groupId>com.github.itspawanbhardwaj</groupId>
  <artifactId>spark-fuzzy-matching_2.10</artifactId>
  <version>1.0.0</version>
</dependency>

For Scala 2.11

<dependency>
  <groupId>com.github.itspawanbhardwaj</groupId>
  <artifactId>spark-fuzzy-matching_2.11</artifactId>
  <version>1.0.1</version>
</dependency>

Metrics and algorithms

Functions

  • All functions are defined under com.pb.fuzzy.matching.functions.

import com.pb.fuzzy.matching.functions._ // import to use fuzzy matching functions

  
  levenshteinFn(document, document1)
  diceSorensenFn(document, document1, nGramSize)
  hammingFn(document, document1)
  jaccardFn(document, document1, nGramSize)
  jaroFn(document, document1)
  jaroWinklerFn(document, document1)
  nGramFn(document, document1, nGramSize)
  overlapFn(document, document1, nGramSize)
  ratcliffObershelpFn(document, document1)
  weightedLevenshteinFn(document, document1, deleteWeight, insertWeight, substituteWeight)
  metaphoneFn(document, document1)
  computeMetaphoneFn(document)
  nysiisFn(document, document1)
  computeNysiisFn(document)
  refinedNysiisFn(document, document1)
  computeRefinedNysiisFn(document)
  refinedSoundexFn(document, document1)
  computeRefinedSoundexFn(document)
  soundexFn(document, document1)
  computeSoundexFn(document)

Example

The project contains a FuzzyMatchingJoinExample which works as follows:

Dataset with proper names
+--------------------+--------------------+-------+
|               title|               gener|ratings|
+--------------------+--------------------+-------+
|The Shawshank Red...|        Crime. Drama|    9.3|
|       The Godfather|        Crime. Drama|    9.2|
|     The Dark Knight|Action. Crime. Drama|    9.0|
|The Godfather: Pa...|        Crime. Drama|    9.0|
|        Pulp Fiction|        Crime. Drama|    8.9|
+--------------------+--------------------+-------+
only showing top 5 rows

Dataset with misspelled names
+--------------------+----+--------+
|               title|year|duration|
+--------------------+----+--------+
|dhe Shwshnk Redem...|1994|     142|
|        dhe Godfdher|1972|     175|
|      dhe Drk Knighd|2008|     152|
|dhe Godfdher: Prd II|1974|     202|
|        Pulp Ficdion|1994|     154|
+--------------------+----+--------+
only showing top 5 rows

Dataset after fuzzy join
+--------------------+--------------------+-------+--------------------+----+--------+
|               title|               gener|ratings|               title|year|duration|
+--------------------+--------------------+-------+--------------------+----+--------+
|The Shawshank Red...|        Crime. Drama|    9.3|dhe Shwshnk Redem...|1994|     142|
|       The Godfather|        Crime. Drama|    9.2|        dhe Godfdher|1972|     175|
|     The Dark Knight|Action. Crime. Drama|    9.0|      dhe Drk Knighd|2008|     152|
|        Pulp Fiction|        Crime. Drama|    8.9|        Pulp Ficdion|1994|     154|
|    Schindler's List|Biography. Drama....|    8.9|    Schindler's Lisd|1993|     195|
+--------------------+--------------------+-------+--------------------+----+--------+
only showing top 5 rows

Library used

stringmetric ( 🎯 String metrics and phonetic algorithms for Scala (e.g. Dice/Sorensen, Hamming, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Metaphone, N-Gram, NYSIIS, Overlap, Ratcliff/Obershelp, Refined NYSIIS, Refined Soundex, Soundex, Weighted Levenshtein). )