Thumb1

Alternative to fuzzy matching techniques for NLP to enhancing performance

Working with natural language processing we might faced a lot of scenario to use various string matching techniques. Mostly we use fuzzy matching techniques to find the closes match of a string from a database or in some other cases we use them to understand the typo errors, mistranslations etc.

Other than using fuzzy libraries for string matching use case we often use edit distance method, levenshtein distance method, TF-IDF character based n-gram method, word embedding method to understand the meaning and to match between words of strings.

Let’s move step by step. For installing polyfuzz you have different methods which you can see below.

## Install base dependencies
pip install polyfuzz
## speed up the cosine similarity comparison and decrease memory usage
pip install polyfuzz[fast]
## making use of the transformers
pip install polyfuzz[flair]
## For all additional dependencies
pip install polyfuzz[all]
view raw polyfuzz.py hosted with ❤ by GitHub

How to make it work?

Yeah! So consider you have two sets of strings
[happily, happy, hippy, holi, holiday, holidays, cool, school, fool] and another one [happy, holiday, schools] . Consider if we want to find the similarity based on their edit distance method. So this is how we can do it with polyfuzz.

from polyfuzz import PolyFuzz
import config
model = PolyFuzz("EditDistance")
model.match(config.list1, config.list2)
print(model.get_matches())
view raw simple_polyfuzz.py hosted with ❤ by GitHub

By doing this you will get a result as such

Image for post
Generated By Author

What others features does this offer you?

It comes up with grouping and clustering of matches. From the previous results you can see there was a chance of grouping some strings together. PolyFuzz gives you the ability to do so.

from polyfuzz import PolyFuzz
import config
model = PolyFuzz("EditDistance")
model.match(config.list1, config.list2)
model.group("EditDistance")
print(model.get_matches())
view raw group_polyfuzz.py hosted with ❤ by GitHub

You can see the results below were input strings are grouped together.

Image for post
Generated By Author

Also there is a chance of putting them together in clusters which you can do using PolyFuzz with very little effort.

from polyfuzz import PolyFuzz
import config
model = PolyFuzz("EditDistance")
model.match(config.list1, config.list2)
model.group("EditDistance")
print(model.get_clusters())
view raw cluster_polyfuzz.py hosted with ❤ by GitHub

You can see the cluster below in which some strings are grouped together.

Image for post
Generated By Author

PolyFuzz also has few Models implemented in it. This includes RapidFuzz, EditDistance, TF-IDF, FastText and GloVe, 🤗 Transformers.

You can use this model based on your requirements for string matching, grouping and clustering.

Check out the below repo link to see how you can use them and make it useful for you.


MaartenGr/PolyFuzz

PolyFuzz performs fuzzy string matching, string grouping, and contains extensive evaluation functions. PolyFuzz is…

github.com


LEAVE A REPLY

Please enter your comment!
Please enter your name here