Alternative to fuzzy matching techniques for NLP to enhancing performance
Working with natural language processing we might faced a lot of scenario to use various string matching techniques. Mostly we use fuzzy matching techniques to find the closes match of a string from a database or in some other cases we use them to understand the typo errors, mistranslations etc.
Other than using fuzzy libraries for string matching use case we often use edit distance method, levenshtein distance method, TF-IDF character based n-gram method, word embedding method to understand the meaning and to match between words of strings.
Let’s move step by step. For installing polyfuzz you have different methods which you can see below.
https://gist.github.com/raoofnaushad/d358d22d19c748b2bf566a0d25ce416d#file-polyfuzz-py
How to make it work?
Yeah! So consider you have two sets of strings
[happily, happy, hippy, holi, holiday, holidays, cool, school, fool]
and another one [happy, holiday, schools]
. Consider if we want to find the similarity based on their edit distance method. So this is how we can do it with polyfuzz.
https://gist.github.com/raoofnaushad/846be92b9905715f1ffddcfcaece578e#file-simple_polyfuzz-py
By doing this you will get a result as such
What others features does this offer you?
It comes up with grouping and clustering of matches. From the previous results you can see there was a chance of grouping some strings together. PolyFuzz gives you the ability to do so.
https://gist.github.com/raoofnaushad/13eafe6c5d9200398cfb673bc5c1068c#file-group_polyfuzz-py
You can see the results below were input strings are grouped together.
Also there is a chance of putting them together in clusters which you can do using PolyFuzz with very little effort.
https://gist.github.com/raoofnaushad/b5a3af39385a33ae48f1d08aca11016f#file-cluster_polyfuzz-py
You can see the cluster below in which some strings are grouped together.
PolyFuzz also has few Models implemented in it. This includes RapidFuzz, EditDistance, TF-IDF, FastText and GloVe, ? Transformers.
You can use this model based on your requirements for string matching, grouping and clustering.
Check out the below repo link to see how you can use them and make it useful for you.