Thumb1

How to correct names and words with mistranslations and misspellings using python?

We do face lots of cases where we have to match a word with a lot of variations. This can be because of typos, pronunciation errors, nicknames, short forms, etc. This can be experienced in the case of matching names in the database with query names. When you need matching of input text to the database we can not expect an exact match always. NLP developers might have gone through scenarios to extract names as named entities and to match it for misspellings and mistranslations. For regular chatbots with text as inputs misspellings are the most common issues that we need to solve. This is usually solved with ordinary fuzzy matching techniques and libraries.

But there are cases we get transcribed sentences as input (input got as a result of speech to text conversation) and we have to deal with such errors. Here we are looking for techniques that we can use for searching input words or names to database entries.

Image for post
Photo by Surendran MP on Unsplash

Common Key Method — NYSIS

Nysis is one of the commonly used techniques for solving similar issues. NYSIS was originally used by what is now the New York Division of Criminal Justice Services to help identify people in their database. It produces better results.

import fuzzy
from common import names
for each in names:
print("Name =>> {} and Conversion =>> {}".format(each, fuzzy.nysiis(each)))
view raw nysis.py hosted with ❤ by GitHub

The results are pretty good. Check down below.

Image for post
Generated By Author

Common Key Method — Double Metaphone

Another library which got published in 1990. It is one of the complex architecture among this. It includes special rules for handling spelling inconsistencies and also looking at combination of consonants. This library was modified with more precision which is called Double Metaphone. Double metaphone further refines the matching by returning both a “primary” and “secondary” code for each name.

import fuzzy
from common import names
dmetaphone = fuzzy.DMetaphone(4)
for each in names:
print("Name =>> {} and Conversion =>> {}".format(each, dmetaphone(each)))
view raw dmetaphone.py hosted with ❤ by GitHub

The result contain two hash values. This can be enabled for making the precision better.

Image for post
Generated By Author

Common Key Method — Soundex

A soundex value is created by taking the first letter and converting the rest of the consonants (not vowels) from letters to digits from a basic lookup table. Even though it appeared to be a naive approach it is useful in many cases.

import fuzzy
from common import names
soundx = fuzzy.Soundex(4)
for each in names:
print("Name =>> {} and Conversion =>> {}".format(each, soundx(each)))
view raw soundex.py hosted with ❤ by GitHub

In the official documentation it is suggesting an output like this. But for me when I checked faced lots of issues. When I look for the error it is understood that soundex appears to be broken.

Image for post
Official Documentation

List Method

This method is used to list all the possible spelling variations of each name component. That is this can create almost all the variations of a given name (which is computationally expensive) then matching can be taken from that.

It comes up with computational cost and reduced speed. But whenever a user complains about a mismatch it is easy to add the new match to the given list.

Distance Method

Methods like Levenshtein distance, the Jaro–Winkler distance, and the Jaccard similarity coefficient can be used to look for the character by character distance between two names. Thus by understanding the error in terms of character is another option to check the matching.

Machine Learning Method

This is another approach suggested by the community that we can train a model that can intake two names and return as a similarity score between them. For doing this we have to train a model with similar and disimilar names so that model learns the pattern of such data and give us a good similarity score.

Word Embedding Method

Here we see that sometimes the names contain synonyms with it this is usually seen in organization names. In these cases, we can use word embeddings. Since word embeddings are numerical vector representations of a word’s semantic meaning. If two words or documents have similar vectors then we can consider them as semantically similar. This idea can be used to implement in name matching case.

Textual Similarity Search

In order to look for typos and errors in names textual similarity search is another option to check the accuracy of them. Jaro-Winkler DistanceHamming DistanceDamerau-Levenshtein Distance and also the regular Levenshtein Distance can be used for this.

from pyjarowinkler.distance import get_jaro_distance
import editdistance
from common import names
name1 = "kale"
print("Querying name ", name1)
for each in names:
print("Name =>> {} and jaro value =>> {} and levenshtein value =>> {}"\
.format(each, get_jaro_distance(name1, each), editdistance.eval(name1, each)))
view raw textsim.py hosted with ❤ by GitHub

We tried to query a name “kale” with the rest of names in the dict. Checkout the results in the below image.

Image for post
Generated By Author

Ensembled Approach

By reading and looking at those outputs you must have figured out that not one can give us a perfect recall and precision together. So we have to do some ensembled approaches for having good accuracy and precision.

  1. Combining common key word (Phonetics) score along with jarovakue or levenshtein score can be a good criterion.
  2. Also, we observed that nysis and Dmetaphone (first hash value) can also be ensembled and can be produced to obtain a better approach.

Checkout the github repo for the code.

LEAVE A REPLY

Please enter your comment!
Please enter your name here