In this blog I will go over the steps required to dedupe your data using the Levenshtein distance method using MySQL function. As most of the dedupe of data is based on the exact matches but some time we need to identify the similar address or due to some typo mistake while entering data by the end user so to find out the similarity between data I have used this method to find the similarity between data.
Creating the function in MySQL for Levenshtein distance score calculation between two strings or multiple words string. In this blog I am not going to discuss the details about the Levenshtein distance score in detail. But you can see its details in Wikipedia online for details.
Now I will take some sample data to demonstrate how to use the Levenshtein distance score to check the similarity between data.
Now we can dedupe the data using the MySQL query.
As you can see in this query levenshtein_ratio3 function is calculating the similarity score between two email strings as it is normal a person can have same email addresses with different domains. So here we can see the benefit of levenshtein function to calculate the sore of similarity. We can select the records which have similarity score of more then 70% to be considered as similar and we can merge those records if we want to be.
As you can see in the results similar email addresses are selected with different domains so we can review the similarity and make decision accordingly.
Conclusion:
As now a days it is common way users are getting registered on one site with different emails but using the levenshtein distance method we can check the similar users.
Thanks for using MySQL!
Comments