Search

Dedupe MySQL data using Levenshtein distance

In this blog I will go over the steps required to dedupe your data using the Levenshtein distance method using MySQL function. As most of the dedupe of data is based on the exact matches but some time we need to identify the similar address or due to some typo mistake while entering data by the end user so to find out the similarity between data I have used this method to find the similarity between data.

Creating the function in MySQL for Levenshtein distance score calculation between two strings or multiple words string. In this blog I am not going to discuss the details about the Levenshtein distance score in detail. But you can see its details in Wikipedia online for details.



Now I will take some sample data to demonstrate how to use the Levenshtein distance score to check the similarity between data.


Now we can dedupe the data using the MySQL query.


As you can see in this query levenshtein_ratio3 function is calculating the similarity score between two email strings as it is normal a person can have same email addresses with different domains. So here we can see the benefit of levenshtein function to calculate the sore of similarity. We can select the records which have similarity score of more then 70% to be considered as similar and we can merge those records if we want to be.


As you can see in the results similar email addresses are selected with different domains so we can review the similarity and make decision accordingly.


Conclusion:

As now a days it is common way users are getting registered on one site with different emails but using the levenshtein distance method we can check the similar users.

Thanks for using MySQL!

20 views0 comments

Recent Posts

See All

What are the future prospects of Java

According to a survey conducted, nearly 63% programmers mentioned that they will likely continue to work with Java along with coding with Python, SQL and HTML/CSS. In addition, the giants in the corpo

Deleting Duplicate Rows in MySQL

In this blog, you will study several ways to delete duplicate rows in MySQL. In this blog, I have shown you how to find duplicate values in a table. Once the duplicates rows are recognized, you may ne

Upload Data to MySQL tables using mysqlimport

Uploading quite a lot of rows of data from a text, csv, and excel file into a MySQL table is a repetitive task for sysadmins and DBAs who are handling MySQL database. This blog clarifies 4 applied exa