Effective Methods for Assessing String Similarity in Python
Written on
Chapter 1: Introduction to String Similarity
In recent projects, I focused on matching surnames within a dataset that includes potential typographical errors. My objective is to pinpoint and rectify incorrect entries. To achieve this, I will assess the similarity between surnames and group those that exceed a specified similarity threshold. This leads us to the question: what metrics can effectively measure similarity?
Here, I will introduce three distinct metrics for gauging the similarity between strings, specifically surnames:
- Levenshtein Distance
- Jaro Similarity
- American Soundex
As a practical example, we will analyze a dataset of 30 fabricated surnames, which were generated using a ChatGPT prompt to create a list that includes ten with intentional typos. For further details on how the dataset was constructed, refer to the linked chat.
To implement surname matching, we will utilize the jellyfish library in Python, which can be installed using the command pip install jellyfish. For additional information regarding this library, please visit its official website.
Before diving into the similarity metrics, let’s take a look at our dataset.
Surnames Dataset
The dataset comprises the following 30 surnames:
- Smith
- Johnson
- Williams
- Brown
- Jones
- Garcia
- Davis
- Rodriguez
- Martinez
- Taylor
- Andersson
- White
- Wilson
- Thomas
- Lee
- Hall
- Harris
- Perez
- Jackson
- Martin
- Thompson
- Murphy
- Turner
- Baker
- Kim
- Johnsonn
- Daviis
- Rodriguuez
- Martinezz
- Tayylor
The last five surnames feature typographical errors. To load the dataset as a Pandas DataFrame, use the following code:
import pandas as pd
df = pd.read_csv('surnames.csv')
Now, let’s examine how our similarity metrics can help identify these errors, starting with Levenshtein Distance.
Levenshtein Distance
Levenshtein Distance quantifies the number of operations required to transform one string into another. These operations include substitutions, deletions, and additions. For instance, changing "Pam" to "Sam" requires one substitution (P to S). We can compute the Levenshtein distance for each surname pair in our dataset using the jellyfish library, which provides the following functions:
- levenshtein_distance(s1, s2): Computes the standard Levenshtein distance.
- damerau_levenshtein_distance(s1, s2): Computes a modified version that includes transpositions.
The closer the distance is to zero, the more similar the strings are, with identical strings yielding a distance of zero.
Here’s how to calculate the Levenshtein distance for each surname pair:
from jellyfish import levenshtein_distance
similarity_df = pd.DataFrame(columns=['surname1', 'surname2', 'levenshtein_distance'])
for i in range(len(df)):
surname1 = df['surname'][i]
for j in range(i + 1, len(df)):
surname2 = df['surname'][j]
leven_distance = levenshtein_distance(surname1, surname2)
similarity_df = similarity_df.append({'surname1': surname1, 'surname2': surname2, 'levenshtein_distance': leven_distance}, ignore_index=True)
# Save the results to a CSV file
similarity_df.to_csv('leven_similarity.csv', index=False)
Next, we will filter surnames with a Levenshtein distance of less than 2 and save the results:
candidates_levenshtein_df = similarity_df[similarity_df['levenshtein_distance'] <= 1].reset_index(drop=True)
candidates_levenshtein_df.to_csv('candidates_levenshtein.csv', index=False)
This metric effectively identified all our errors, although more complex datasets could yield some inaccuracies.
Now, let’s explore the next metric: Jaro Similarity.
Jaro Similarity
Jaro Similarity assesses the degree of similarity between two strings by considering their lengths and the number of common characters, placing greater emphasis on characters that occupy the same or similar positions. The Jaro similarity score ranges from 0 to 1, with higher values indicating greater similarity.
The jellyfish library offers two functions for calculating Jaro Similarity:
- jaro_similarity(s1, s2)
- jaro_winkler_similarity(s1, s2): An enhanced version of Jaro similarity.
For our analysis, we will utilize the first function and filter for Jaro similarity values above 0.9:
from jellyfish import jaro_similarity
similarity_df = pd.DataFrame(columns=['surname1', 'surname2', 'jaro_similarity'])
for i in range(len(df)):
surname1 = df['surname'][i]
for j in range(i + 1, len(df)):
surname2 = df['surname'][j]
jaro_sim = jaro_similarity(surname1, surname2)
similarity_df = similarity_df.append({'surname1': surname1, 'surname2': surname2, 'jaro_similarity': jaro_sim}, ignore_index=True)
# Save results to a CSV file
similarity_df.to_csv('jaro_similarity.csv', index=False)
candidates_jaro_df = similarity_df[similarity_df['jaro_similarity'] >= 0.9].reset_index(drop=True)
candidates_jaro_df.to_csv('candidates_jaro.csv', index=False)
This approach also included the surnames Martinez and Martin, which were not captured by the Levenshtein distance.
Next, we’ll examine the final metric: American Soundex.
American Soundex
American Soundex converts a string into its phonetic equivalent. We can transform each surname into its phonetic representation and then apply either Levenshtein distance or Jaro similarity to compare these representations.
The jellyfish library includes the soundex() function for calculating the American Soundex. The following code illustrates how to calculate Jaro similarity for the phonetic representations:
from jellyfish import soundex
similarity_df = pd.DataFrame(columns=['surname1', 'surname2', 'jaro_soundex_similarity'])
for i in range(len(df)):
surname1 = df['surname'][i]
for j in range(i + 1, len(df)):
surname2 = df['surname'][j]
jaro_sim = jaro_similarity(soundex(surname1), soundex(surname2))
similarity_df = similarity_df.append({'surname1': surname1, 'surname2': surname2, 'jaro_soundex_similarity': jaro_sim}, ignore_index=True)
# Save results to a CSV file
similarity_df.to_csv('jaro_soundex_similarity.csv', index=False)
candidates_jaro_df = similarity_df[similarity_df['jaro_soundex_similarity'] >= 0.9].reset_index(drop=True)
candidates_jaro_df.to_csv('candidates_jaro_soundex.csv', index=False)
This method demonstrated a higher number of matches compared to the previous examples.
In all discussed approaches, a key question arises: how can we determine if matched surnames represent distinct names or if one contains a typographical error? Strategies might include:
- Manual review for small match sets
- Downloading a comprehensive surname list and developing a procedure to verify the existence of surnames, only calculating similarity for non-existent entries.
The challenge of string matching is complex and varies based on your specific objectives. This article merely scratches the surface; it’s your turn to delve deeper!
Comparing the Three Metrics
- Levenshtein Distance: This metric is suitable for strings that may differ by one or two characters and is effective for spell-checking.
- Jaro Similarity: This metric accounts for string length and is better suited for identifying duplicates or record linkage.
- American Soundex: Though it cannot be used alone for string similarity, it can enhance other methods for tasks like detecting dictation errors or phonetic matching.
The following figure summarizes the appropriate applications for each metric.
Summary
Congratulations! You’ve learned how to compute string similarity using three metrics: Levenshtein distance, Jaro similarity, and American Soundex.
- Levenshtein distance is a non-negative integer.
- Jaro similarity ranges from 0 to 1.
- American Soundex provides a phonetic representation of strings.
You can access the complete code from this article on my GitHub repository, linked here.
Thank you for reading! If you have any questions or feedback, feel free to comment or connect with me on LinkedIn.
See you next time!
Learn how to compare string similarity using Python in this insightful video!
Explore three traditional methods for similarity search, including Jaccard, w-shingling, and Levenshtein in this informative video!