Effective Methods for Assessing String Similarity in Python

Chapter 1: Introduction to String Similarity

In recent projects, I focused on matching surnames within a dataset that includes potential typographical errors. My objective is to pinpoint and rectify incorrect entries. To achieve this, I will assess the similarity between surnames and group those that exceed a specified similarity threshold. This leads us to the question: what metrics can effectively measure similarity?

Here, I will introduce three distinct metrics for gauging the similarity between strings, specifically surnames:

Levenshtein Distance
Jaro Similarity
American Soundex

As a practical example, we will analyze a dataset of 30 fabricated surnames, which were generated using a ChatGPT prompt to create a list that includes ten with intentional typos. For further details on how the dataset was constructed, refer to the linked chat.

To implement surname matching, we will utilize the jellyfish library in Python, which can be installed using the command pip install jellyfish. For additional information regarding this library, please visit its official website.

Before diving into the similarity metrics, let’s take a look at our dataset.

Surnames Dataset

The dataset comprises the following 30 surnames:

Smith
Johnson
Williams
Brown
Jones
Garcia
Davis
Rodriguez
Martinez
Taylor
Andersson
White
Wilson
Thomas
Lee
Hall
Harris
Perez
Jackson
Martin
Thompson
Murphy
Turner
Baker
Kim
Johnsonn
Daviis
Rodriguuez
Martinezz
Tayylor

The last five surnames feature typographical errors. To load the dataset as a Pandas DataFrame, use the following code:

import pandas as pd

df = pd.read_csv('surnames.csv')

Now, let’s examine how our similarity metrics can help identify these errors, starting with Levenshtein Distance.

Levenshtein Distance

Levenshtein Distance quantifies the number of operations required to transform one string into another. These operations include substitutions, deletions, and additions. For instance, changing "Pam" to "Sam" requires one substitution (P to S). We can compute the Levenshtein distance for each surname pair in our dataset using the jellyfish library, which provides the following functions:

levenshtein_distance(s1, s2): Computes the standard Levenshtein distance.
damerau_levenshtein_distance(s1, s2): Computes a modified version that includes transpositions.

The closer the distance is to zero, the more similar the strings are, with identical strings yielding a distance of zero.

Here’s how to calculate the Levenshtein distance for each surname pair:

from jellyfish import levenshtein_distance

similarity_df = pd.DataFrame(columns=['surname1', 'surname2', 'levenshtein_distance'])

for i in range(len(df)):

surname1 = df['surname'][i]

for j in range(i + 1, len(df)):

surname2 = df['surname'][j]

leven_distance = levenshtein_distance(surname1, surname2)

similarity_df = similarity_df.append({'surname1': surname1, 'surname2': surname2, 'levenshtein_distance': leven_distance}, ignore_index=True)

# Save the results to a CSV file

similarity_df.to_csv('leven_similarity.csv', index=False)

Next, we will filter surnames with a Levenshtein distance of less than 2 and save the results:

candidates_levenshtein_df = similarity_df[similarity_df['levenshtein_distance'] <= 1].reset_index(drop=True)

candidates_levenshtein_df.to_csv('candidates_levenshtein.csv', index=False)

This metric effectively identified all our errors, although more complex datasets could yield some inaccuracies.

Now, let’s explore the next metric: Jaro Similarity.

Jaro Similarity

Jaro Similarity assesses the degree of similarity between two strings by considering their lengths and the number of common characters, placing greater emphasis on characters that occupy the same or similar positions. The Jaro similarity score ranges from 0 to 1, with higher values indicating greater similarity.

The jellyfish library offers two functions for calculating Jaro Similarity:

jaro_similarity(s1, s2)
jaro_winkler_similarity(s1, s2): An enhanced version of Jaro similarity.

For our analysis, we will utilize the first function and filter for Jaro similarity values above 0.9:

from jellyfish import jaro_similarity

similarity_df = pd.DataFrame(columns=['surname1', 'surname2', 'jaro_similarity'])

for i in range(len(df)):

surname1 = df['surname'][i]

for j in range(i + 1, len(df)):

surname2 = df['surname'][j]

jaro_sim = jaro_similarity(surname1, surname2)

similarity_df = similarity_df.append({'surname1': surname1, 'surname2': surname2, 'jaro_similarity': jaro_sim}, ignore_index=True)

# Save results to a CSV file

similarity_df.to_csv('jaro_similarity.csv', index=False)

candidates_jaro_df = similarity_df[similarity_df['jaro_similarity'] >= 0.9].reset_index(drop=True)

candidates_jaro_df.to_csv('candidates_jaro.csv', index=False)

This approach also included the surnames Martinez and Martin, which were not captured by the Levenshtein distance.

Next, we’ll examine the final metric: American Soundex.

American Soundex

American Soundex converts a string into its phonetic equivalent. We can transform each surname into its phonetic representation and then apply either Levenshtein distance or Jaro similarity to compare these representations.

The jellyfish library includes the soundex() function for calculating the American Soundex. The following code illustrates how to calculate Jaro similarity for the phonetic representations:

from jellyfish import soundex

similarity_df = pd.DataFrame(columns=['surname1', 'surname2', 'jaro_soundex_similarity'])

for i in range(len(df)):

surname1 = df['surname'][i]

for j in range(i + 1, len(df)):

surname2 = df['surname'][j]

jaro_sim = jaro_similarity(soundex(surname1), soundex(surname2))

similarity_df = similarity_df.append({'surname1': surname1, 'surname2': surname2, 'jaro_soundex_similarity': jaro_sim}, ignore_index=True)

# Save results to a CSV file

similarity_df.to_csv('jaro_soundex_similarity.csv', index=False)

candidates_jaro_df = similarity_df[similarity_df['jaro_soundex_similarity'] >= 0.9].reset_index(drop=True)

candidates_jaro_df.to_csv('candidates_jaro_soundex.csv', index=False)

This method demonstrated a higher number of matches compared to the previous examples.

In all discussed approaches, a key question arises: how can we determine if matched surnames represent distinct names or if one contains a typographical error? Strategies might include:

Manual review for small match sets
Downloading a comprehensive surname list and developing a procedure to verify the existence of surnames, only calculating similarity for non-existent entries.

The challenge of string matching is complex and varies based on your specific objectives. This article merely scratches the surface; it’s your turn to delve deeper!

Comparing the Three Metrics

Levenshtein Distance: This metric is suitable for strings that may differ by one or two characters and is effective for spell-checking.
Jaro Similarity: This metric accounts for string length and is better suited for identifying duplicates or record linkage.
American Soundex: Though it cannot be used alone for string similarity, it can enhance other methods for tasks like detecting dictation errors or phonetic matching.

The following figure summarizes the appropriate applications for each metric.

Summary

Congratulations! You’ve learned how to compute string similarity using three metrics: Levenshtein distance, Jaro similarity, and American Soundex.

Levenshtein distance is a non-negative integer.
Jaro similarity ranges from 0 to 1.
American Soundex provides a phonetic representation of strings.

You can access the complete code from this article on my GitHub repository, linked here.

Thank you for reading! If you have any questions or feedback, feel free to comment or connect with me on LinkedIn.

See you next time!

Learn how to compare string similarity using Python in this insightful video!

Explore three traditional methods for similarity search, including Jaccard, w-shingling, and Levenshtein in this informative video!

parkmodelsandcabins.com

Effective Methods for Assessing String Similarity in Python

Chapter 1: Introduction to String Similarity

Surnames Dataset

Levenshtein Distance

Jaro Similarity

American Soundex

Comparing the Three Metrics

Summary

Share the page:

Recent Post:

Unlock Your Athletic Potential with Mindfulness Training

Engaging Insights and Heartfelt Narratives from ILLUMINATION

The Flaws of Psychometric Testing in Recruitment Revealed

Harnessing Artificial Solar Flares to Safeguard Space Assets

The Case for a $199 iPhone: Exploring Apple's Future Strategy

Discovering the Keto Diet: Surprising Results After One Week

Market Outlook: Potential for Immediate Rally in 2024

China's Ambitious Dam Project: Unveiling the Future of Hydropower