Normalization in preparation for fuzzy matching

May 24, 2018

When we need to compare entities stored as string data we often need to consider many different pieces of the string. For example, suppose we examine the following names in our data:

Steven M Johnson

Steve Michael Johnson

When we, as humans, compare these entities, we don’t directly compare each string as a whole. Instead we compare substrings. “Steven” is compared to “Steve,” “M” to “Michael,” and “Johnson” to “Johnson.” We do this because we recognize that names aren’t one complete thing but comprised of smaller attributes – in this case the first, middle, and last names. It logically makes more sense to compare these substrings together rather than the entire string. “Steven M Johnson” as a whole may not be very similar (by whichever way we define that) to “Steve Michael Johnson” but the individual components are not that far off.

This process of deconstructing entities to more atomic components is called normalization and is an essential step in string comparison and fuzzy matching. It allows us to make more educated assessments in comparing strings because we are only comparing substrings that are directly related to one another.

Automating the process of normalization

But we humans naturally understand the semantic components of entities. Creating automated processes to do this is a different matter. Luckily, PolyAnalyst has a solution. By pre-labeling certain data columns as certain semantic categories such as names, addresses, or dates, we can instruct PolyAnalyst on how to normalize the corresponding data with prewritten rules. When we wish to normalize data all we need to do is simply tell PolyAnalyst want kind of data our columns are and it will do the rest.

An example of normalization

For example, suppose we have the following data:

Steve Michael Johnson

987 S Woodwillow Dr Apt 9

2/12/1982

These three columns represent different types of semantic information – names, addresses, and dates respectively. Once we label these columns appropriately and run normalization we would end up with something like this:

Name

First Name	Middle Name	Last Name
Steven	Michael	Johnson

Address

Number	Modifier	Street	Type	Modifier 2
987	South	Woodwillow	Drive	Apartment 9

Date

Month	Day	Year
2	12	1982

Notice that “S,” “Dr,” and “Apt” have all be expanded. This is another function of normalization. Many strings can represent the same type of information and are semantically equivalent. “S” and “South” are equal as are the rest thus we should arbitrarily choose one and normalize all of the strings to that selected version. This will allow us to be precise when making future comparisons.

Conclusion

Ultimately, normalization is a rather effortless task for an analyst yet incredibly important in the process of fuzzy matching. Ensuring that our data is broken down into atomic pieces for ease of comparison is crucial to obtaining accurate results.

February 6, 2024

Query languages—the Swiss army knife of information extraction

October 24, 2023

Medical Ontology and Its Use in Text Analysis

September 28, 2023

Normalization in preparation for fuzzy matching

Normalization in preparation for fuzzy matching

Automating the process of normalization

An example of normalization

Conclusion

Related Articles

Query languages—the Swiss army knife of information extraction

Medical Ontology and Its Use in Text Analysis

Mastering Language Models: A Deep Dive into Input Parameters

Contact Us

Software

Solutions

Services