References → Fuzzy Join Recipe

Fuzzy matching is a useful recipe when two string columns between datasets don’t exactly match and have poor join quality. Joining on names (Nick vs. Nicholas), addresses (road vs rd), regions (NY vs New York) can be tricky. The fuzzy match will help identify matches based on how ‘close’ the strings appear to each other.

Configuration

ConfigurationDescription
Recipe NameA freeform name of how a user would like to name a recipe
InputSelect a previously constructed recipe to process
Target ColumnChoose the column from the input data that will be cleaned and matched
Primary Key ColumnsSelect the primary key columns for the data input
Lookup InputSelect the lookup data input for the fuzzy join operation
Reference ColumnChoose the column from the lookup data that will serve as the reference for matching
Maximum distance CutoffSpecify the maximum allowable distance for a match. Values range from 0 to 1000.
Include distance as a columnToggle to include or exclude the Distance column in the output. This can be helpful to view distance before removing rows with a filter.

Levenshtein distance

Levenshtein distance is the primary method of determining the similarity distance between two strings. Simply put, the distance between the two words is the minimum number of character edits between two strings. These distances are calculated by counting the inserts, substitutions, and deletes from a string.

Example: Let’s compare the similarity of the words “Margaret” and “Maggie”, where “Margaret” is the source word.

Similar letters: M-A-G-E are both shared between words Substitute letters (2): G (is substituted for R), I (substituted for the second G) Removed letters (2) : R and T

Since two letters were substituted and two were removed, the distance between these two words is four. Since the original word was eight characters and four required editing, it’s considered 50% similar. Use this calculator to test other strings.

Best Practices

  • When it comes to fuzzy matching, there is no ‘perfect’ answer. It’s an iterative art to find which threshold will join records that should be joined while leaving less similar records unjoined. Play with the maximum cutoff distance until you are happy with the result.
  • Avoid setting maximum distance to a high value. High values could result in all records cross joining, and as a result, create large and long-running materialized views.