This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the data category.
Last Updated: 2025-01-18
I had hundreds of fields (law_disciplines
, colleges
etc.) where users were
(previously) allowed to enter whatever they wanted, leading to lots of "almost
the same" strings - e.g. University of London
vs London University
.
Eventually I wanted so streamline this to enable better on-site filters. Manually doing this would be too painful to consider so I used a fuzzy match algorithm to give me an idea of closeness and automate the process (or at much as possible)
Here was the time-saving code (relying on a generic fuzzy match library)
def best_match(needle, possibilities)
match, score = FuzzyMatch.new(possibilities).find_with_score(needle)
puts "Possible issue: #{match}:#{needle}" if score < 0.5
match
end
colleges.each {|college|
{
id: college.id,
old_name: college.name,
new_name: best_match(college.name, valid_colleges)
}
}
98% were perfectly matched, and the low scores indicated what records needed visiting by hand.