Tuesday, May 29, 2007

Convergence of Matching and Search

I’ve been looking at name and address matching software recently. This is a field I’ve stopped following closely because action has moved to the higher planes of Customer Data Integration and Master Data Management. If there’s any trend in the matching field, it’s the development of generic string matching techniques that can be used on any data, not just names and addresses.

Like all matching engines, these return a score indicating how closely two strings resemble each other. In name/address matching, the scoring method has usually been pretty simple, at most a “string distance” calculation counting the number of differences between one string and another. All the real work went into the data parsing and standardization used to split a record into its components (first name, last name, house number, street name, etc.) and to remove variations such as alternate spellings and nicknames vs. formal names.

The newer approaches—I’m thinking specifically of Netrics, although Choicemaker and SAS DataFlux also qualify to some degree—apply more advanced matching to the strings themselves. This means they have less need for parsing and standardization.

Let’s acknowledge that string matching can never find matches identified by standardization. “Peggy” and “Margaret” are simply not similar strings, so the only way a computer can know one is a nickname for the other is if a reference database makes the connection.. But advanced string matching can look for relationships among segments within strings, such as words in different sequences, that make parsing less essential. Since parsing is both computationally intensive and itself less than perfectly accurate, this offers definite advantages.

These advantages are particularly evident once you move beyond the highly structured world of names and addresses to other types of data. Here, external information is less likely to be available to help with standardization, so the ability to uncover subtle relationships between different strings becomes more important. Actually, “subtle” isn’t quite the right word here: any human would recognize that “Mary Smith Gallagher” and “Gallagher Mary” are probably the same person, while a simple matching algorithm would see almost no similarity.

What’s interesting is this sort of matching applies to the wider world of search at least as well as to the traditional world of data quality. Most discussions of search are warped by the gravitational field of Internet search engines, which leads them to focus on finding the most popular or authoritative content relating to a query string. But for other applications, such as text search, string similarity is a primary concern.

As with name and address matching, text search often contains a large component of parsing and standardization, which the text search people would label as “semantic analysis”. Again, this indisputably adds important value. But simple misspellings and partial information abound in search inputs, and often in the data being searched as well. An engine that cannot overcome such imperfections will be at best partially effective. This is where more sophisticated matching methods can help.

In short, I’m proposing there is some useful opportunity for cross-fertilization between matching software and search vendors. Not perhaps the most brilliant insight ever, but worth mentioning nevertheless.

No comments: