The data
The IATI-standard has a codelist of organisation registry agencies (Chamber of Commerces) to create an organisation identifier. The organisation identifier should start with the code of one of these agencies. However the codelist is not complete (yet).
At the opencorporates website it is possible to search companies that are registered via several ‘registry agencies’. To use the information found via opencorporates the original registry agencies should be on the IATI codelist. So we need to get a list of agencies from the opencorporate website that are not known to IATI.
The Problem
On both list we have a (iso 3166) country code and a registry name. The same registry could have a (slightly?) different name on both list. But (about) the same name could exists in multiple countries. So we want to perform a fuzzy match on registry name, taken the country into account. We don’t need to compare the UK Companies House with the Companies House of Gibraltar.
In Pentaho Data-Integration is a fuzzy match step available to match two datasets using one field with different search algorithms. Unfortunately it is not possible to add another field to restrain the possible matches.
The solution
Instead of a one step approach we need three steps:
- merge the datasets based on the countrycode
- calculate the (levenshtein) distance
- determin the best/correct matches
In this case this method was sufficient. We only had a few registries in each country. But with other datasets this method is not optimal. I hope somebody will extent the Pdi fuzzy match step. A jira case is already filed.