Quality of IATI Organisation Identifier

One of the big advantages of IATI activity data is that it is possible to find information about a specific organisation. Questions like:iati-logo

  • In which countries or sectors is an organisation active?
  • What roles does an organisation have?
  • How many activities is an organisation involved in?
  • What kind of activities is an organisation involved in?
  • With which organisation is a organisation working?

The problem

To merge information from different publishers you can use the organisation name, but that is (to put it midly) risky.  Organisations could be know with different name or spelling (“World Bank” and “The World Bank” are two different names, but I guess they are the same organisation; while two activities with “Freedom Forum” could be with two different organisations). Therefore the attribute “ref” is available in the IATI standard, containing the IATI organisation identifier.  This precise format of this identifier is described here.

To be usefull this identifier must be available and correct. To test this we have harvest every organisation name with its refs from all published IATI files.

Resultsvalid

  • In total we find 17987 different organisations.
  • We have 13224 organisation with an organisation identifier (sounds not bad) and 4763 organisations without this identifier.
  • However we have only 994 organisations with a valid organisation identifier (of which 677 identifiers are unique (so we have different organisation names with the same identifier))
  • And we have 16993 (94,5%) organisations without a valid organisation identifier (of which 15867 organisation names are unique)

The solution

We are developing a service to verify and find suggestions for the organisation identifiers. We hope to have this available in January 2015

 

Technical background:

  • An organisation is unique based on organisation name and organisation identifier. So if, say organisation “world bank” is used at one activity with a correct identifier and at another activity without identifier it are two different organisations.
  • The iati files as were known on 15 December 2014 were used for the results
  • We only checked if the organisation identifier was well-formed. We did not check if the ‘base identifier’ was correct. Valid if:
    • On codelist with organisation names of version 1.04
    • or of the format registryCode_someIdentifier

Combine two list of registry agencies (fuzzy match)

The data

iati-logo

The IATI-standard has a codelist of organisation registry agencies (Chamber of Commerces) to create an organisation identifier. The organisation identifier should start with the code of one of these agencies. However the codelist is not complete (yet).

 

opencorporates

At the opencorporates website it is possible to search companies that are registered via several ‘registry agencies’. To use the information found via opencorporates the original registry agencies should be on the IATI codelist. So we need to get a list of agencies from the opencorporate website that are not known to IATI.

The Problem

On both list we have a (iso 3166) country code and a registry name. The same registry could have a (slightly?) different name on both list. But (about) the same name could exists in multiple countries. So we want to perform a fuzzy match on registry name, taken the country into account. We don’t need to compare the UK Companies House with the Companies House of Gibraltar.

In Pentaho Data-Integration is a fuzzy match step available to match two datasets using one field with different search algorithms. Unfortunately it is not possible to add another field to restrain the possible matches.

The solution

pdi fuzzy match

Instead of a one step approach we need three steps:

  1. merge the datasets based on the countrycode
  2. calculate the (levenshtein) distance
  3. determin the best/correct matches

In this case this method was sufficient. We only had a few registries in each country. But with other datasets this method is not optimal. I hope somebody will extent the Pdi fuzzy match step. A jira case is already filed.

 

Pentaho Community Meeting 2014: Hackathon

Presenting our results at PCM14 (gpx output on screen)This year the Pentaho Community Meeting 2014 (pcm14) was in Antwerpen and started with a (short) hackathon. Some company groups were formed. Together with Peter Fabricius I joined the people from Cipal (or they joined us). We did not get an assignment so we have to come up with something nice ourselves. Our first thought was to do ‘something’ with Philae (the lander who just started sending information form the comet “Churyumov-Gerasimenko“). We searched for some data, but we could not find anything useful.

So we decided to take a subject closer to home and wondered if we could map the locations of the PCM14 participants. We already had a kettle transformation to get the location data from Facebook pages(city, country), parse it to a geocoding service to get the latitude and longitude and save it to a gpx file. It was based on some work Peter did for German rally teams to the Orient. We ‘only’ need to adjust it to our needs and we need data to request the Facebook company page of the participants.

From Bart we got a list of the email addresses of the participants (it has advantages that you are part of a semi Belgium team and one of the team members was actually working on Bart his machine ;-)). We were able to grap the domain name without country code using Libre Office (sorry we only had an hour to code) and tried to feed it to the Facebook Graph API. It is basically just a http client step to get the info from eg http://graph.facebook.com/pentaho. This results in the company page in a nice json format (Unfortunately(?) the Graph API does not return the location for normal ‘users’ with this method). One request broke the kettle transformation (some strange error), so we removed that organization.

Facebook returned the country name, but the geocoding tool needed the 2-character country code. Because Peter had only German teams, he just added GE, but of course this was not an option for us. Fortunately we had a databases with the country-isocode translation. So we could feed the geocoding service with the right data and this also returns some nice json.
After about 37 requests we got an error: no content is allowed before the prolog (or something like that). Damn we reach some rate limiting….  So we need delayed each request a second to get all the results. The first run we did not get all the results. Why? we don’t know…

In the mean time Peter and ‘uh I forgot his name’ were busy trying to get the bi-server installed and prepare a dashboard with a map, which should read a kettle transformation step and plotting the participants. They had also some issues, but……

It was time for the presentations…. At that point we did not have anything to show…. No results of the kettle transformation, no map….. During the setup of one of the presentations I run the kettle transformations again and hooray I get a GPX file. It contains 9 locations of the participants (we had about 55 different companies in our list). Since we did not have the map ready, we could not present it using the bi-server. But also in this case ‘Google was our friend’. Uploading it to Google drive, using preview content (using My GPX Reader (it took some clicks) we were able to show it to the public.

On my way to podium I noticed Facebook also returns the latitude and longitude. So we did not need to use the detour via the geocoding service 🙁

pcm14-hackathon-map

After al presentations were made, the jury discussed the products and presentations and we won!!! (as did all the other teams). We got some nice raspberry PI B+. In case you don’t know what it is: Basically it is a hand sized desktop computer with no case and a lot of connectors…

Thanks Bart and Matt for organizing this hackathon!!!

Edit: By request I added a sample input file. I also changed it to read csv: facebook_locations

Just a simple question, or how open source should work

This cheap nba jerseys morning I asked a simple question ‘Did you make progress with some adjustments to some nice har nerdy stuff?’ ‘Yes he did’ said somebody else, but ‘his work needs review’, said another. ‘But why don’t cheap nfl jerseys you help us and test and review it?’repositorySynchroniser

As all I was considering to make the adjustments myself, and already investigated it a little I said ‘alright I’ll try it’. The changes were made in the same file I was planning to make the adjustments, so that looked cheap nfl jerseys good. Unfortunately the adjustments where not correct. So I reported it back. And of course I got the reaction: ‘Well fix it’…

I know the problem, I know the solution. Alright I’ll fix it. That did not cost me that much time, however how to test if it works after your changed it. If you know how to do it, it is easy, but if you don’t….. (basically it was performing `ant`, wholesale mlb jerseys remove the original solution, extract a zip to the proper location running and restart for the server), reporting: that took me some time. And yes it seems to work. But how to send the nice nerdy adjustments to the original developers…

Also if you know how to do it…. (basically it was forking the repo of the person who adjusted the nerdy stuff, cheap nba jerseys make adjustments, send pull request to original developer).

Hooray!!! I contributed to an open source project. And a while later: Wow they are able to mobilize others to improve their program. Well done Pentaho and good choice Jaap-André

 

(original posted at to www.nivocer.com)

IATI identifier (IATI reporting: It’ all about identifiers)

The IATI identifier should be the key to the informationOne of the identifier in выборе the IATI Standard is the activity identifier called It’s IATI wholesale NFL jerseys identifier. It identifies an activity. It is useful because multiple reporting organisations could report about the same activity. By referencing to the activity of another reporting organisation, you could match these reports and determin how for instance the money flows.

According wholesale NBA jerseys to the guidelines, it should:

  • exist
  • be unique
  • start with the identifier of the reporting organisation
How good or bad is the use of the IATI identifier

At this moment (actually yesterday), we were able to import 429,675 activities from 226 reporting organisations (some cheap NFL jerseys files are Mallorca not valid/doesn’t exist/were not g_bus_own_name连接总线失败 downloadable, see also http://bjwebb.github.io/IATI-Dashboard/index.html).

  • 4866 Top activities don’t have an identifier
  • 73125 activities don’t have an unique identifier
  • 206198 activities have an identifier which does not  start with the reporting organisation identifier
  • This combined: 213843 activities (about 50%) has an iati identifier which matches the guidelines (and 215832 who don’t)

Remarkable is that 4829 activities doesn’t have a reporting organisation identifier.

How many reporting organisations are good

170 of 226 reporting organisations have correct activity identifiers.  So 56 organisations doesn’t.

However these results are worse. Marks I assume the reporting organisation identifier is valid. Which is not always true. But that is for another post.

 

(earlier posted on nivocer.com)

IATI reporting: It’s all about identifiers

validA while wholesale mlb jerseys ago, I started investigating the IATI data. The International Aid Transparency Initiative (IATI) makes information about aid spending easier to access, use and understand. Different stakeholders (Ministries of foreign affairs, NGO, etc) publish information about budget, spendings, participants, results and more about their cheap nfl jerseys activities. It is useful to look at Will the data of an individual organisation, but if you aggregate the data of all organisations, you can answer many more (and more interesting?) questions like:

  • Which organisations are active in Mali
  • How much money PROFESSIONAL does organisation X receive (and spend)
  • Which cheap jerseys organisation is the most successful considering budget spending, Summer results, etc.

Very important if you want to compare or aggregate results. Organisations needs to report on a consistent manner. The IATI Standard defines guidelines and code lists to support this. However the data contains a lot of violations against these guidelines. Blog In this series I want to report on these violations.