Dag van de duurzame eieren

Vandaag is het de dag van de duurzaamheid en de internationale dag van het ei (*). Het leek me leuk om deze twee te combineren. Een interessante vraag is dan: Hoeveel eieren worden er duurzaam geproduceerd en wat is de trend daarvan. Het vinden van data viel me erg tegen. Het CBS publiceert alleen over het aantal bedrijven en aantal dieren en maakt alleen onderscheid tussen totaal en biologisch. Ze publiceren dus niet apart over bijvoorbeeld vrije uitloop, scharrel- en kooieieren.  Het productschap vee, vlees en eieren publiceerde wel een overzicht. Maar deze is opgeheven per 1 januari 2015, dus zijn hier ook geen recente cijfers te vinden. Ook veel andere ‘ei’-organisaties zijn moeilijk online te vinden of hebben geen data.

De beste gegevens die ik gevonden heb, zijn dus het aantal biologische leghennen als aandeel van alle leghennen voor de periode 2011-2014. In onderstaande figuur is te zien dat dit percentage iets meer dan 2% is. Er lijk 2012 een kleine stijging gerealiseerd te zijn. Als ik meer tijd kon besteden, had ik mogelijk ook informatie over vrij uitloop en scharreleieren kunnen vinden. We zouden dan een beter beeld hebben van de productie (en daarmee het gebruik) van duurzame eieren.

dag van duurzaamheid: percentage biologische leghennen in Nederland

 

 

 

 

 

 

* ) De komende tijd zal ik vaker een post doen geïnspireerd op de ‘Dag van…..’. Bijna elke dag is het wel een bijzondere dag. Een mooi overzicht is te vinden op: http://www.fijnedagvan.nl/. Ik zal er een aantal kiezen om een data-gebaseerde post te schrijven. Daarbij zal ik wel steeds vanuit een vraag vertrekken.

Quality of IATI Organisation Identifier

One of the big advantages of IATI activity data is that it is possible to find information about a specific organisation. Questions like:iati-logo

  • In which countries or sectors is an organisation active?
  • What roles does an organisation have?
  • How many activities is an organisation involved in?
  • What kind of activities is an organisation involved in?
  • With which organisation is a organisation working?

The problem

To merge information from different publishers you can use the organisation name, but that is (to put it midly) risky.  Organisations could be know with different name or spelling (“World Bank” and “The World Bank” are two different names, but I guess they are the same organisation; while two activities with “Freedom Forum” could be with two different organisations). Therefore the attribute “ref” is available in the IATI standard, containing the IATI organisation identifier.  This precise format of this identifier is described here.

To be usefull this identifier must be available and correct. To test this we have harvest every organisation name with its refs from all published IATI files.

Resultsvalid

  • In total we find 17987 different organisations.
  • We have 13224 organisation with an organisation identifier (sounds not bad) and 4763 organisations without this identifier.
  • However we have only 994 organisations with a valid organisation identifier (of which 677 identifiers are unique (so we have different organisation names with the same identifier))
  • And we have 16993 (94,5%) organisations without a valid organisation identifier (of which 15867 organisation names are unique)

The solution

We are developing a service to verify and find suggestions for the organisation identifiers. We hope to have this available in January 2015

 

Technical background:

  • An organisation is unique based on organisation name and organisation identifier. So if, say organisation “world bank” is used at one activity with a correct identifier and at another activity without identifier it are two different organisations.
  • The iati files as were known on 15 December 2014 were used for the results
  • We only checked if the organisation identifier was well-formed. We did not check if the ‘base identifier’ was correct. Valid if:
    • On codelist with organisation names of version 1.04
    • or of the format registryCode_someIdentifier

Combine two list of registry agencies (fuzzy match)

The data

iati-logo

The IATI-standard has a codelist of organisation registry agencies (Chamber of Commerces) to create an organisation identifier. The organisation identifier should start with the code of one of these agencies. However the codelist is not complete (yet).

 

opencorporates

At the opencorporates website it is possible to search companies that are registered via several ‘registry agencies’. To use the information found via opencorporates the original registry agencies should be on the IATI codelist. So we need to get a list of agencies from the opencorporate website that are not known to IATI.

The Problem

On both list we have a (iso 3166) country code and a registry name. The same registry could have a (slightly?) different name on both list. But (about) the same name could exists in multiple countries. So we want to perform a fuzzy match on registry name, taken the country into account. We don’t need to compare the UK Companies House with the Companies House of Gibraltar.

In Pentaho Data-Integration is a fuzzy match step available to match two datasets using one field with different search algorithms. Unfortunately it is not possible to add another field to restrain the possible matches.

The solution

pdi fuzzy match

Instead of a one step approach we need three steps:

  1. merge the datasets based on the countrycode
  2. calculate the (levenshtein) distance
  3. determin the best/correct matches

In this case this method was sufficient. We only had a few registries in each country. But with other datasets this method is not optimal. I hope somebody will extent the Pdi fuzzy match step. A jira case is already filed.

 

Just a simple question, or how open source should work

This cheap nba jerseys morning I asked a simple question ‘Did you make progress with some adjustments to some nice har nerdy stuff?’ ‘Yes he did’ said somebody else, but ‘his work needs review’, said another. ‘But why don’t cheap nfl jerseys you help us and test and review it?’repositorySynchroniser

As all I was considering to make the adjustments myself, and already investigated it a little I said ‘alright I’ll try it’. The changes were made in the same file I was planning to make the adjustments, so that looked cheap nfl jerseys good. Unfortunately the adjustments where not correct. So I reported it back. And of course I got the reaction: ‘Well fix it’…

I know the problem, I know the solution. Alright I’ll fix it. That did not cost me that much time, however how to test if it works after your changed it. If you know how to do it, it is easy, but if you don’t….. (basically it was performing `ant`, wholesale mlb jerseys remove the original solution, extract a zip to the proper location running and restart for the server), reporting: that took me some time. And yes it seems to work. But how to send the nice nerdy adjustments to the original developers…

Also if you know how to do it…. (basically it was forking the repo of the person who adjusted the nerdy stuff, cheap nba jerseys make adjustments, send pull request to original developer).

Hooray!!! I contributed to an open source project. And a while later: Wow they are able to mobilize others to improve their program. Well done Pentaho and good choice Jaap-André

 

(original posted at to www.nivocer.com)