Whether one is a journalist using data for an investigation or a governement publishing its budget its important that we can check assess that data’s quality.
It’s also true that if I’m a user of a data catalogue it’s very useful for me to know something about the dataset before I try to download it — not just it’s quality, but its characteristics, size etc.
We have to have the data first to be able to measure them. Data catalogue gives us references (URLs) to various datasets and datasources, however to which extent we can use them? Here are properties of data resources in regard to their quality:
- Availability – can it be machine downloaded? Does the server reply with “404 Not Found”? (This is 5 stars of openness item 1)
- Processability – Is it in a convenient format – one that can be machine processed into structured form? Is it in closed proprietary format? (This is 5 stars of openness item 2)
Data Quality and Quality Dimensions
Data quality is a complex measure of data properties from various dimensions. It gives us a picture of the extent to which the data are appropriate for their purpose.
What are the main dimensions of data quality?
- Completeness – extent to which the expected attributes of data are provided. Data do not have to be 100% complete, the dimension is measured to the degree to which it matches user’s expectations and data availability. Can be measured in an automated way.
- Accuracy – data reflect real world state. For example: company name is real company name, company identifier exists in the official register of companies. Can be measured in an automated way using various lists and mappings. (NB: data can be complete but not accurate)
- Credibility – extent to which the data is regarded as true and credible. It can vary from source to source, or even one sourced can contain automated and manually entered data. This is not quite measurable in an automated way.
- Timeliness (age of data) – extent to which the data is sufficiently up-to-date for the task at hand. For example not timely data would be scraped from unstructured PDF that was published today, however, contains contracts from three months ago. This can be measured by comparing publishing date (or scraping date) and dates within the data source
Some other dimensions can also be measured, but require that one has multiple datasets describing the same things:
- Consistency – do the facts in multiple datasets match? (some measurable)
- Integrity – can be multiple datasets correctly joined together? Are all references valid? (measurable in automated way)