Machine readable data needs to be human usable
This is perfectly machine readable data:
Month,Year,Region,Value 1,2016,Auckland,1000 1,2016,Wellington,2000 3,2017,Auckland,1500 4,2017,Wellington, 5,2017,Auckland,1.2
R, Python, Excel, STATA, etc, will all have no trouble reading such a CSV data file.
Any human trying to use this data will have many questions:
- What do the values represent and where did they come from?
- The value on the 4th row is missing. What could cause a missing value? Is it an error or has it been censored for some reason (confidentiality etc)?
- The value on the 5th row looks suspiciously small compared to other values. What is the typical range of values that we should expect? What could cause a very small value?
- Are the years calendar years, financial years, years ending June, or something else?
- The data appears to be monthly. Are the values totals or averages or some other statistic for each month? Or are they instantaneous observations at a particular point in time each month (eg the 15th of the month)?
- The data frequency seems to be irregular. What’s the reason for that? Are entire rows actually missing from this data?
- The data also appears to be for geographic areas. How exactly are these areas defined? Do they follow standard geographic definitions that would allow this data to be joined to other geographic data? Or if non-standard areas have been used, what are they?
So this data is perfectly readable by a machine but it is not useable by a human. Since ultimately almost everything that machines do is decided by humans, for data to be useful it needs to be both machine readable and human usable. This applies to open data and data used within organisations.
Human usability requires careful documentation of the data’s characteristics and quirks. This needs to be recorded separately from the data (ie as metadata and/or documentation) but be easily findable and accessible by all users of the data. It should be in a single place or file, not scattered about, and definitely not only stored in people’s heads. People should be able to find documentation and metadata in the same place that they find data itself. They shouldn’t have to go hunting for it elsewhere on a website or server.
Some metadata can also be machine readable, eg whether the years are calendar or financial years could be recorded in a standard format and the machine could “understand” this when reading the data. However, in almost all cases what the machine does with the data still has to be specified by a human so ultimately a human needs to understand the metadata too. And some of the more features a dataset, such as the process by which it was collected, are best recorded as free text that will be difficult for a machine to “understand” anyway. In other words, you can’t write the human out of the equation (not yet …).
Making data machine readable is largely a mechanical process of ensuring it conforms to appropriate standards. Making that data also human useable is more difficult. It requires thinking about and answering the types of questions listed above. If you are already quite familiar with a dataset, it may be hard to know which features are not obvious to a newcomer.
Making data human useable can be a tedious and boring process, but without this work, data is not valuable. It often seems like data providers devote too many resources to technical solutions to make their data machine readable, like complicated APIs, while devoting too few resources to metadata and documentation. In many cases, data would be more valuable and would get more use if the technology for sharing it was simpler but the documentation was better.