furialog

¶ 18 Ways to Think About Data Quality · 12 April 2011 essay/tech

[From a note to the public-lod@w3.org mailing list.]

Data "beauty" might be subjective, and the same data may have different applicability to different tasks, but there are a lot of obvious and straightforward ways of thinking about the quality of a dataset independent of the particular preferences of individual beholders. Here are just some of them:

1. Accuracy: Are the individual nodes that refer to factual information factually and lexically correct. Like, is Chicago spelled "Chigaco" or does the dataset say its population is 2.7?

2. Intelligibility: Are there human-readable labels on things, so you can tell what one is when you're looking at it? Is there a model, so you can tell what questions you can ask? If a thing has multiple labels (or a set of owl:sameAs things have multiple labels), do you know which (or if) one is canonical?

3. Referential Correspondence: If a set of data points represents some set of real-world referents, is there one and only one point per referent? If you have 9,780 data points representing cities, but 5 of them are "Chicago", "Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and "Chicagoland", that's bad.

4. Completeness: Where you have data representing a clear finite set of referents, do you have them all? All the countries, all the states, all the NHL teams, etc? And if you have things related to these sets, are those projections complete? Populations of every country? Addresses of arenas of all the hockey teams?

5. Boundedness: Where you have data representing a clear finite set of referents, is it unpolluted by other things? E.g., can you get a list of current real countries, not mixed with former states or fictional empires or adminstrative subdivisions?

6. Typing: Do you really have properly typed nodes for things, or do you just have literals? The first president of the US was not "George Washington"^^xsd:string, it was a person whose name-renderings include "George Washington". Your ability to ask questions will be constrained or crippled if your data doesn't know the difference.

7. Modeling Correctness: Is the logical structure of the data properly represented? Graphs are relational databases without the crutch of "rows"; if you screw up the modeling, your queries will produce garbage.

8. Modeling Granularity: Did you capture enough of the data to actually make use of it. ":us :president :george_washington" isn't exactly wrong, but it's pretty limiting. Model presidencies, with their dates, and you've got much more powerful data.

9. Connectedness: If you're bringing together datasets that used to be separate, are the join points represented properly. Is the US from your country list the same as (or owl:sameAs) the US from your list of presidencies and the US from your list of world cities and their populations?

10. Isomorphism: If you're bring together datasets that used to be separate, are their models reconciled? Does an album contain songs, or does it contain tracks which are publications of recordings of songs, or something else? If each data point answers this question differently, even simple-seeming queries may be intractable.

11. Currency: Is the data up-to-date? As of when?

12. Model Uniformity: Are discretionary modeling decisions made the same way throughout the dataset, so that you don't have to ask many permutations of the same question to get different subsets of the answer? Nobody should have to worry whether some presidents and presidencies are asserted in only one direction and some only the other.

13. Attribution: If your data comes from multiple sources, or in multiple batches, can you tell which came from where? If a source becomes obsolete or discredited or corrupted, can you take its data out again?

14. History: If your data has been edited, can you tell how and when and by whom? Can you undo errors, both individual (no, those two people aren't the same, after all) and programmatic (those two datasets should have been aligned with different keys)?

15. Internal Consistency: Do the populations of your counties add up to the populations of your states? Do the substitutes going into your soccer matches balance the substitutes going out? Would you notice if errors were introduced?

16. Legality: Is the license under which the data can be used clearly defined, ideally in a machine readable way?

17. Sustainability: Is there is some credible basis or evidence for believing the data will be kept available and current? If it's your data, what commitment to its maintenance are you making?

18. Authority: Is the source of the data a credible authority on the subject? Did you find a list of NY Charter Schools, or the list?

[Revision of #12 and addition of 16-18 suggested by Dave Reynolds.]