furialog

¶ Whole Data · 27 June 2008 essay/tech

So there's "the Semantic Web", which has lots of philosophical baggage, and then there's "Linked Data", which is supposed to represent unpacking only the practical stuff from the piles of baggage. A secret password for a cult within a cult. A cult-within-a-cult big enough to have a two-day conference dedicated to it, but not yet big enough that the final panel on the second day wasn't facing a mostly empty room.

And not quite yet a cult-within-a-cult with a coherent, useful agenda, I think. If you want to stage a grassroots revolution, you need to figure out four things:

- What is the big change you're going to bring about?
- What's the work that has to be done?
- Who has to do the work?
- What's in it for them?

Unfortunately, the current Linked Data agenda kind of answers these questions like this:

- The big change is that all "data records" will have universally unique names, which will all also be web addresses so you can look them (or maybe something about them, or maybe kind of both) up with a browser, and when you look them up they will point you to other things.

- The work to be done is that everybody must either convert their data into a list of individual subject-verb-objects assertions, including meta-assertions about those first assertions, or else at least construct the meta-assertions about the assertions and construct the unique names for the original assertions even if they don't actually exist. And all this should be done in a language (and data model) called RDF, most explanations of which begin by apologizing for it, which is understandable because it's the data-modeling equivalent of assembly language for programming, and pretty much nobody voluntarily works with nothing but levers when there's a Home Depot nearby.

- The people who have to do the work are the owners of data.

- The thing that's in it for them is that other people might mash up their data into something else, or discover it serendipitously, or effortlessly integrate it. Except Linked Data doesn't mean Open Linked Data, so you don't have to expose your data to the world, so in that case the benefit is, um, something something ontology something toolchain serendipitous LOOK!! OVER THERE!! IT'S A GIANT GLOBAL GRAPH EATING THE EMPIRE STATE BUILDING!!!!!! THIS IS THE WEB DONE RIGHT!!!!!

This is a pretty bad plan for a revolution. It's hard to understand why the big change is big, never mind why it's good; the work is artificially difficult and deliberately obscure; the people who would have to do it are basically unprepared; and the "benefits" are peripheral more or less by definition.

Let me try a different set of answers:

- The big change is that we can finally afford to store most kinds of information in a way that separates the logical data model from the storage mechanics, so that the only thing that constrains what you can do with your data is your data.

- The work to be done is that someone needs to build a node/arc-oriented database system, with accompanying data-model and query-language, that brings this idea up to the usability level of, say, a simple Excel spreadsheet or Google Base.

- The people who have to do the work are data-system developers, whether new companies or existing ones.

- The things that are in it for them are that the market of people with data is huge and growing, and the market for people with highly interconnected data is newly huge and frantically growing, and the existing RDBMS/SQL-based tools for analyzing and exploring and publishing that data are so awful that we will look back on them like we look back on COBOL and punch-cards. There is both money and human progress to be made.

Mine probably initially seems like a smaller revolution. My enemy is not the web. My revolution is not about URIs or linking or dereferencing or ontology subsumption or transitive closure. I just know that we've wasted human-centuries of time fighting against internal problems introduced by relational-database implementations, and built legions of crappy, inflexible, inhumane data systems because the limitations of our databases constrained what questions we allowed humans to even ask, never mind get answered. And I know that computers are now big enough and fast enough that for increasingly many datasets, the old ugliness is no longer even remotely necessary.

And thus the new ugliness, RDF and SPARQL, is a painfully tragic missed opportunity. I no more want to be buried in triples than I wanted to be boxed into tables. Where SQL and tables made some simple things relatively simple, some simple things very hard, and most hard things really hard, RDF and SPARQL make all things theoretically possible, but all specific trivial things barely feasible, any simple things too cumbersome to attempt without assistance, and hard things moot because the simple things raise so many epistemological issues that all actual work halts.

We can do better. Let's call this new movement Whole Data, like whole food. Let our data be what it is to people, not what it's easier to be to machines. I don't claim this is any kind of new science. Object-oriented databases aren't a new idea, and arguably a binary-relation node/arc data-model is more or less what the original idea of a "relational" database was before mundane practicality spent a few decades abusing it. But we can build it now. We can build it so that people can use it, easily and without having to learn some whole esoteric new trade.

And yes, a node/arc Whole Data database will lend itself incredibly well to publishing that data on the web. Arcs are links, and the web is made of links, and it's deliciously easy to turn links into links. Moreover, arcs are labeled links, and labeled links mean that machines can do something useful with them. But it is not the job of each data owner to rewrite their own database tools. It is not the job of each webmaster to figure out how to "semantically annotate" their pages. This is a revolution in data-modeling abstraction, and it will be fought by software designers and developers.

And linking disparate datasets together, which is supposed to be the whole point of Linked Data, and thus the whole point of this slightly sad and more than slightly confused conference I went to last week, is only barely a technical problem to begin with, and it's not at all clear to me that the technical agenda of this subcult actually makes non-trivial linking any easier. Most of the claimed benefits, even if you don't worry about whom they're benefits for, seem to me like imaginative hand-waving.

- Giving things unique IDs does make it easier to talk about them. But you can have IDs in table records, so that isn't new, and you can refer to things by instructions instead of IDs and most of the time it doesn't make a lot of difference, so IDs aren't even always necessary. And giving things web IDs is a totally orthogonal issue about publishing that isn't about the data itself at all, and makes no more general sense than saying that all children should be named with cell-phone numbers.

- You can "mesh together" RDF sets. Sometimes, modulo some obtuse URI issues. And if the sets' conceptual data models are compatible enough, you might be able to get something out of the combination. But this is more or less true of relational database tables, too. Exchanging data is not difficult. CSV, JSON, XML, whatever. Agreeing what the data should be is difficult, and agreeing what it should be in triples is no easier than agreeing what it should be in tables. (Maybe easier in theory, but definitely harder in current practice.) Integration is never "effortless". It's just that sometimes somebody else has already done the effort of reaching agreement, and sometimes they haven't.

- My adorable little FOAF file can link to yours, and we can both link ours to Tim Berners-Lee's. But this is a toy "data web". It doesn't scale. Currently it doesn't even scale technically, as the web is too slow for queries to be federated out across every foafy server on the planet, but solving the technical problem will only make it faster for us to see how useless the answers are for human reasons. Every level of remove you go through is a chance for context or trust or meaning to be lost. I don't know how you decide who goes in your file, and I certainly don't know how the people in your file decide who goes in theirs. If you find your way to me and then to Tim, it is unlikely to do you very much real good. On one hand, his office is about two blocks away from mine. On the other hand, the only reason it's becoming marginally more likely that he knows me is that I keep complaining about things he's working on. Small Pieces Loosely Joined works great when you have time to examine the joins and the pieces and figure out which ones are useful for some particular personal purpose. Carry them home, trim them, water them, love them, and you might get a garden for your efforts. Send a machine out to gather them all up at once and you'll get a mushy septic field with some fine heirloom tomatoes submerged somewhere in it.

- As long as our cults are small, sometimes we can find each other with nothing but our code words. But thinking that URI standards will yield pervasive serendipity is socially oblivious. Nothing I do in my FOAF file will ensure that you find me when you look at Tim's. Distributed linking is inherently asymmetric, and even if you crawl the entire web to construct all the missing inverses, the numbers are implacable. You cannot center yourself in a social network without cooperation. The space of associativity is curved: you can move from the obscure to the popular easily (indeed often inexorably), and along popularity contours with some effort. The imbalance of attention is a human phenomenon, not a technical one.

But my little revolution, at least, doesn't involving changing human nature. Distrust any that does. Whole Data is not magic, it's just healthier than the bleached, bromated version, and yummier than chewing raw semantic bran. All I'm trying to do is make information tools that allow humans to be less mechanical, and machines to be more helpful.

Here are some helpfulnesses:

- A node/arc database can handle multiple-value and no-value relationships just as easily as single-value relationships. Multiple-value relationships are 90% of the interesting structure of data, yet they're basically the bane of relational-database usability, and RDF only handles them if you don't care what order they go in or how easy it is to find out what they all are. No-value relationships are painful in relational databases, and inexpressible in RDF.

- In a node/arc database you can make data-model modifications without touching any node whose data is not actually involved. The model does not determine the storage. In relational databases the model is the storage. Unless you already consider yourself a "working ontologist" you don't want to know what it takes to express a model in (actually for, because you can't do it in) RDF. Real models change all the time, so this can't be a half-assed after-thought.

- In a node/arc database you can look at any data from the perspective of any other data. There's no "top", there's just "outward from here". In a relational database your tables and indexes limit your perspectives. In RDF there's no "top", but neither is there any "here"; all you can do is scan assertions until your patience runs out, looking for ones that say something else somethings this, or this somethings something else. Being able to stand anywhere and look in any direction is the difference between exploration and a zoo, and thus between discovery and watching an okapi crap.

- A node/arc data structure can be browsed. Relational tables and RDF can mostly only be searched, and browsing is a UI illusion built on top of searching. But browsing, wandering, connecting, following, compiling, discarding: these are how humans organize and comprehend information, and we deserve a database in which those are the cheapest and sturdiest building blocks, not the most expensive and fragile fabrications.

- In a node/arc database with a path-based query language, human browsing can be used as machine training, and human questions can be formulated for machine processing in structurally the same way they would be stated for human evaluation. We think in contexts and connections, not in subgraph matching or index inversion. We do not start every inquiry with a list of all the universe's truths. We start somewhere, and follow paths. We need to be able to send our questions down these paths, down all the paths at once, and have them come back with their own answers.

The project I'm working on has a database system like I describe, and a data-model that's more usable than RDF, and a path-based query-language I wrote that's better than SPARQL. I want this thing to exist. I think we need it to exist. I'd rather it already existed, and if somebody else comes out with one tomorrow that's at least as good as mine, I'll be totally thrilled to use that instead. Mine isn't going to come out tomorrow. I don't know when it will. I don't know who will be totally thrilled, the day mine comes out, to use that one instead of the one they were working on.

But I know these small improvements matter. Information matters, and understanding matters, and the tools for getting from one to the other matter. I think our ability to make use of information is part of how we are going to survive on Earth. So this, for the moment, is what I'm trying to do for us.