furia furialog · New Particles · The War Against Silence · Aedliga (songs) · photography · code · other things     ↑vF
16 October 2008 to 9 June 2008
I gave another version of my Whole Data blog post as part of a panel called "Death of the Relational Database?" at the Web 3.0 conference today.  

My answer to the question in the panel-title is that the relational database was always mostly a coping mechanism for a time of relative scarcity, when our data volume and needs exceeded our storage and processing capacities. This era is not entirely past us, but with more memory, faster data-loading, faster processing and better programming tools, there are increasingly many datasets of larger and larger size for which we can now support qualitatively richer abstractions. Where we used to have to break data apart, and thus were constrained to interact with it in fragmented and broken ways, we can now increasingly afford to keep it whole.  

And here, then, to try to explain why that matters, is my current four-point reduction of the Whole Data manifesto:  

1. What logic allows, storage may not constrain.  

The forms of the logical data-model and its inquiries are by definition independent of any extra-logical storage mechanics. That is, a proper data system exists for the express purpose of sustaining the illusion that the logical data-model is real. This must also be true over time: logically-sensible changes in the logical data model cannot be constrained by storage mechanics.  

2. All relationships are lists.  

Not only must multiple-value relationships and no-value relationships be exactly as easily expressed, manipulated and inquired about as single-value relationships, but every fundamental data operation should thrive on multiplicity and sequence.  

3. There is no up or down, only onwards from here.  

Humans impose hierarchy and directionality on data, but different humans recognize different hierarchies in the same data at the same time. The system itself must be agnostic. No fact is inherently the property of another, they are all independent and structurally equal, and although pairs of relationships can be inverses of each other, there is no absolute “up” or “down”, or “forwards” or “backwards”. A database of albums, artists and years, for example, is exactly as much a database of years as it is of artists as it is of albums, not by the generous will of its schema-creator, but by the intrinsic nature of data. Most significantly, you must be able to start at any node and see everything else in this dataset to which, and from which, it is connected. In fact, turn this around and you get the essential practical definition of a “dataset”: it is a collection of data in which all internal connections are expressed.  

4. Human inquiry follows paths; machine inquiry follows all the paths at once.  

The reason to put data into computers is so that computers can answer questions that are meaningful to humans, but faster for computers. A data model and query language exist to communicate human needs to computers, not to communicate machine preferences to humans, and the responsiveness of a data system is its qualitative ability to answer questions with human motivations, not its quantitative throughput.  
 

At the end of this talk I also did the first (very brief) public demo of Thread, the path-based query language I've written as part of the data-modeling (among other things) system I've been working on at ITA Software. I'm looking forward (to put it mildly) to being able to talk much more about this, but for now I just whirled through a fast series of mostly-unexplicated queries that at least gestured in the direction of the points in my manifesto:  

Album:Year=2000
albums from the year 2000; a one-to-one relationship in the old world, but in the new world the year 2000 is a real thing, too, making this many-to-one

Album:Artist=Nightwish
albums by Nightwish; a more familiar many-to-one relationship

Artist:Album~Dark
Artists who did an album with "Dark" in the name; a one-to-many relationship with the same syntax as above

Label:Album~Dark
Labels that put out albums with "Dark" in the name; same syntax again, but following a different path to Albums

Album:~Dark.Label
same question again, but following the path from Albums, instead of to them

Artist:(.Album.Label:=Spinefarm)
Artists who've had albums on Spinefarm; same pattern as above, but filtering on a compound relationship

Label:=Spinefarm.Album.Artist
or go the other way

Artist:(.Album:#5.Year:>2000)
Artists whose fifth album came out after 2000; multiples and missing values (one of the bands in the sample data had only four albums)

Artist:!(.Album:#5)
artists who don't have a fifth album; filtering on absence
 

I also mentioned in the talk, but didn't actually show, the hypothetical query "Who are all the living movie directors who ever directed Cary Grant?":  

Actor:=Cary Grant.Film.Director:!(.Date of Death)  

Go ahead and try to answer that in IMDB without a query language. For that matter, try to answer it in Freebase with a query language. (Or, even worse, try to answer it in Freebase yesterday, before I fixed a bunch of dates of death by hand...)  
 

There's also a short article up on semanticweb.com at the moment, which if nothing else gives a transcribed sense of how I explain these ideas differently in a phone interview than I do in writing.  
 

And there's space for discussion here.

When she thinks she's alone, my daughter practices air guitar.
1. I have $1,000, you have nothing, a latte costs $4.
2. I "leverage" my $1,000 to "lend" you "$100,000".
3. You "leverage" "your" "$100,000" to "lend" Enrique "$10,000,000".
4. Enrique "leverages" "his" "$10,000,000" to "lend" me "$1,000,000,000".
5. The three of us go to the coffeeshop and stand in line talking about our "money" until they panic about "inflation" and raise the "price" of a latte to "$4,000,000". I order one, Enrique orders two.
6. When we go to pay, for some reason Visa rejects both of our cards, despite the two of us having a combined net "worth" well in excess of "$2,000,000,000".
7. Enrique calls in his "loan" to me, "forcing" me to call in my "loan" to you.
8. You have nothing to pay me, and are angry about the whole thing because you took out a "$100,000" "loan" and still couldn't buy a single lousy latte. So you declare "bankruptcy". And then so do Enrique and I, and the coffeeshop for good measure.  

Question for Study:  

In the subsequent $2,032,200,000 government bailout, after I get my $1,000,100,000, Enrique gets his $1,010,000,000, you get your measly $10,100,000, and the coffeeshop gets their $12,000,000, how much does the coffeeshop charge us to reheat $12 of cold lattes?  
 

Or the real question, as always:  

Who profits from your fears?
I note, for the record, that I gave an adapted version of my recent Whole Data post as a talk at the Cambridge Semantic Web Gathering tonight.  

Sadly, usual host Tim Berners-Lee was away on vacation, so he didn't get to hear me reveal his secret superhero identity as Graphius.  
 

At the beginning of the evening we went around the room doing introductions, and each person was supposed to say their name, their affiliation, and then three words. Evidence suggests that a lot of Semantic Web people cannot count.  

My three words were Norwegian Death Metal. The relevance of this was not explicated in the talk, but if you're curious, there is actually a connection.  
 

Some additional notes, and space for discussion, here.
I like CSV. It addresses a common data-exchange need, and does it with admirably little conceptual or syntactic overhead. As formats go, it's totally unglamorous, but glamor is very definitely not an end in itself. If you need to exchange a simple table of simple data, one datum per row/column, CSV is a fine tool.  

I also like JSON. Actually, I really like JSON. It has the data-modeling virtues of my old Elemental XML proposal, with a cleaner semantic distinction between hashes and arrays. I would like to see the spec amended to allow keys to be strings or numbers, rather than only strings, but other than that I think the whole thing is a solid, self-consistent, useful, appropriately scaled solution to an impressively large set of data-exchange and serialization problems. Certainly anything you could model with CSV, you could put into JSON instead, and not at all vice versa.  

But the Law of Conservation of Complexity applies here, as in most things. Although JSON syntax is pretty simple, any particular JSON data model can be arbitrarily complex, and the modeling issues in data-exchange are always far more involved than the syntactic ones. The important simplicity of CSV is in constraining the data model to be a table.  

Flexibility, however, can be employed in moderation. It is possible, for example, to render a CSV model in JSON syntax. That is, to turn this:  

Artist,Albums
Cradle of Filth,7
Nightwish,8
To/Die/For,4
 

into this:  

[
["Artist","Albums"],
["Cradle of Filth","7"],
["Nightwish","8"],
["To\/Die\/For","4"]
]
 

The advantage of doing so will, I hope, become quickly obvious once you realize that JSON then makes it trivially easy to have cell values be arrays. If we want a list of artists, each with their list of albums, CSV forces us to refactor:  

Album,Artist
Cruelty and the Beast,Cradle of Filth
Damnation and a Day,Cradle of Filth
Dusk...and Her Embrace,Cradle of Filth
Midian,Cradle of Filth
Nymphetamine,Cradle of Filth
The Principle of Evil Made Flesh,Cradle of Filth
Thornography,Cradle of Filth
Angels Fall First,Nightwish
Bless the Child,Nightwish
Century Child,Nightwish
Dark Passion Play,Nightwish
Oceanborn,Nightwish
Once,Nightwish
Over the Hills and Far Away,Nightwish
Wishmaster,Nightwish
All Eternity,To/Die/For
Epilogue,To/Die/For
IV,To/Die/For
Jaded,To/Die/For
 

This is annoying at minimum, and unworkable if we also wanted to see other properties of artists. But in JSON, or this modeling reduction of JSON that we might as well call "JSV", we can simply say:  

[
["Artist","Album"],
["Cradle of Filth",["The Principle of Evil Made Flesh","Dusk...and Her Embrace","Cruelty and the Beast","Midian","Damnation and a Day","Nymphetamine","Thornography"]],
["Nightwish",["Angels Fall First","Oceanborn","Wishmaster","Over the Hills and Far Away","Century Child","Bless the Child","Once","Dark Passion Play"]],
["To\/Die\/For",["All Eternity","Epilogue","Jaded","IV"]]
]
 

JSV retains the modeling simplicity of CSV (a single row of "fields" defined at the top), but allows a cell to be either a single value or a list of them. This is still unglamorous, but to me it's a big gain in expressivity for essentially no technical cost. I suggest that if combining lists and tables isn't a big deal to you (and you're the sort of person to whom a data-serialization format could be a big deal), maybe you've allowed tabular single-mindedness to beat all the natural human list-thinking out of you. Time to reclaim your right to multiplicity. Time to reclaim lots of your rights, really. Yet another argument for more lists, everywhere around us.
1. Grand Magus: "Like the Oar Strikes the Water" (from Iron Will)  

Music pitched to carry across churning seas and urge on flagging longboats, and for suggesting that maybe I could have loved those thankless non-Ozzy/Dio/Gillan Sabbath voyages, too, if I could somehow separate them from where they weren't ever leaving or landing.  

2. Týr: "Fipan Fagra" (from Land)  

Maybe this should be beyond obvious, but one of the simplest ways to be bigger than the sum of your parts is to keep the parts from wasting so much of their energy differing. You can make good work out of one great idea and a few boring ones, but you can also sometimes make a great work out of one good idea and a relentless disregard for distractions.  

3. Gyöngyvér: "Halhatatlan ámok" (from Világok virága)  

And for eliminating distractions, don't underrate ignorance. You don't have to have the slightest idea what something says to have a beckoning vision of what it means. Something about the limits of anticipation, or how sometimes you honor what you allow to recede, or how chaos snarls itself around everything but starts unraveling from any end you have the patience to pull.  

4. Dark Tranquillity: "Below the Radiance" (from Fiction expanded edition)  

So stand on the nearest prow, steer into your fears, and call for speed.  

5. Everon: "South of London" (from North)  

There's always some direction for progress to take you.  

6. DragonForce: "A Flame for Freedom" (from Ultra Beatdown)  

And always some other side to come through to, out into enough sunshine to cheer and bask in for a while even if you're basically more lost than you were before.  

7. Katy Perry: "Waking Up in Vegas" (from One of the Boys)  

Because the question isn't, ultimately, if you're more or less lost. If you ask it right, it isn't even where you are. It's where next?  
 

[Download all 7.]
So there's "the Semantic Web", which has lots of philosophical baggage, and then there's "Linked Data", which is supposed to represent unpacking only the practical stuff from the piles of baggage. A secret password for a cult within a cult. A cult-within-a-cult big enough to have a two-day conference dedicated to it, but not yet big enough that the final panel on the second day wasn't facing a mostly empty room.  

And not quite yet a cult-within-a-cult with a coherent, useful agenda, I think. If you want to stage a grassroots revolution, you need to figure out four things:  

- What is the big change you're going to bring about?
- What's the work that has to be done?
- Who has to do the work?
- What's in it for them?  

Unfortunately, the current Linked Data agenda kind of answers these questions like this:  

- The big change is that all "data records" will have universally unique names, which will all also be web addresses so you can look them (or maybe something about them, or maybe kind of both) up with a browser, and when you look them up they will point you to other things.  

- The work to be done is that everybody must either convert their data into a list of individual subject-verb-objects assertions, including meta-assertions about those first assertions, or else at least construct the meta-assertions about the assertions and construct the unique names for the original assertions even if they don't actually exist. And all this should be done in a language (and data model) called RDF, most explanations of which begin by apologizing for it, which is understandable because it's the data-modeling equivalent of assembly language for programming, and pretty much nobody voluntarily works with nothing but levers when there's a Home Depot nearby.  

- The people who have to do the work are the owners of data.  

- The thing that's in it for them is that other people might mash up their data into something else, or discover it serendipitously, or effortlessly integrate it. Except Linked Data doesn't mean Open Linked Data, so you don't have to expose your data to the world, so in that case the benefit is, um, something something ontology something toolchain serendipitous LOOK!! OVER THERE!! IT'S A GIANT GLOBAL GRAPH EATING THE EMPIRE STATE BUILDING!!!!!! THIS IS THE WEB DONE RIGHT!!!!!  
 

This is a pretty bad plan for a revolution. It's hard to understand why the big change is big, never mind why it's good; the work is artificially difficult and deliberately obscure; the people who would have to do it are basically unprepared; and the "benefits" are peripheral more or less by definition.  
 

Let me try a different set of answers:  

- The big change is that we can finally afford to store most kinds of information in a way that separates the logical data model from the storage mechanics, so that the only thing that constrains what you can do with your data is your data.  

- The work to be done is that someone needs to build a node/arc-oriented database system, with accompanying data-model and query-language, that brings this idea up to the usability level of, say, a simple Excel spreadsheet or Google Base.  

- The people who have to do the work are data-system developers, whether new companies or existing ones.  

- The things that are in it for them are that the market of people with data is huge and growing, and the market for people with highly interconnected data is newly huge and frantically growing, and the existing RDBMS/SQL-based tools for analyzing and exploring and publishing that data are so awful that we will look back on them like we look back on COBOL and punch-cards. There is both money and human progress to be made.  
 

Mine probably initially seems like a smaller revolution. My enemy is not the web. My revolution is not about URIs or linking or dereferencing or ontology subsumption or transitive closure. I just know that we've wasted human-centuries of time fighting against internal problems introduced by relational-database implementations, and built legions of crappy, inflexible, inhumane data systems because the limitations of our databases constrained what questions we allowed humans to even ask, never mind get answered. And I know that computers are now big enough and fast enough that for increasingly many datasets, the old ugliness is no longer even remotely necessary.  

And thus the new ugliness, RDF and SPARQL, is a painfully tragic missed opportunity. I no more want to be buried in triples than I wanted to be boxed into tables. Where SQL and tables made some simple things relatively simple, some simple things very hard, and most hard things really hard, RDF and SPARQL make all things theoretically possible, but all specific trivial things barely feasible, any simple things too cumbersome to attempt without assistance, and hard things moot because the simple things raise so many epistemological issues that all actual work halts.  
 

We can do better. Let's call this new movement Whole Data, like whole food. Let our data be what it is to people, not what it's easier to be to machines. I don't claim this is any kind of new science. Object-oriented databases aren't a new idea, and arguably a binary-relation node/arc data-model is more or less what the original idea of a "relational" database was before mundane practicality spent a few decades abusing it. But we can build it now. We can build it so that people can use it, easily and without having to learn some whole esoteric new trade.  

And yes, a node/arc Whole Data database will lend itself incredibly well to publishing that data on the web. Arcs are links, and the web is made of links, and it's deliciously easy to turn links into links. Moreover, arcs are labeled links, and labeled links mean that machines can do something useful with them. But it is not the job of each data owner to rewrite their own database tools. It is not the job of each webmaster to figure out how to "semantically annotate" their pages. This is a revolution in data-modeling abstraction, and it will be fought by software designers and developers.  

And linking disparate datasets together, which is supposed to be the whole point of Linked Data, and thus the whole point of this slightly sad and more than slightly confused conference I went to last week, is only barely a technical problem to begin with, and it's not at all clear to me that the technical agenda of this subcult actually makes non-trivial linking any easier. Most of the claimed benefits, even if you don't worry about whom they're benefits for, seem to me like imaginative hand-waving.  

- Giving things unique IDs does make it easier to talk about them. But you can have IDs in table records, so that isn't new, and you can refer to things by instructions instead of IDs and most of the time it doesn't make a lot of difference, so IDs aren't even always necessary. And giving things web IDs is a totally orthogonal issue about publishing that isn't about the data itself at all, and makes no more general sense than saying that all children should be named with cell-phone numbers.  

- You can "mesh together" RDF sets. Sometimes, modulo some obtuse URI issues. And if the sets' conceptual data models are compatible enough, you might be able to get something out of the combination. But this is more or less true of relational database tables, too. Exchanging data is not difficult. CSV, JSON, XML, whatever. Agreeing what the data should be is difficult, and agreeing what it should be in triples is no easier than agreeing what it should be in tables. (Maybe easier in theory, but definitely harder in current practice.) Integration is never "effortless". It's just that sometimes somebody else has already done the effort of reaching agreement, and sometimes they haven't.  

- My adorable little FOAF file can link to yours, and we can both link ours to Tim Berners-Lee's. But this is a toy "data web". It doesn't scale. Currently it doesn't even scale technically, as the web is too slow for queries to be federated out across every foafy server on the planet, but solving the technical problem will only make it faster for us to see how useless the answers are for human reasons. Every level of remove you go through is a chance for context or trust or meaning to be lost. I don't know how you decide who goes in your file, and I certainly don't know how the people in your file decide who goes in theirs. If you find your way to me and then to Tim, it is unlikely to do you very much real good. On one hand, his office is about two blocks away from mine. On the other hand, the only reason it's becoming marginally more likely that he knows me is that I keep complaining about things he's working on. Small Pieces Loosely Joined works great when you have time to examine the joins and the pieces and figure out which ones are useful for some particular personal purpose. Carry them home, trim them, water them, love them, and you might get a garden for your efforts. Send a machine out to gather them all up at once and you'll get a mushy septic field with some fine heirloom tomatoes submerged somewhere in it.  

- As long as our cults are small, sometimes we can find each other with nothing but our code words. But thinking that URI standards will yield pervasive serendipity is socially oblivious. Nothing I do in my FOAF file will ensure that you find me when you look at Tim's. Distributed linking is inherently asymmetric, and even if you crawl the entire web to construct all the missing inverses, the numbers are implacable. You cannot center yourself in a social network without cooperation. The space of associativity is curved: you can move from the obscure to the popular easily (indeed often inexorably), and along popularity contours with some effort. The imbalance of attention is a human phenomenon, not a technical one.  
 

But my little revolution, at least, doesn't involving changing human nature. Distrust any that does. Whole Data is not magic, it's just healthier than the bleached, bromated version, and yummier than chewing raw semantic bran. All I'm trying to do is make information tools that allow humans to be less mechanical, and machines to be more helpful.  

Here are some helpfulnesses:  

- A node/arc database can handle multiple-value and no-value relationships just as easily as single-value relationships. Multiple-value relationships are 90% of the interesting structure of data, yet they're basically the bane of relational-database usability, and RDF only handles them if you don't care what order they go in or how easy it is to find out what they all are. No-value relationships are painful in relational databases, and inexpressible in RDF.  

- In a node/arc database you can make data-model modifications without touching any node whose data is not actually involved. The model does not determine the storage. In relational databases the model is the storage. Unless you already consider yourself a "working ontologist" you don't want to know what it takes to express a model in (actually for, because you can't do it in) RDF. Real models change all the time, so this can't be a half-assed after-thought.  

- In a node/arc database you can look at any data from the perspective of any other data. There's no "top", there's just "outward from here". In a relational database your tables and indexes limit your perspectives. In RDF there's no "top", but neither is there any "here"; all you can do is scan assertions until your patience runs out, looking for ones that say something else somethings this, or this somethings something else. Being able to stand anywhere and look in any direction is the difference between exploration and a zoo, and thus between discovery and watching an okapi crap.  

- A node/arc data structure can be browsed. Relational tables and RDF can mostly only be searched, and browsing is a UI illusion built on top of searching. But browsing, wandering, connecting, following, compiling, discarding: these are how humans organize and comprehend information, and we deserve a database in which those are the cheapest and sturdiest building blocks, not the most expensive and fragile fabrications.  

- In a node/arc database with a path-based query language, human browsing can be used as machine training, and human questions can be formulated for machine processing in structurally the same way they would be stated for human evaluation. We think in contexts and connections, not in subgraph matching or index inversion. We do not start every inquiry with a list of all the universe's truths. We start somewhere, and follow paths. We need to be able to send our questions down these paths, down all the paths at once, and have them come back with their own answers.  
 

The project I'm working on has a database system like I describe, and a data-model that's more usable than RDF, and a path-based query-language I wrote that's better than SPARQL. I want this thing to exist. I think we need it to exist. I'd rather it already existed, and if somebody else comes out with one tomorrow that's at least as good as mine, I'll be totally thrilled to use that instead. Mine isn't going to come out tomorrow. I don't know when it will. I don't know who will be totally thrilled, the day mine comes out, to use that one instead of the one they were working on.  

But I know these small improvements matter. Information matters, and understanding matters, and the tools for getting from one to the other matter. I think our ability to make use of information is part of how we are going to survive on Earth. So this, for the moment, is what I'm trying to do for us.
Applied natural-language-processing work is motivated by the proposition that it's more practical to teach a few computers to understand human languages than it is to teach a lot of people (including every new one) to speak computer languages. The potential advantage, thus, is numeric and large.  

So far, though, many people have learned to speak computer languages pretty well, and no computers have learned to understand human languages. Indeed, at this point it's still faster to teach computer languages to humans even if you have to create the humans from scratch first...
1. Trinacria: "Make No Mistake"  

Insanely inspired collaboration between viking-metal grind and noise-terror agit-processing, easily the most bracing thing I've heard in years.  

2. Leviathan: "Receive the World"  

Like being shredded in a slow-motion tornado filled with mid-explosion bombs and extrapolated nightmares.  

3. Ihsahn: "Emancipation"  

Progressive and necromantic at once, like court music from a Hades starting to gentrify just a little as it discovers its political strength.  

4. Moonspell: "Dreamless (Lucifer and Lilith)"  

Gothic metal's current standard-bearers.  

5. Charon: "Deep Water"  

A HIM to Moonspell's Sentenced.  

6. Morgion: "Mundane"  

A funeral march for glaciers.  

7. Dalriada: "Tavaskzköszöntõ"  

And at the end of the march, the unexpected moment when you are caught and carried off by sprint-pogoing goat-horned leprechauns and their seven-armed princess who only ever speaks backwards fast.  
 

(All 14 in AAC; 65MB zip file)
By request:  

"List seven songs you are into right now. No matter what the genre, whether they have words, or even if they're not any good, but they must be songs you're really enjoying now, shaping your spring. Post these instructions in your blog along with your 7 songs. Then tag 7 other people to see what they're listening to."  

The tags: Bethany, Dennis, Heather, Dan, James, Alan, David.  

The songs:  

1. Frightened Rabbit: "Head Rolls Off"  

:05 snippets of :30 clips is no way to fall in love with anything subtle. For Frightened Rabbit it took me a mix received and another one given to stop my scanning long enough to register dawning awe. This may be the best album of tragic-heroism since Del Amitri's Waking Hours, and "Head Rolls Off" is as defiantly hopeful a life anthem as anything since the Waterboys' "I Will Not Follow".  

2. Alanis Morissette: "Underneath"  

"We have the ultimate key to the cause right here." Alanis says awkward, earnest things that I also believe.  

3. Jewel: "Two Become One"  

If you don't have the courage to be kaleidoscopically foolish while you're still young enough, you won't have the chance to turn your "2"s into "Two"s when you're a little older, a little wiser, and a little less afraid of yourself.  

4. Delays: "Love Made Visible"  

Joy made audible.  

5. Runrig: "Protect and Survive"  

There can be echoes, inside blood, of decades and centuries and stone and water.  

6. Okkervil River: "Plus Ones"  

Not only probably the best gimmick song ever, but a gimmick song that actually ennobles other gimmick songs.  

7. M83: "Graveyard Girl"  

Best mid-80s synthpop song since the mid-80s. When my daughter is old enough to gripe about all the old crap I listen to from before she was alive, which thus can't possibly matter, I'll be able to tell her that at least she was around to dance with me to this.
Site contents published by glenn mcdonald under a Creative Commons BY/NC/ND License except where otherwise noted.