furia furialog · Every Noise at Once · New Particles · The War Against Silence · Aedliga (songs) · photography · other things
18 September 2009 to 8 August 2007 · tagged tech/essay
What Beatles album is "Day Tripper" on?  

This is partially a trick question, of course, as "Day Tripper" was originally a non-album single, but it has been on several Beatles compilations over the years, including the red 1962-1966 best-of, and in the remastered 2009 catalog it lands on both the mono and stereo versions of Past Masters.  

Starting from scratch, it would take you, as a person, a little while to figure this answer out on the web. There's a Wikipedia page for the song, which is the top search hit for the above question in both Google and Bing, and it explains the non-album-ness and mentions several compilations, but it doesn't (at least as of this moment) clarify current availability, and none of the pages for the non-current compilations refer you explicitly (again, as of this moment) to the right place.  

There are a few paths that lead you to a voluminous page for the Beatles discography, on which "Day Tripper" is again mentioned several times, but this page doesn't itemize the track-listings of compilations, and the "Day Tripper" links just go back to the original song-page. But eventually you might blunder into the separate pages for the Mono and Stereo box sets, and from there you might wander over to Amazon.  

Here even more potential confusion awaits you. The top Google hit for "amazon past masters", at the moment, is the now-obsolete second volume of the old CD edition. Searching on "past masters" on Amazon itself gets you the Remastered edition as the top hit, but idiotically suggests buying it together with the old Volume 2, and you would have to scrutinize the track lists to realize that Past Masters [Remastered] subsumes Past Masters, Vol. 1 and Past Masters, Vol. 2.  

In fact, if you backtrack and try searching for "Day Tripper" on Amazon, the top hit is Rubber Soul, the album on which "Day Tripper" would have appeared, chronologically, but doesn't. And, for good measure, that top hit is the obsolete 1990 CD, not the new remaster.  

Bleh.  
 

But at least you're a person, and via a judicious combination of intuition and stubbornness and asking some friends who might know, you can eventually solve these information problems. If you were a computer, you'd be fucked.  

Which is OK on some existential level, because if you were a computer you probably wouldn't appreciate the song, anyway. But the point of computers is to help people do things, and a computer ought to be a particularly helpful tool when the thing you need to do is sort through some data.  

But for the computer to help you puzzle through this data, the data has to be modeled usefully by people first. There are several prominent sources of meticulously structured data about music, so this should be easy. But here, sadly, people have let us down again. And again, and again. Let's see how.  
 

All Music Guide  

A text-search for "Day Tripper" (there's no other query interface) returns a full page of cryptic results. There's an "Occurrences" column, and although it's not clear exactly what that means, it's obvious that more is supposed to be better, and the first listing has 360 where none of the rest have more than 13, so presumably that's the "right" one.  

Clicking this gets you 8 pages of results, which is annoying in itself (the splitting of them into pages, I mean). They're sorted by Artist, which sounds reasonable enough, except that the ones without artists are sorted first, and thus the first page of results is almost totally crap. There are lots of Beatles releases listed, but they get split between pages 1 and 2 of the results list, making it impossible to look at them all at once.  

But if you drill into one of them at random, and then click on "Day Tripper" in the track listing, you do finally get to a page that lists all (or several, anyway) Beatles releases on which this song appears. There are 24, though, including such things as "Five Nights in a Judo Arena", which human intuition might guess is not a normative release, but a computer would have no basis for dismissing. These releases are in date-order, at least, but this turns out to be worse than worthless for our current question, because All Music has modeled Past Masters [Remastered] not as an album, but as an alternate manifestation of the album Past Masters, Vol. 1 & 2, which means it appears way up in the middle of this list, labeled 1988, because that was the year of its earliest issue (on cassette!).  

Looking through the data, in fact, we see that although All Music has lots of individual detail on most kinds of things, it has essentially nothing that models the relationships of things to each other, or in groups. There is no modeled connection between Past Masters, Vol. 1 & 2, Past Masters, Vol. 2, The Beatles: Mono Box Set and The Beatles: Stereo Box Set, even though v1/2 subsumes v2, and both boxes subsume both.  

And there's no modeling of "in print", or any notion of representing the subset of albums that represent the current core catalog. So a computer can't use this data to answer real questions by itself. Source fail.  
 

MusicBrainz  

This is a database first, not a guide, and thus a more likely candidate for well-structured data anyway, and one where I won't pick at their explicitly-secondary browsing UI.  

The good news is that MusicBrainz has the kind of data we need. They have some relationships between tracks, like one being a mashup of some others, so presumably they could add one to express that the Mono Masters of "Day Tripper" is a different version from the Past Masters [Remastered] one, but the same underlying song. They already have a reconciliation mechanism by which they can say that the "Day Tripper" on 1962-1966 is "the same" as the one on Past Masters, although at the moment the reconciliation data looks too noisy for real use.  

They even have the notion of one release being part of a set, although I didn't find very many examples of sets, and in particular I can't tell if a release can be part of more than one set. But if they can, that might be a mechanism for expressing official catalogs, current availability, and various other kinds of groupings and subsets.  

So current source fail, but at least there's hope here.  
 

Freebase  

Freebase is easily the most sophisticated public attempt at universal data-modeling, at the moment, but this is a caveat as well as a compliment. Freebase models attempt to represent everything that could possibly exist, and thus tend to drift quickly from usable simplicity towards abstractly-correct awkwardness, usually coming to rest far into the latter.  

So if you search on "Day Tripper", you will find that there are results of that name as "Topic", "Composition", "Musical Album" and "Musical Track", with at least dozens of the latter. Freebase fails the usability test even more spectacularly than All Music, as the list of Musical Tracks is presented with no grouping or distinguishing information at all, just "Day Tripper (Musical Track)" after "Day Tripper (Musical Track)", and you have to click on each one to get any clarifying info. "Day Tripper" the composition does not link to any of the tracks, and "Day Tripper" the album turns out to be a compilation of Beatles covers which does not, at least as far as the listed information shows, even contain the song "Day Tripper".  

And if you delve into the internals of the Freebase music schema, you can quickly develop a guess about why the data has not all been filled in: there's too damn much structure. A music/track is a recording of a music/composition. The track can appear on multiple music/releases, each of which is a publication of a particular music/album. Unless you need to model who was in the band during the making of an album, in which case the album links instead to a set of music/recording_contributions, which is each a combination of albums, contributors and roles.  

Oh, except compositions can be recorded "as albums", in which case they link to music/album without going through music/track, and tracks can appear directly on releases without going through music/album. And there's no current property for saying that a given track is an alternate version of another, but from following Freebase modeling discussions I can confidently guess that they'd model that by saying that a music/track links to something like a music/track_derivation, which itself is a combination of original track, derivative track, deriving artist (or music/track_derivation_contribution) and derivation type. And Freebase's query-language doesn't provide any recursion, so if these relationships chain, good luck querying them.  
 

Music Ontology  

This isn't a database, just an attempt at a model for one. And, grimly, it's another quantum level more elaborately and unusably correct than the Freebase model. Even "MO Basics" (and the "Overview" has 22 more tables of explication beyond these "Basics", without getting into the "details") includes conceptual distinctions between Composition, Arrangement, Recording, Musical Work, Musical Item, MusicalExpression and MusicalManifestation. And then there are pages upon pages of minutely itemized trivia like beginsAtDuration, djmix_of, paid_download (and "paiddownload", which is different in some way I couldn't figure out), AnalogSignal, isFactorOf... This list is bad because it's too long, but the fact that it's in the schema means that it's also bad because no matter how long it is, it will never include every nuance you ever find yourself wanting, and thus over time it will only accumulate more debris.  

A tour-de-force into a cul-de-sac.  
 

The Rest of the Web  

Searching on any particular band or bit of music will unearth dozens or hundreds of other sites that contain bits of the information we need: stores, discographies, databases, forums, fan pages, official sites. Almost universally, these are either unqueriable flat HTML pages, or tree-structured databases with even less interlinking than the above sites. Encyclopaedia Metallum, my favorite metal site, has full track listings for a genuinely mind-boggling number of releases by an astonishing number of bands, but the tracks themselves are not data-objects and a machine can find out nothing about them. There are several lovingly hand-crafted Beatles discographies on the web, all far too detailed for our original casual query, and all essentially useless to a computer attempting to help us.  

So: Ugh. Triple ugh because a) the population of people willing to put time and energy into filling out music-related data-forms is obviously huge, b) the modeling problems are not intractably complicated in any theoretical sense, c) MusicBrainz and Freebase, at least (and the system I'm designing at work, I think), seem to be technically sufficient to represented the data correctly. If only we had a better plan.  
 

DiscO  

So here's my attempt at a better plan. I call it DiscO, for Discographic Ontology; that is, it's a scheme for structuring discographies. It is not an attempt at an abstract physics of human air-vibrating creativity, it is just an outline of a way to write down the information about bands, the music they've made, and how that music was released. It's intended to be simple enough that you can imagine people actually filling in the data, but expressive enough that the data can support interesting queries. And it's specifically intended to model nuance abstractly, so that it can accommodate some kinds of new needs without perpetually having to itself expand.  
 

There are four basic types:  

Artist - The Beatles, Big Country, Megadeth, Frank Zappa, whatever...  

Release - an individual album, single, compilation, whatever; Rubber Soul, Past Masters [Remastered], "Day Tripper"/"We Can Work It Out"...  

Track - an individual version of a song; "Day Tripper", "Day Tripper [mono]", "Day Tripper (performed live on pan flute and triangle by Zamfir and Elmo)", etc.  

Sequence - any collection of releases; Original Albums, Japanese Cassette Singles, 2009 Remasters, etc.  
 

These are related to each other like this:  

Artists mostly have Sequences. Sequences can be anything, but many artists would have some standard ones: Original Albums, Singles, Compilations, Remastered Albums, Current Catalog.  

Sequences have Releases (and Artists).  

Releases have Dates, Labels and Tracks. A Release may have an Artist directly, but more often would have one indirectly via a Sequence.  

Releases may be related to each other via Alternate Version/Original Version links. Thus Past Masters, Vol. 1 & 2 and Past Masters [Remastered] are both Releases, but Past Masters [Remastered] has an Original Version link to Past Masters, Vol. 1 & 2, and Past Masters, Vol. 1 & 2 has an Alternate Version link to Past Masters [Remastered].  

Tracks have Durations. A Track may have an Artist directly (so individual tracks on multi-artist compilations can be attributed correctly), but more often would have one indirectly via Release (which itself may have one indirectly via Sequence).  

Tracks may also be related to each other via Alternate Version/Original Version links. "Day Tripper" and "Day Tripper [mono]" are both Tracks, but "Day Tripper" has an Alternate Version link to "Day Tripper [mono]", and "Day Tripper [mono]" has an Original Version link to "Day Tripper". (We can get into geek arguments about which versions are the same and which are derivations (of which!), if we want, but whatever we decide, we can model.)  

Restated in schema-ish form, that's:  

Artist
- Sequence
- Release
- Track  

Sequence
- Artist
- Release  

Release
- Sequence
- Artist
- Date
- Label
- Track
- Original Version
- Alternate Version  

Track
- Artist
- Duration
- Original Version
- Alternate Version  

I think that's basically enough. What it gives up in expressiveness, it gains in usability. Our Beatles data can now, I think, be modeled both tractably and informatively. We can hook up all the versions of albums and versions of songs. We can create whatever sequences we need, and since the sequences themselves are just data, it's fine to have "Canadian Singles" for the Beatles and "Fanzine Flexis" for The Bedsitters without implying that either band should also have the other.  

And using Thread, the query-language I will (before long, hopefully) be attempting to spread through the universe, we can start to ask our questions in a way the computer can answer:  

Track:=Day Tripper.Release
 

This is our naive query. It gets all the releases that have any track called exactly "Day Tripper". Good for assuring us there's some data in the bucket, but not much help in answering our question.  

Track:=Day Tripper.Release:(.Artist:=The Beatles)
 

That limits our results to albums by the Beatles, but there are still too many. With our fully-interlinked data-model, though, we can now actually ask something that is much closer to what we mean:  

Artist:=The Beatles.Sequence:=Current Catalog.Release:(.Track:=Day Tripper)
 

That is, find the artist The Beatles, get their Current Catalog sequence, get that sequence's releases, and filter those releases down to the ones that contain a track called exactly "Day Tripper". This is progress.  

But "called exactly 'Day Tripper'" will exclude "Day Tripper [mono]", which isn't what we want to do. We're trying to ask a musical question about a song, not a typographical question about a title. But this, too, we have the powers to cope with:  

Track|Day Trippers=(...__,Original Version:=Day Tripper)  

Artist:=The Beatles.Sequence:=Current Catalog.Release:(.Track.Day Trippers)|
DT Versions=(.Track:Day Trippers=?),
Other Tracks=(.Track:Day Trippers=_._Count)
 

This time we first define a new inferred relationship on Track called "Day Trippers", which gets the Track, all its Original Versions, all their Original Versions (recursively), and then filters this set of tracks down to just the ones called "Day Tripper".  

Then we get the Beatles' current catalog releases again, but this time instead of checking each release for a track named "Day Tripper", we use our Day Trippers relationship to check for a track that is, or is derived from, "Day Tripper". And then, for each of the releases that have one, we infer two new relationships: "DT Versions" tells us which track(s) on this release are versions of "Day Tripper", and "Other Tracks" counts the tracks on this release that are not derivations of "Day Tripper".  

I.e.:  

# Release DT Versions Other Tracks
1Past Masters [Remastered] Day Tripper 32
2Mono Box Set Day Tripper [mono] 212
3Stereo Box Set Day Tripper 238
 

So now we know our choices. It took us so long to find out, but we found out.  
 
 
 

Tantalizing Postscript: But now that we have these three options before us, how do their contents overlap or differ, track by track?! We could bring up three different windows and squint at them. Or we could ask the computer:  

Artist:=The Beatles.Sequence:=Current Catalog.Release:(.Track.Day Trippers)
/(.Track...__,Original Version:##1)/=nodes
 

Aaah. I see now. (How nice it will be when I'm allowed to show you...)
In his post explaining his departure from Google, Douglas Bowman says "Yes, it's true that a team at Google couldn't decide between two blues, so they're testing 41 shades between each blue to see which one performs better. I had a recent debate over whether a border should be 3, 4 or 5 pixels wide, and was asked to prove my case. I can't operate in an environment like that."  

He's not really trying to have an opinion-changing last word in an argument against data-driven product decisions, but if he were, this is not how to do it. If you believe that the proper width of a border can be tested, then Bowman's refusal to subject his intuitions to quantitative confirmation just sounds like petulant prima-donna nonsense. If you can test 41 shades of blue, this line of reasoning goes, you don't need to guess, so a guessing specialist is an annoying waste of everybody's time.  

The great advantage of testing and data, of course, is that you get precise, decisive answers you can act on. Shade 31, with 3.7%, trouncing runner-up shade 14 with only 3.4%! Apply shade 31, declare progress.  

But the great disadvantage of testing and data is that you get precise, decisive answers you can and will act on, but you almost never know what question you really asked. Sure, the people who saw shade 31 did some measurable thing at some measurable rate. But why? Is it shade 31? Or is it the constrast between shade 31 and the old shade? Or is it the interplay between shade 31 and some other thing you aren't thinking about, or possibly don't even control? Are you going to run your test again in a month to see if the results have changed? Did you cross-correlate them with HTTP_REFERER and index the colors on the pages people came from? What about all the combinations of these 41 shades and 41 backgrounds and 8 fonts and 3 border widths (12 if you vary each side of the box separately!) and 41 line-heights and 19 title-bar wordings and the color of the tie Jon Stewart was wearing the night before? Which things matter? How do you know?  

You don't. And if you need to add some new element, tomorrow, you don't know which of the tests you've already run it invalidates. Are you going to rerun all of them, permuted 41 more new ways?  

No. You are going to sheepishly post a job opening for a new guessing specialist. Bowman already had his last word. It was "Goodbye".
An adequate computer language allows humans to communicate with machines about machine concerns.  

A good computer language also facilitates communication between humans about machine concerns.  

A great language allows machines to participate in conversations between humans about human concerns.  
 

There are not very many of this last sort. As I've mentioned before, I'm trying to write one. I've been calling it a query language, but I've started to think I shouldn't. It's a language for talking about data-relationships, where most other things called "query languages" are for excerpting data, and the two are different qualitative goals even when the individual tasks end up being logistically similar. I'm trying to do for data-relationships what the system for symbolic algebra did for numbers. Not what algebra did for numbers, thankfully, just what we accomplished by making up a written syntax for expressing algebra compactly and precisely.  
 

So here's just one real-world example from yesterday. We were talking, elsewhere, about how you calculate overall ratings for bands in a large reviews database. The simplest thing is just to average all their ratings. In Thread, my data-relationship language, this is:  

Artist|(.Album.Rating._Average)
 

I.e.: For each artist, get their albums, then get those albums' ratings, then average all the ratings. But this is maybe not the best statistic, as it weights albums proportionally to the number of reviews. Maybe we want to average the ratings for each album, and then average the album-averages to get the artist average. That's a hard sentence for a person to read, and the computer can't read it at all. But in Thread it's just:  

Artist|(.Album.(.Rating._Average)._Average)
 

Run this, though, and you see that the top of the list is dominated by bands with very small numbers of very high ratings. Not really what we're trying to find out. So let's include only bands with at least 25 ratings:  

Artist:(.Album.Rating::#25)|(.Album.(.Rating._Average)._Average)
 

This is better, but maybe not as much better as you'd think. It turns out that there are a number of bands for which a small number of people have written a large number of reviews. Maybe what we really want is to average the ratings for each user, not for each album. That way one person giving the same high rating to 8 different albums counts as 1, not 8. And we'll only consider artists with ratings from 25 different users, not just 25 ratings total. This is:  

Artist:|(.Album.Rating/User::#25.(.group._Average)._Average)
 

Better, but it's still pretty easy to game this by creating new accounts and filing one very high rating from each of them. We can mitigate that, though, by trusting only ratings from users who have rated, say, at least 5 different albums, from at least 3 different artists. That's:  

Album|Trusted Rating=(.Rating:(.User:(.Rating.Album::#5.Artist::#3)))  

Artist|(.Album.Trusted Rating/User::#25.(.group._Average)._Average)
 

Better again. But there are still a few pretty obscure things at the top of the list. This doesn't prove that the results are flawed, of course, but scrutinizing them, and thinking about the sample-size effects of rating variation at this scale, reveals that the highest and lowest ratings are having pretty dramatic effects. Perhaps it would be smart to toss out the top and bottom 10% of the per-reviewer averages, averaging only the middle 10%. This keeps one perspective-challenged fan or one vengeful ex-bassist from single-handedly jumping the ratings up or down. Thus:  

Album|Trusted Rating=(.Rating:(.User:(.Rating.Album::#5.Artist::#3)))  

Artist|(.Album.Trusted Rating/User::#25.(.group._Average)#._Trim 10%._Average)
 

The result of this, in fact, is this leaderboard. By these rules Immolation is currently the top-ranked band in the Encyclopaedia Metallum.  
 

The English version of this final formulation is "bands with 25+ reviewers of their full-length albums, counting only reviewers who have filed at least 5 reviews and covered at least 3 bands; scored by averaging the ratings from each reviewer, dropping the top and bottom 10% of these reviewer-averages, and then averaging the remainder". This is a long sentence for people, and a useless sentence for machines, and as long as this is our canonical format, we will be at considerable risk for error every time we retranslate into a computer language. Put this in SPARQL or SQL or MQL, though, and it would be essentially inaccessible to people. So you chose between knowing what you want and not necessarily getting it, or knowing what you're getting but not whether it's what you want.  

I think we have to do better. The human stakes for data-comprehension are approaching critical levels, and our tools have not kept up. Worse, the shiny new tools in the big labs are not ready yet and not even that great.  

So Thread is my own personal attempt at doing better. Could it be the language we could actually share, humans and computers, to talk about data? I can't prove it is yet, and the project in which it's embedded is still working towards its public debut, so you can't make up your own mind yet, either. But for the past couple years I've been using it to talk to computers, and to myself, and even to a few coworkers, and the experience at least gives me hope. I know it's powerful, and I know it's compact.  

Like any language, of course, we'd have to learn it. I make no claims of it being "intuitive", whatever meaning that term might have for a symbolic-reasoning language, nor do I claim it's trivially implemented at scale. It's cryptic in its own particular way, and poses its own technical challenges. But I'm not trying to minimize anybody's absolute difficulty, I'm trying to maximize the ratio of power to difficulty. If, reading those examples above, without a formal tutorial or even an actual diagram of the data model in question, you have at least a sense of what might be going on, then it's at least possible I'm getting somewhere.  
 

[Note from a few days later: in re-reading these queries I actually noticed a methodological error! The first time I did this, I neglected to sort the ratings before trimming the first and last few. That is, I did this:  

Album|Trusted Rating=(.Rating:(.User:(.Rating.Album::#5.Artist::#3)))  

Artist|(.Album.Trusted Rating/User::#25.(.group._Average)._Trim 10%._Average)
 

where I should have done this:  

Album|Trusted Rating=(.Rating:(.User:(.Rating.Album::#5.Artist::#3)))  

Artist|(.Album.Trusted Rating/User::#25.(.group._Average)#._Trim 10%._Average)
 

The operative difference is the "#" for sorting right before "._Trim 10%" in the second query, which is what makes the trim function take off the highest and lowest ratings, rather than just the first and last.  

But even this error is kind of my point. The language is a tool for me to talk to myself over time.]
I gave another version of my Whole Data blog post as part of a panel called "Death of the Relational Database?" at the Web 3.0 conference today.  

My answer to the question in the panel-title is that the relational database was always mostly a coping mechanism for a time of relative scarcity, when our data volume and needs exceeded our storage and processing capacities. This era is not entirely past us, but with more memory, faster data-loading, faster processing and better programming tools, there are increasingly many datasets of larger and larger size for which we can now support qualitatively richer abstractions. Where we used to have to break data apart, and thus were constrained to interact with it in fragmented and broken ways, we can now increasingly afford to keep it whole.  

And here, then, to try to explain why that matters, is my current four-point reduction of the Whole Data manifesto:  

1. What logic allows, storage may not constrain.  

The forms of the logical data-model and its inquiries are by definition independent of any extra-logical storage mechanics. That is, a proper data system exists for the express purpose of sustaining the illusion that the logical data-model is real. This must also be true over time: logically-sensible changes in the logical data model cannot be constrained by storage mechanics.  

2. All relationships are lists.  

Not only must multiple-value relationships and no-value relationships be exactly as easily expressed, manipulated and inquired about as single-value relationships, but every fundamental data operation should thrive on multiplicity and sequence.  

3. There is no up or down, only onwards from here.  

Humans impose hierarchy and directionality on data, but different humans recognize different hierarchies in the same data at the same time. The system itself must be agnostic. No fact is inherently the property of another, they are all independent and structurally equal, and although pairs of relationships can be inverses of each other, there is no absolute “up” or “down”, or “forwards” or “backwards”. A database of albums, artists and years, for example, is exactly as much a database of years as it is of artists as it is of albums, not by the generous will of its schema-creator, but by the intrinsic nature of data. Most significantly, you must be able to start at any node and see everything else in this dataset to which, and from which, it is connected. In fact, turn this around and you get the essential practical definition of a “dataset”: it is a collection of data in which all internal connections are expressed.  

4. Human inquiry follows paths; machine inquiry follows all the paths at once.  

The reason to put data into computers is so that computers can answer questions that are meaningful to humans, but faster for computers. A data model and query language exist to communicate human needs to computers, not to communicate machine preferences to humans, and the responsiveness of a data system is its qualitative ability to answer questions with human motivations, not its quantitative throughput.  
 

At the end of this talk I also did the first (very brief) public demo of Thread, the path-based query language I've written as part of the data-modeling (among other things) system I've been working on at ITA Software. I'm looking forward (to put it mildly) to being able to talk much more about this, but for now I just whirled through a fast series of mostly-unexplicated queries that at least gestured in the direction of the points in my manifesto:  

Album:Year=2000
albums from the year 2000; a one-to-one relationship in the old world, but in the new world the year 2000 is a real thing, too, making this many-to-one

Album:Artist=Nightwish
albums by Nightwish; a more familiar many-to-one relationship

Artist:Album~Dark
Artists who did an album with "Dark" in the name; a one-to-many relationship with the same syntax as above

Label:Album~Dark
Labels that put out albums with "Dark" in the name; same syntax again, but following a different path to Albums

Album:~Dark.Label
same question again, but following the path from Albums, instead of to them

Artist:(.Album.Label:=Spinefarm)
Artists who've had albums on Spinefarm; same pattern as above, but filtering on a compound relationship

Label:=Spinefarm.Album.Artist
or go the other way

Artist:(.Album:#5.Year:>2000)
Artists whose fifth album came out after 2000; multiples and missing values (one of the bands in the sample data had only four albums)

Artist:!(.Album:#5)
artists who don't have a fifth album; filtering on absence
 

I also mentioned in the talk, but didn't actually show, the hypothetical query "Who are all the living movie directors who ever directed Cary Grant?":  

Actor:=Cary Grant.Film.Director:!(.Date of Death)  

Go ahead and try to answer that in IMDB without a query language. For that matter, try to answer it in Freebase with a query language. (Or, even worse, try to answer it in Freebase yesterday, before I fixed a bunch of dates of death by hand...)  
 

There's also a short article up on semanticweb.com at the moment, which if nothing else gives a transcribed sense of how I explain these ideas differently in a phone interview than I do in writing.  
 

And there's space for discussion here.
I like CSV. It addresses a common data-exchange need, and does it with admirably little conceptual or syntactic overhead. As formats go, it's totally unglamorous, but glamor is very definitely not an end in itself. If you need to exchange a simple table of simple data, one datum per row/column, CSV is a fine tool.  

I also like JSON. Actually, I really like JSON. It has the data-modeling virtues of my old Elemental XML proposal, with a cleaner semantic distinction between hashes and arrays. I would like to see the spec amended to allow keys to be strings or numbers, rather than only strings, but other than that I think the whole thing is a solid, self-consistent, useful, appropriately scaled solution to an impressively large set of data-exchange and serialization problems. Certainly anything you could model with CSV, you could put into JSON instead, and not at all vice versa.  

But the Law of Conservation of Complexity applies here, as in most things. Although JSON syntax is pretty simple, any particular JSON data model can be arbitrarily complex, and the modeling issues in data-exchange are always far more involved than the syntactic ones. The important simplicity of CSV is in constraining the data model to be a table.  

Flexibility, however, can be employed in moderation. It is possible, for example, to render a CSV model in JSON syntax. That is, to turn this:  

Artist,Albums
Cradle of Filth,7
Nightwish,8
To/Die/For,4
 

into this:  

[
["Artist","Albums"],
["Cradle of Filth","7"],
["Nightwish","8"],
["To\/Die\/For","4"]
]
 

The advantage of doing so will, I hope, become quickly obvious once you realize that JSON then makes it trivially easy to have cell values be arrays. If we want a list of artists, each with their list of albums, CSV forces us to refactor:  

Album,Artist
Cruelty and the Beast,Cradle of Filth
Damnation and a Day,Cradle of Filth
Dusk...and Her Embrace,Cradle of Filth
Midian,Cradle of Filth
Nymphetamine,Cradle of Filth
The Principle of Evil Made Flesh,Cradle of Filth
Thornography,Cradle of Filth
Angels Fall First,Nightwish
Bless the Child,Nightwish
Century Child,Nightwish
Dark Passion Play,Nightwish
Oceanborn,Nightwish
Once,Nightwish
Over the Hills and Far Away,Nightwish
Wishmaster,Nightwish
All Eternity,To/Die/For
Epilogue,To/Die/For
IV,To/Die/For
Jaded,To/Die/For
 

This is annoying at minimum, and unworkable if we also wanted to see other properties of artists. But in JSON, or this modeling reduction of JSON that we might as well call "JSV", we can simply say:  

[
["Artist","Album"],
["Cradle of Filth",["The Principle of Evil Made Flesh","Dusk...and Her Embrace","Cruelty and the Beast","Midian","Damnation and a Day","Nymphetamine","Thornography"]],
["Nightwish",["Angels Fall First","Oceanborn","Wishmaster","Over the Hills and Far Away","Century Child","Bless the Child","Once","Dark Passion Play"]],
["To\/Die\/For",["All Eternity","Epilogue","Jaded","IV"]]
]
 

JSV retains the modeling simplicity of CSV (a single row of "fields" defined at the top), but allows a cell to be either a single value or a list of them. This is still unglamorous, but to me it's a big gain in expressivity for essentially no technical cost. I suggest that if combining lists and tables isn't a big deal to you (and you're the sort of person to whom a data-serialization format could be a big deal), maybe you've allowed tabular single-mindedness to beat all the natural human list-thinking out of you. Time to reclaim your right to multiplicity. Time to reclaim lots of your rights, really. Yet another argument for more lists, everywhere around us.
So there's "the Semantic Web", which has lots of philosophical baggage, and then there's "Linked Data", which is supposed to represent unpacking only the practical stuff from the piles of baggage. A secret password for a cult within a cult. A cult-within-a-cult big enough to have a two-day conference dedicated to it, but not yet big enough that the final panel on the second day wasn't facing a mostly empty room.  

And not quite yet a cult-within-a-cult with a coherent, useful agenda, I think. If you want to stage a grassroots revolution, you need to figure out four things:  

- What is the big change you're going to bring about?
- What's the work that has to be done?
- Who has to do the work?
- What's in it for them?  

Unfortunately, the current Linked Data agenda kind of answers these questions like this:  

- The big change is that all "data records" will have universally unique names, which will all also be web addresses so you can look them (or maybe something about them, or maybe kind of both) up with a browser, and when you look them up they will point you to other things.  

- The work to be done is that everybody must either convert their data into a list of individual subject-verb-objects assertions, including meta-assertions about those first assertions, or else at least construct the meta-assertions about the assertions and construct the unique names for the original assertions even if they don't actually exist. And all this should be done in a language (and data model) called RDF, most explanations of which begin by apologizing for it, which is understandable because it's the data-modeling equivalent of assembly language for programming, and pretty much nobody voluntarily works with nothing but levers when there's a Home Depot nearby.  

- The people who have to do the work are the owners of data.  

- The thing that's in it for them is that other people might mash up their data into something else, or discover it serendipitously, or effortlessly integrate it. Except Linked Data doesn't mean Open Linked Data, so you don't have to expose your data to the world, so in that case the benefit is, um, something something ontology something toolchain serendipitous LOOK!! OVER THERE!! IT'S A GIANT GLOBAL GRAPH EATING THE EMPIRE STATE BUILDING!!!!!! THIS IS THE WEB DONE RIGHT!!!!!  
 

This is a pretty bad plan for a revolution. It's hard to understand why the big change is big, never mind why it's good; the work is artificially difficult and deliberately obscure; the people who would have to do it are basically unprepared; and the "benefits" are peripheral more or less by definition.  
 

Let me try a different set of answers:  

- The big change is that we can finally afford to store most kinds of information in a way that separates the logical data model from the storage mechanics, so that the only thing that constrains what you can do with your data is your data.  

- The work to be done is that someone needs to build a node/arc-oriented database system, with accompanying data-model and query-language, that brings this idea up to the usability level of, say, a simple Excel spreadsheet or Google Base.  

- The people who have to do the work are data-system developers, whether new companies or existing ones.  

- The things that are in it for them are that the market of people with data is huge and growing, and the market for people with highly interconnected data is newly huge and frantically growing, and the existing RDBMS/SQL-based tools for analyzing and exploring and publishing that data are so awful that we will look back on them like we look back on COBOL and punch-cards. There is both money and human progress to be made.  
 

Mine probably initially seems like a smaller revolution. My enemy is not the web. My revolution is not about URIs or linking or dereferencing or ontology subsumption or transitive closure. I just know that we've wasted human-centuries of time fighting against internal problems introduced by relational-database implementations, and built legions of crappy, inflexible, inhumane data systems because the limitations of our databases constrained what questions we allowed humans to even ask, never mind get answered. And I know that computers are now big enough and fast enough that for increasingly many datasets, the old ugliness is no longer even remotely necessary.  

And thus the new ugliness, RDF and SPARQL, is a painfully tragic missed opportunity. I no more want to be buried in triples than I wanted to be boxed into tables. Where SQL and tables made some simple things relatively simple, some simple things very hard, and most hard things really hard, RDF and SPARQL make all things theoretically possible, but all specific trivial things barely feasible, any simple things too cumbersome to attempt without assistance, and hard things moot because the simple things raise so many epistemological issues that all actual work halts.  
 

We can do better. Let's call this new movement Whole Data, like whole food. Let our data be what it is to people, not what it's easier to be to machines. I don't claim this is any kind of new science. Object-oriented databases aren't a new idea, and arguably a binary-relation node/arc data-model is more or less what the original idea of a "relational" database was before mundane practicality spent a few decades abusing it. But we can build it now. We can build it so that people can use it, easily and without having to learn some whole esoteric new trade.  

And yes, a node/arc Whole Data database will lend itself incredibly well to publishing that data on the web. Arcs are links, and the web is made of links, and it's deliciously easy to turn links into links. Moreover, arcs are labeled links, and labeled links mean that machines can do something useful with them. But it is not the job of each data owner to rewrite their own database tools. It is not the job of each webmaster to figure out how to "semantically annotate" their pages. This is a revolution in data-modeling abstraction, and it will be fought by software designers and developers.  

And linking disparate datasets together, which is supposed to be the whole point of Linked Data, and thus the whole point of this slightly sad and more than slightly confused conference I went to last week, is only barely a technical problem to begin with, and it's not at all clear to me that the technical agenda of this subcult actually makes non-trivial linking any easier. Most of the claimed benefits, even if you don't worry about whom they're benefits for, seem to me like imaginative hand-waving.  

- Giving things unique IDs does make it easier to talk about them. But you can have IDs in table records, so that isn't new, and you can refer to things by instructions instead of IDs and most of the time it doesn't make a lot of difference, so IDs aren't even always necessary. And giving things web IDs is a totally orthogonal issue about publishing that isn't about the data itself at all, and makes no more general sense than saying that all children should be named with cell-phone numbers.  

- You can "mesh together" RDF sets. Sometimes, modulo some obtuse URI issues. And if the sets' conceptual data models are compatible enough, you might be able to get something out of the combination. But this is more or less true of relational database tables, too. Exchanging data is not difficult. CSV, JSON, XML, whatever. Agreeing what the data should be is difficult, and agreeing what it should be in triples is no easier than agreeing what it should be in tables. (Maybe easier in theory, but definitely harder in current practice.) Integration is never "effortless". It's just that sometimes somebody else has already done the effort of reaching agreement, and sometimes they haven't.  

- My adorable little FOAF file can link to yours, and we can both link ours to Tim Berners-Lee's. But this is a toy "data web". It doesn't scale. Currently it doesn't even scale technically, as the web is too slow for queries to be federated out across every foafy server on the planet, but solving the technical problem will only make it faster for us to see how useless the answers are for human reasons. Every level of remove you go through is a chance for context or trust or meaning to be lost. I don't know how you decide who goes in your file, and I certainly don't know how the people in your file decide who goes in theirs. If you find your way to me and then to Tim, it is unlikely to do you very much real good. On one hand, his office is about two blocks away from mine. On the other hand, the only reason it's becoming marginally more likely that he knows me is that I keep complaining about things he's working on. Small Pieces Loosely Joined works great when you have time to examine the joins and the pieces and figure out which ones are useful for some particular personal purpose. Carry them home, trim them, water them, love them, and you might get a garden for your efforts. Send a machine out to gather them all up at once and you'll get a mushy septic field with some fine heirloom tomatoes submerged somewhere in it.  

- As long as our cults are small, sometimes we can find each other with nothing but our code words. But thinking that URI standards will yield pervasive serendipity is socially oblivious. Nothing I do in my FOAF file will ensure that you find me when you look at Tim's. Distributed linking is inherently asymmetric, and even if you crawl the entire web to construct all the missing inverses, the numbers are implacable. You cannot center yourself in a social network without cooperation. The space of associativity is curved: you can move from the obscure to the popular easily (indeed often inexorably), and along popularity contours with some effort. The imbalance of attention is a human phenomenon, not a technical one.  
 

But my little revolution, at least, doesn't involving changing human nature. Distrust any that does. Whole Data is not magic, it's just healthier than the bleached, bromated version, and yummier than chewing raw semantic bran. All I'm trying to do is make information tools that allow humans to be less mechanical, and machines to be more helpful.  

Here are some helpfulnesses:  

- A node/arc database can handle multiple-value and no-value relationships just as easily as single-value relationships. Multiple-value relationships are 90% of the interesting structure of data, yet they're basically the bane of relational-database usability, and RDF only handles them if you don't care what order they go in or how easy it is to find out what they all are. No-value relationships are painful in relational databases, and inexpressible in RDF.  

- In a node/arc database you can make data-model modifications without touching any node whose data is not actually involved. The model does not determine the storage. In relational databases the model is the storage. Unless you already consider yourself a "working ontologist" you don't want to know what it takes to express a model in (actually for, because you can't do it in) RDF. Real models change all the time, so this can't be a half-assed after-thought.  

- In a node/arc database you can look at any data from the perspective of any other data. There's no "top", there's just "outward from here". In a relational database your tables and indexes limit your perspectives. In RDF there's no "top", but neither is there any "here"; all you can do is scan assertions until your patience runs out, looking for ones that say something else somethings this, or this somethings something else. Being able to stand anywhere and look in any direction is the difference between exploration and a zoo, and thus between discovery and watching an okapi crap.  

- A node/arc data structure can be browsed. Relational tables and RDF can mostly only be searched, and browsing is a UI illusion built on top of searching. But browsing, wandering, connecting, following, compiling, discarding: these are how humans organize and comprehend information, and we deserve a database in which those are the cheapest and sturdiest building blocks, not the most expensive and fragile fabrications.  

- In a node/arc database with a path-based query language, human browsing can be used as machine training, and human questions can be formulated for machine processing in structurally the same way they would be stated for human evaluation. We think in contexts and connections, not in subgraph matching or index inversion. We do not start every inquiry with a list of all the universe's truths. We start somewhere, and follow paths. We need to be able to send our questions down these paths, down all the paths at once, and have them come back with their own answers.  
 

The project I'm working on has a database system like I describe, and a data-model that's more usable than RDF, and a path-based query-language I wrote that's better than SPARQL. I want this thing to exist. I think we need it to exist. I'd rather it already existed, and if somebody else comes out with one tomorrow that's at least as good as mine, I'll be totally thrilled to use that instead. Mine isn't going to come out tomorrow. I don't know when it will. I don't know who will be totally thrilled, the day mine comes out, to use that one instead of the one they were working on.  

But I know these small improvements matter. Information matters, and understanding matters, and the tools for getting from one to the other matter. I think our ability to make use of information is part of how we are going to survive on Earth. So this, for the moment, is what I'm trying to do for us.
1. "Semantic". By starting the name this way, you have essentially, avoidably, uselessly doomed the whole named enterprise before it starts. Most people don't have the slightest idea what this word even means, most of the people who do have an idea think it implies pointless distinctions, and everybody left after you eliminate those two groups will still have to argue about what "semantic" means. This is a rare actual example of begging the question. Or to put it in terms you will understand: congratulations, you've introduced terminological head recursion. Any wonder the program never gets around to doing anything?  

2. "The Semantic Web". The "The" and the "Web" and the capitalization combine to suggest, even before anybody compounds the error by stating it explicitly, that this thing, which nobody can coherently explain, is intended to compete with a thing we already grok and see and fetishize. But this is totally not the point. The web is good. What we're talking about are new tools for how computers work with data. Or, really, what we're talking about are actually old tools for working with data, but ones that a) weren't as valuable or critical until the web made us more aware of our data and more aware of how badly it is serving us, and b) weren't as practical to implement until pretty recently in processor-speed and memory-size history.  

3. "FOAF". There have been worse acronyms, obviously, but this one is especially bad for the mildness of its badness. It sounds like some terrible dessert your friends pressured you to eat at a Renaissance Festival after you finally finished gnawing your baseball-bat-sized Turkey Sinew to death.  

4. FOAF as the stock example. You could have started anywhere, and almost any other start would have been better for explaining the true linked nature of data than this. "Friend" is the second farthest thing from a clean semantic annotation in anybody's daily experience. I'm barely in control of the meaning of my own friend lists, and certain wouldn't do anything with anybody else's without human context.  

5. Tagging as the stock example. "Tagged as" is the first farthest thing from a clean semantic annotation in anybody's daily experience.  

6. Blogging as the stock example. Even if your hand-typed RDFa annotations are nuggets of precious ontological purity, you can't generate enough of them by hand to matter. Your writing is for humans, not machines, and wasting brains the size of planets on chasing pingbacks is squandering electricity. We already know how to add to humanity's knowledge one fact at a time. The problem is in grasping the facts en masse, in turning data to information to knowledge to wisdom to the icecaps not melting on us.  

7. Anything AI. Natural-language-processing and entity-extraction are interesting information-science problems, and somebody, somewhere, probably ought to be working on them. But those tools are going to pretty much suck for general-purpose uses for a really long time. So keep them out of our way while we try to actually improve the world in the meantime.  

8. "Giant Global" Graph. The "Giant" and "Global" parts are menacing and unnecessary, and maybe ultimately just wrong. In data-modeling, the more giant and global you try to be, the harder it is to accomplish anything. What we're trying to do is make it possible to connect data at the point where humans want it to connect, not make all data connected. We're not trying to build one graph any more than the World Wide Web was trying to build one site.  

9. Giant Global "Graph". This is a classic jargon failure: using an overloaded term with a normal meaning that makes sense in most of the same sentences. I don't know the right answer to this one, since "web" and "network" and "mesh" and "map" are all overloaded, too. We may have to use a new term here just so people know we're talking about something new. "Nodeset", possibly. "Graph" is particularly bad because it plays into the awful idea that "visualization" is all about turning already-elusive meaning into splendidly gradient-filled, non-question-answering splatter-plots.  

10. URIs. Identifying things is a terrific idea, but "Uniform" is part of the same inane pipe-dream distraction as "Giant" and "Global", and "Resource" and the associated crap about protocols and representations munge together so many orthogonal issues that here again the discussions all end up being Zenotic debates over how many pins can be shoved halfway up which dancing angel.  

11. "Metadata". There is no such thing as "metadata". Everything is relative. Everything is data. Every bit of data is meta to everything else, and thus to nothing. It doesn't matter whether the map "is" the terrain, it just matters that you know you're talking about maps when you're talking about maps. (And it usually doesn't matter if the computer knows the difference, regardless...)  

12. RDF. It's insanely brilliant to be able to represent any kind of data structure in a universal lowest-common-denominator form. It's just insane to think that this particular brilliance is of pressing interest to anybody but data-modeling specialists, any more than hungry people want to hear your lecture about the atomic structure of food before they eat. RDF will be the core of the new model in the same way that SGML was the core of the web.  

13. The Open-World Hypothesis. See "Global", above. Acknowledging the ultimate unknowability of knowledge is a profound philosophical and moral project, but not one for which we need computer assistance. Meanwhile, computers could be helping us make use of what we do know in all our little worlds that are already more than closed enough.
There are no dead-ends in data. Everything connects to something, and if anybody tells you otherwise, you should suspect them of hoping you won't figure out the connection they've omitted.  

But we have suffered, for most of humanity's life with data, without good tools for really recognizing the connectivity of all things. We have cut down trees of wood and used them to make trees of data. Trees are full of dead-ends, of narrower and narrower branches. Books are mostly trees. Documents are usually trees. Spreadsheets tend to be trees. Speeches are trees. Trees say "We built 5 solar-power plants this year", and you either trust them, or else you break off the end of the branch and go looking for some other tree this branch could have come from. Trees are ways of telling stories that yearn constantly to end, of telling stories you can circumscribe, and saw through, and burn.  

And stories you can cut down and burn are Evil's favorite medium. Selective partial information is ignorance's fastest friend. They built 5 solar plants, they say. Is that right? And how big were they? And where? And why did they say "we built", past-tense, not "we are now running"? And how many of last year's 5 did they close this year? And how many shoddy coal plants did they bolt together elsewhere, while the PR people were shining the sun in our eyes? There are statements, and then there are facts, and then there is Truth; and Truth is always tied up in the connections.  

Not that we haven't ever tried to fix this, of course. Indices help. Footnotes help. Dictionaries and encyclopedias and catalogs help. Librarians help. Archivists and critics and contrarians and journalists help. Anything helps that lets assertions carry their context, and makes conclusions act always also as beginnings. Human diligence can weave the branches back together a little, knit the trees back into a semblance of the original web of knowledge. But it takes so much effort just to keep from losing what we already knew, effort stolen from time to learn new things, from making connections we didn't already throw away.  

The Web helps, too, by giving us in some big ways the best tools for connection that we've ever had. Now your assertions can be packaged with their context, at least loosely and sometimes, if you make the effort. Now unsupported conclusions can be, if nothing else, terms for the next Google or Wikipedia search. This is more than we had before. It is a little harder for Evil to hide now, harder to lie and get away with it, harder to control the angle from which you don't see the half-truth's frayed ends.  

But these are all still ultimately tenuous triumphs of constant human vigilance. The machines don't care what we say. The machines do not fact-check or cross-reference, of their own volition, and only barely help us when we try to do the work ourselves. The Web ought to be a web, a graph, but mostly it's just more trees. Mostly, any direction you crawl, you keep ending up on the narrowest branches, listening for the crack. All the paths of Truth may exist somewhere, but that doesn't mean you can follow them from any particular here to any specific there.  

And even if linking were thoroughly ubiquitous, and most of the Web weren't SQL dumps occasionally fogged in by tag clouds, this would still be far from enough. The links alone are nowhere near enough, and believing they are is selling out this revolution before it has deposed anything, before it has done much more than make some posters. It is not enough for individual assertions to carry their context. It is not enough for our vocabulary of connection to be reduced to "see also", even if that temporarily seems like an expansion. It is not enough to link the self-aggrandizing press-release about solar plants to the company's web site, and hope you can find their SEC filings under Investor Relations somewhere. It is not enough to link the press-release to the filings, or for your blog-post about their operations in China to make Digg for six hours, or to take down one company or expose one lie. We've built a system that fountains half-truths at an unprecedented speed, and it is nowhere near enough to complete the half-truths one at a time.  

The real revolution in information consists of two fundamental changes, neither of which have really begun yet in anything like the pervasive way they must:  

1. The standard tools and methods for representing and presenting information must understand that everything connects, that "information" is mostly, or maybe exactly, those connections. As easy as it once became to print a document, and easier than it has become to put up web-pages and query-forms and database results-lists, it must become to describe and create and share and augment sets of data in which every connection, from every point in every direction, is inherently present and plainly evident. Not better tools for making links, tools that understand that the links are already inextricably everywhere.  

2. The standard tools for exploring and consuming and analyzing connected information must move far beyond dealing with the connections one at a time. It is not enough to look up the company that built those plants. It is not enough to look up each of their yearly financial reports, one by one, for however many years you have patience to click. It's time to let the machines actually help us. They've been sitting around mostly wasting their time ever since toasters started flying, and we can't afford that any more. We need to be able to ask "What are the breakdowns of spending by plant-type for all companies that have built solar plants?", and have the machines go do all the clicking and collating and collecting. Otherwise our fancy digital web-pages might as well be illuminated manuscripts in bibliographers' crypts for all the good they do us. Linked pages must give way to linked data even more sweepingly and transformationally than shelved documents have given way to linked pages.  
 

And because we can't afford to wait until machines learn to understand human languages, we will have to begin by speaking to them like machines, like we aren't just hoping they'll magically become us. We will have to shift some of our attention, at least some of us some of the time, from writing sentences to binding fields to actually modeling data, and to modeling the tools for modeling data. From Google to Wikipedia to Freebase, from search terms to query languages to exploration languages, from multimedia to interactive to semantic, from commerce to community to evolving insight. We have not freed ourselves from the tyranny of expertise, we've freed expertise from the obscurity of stacks. Escaping from trees is not escaping from structure, it is freeing structure. It is bringing alive what has been petrified.  

There are no dead-ends in knowledge. Everything we know connects, by definition. We connect it by knowing. We connect. This is what we do, and thus what we must do better, and what we must train and allow our tools to help us do, and the only way Truth ever defeats Evil. Connecting matters. Truths, tools, links, schemata, graph alignment, ontology, semantics, inference: these things matter. The internet matters. This is why the internet matters.
If you're going to waste your time doing obsessive analysis of data on which nobody's life or ecology depends, you ought to at least do it diligently and efficiently.  

About a month ago The Deciblog published Justin Foley's attempt to answer the timeless question "How likely is a metal band to start their name with a particular letter of the alphabet?". For his sample set, Foley took the combined rosters of several metal labels (he doesn't reveal either list), which gave him 814 names, for which he then calculated the first-letter distributions, reaching the startling conclusion that the most likely letter is S.  

Foley put his results in a bar-chart, which I assume means he used a spreadsheet, so hopefully he didn't spend a whole lot of time hand-counting. But he should have spent even less. The Encyclopaedia Metallum is not only just sitting there with a collaboratively-amassed and collectively-moderated database of 50,000+ metal bands, but they've even already split it up by first-letter and there are band-counts right at the top of each letter-page. Add, divide, and you're done.  

Here, then, is the much better-informed version of this still-pointless breakdown:

? % *
# 0.3%
A 9.1% *********
B 5.9% *****
C 6.3% ******
D 8.9% ********
E 4.9% ****
F 3.6% ***
G 3.0% ***
H 3.9% ***
I 3.7% ***
J 0.6%
K 2.2% **
L 3.1% ***
M 7.4% *******
N 4.2% ****
O 2.3% **
P 4.0% ***
Q 0.2%
R 3.3% ***
S 10.8% **********
T 4.5% ****
U 1.3% *
V 2.7% **
W 2.7% **
X 0.3%
Y 0.2%
Z 0.7%


Most of Foley's numbers aren't that far off. His small sample-size leads him to overestimate J and Y, and underestimate Q and V, but the absolute numbers for these letters are small anyway. He also seems to underestimate A, P and R, for reasons which are not apparent in his opaque reporting, but might have to do with language tendences, as EM's list is probably more global than his.  

But the biggest discrepancy in Foley's numbers, by far, is T, which he credits with 9.3% of the band-names, where EM data indicates less than half that. Here I have a wearyingly mundane but highly plausible theory: Foley has accidentally counted all the bands whose names begin with "The " as T, despite specifically saying that he didn't. This is obviously both philosophically and methodologically repugnant, and although I regret the maelstrom of blogospheric outrage that will undoubtably accompany my public exposure of this error, I think we owe ourselves (the) Truth.  



Of course, the thousands of metal fans around the world who have put time and effort into building the Encyclopaedia Metallum did it because they care about the music, not the alphabet. The site is the definitive central reference source for most metal-related matters, and certainly the final arbiter of obscurity for the vast unknown majority of the bands it lists.  

It also has, in addition to its factual content, tens of thousands of percentage-scored, user-attributed, peer-moderated reviews of metal recordings. What it does not have is any sort of similarity analysis to make use of the huge data-graph represented by the connections between bands, users and ratings. There is a wearyingly mundane reason for this, too: trying to do similarity analysis with SQL queries will make you want to eat your own neck.  

So here is an actual contribution to the world's knowledge on this admittedly peripheral subject: empath, the missing similarity analysis of EM user/review/band data, accurate as of yesterday. Pick a band, see the other bands that people who like the first band also like. Data wants to form shapes. In a better world, this would be just as easy for EM to do themselves, updating live, as it is for them to serve their raw data into web pages.  

Easier, actually. I bet it took me less time to do this analysis than it took Foley to make a bar-chart of first letters. But I have better tools. I have better tools because at the moment I'm paid to design better tools. If I do my job well enough, eventually you'll have my better tools, too. I'm not designing them to tabulate heavy metal, I'm designing them to answer questions. Not all answers turn out to be shaped like Truths, of course. But if you can't answer them, you can't be sure which are which.
Two of the very bad features of the SQL-based conception of relational data-modeling, I think, are these:  

- one-to-many relationships are significantly harder to handle than one-to-one relationships (and, inversely, many-to-one relationships easiest to handle by defactoring)  

- absence is significantly harder to handle than presence  

It's easy to add an Artist column to an Album table, for example, so we get:  

AlbumArtist
Black OneSunn O)))

But if you want to model the fact that collaboration albums can have multiple artists, you need an Albums table, an Artists table, a join table, and a bunch of IDs:  

Albums
AlbumID
Black One1
Altar2

Artists
ArtistID
Sunn O)))1
Boris2

Authorship
AlbumIDArtistID
11
21
22

This is already a pain in the ass, and thus you get a world filled with cop-outs like this:  

AlbumArtist
Black OneSunn O)))
AltarSunn O))) / Boris

and this:  

AlbumArtist 1Artist 2Artist 3
Black OneSunn O)))FALSEFALSE
AltarSunn O)))BorisFALSE

both of which are tolerable enough to just look at, but awful when you go to try to get the computer to answer even perfectly sensible questions like what albums Boris has done.  

And then, when you want to add a little more detail, you really ought to do this:  

Albums
AlbumID
Black One1
Altar2

Artists
ArtistID
Sunn O)))1
Boris2

Formats
FormatID
CD1
LP2
Download3
Authorship
AlbumIDArtistID
11
21
22
Formatship
AlbumIDFormatID
11
12
13
21
22

but instead you'll probably try to get away with:  

AlbumArtistFormatCode
Black OneSunn O)))CD/LP/D
AltarSunn O))) & BorisCD/LP

or  

AlbumArtist 1Artist 2Artist 3CDLPD
Black OneSunn O)))FALSEFALSETRUETRUETRUE
AltarSunn O)))BorisFALSETRUETRUEFALSE

because if you have the latter, you can do:  

SELECT Album, Artist

FROM Albums
WHERE D=FALSE

which is easy to understand, although it won't work because you forgot you don't have a field called plain "Artist" anymore. Whereas with the five-table form you have to do something like:  

SELECT Albums.Album, Artists.Artist

FROM Albums, Artists, Authorship
WHERE Albums.ID=Authorship.AlbumID
AND Artists.ID=Authorship.ArtistID
AND NOT EXISTS (
SELECT *
FROM Formats, Formatship
WHERE Albums.ID=Formatship.AlbumID
AND Formats.ID=Formatship.FormatID
AND Formats.Format='Download'
)

and that's probably still wrong.  
 

I submit that this is all a colossal mess, and it's a wonder the database-driven software we get driven out of it isn't even more woeful than it is. A better paradigm would embrace the opposites of these two flaws:  

- the modeling of a relationship should be the same whether the relationship is one-to-one, one-to-many, many-to-one or many-to-many, and should be the same no matter how many "many"s you have  

- absence and presence should be equally easy to assess  

And I think a useful rough metric for such a new system is that it allows you to model your data without ever needing to say FALSE.  

And I think this, I think, because I think this is how we think and talk about relationships between things when nobody is getting in our way:  

Black One was recorded by Sunn O))), and was released in CD, LP and Download formats. Altar was recorded by Sunn O))) and Boris, and was released in CD and LP formats. All of these albums were released on CD and LP. Altar is the only one that wasn't released in Download format.  

Take these out of sentence form and make the data-types explicit and everything is still perfectly straightforward:  

Album: [Black One]

Artists: [Sunn O)))]
Formats: [CD] [LP] [Download]  

Album: [Altar]
Artists: [Sunn O)))] [Boris]
Formats: [CD] [LP]

or, looking at the same data from a different perspective:  

Format: [CD]

Albums: [Black One] [Altar]  

Format: [LP]
Albums: [Black One] [Altar]  

Format: [Download]
Albums: [Black One]

and then the old questions are also easier:  

FIND Formats WHOSE Albums INCLUDE 'Black One' and 'Altar'

FIND Albums WHOSE Formats DON'T INCLUDE 'Download'

Subliminating the one/many distinction allows us to talk about albums with 3 or 8 or 0 artists just as readily as we talk about albums with 1 or 2. Fixing the query language to handle absence lets us cope with the introduction of a new edition into our dataset, say, without having to add a new field for it and/or go through asserting that everything we already knew about doesn't have that format. I suspect pretty much every database field headed X and filled with TRUEs and FALSEs could at least be more usefully modeled in my new world as a relationship called "Characteristics" that either does or doesn't include X. And usually, as with formats, there's actually a meaningful label that contains information itself. The point is that things have relationships to their characteristics, or their formats. They don't, except in a very existential sense, have a relationship to falseness. FALSE is a machine thing.  
 

The relationally astute will note that the five-table version is the fully-normalized representation that would support an SQL-based implementation of my new form as a display abstraction, and presumably my pseudo-query-code could be preprocessed into the messy SQL so I'd never see it. But that's my point: SQL is too low an abstraction, which makes it too hard to do things naturally, which makes it too hard to pass along natural behavior to the victims of the system. I'm mainly arguing for a better abstraction. (Although I also expect it's still always going to be bad that there's no better underlying way to represent lists, so I'm also probably arguing for the better abstraction to not merely be layered on top of the bad one.)  
 

The discographically astute will note that Altar is available on iTunes, and that I have vastly undermodeled the rich and complicated space of Sunn O))) release formats. But if I'd done the examples in real detail, I'd still be typing out the bad versions...
Site contents published by glenn mcdonald under a Creative Commons BY/NC/ND License except where otherwise noted.