furialog

30 May 2012 to 29 July 2008 · tagged tech/essay

¶ Needless · 30 May 2012 essay/tech

We will look back on these days, I think, as some weird interlude after the invention of computers but before we actually grasped what they meant for us. The Age we are stumbling towards, I am very sure, is the Age of Data. And when we get there, we will be there because we have sublimated the state-machine mechanics of computers beneath the logical structural abstractions of information and relation, and begun to inhabit this new higher world without reference to its substrate.

I spent 5 years of my life trying to help bring this future about. That is, in a sense I've spent my whole adult life trying to help bring this future about, but for those 5 years I got to work on it very directly. I designed, and our team built, an attempt at a prototype of what a new data exploration system could be like, and at the core of this was my attempt at a draft of a language for discussing data the way algebra is a language for discussing math. These are the elements out of which this new age's alchemies will be constituted. And there were moments, as the system began to come into its own, when I felt the twitches of power awakening. You could conjure shapes out of data with this thing. It made information malleable, made it flow.

The computer programmers on the team sometimes referred to the project as a system for "non-programmers", and I've come to think of that as both its potential and its downfall. Programmers never say "non-programmers" as a compliment. At best it's merely condescending, at worst it's a euphemism for "idiot" or a semi-aware admission of incomprehension. For programmers, programming is by definition an end, not a means, and therefore the motivations of non-programmers are inherently mysterious and alien. But what we built was for non-programmers in the same way that a bridge is for non-engineers. That is, the whole point of it was to represent a different interaction model between people and information than the ones offered by, at one end, programming languages, and at the other spreadsheets and traditional database programs. As I said over and over throughout those 5 years, I was trying to get us to do for hyper-connected datasets what VisiCalc once did for columns of numbers. I wasn't trying to simplify; if anything, I was making some things harder, or at least less familiar. This new age is not a subset of a previous age. It is not for lesser people, and its challenges are not of a simpler character.

And as Google now shuts that system down, literally unceremoniously, and 5 years of my work and dreams and visions are at least nominally obliterated, I feel a little sadness but mostly relief. I'm still very convinced that our tools -- humanity's tools -- for interacting with data are hopelessly primitive. I'm still convinced that it won't make a whole lot of difference what those tools are if kids don't grow up learning how to think about data in the first place. I'm still convinced that I have a blurry, fractured vision of what it might take to change these things.

But I also realize two more things.

First, the system we built was only a beginning, and it had hardened into a premature finality long before its official corporate fate was settled. The query language I invented was cool, but the successor to it, which I'm sketching in my head whether I want to or not, is a different sort of thing yet again. And I was never going to reach it incrementally, arguing over every syntax decision on the way. Sometimes you have to just start over. The next one will not aspire to be the Visicalc of anything. It's not better business tools we need. The problem is not that we are alienated from our inner accountants. The thing we need first is not even an algebra of data, probably, but an arithmetic of data. We need an inversion of "normalization" in which you don't write data wrong and then endure six Herculean labors to make it obscurely more pleasing to capricious gods, but rather a way of writing it in the first place with an inherent expressive gravity towards truth because more true is always more powerful. This is a task in applied philosophy, not programming and not engineering and not even science. We need to imagine what Plato would have done when his record collection got too big for his cave.

Second, I still believe that we all deserve better tools, tools more suited for our actual tasks and needs as people whose lives and choices and options are increasingly functions in, not merely of, information. But in the process of exploring what I mean by that I've become a non-non-programmer myself. At my new job I am an engineer. And sometimes, when you think you know what the better world looks like, you can bring pieces of it up out of your dreams. You can walk where the new paths will be. With enough belief, you can walk where the bridges will be. I will come back to these paths, one way or another, but you never do great things by imagining what people you don't understand might want for purposes you don't grasp or embrace. You should trust your own judgment only where you love beyond reason. Anybody could do nearly anything with Needle, and the business cases for it all involved hypothetical big companies doing hypothetical big things with hypothetical big data that repeatedly never actually materialized (and might have been hypoethical if they had). But left to my own invented devices, I always ended up using it for music data.

So I have followed my own love, and my own obsessions, deeper into that data. At my new job, I am trying to make sense of the largest music database in the world, which is a lot more fun than what I was doing before, and harder, and of rather more direct and demonstrable relevance to anything. On my own, I will continue the music projects I started in Needle. The Discordance evolved out of empath, and so I've evolved it back in, with less marginalia but maybe more coherence. For the Pazz & Jop I've built a stats site far more specific than I could ever have done in the generalized environment of Needle. These will grow as I play with them, and probably there will be other things. I spent 5 years trying to build fancy tools, but it's pretty amazing what you can do with just a hammer. I was Needle's most dedicated user, but in the end, both sadly and happily, I don't actually need it any more. Nobody will miss it more than I will, but maybe nobody will really miss it very much. The moral, I think, and maybe even the ethic, is that these systems do not matter. This isn't the first system I worked on only to see it shut down, and it won't be the last. Software is the epitome of ephemera, necessary in aggregate but needless in every mundane specific.

But the things we learn from these systems stay learned. Even the ways of learning remain ways after their original demonstrations disintegrate. This is another phrasing of the point about this Age, in fact: the flow from Data to Information to Knowledge to Wisdom is not a function of syntax or platforms or prevalence or virtualization. It is something we do, to which the technology is merely witness. We must teach our children how to think about data because the data survives where the systems fail. We must teach ourselves to be children again in this new Age, because its most transformative truths still await discovery, and are anything but mundane or needless, and we will never recognize them unless we can recall what it felt like in our hearts when everything was amazing and new and ahead of us, and the act of waking was an invitation to wonder to show us a way.

¶ Quantifying School · 9 September 2011 essay/tech

[May 2012 note: Needle, the database system I used to collect, analyze and show this information, was acquired and shut down by Google. Thus many of the links below go to non-functional snapshots of Needle pages I took before the shutdown. The points should survive.]

Boston Magazine recently published their annual Best Schools ranking. They've been doing this for years, and are known for various other Boston rankings as well (places to live, places to eat...), so by now you'd expect them to be pretty good at it.

Here's what "pretty good at it" amounts to, in 2011: two lists of 135 school districts, one with some configuration information (enrollment, student/teacher ratio, per-pupil spending, graduation rate, number of sports teams, what kind of pre-k they offer, how many AP classes), the second with test scores, and exactly this much methodological transparency: "we crunched the data and came up with this".

Some obvious things that you can't do with this information:

- sort it by any criteria other than the magazine's rank
- see the stuff in the first table alongside the stuff in the second
- understand which figures are actually part of the ranking, in what weights
- fact-check it
- compare it in bulk to any other information about these schools
- compare it to any other information about the towns served by these districts
- figure out why certain towns were included or excluded
- find out what towns are even meant by non-town district names if you don't already happen to know
- evaluate the significance of any individual factor, or the correlations of any set of them

This is not a proud state of the art. And the quality of secondary journalism around it emphasizes the point further: this article about Salem's low ranking basically just turns a table-row into prose sentences, with no context or analysis, and fails to even realize that the 135 districts in the ranking represent just the immediate vicinity of Boston, not the whole state. This Melrose article claims Melrose "climbed" from 97th last year to 94th, but then has to add a note that last year's ranking was of high schools, not whole districts, and thus not even the same thing. Swampscott exults in making the top 50. Malden fights back at being ranked 119th. But nobody actually knows what the rankings mean or signify, because Boston Magazine doesn't say.

In an attempt to improve this situation a little, I imported these two tables of information into Needle:

· Needle - Boston Public Schools 2011

This in itself was sufficient to unify the two tables and render them malleable, which seems to me like the most basic start. Now at least you can re-sort them yourself, and choose what to look at next to what.

And a little sorting, in fact, quickly reveals some statistical oddities. North Attleborough was listed with an SAT Reading score of 823, which since SAT scores only go up to 800, is very obviously wrong. Some trivial research verifies that this was a typo for 523, and while typos happen in journalism all the time, a typo in data journalism is a dangerous indication that some human has been retyping things by hand, which is never good. (This datum has now been fixed in the magazine's table.)

More interestingly, when you start scrutinizing each district's 5th/8th/10th-grade MCAS scores, you find some surprising skews. Here are the MCAS and SAT scores for Georgetown:

MCAS 5 English: 74
MCAS 5 Science: 54
MCAS 5 Math: 42

MCAS 8 English: 81
MCAS 8 Science: 36
MCAS 8 Math: 51

MCAS 10 English: 92
MCAS 10 Science: 90
MCAS 10 Math: 88

SAT Reading: 570
SAT Writing: 566
SAT Math: 584

Boston Magazine says they "looked within those districts to determine how schools were improving (or not) over time". But that's not what these scores are measuring. These aren't time-slices for a single cohort, these are different tests being given to different kids. If you're interested in history, the Department of Education profile of Georgetown includes annual MCAS results for 2006-2009, and all you have to do is scan down the page to spot the weird anomaly that is 8th grade Science. Every other test has healthy dark-blue bars for "Advanced" scores; but in 8th grade Science virtually no kids managed Advanced scores in any year. This pattern repeats in Wellesley in an even more dramatic fashion. An article from Wellesley Patch explains that their 8th grade science curriculum doesn't cover "space", while the MCAS does. It's an interesting ideological question whether curricula should be matched to the standardized tests, but whatever your opinion on that, it seems clearly misleading to interpret this policy issue as a quality issue.

A little more sorting repeatedly raised another question: why is Cambridge ranked 25th? In virtually every test-score-based sort it falls close to the bottom of the table. In the magazine's ranking, Cambridge comes in ahead of Westford, at #26. But observe these scores for the two:

MCAS 5 English: 59 - 88
MCAS 5 Math: 53 - 86
MCAS 5 Science: 45 - 85
MCAS 8 English: 75 - 95
MCAS 8 Math: 45 - 86
MCAS 8 Science: 34 - 78
MCAS 10 English: 70 - 97
MCAS 10 Math: 77 - 95
MCAS 10 Science: 59 - 94
SAT Reading: 498 - 587
SAT Writing: 493 - 582
SAT Math: 503 - 602
Graduation Rate: 85.2 - 94.6

This doesn't even look close. But then notice these:

Students per Teacher: 10.5 - 14.6
Per-Pupil Spending: $25,737 - $10,697

Cambridge's spending per student is remarkable. It's almost 50% higher than the next highest, which is Waltham at $18,960. The 10.5 students per teacher is also the best ratio of the 135 schools listed, with 115th-ranked Salem in second place with 11. These factors seem like they should matter, and clearly they must be part of the magazine's ranking calculation, but if they're so uniformly not translating to better test scores or graduation rates in Cambridge, does this really make any sense?

At least we ought to be able to say that these, along with the other non-test characteristics in the magazine like the number of sports teams and the number of AP classes, are different sorts of statistics than test scores. This seems increasingly true as you start looking at them in detail. Plymouth is listed as having 94 sports teams, for example. Can you name 94 different sports? I can't, and the Plymouth High School web site only claims they participate in 19. Newton is listed as having 39 AP classes, and Boston as having 155. But there are only 34 AP subjects, so it seems like a pretty safe guess that in these two multi-high-school districts the magazine is adding the totals for each school. It's hard for me to see what that accomplishes.

So for my own interest, at least, I created my own Quant Score, which is calculated like this:

- take all 9 of the listed MCAS scores, drop the lowest one, and sum the other 8
- divide each of the three SAT scores by 3, to put them into a range where they're each worth around twice as much as an individual MCAS score, and add those in
- multiply the graduation rate by 2, to put it into a similar range to the SAT scores, and add that in, as well

These factors are admittedly arbitrary, so you're welcome to try your own variations, but at least I'm telling you exactly what goes into mine, so you know what you're varying against. I deliberately left out all the other descriptive metrics from this calculation, including student/teacher ratio and spending. I then reranked the schools according to these Quant Scores. See the comparison of the magazine's ranking and mine here.

The differences are pretty dramatic. Three schools from outside the magazine's top 20 move into my top 10 (and 2 more from outside the magazine's top 10). The magazine's #s 6 and 8 drop to 28 and 30 in my list. Watertown and Waltham drop from 53 and 54 in the magazine to 100 and 114 in my list. Swampscott will be displeased to see that my re-ranking them sends then back out of the top 50. Malden will probably not be much appeased that I've bumped them up from 119 to 118. Acton and Winchester will be thinking about staging parades for me. And Cambridge (where I live, and where my pre-K daughter will go to school unless we take some drastic action) plunges from 25th to 107th.

But these are not answers, these are more questions. Most obviously: Why? I'm not claiming my Quant Score is definitive in any way, but it measures something, and I'm willing to claim that what it measures is something more coherent than what the magazine's rank measures. So this sets me off on the quest for better explanations, for which we obviously need more data.

Needle is good at integrating data, so I have integrated a bunch of it: per-capita incomes, town populations, unemployment rates, district demographic breakdowns, lunch subsidy percentages and 2010 election results. Some of these apply to towns, not districts, and several districts serve multiple towns, but Needle loves one-to-one and one-to-many relationships the same, so I've done properly weighted multi-town averages. (Don't try that in a spreadsheet.)

And then I started comparing things. Per-pupil spending seems like it ought to matter, but it shows very little statistical correlation to quant scores. Student/teacher ratios, sports-team counts and AP classes also seem like they ought to matter, but the numbers don't support this.

Per-capita income, on the other hand, matters. The percentage of students receiving lunch subsidies matters even more. In fact, this last factor (the precise calculation I used was adding the percentage of students receiving free lunch and half of the percentage of students receiving partially subsidized lunch) is the single best predictor of quant score that I've found so far. This is depressingly unsurprising: poverty at home is hard to overcome: hard enough for individuals, and even harder in aggregate.

With this in mind, then, I ran a quick linear regression of quant score as a strict function of lunch-subsidy percentage, and used that to calculate predicted quant scores for each district. The depressing headline is how small those variations are. In a quant-score range from 1531 to 727, only 10 districts did more than 100 quant points better than predicted, and only 10 districts did more than 100 points worse. If I use the square roots of the lunch-subsidy percentages, instead, only 6 districts beat their predictions by 100, and only 8 miss by 100.

If I toss in town unemployment rates, Democratic vote percentages in the 2010 Senate election, and town per-capita income, I can get my predictions so close that only 1 school did more than 100 points better than expected, and only two did more than 100 points worse. This is daunting precision.

But OK, even if the variations are small, they're there. So surely this is where those aspirational metrics like spending must come into play. Throwing money at students in school may not be able to counteract poverty at home, but doesn't it at least help?

No.

Students per Teacher? No.
AP classes? No.
Percentage of minority students? No.

I'm by no means saying that there isn't an explanation, or more of an explanation, or other factors. But if there are, I haven't found them yet.

But at least I'm trying. And I give you the data so you can try, too. I submit that this is what data journalism should be trying to do. We are trying to find knowledge in data. Secrecy and opaqueness and non-interactivity are counter-productive. It's more than hard enough to find truth even with all the data arrayed in front of us. If there's an equivalent of the Hippocratic Oath for data journalists, it should be that we will endeavor to never make the truth more obscure.

[Space for discussion here.]

[Postscript, 10 September: The more I thought about that 823/523 error, the more I worried that there might be other errors that weren't as obvious, so I used Needle to cross-check all the test-scores against the official DOE figures. Two more were wrong. Manchester Essex's SAT Reading score was 559, not 599, which I'm guessing would lower their #6 magazine rank, perhaps considerably. In my rankings it dropped them from 28 to 31. Ashland's SAT Reading score was also wrong, 531 not 537, but this didn't change their rank in my method. Both corrections moved those schools' scores closer to my predictions.]

[Postscript, 12 September: But charter schools do better relative to expectations, right? Nope.]

¶ 18 Ways to Think About Data Quality · 12 April 2011 essay/tech

[From a note to the public-lod@w3.org mailing list.]

Data "beauty" might be subjective, and the same data may have different applicability to different tasks, but there are a lot of obvious and straightforward ways of thinking about the quality of a dataset independent of the particular preferences of individual beholders. Here are just some of them:

1. Accuracy: Are the individual nodes that refer to factual information factually and lexically correct. Like, is Chicago spelled "Chigaco" or does the dataset say its population is 2.7?

2. Intelligibility: Are there human-readable labels on things, so you can tell what one is when you're looking at it? Is there a model, so you can tell what questions you can ask? If a thing has multiple labels (or a set of owl:sameAs things have multiple labels), do you know which (or if) one is canonical?

3. Referential Correspondence: If a set of data points represents some set of real-world referents, is there one and only one point per referent? If you have 9,780 data points representing cities, but 5 of them are "Chicago", "Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and "Chicagoland", that's bad.

4. Completeness: Where you have data representing a clear finite set of referents, do you have them all? All the countries, all the states, all the NHL teams, etc? And if you have things related to these sets, are those projections complete? Populations of every country? Addresses of arenas of all the hockey teams?

5. Boundedness: Where you have data representing a clear finite set of referents, is it unpolluted by other things? E.g., can you get a list of current real countries, not mixed with former states or fictional empires or adminstrative subdivisions?

6. Typing: Do you really have properly typed nodes for things, or do you just have literals? The first president of the US was not "George Washington"^^xsd:string, it was a person whose name-renderings include "George Washington". Your ability to ask questions will be constrained or crippled if your data doesn't know the difference.

7. Modeling Correctness: Is the logical structure of the data properly represented? Graphs are relational databases without the crutch of "rows"; if you screw up the modeling, your queries will produce garbage.

8. Modeling Granularity: Did you capture enough of the data to actually make use of it. ":us :president :george_washington" isn't exactly wrong, but it's pretty limiting. Model presidencies, with their dates, and you've got much more powerful data.

9. Connectedness: If you're bringing together datasets that used to be separate, are the join points represented properly. Is the US from your country list the same as (or owl:sameAs) the US from your list of presidencies and the US from your list of world cities and their populations?

10. Isomorphism: If you're bring together datasets that used to be separate, are their models reconciled? Does an album contain songs, or does it contain tracks which are publications of recordings of songs, or something else? If each data point answers this question differently, even simple-seeming queries may be intractable.

11. Currency: Is the data up-to-date? As of when?

12. Model Uniformity: Are discretionary modeling decisions made the same way throughout the dataset, so that you don't have to ask many permutations of the same question to get different subsets of the answer? Nobody should have to worry whether some presidents and presidencies are asserted in only one direction and some only the other.

13. Attribution: If your data comes from multiple sources, or in multiple batches, can you tell which came from where? If a source becomes obsolete or discredited or corrupted, can you take its data out again?

14. History: If your data has been edited, can you tell how and when and by whom? Can you undo errors, both individual (no, those two people aren't the same, after all) and programmatic (those two datasets should have been aligned with different keys)?

15. Internal Consistency: Do the populations of your counties add up to the populations of your states? Do the substitutes going into your soccer matches balance the substitutes going out? Would you notice if errors were introduced?

16. Legality: Is the license under which the data can be used clearly defined, ideally in a machine readable way?

17. Sustainability: Is there is some credible basis or evidence for believing the data will be kept available and current? If it's your data, what commitment to its maintenance are you making?

18. Authority: Is the source of the data a credible authority on the subject? Did you find a list of NY Charter Schools, or the list?

[Revision of #12 and addition of 16-18 suggested by Dave Reynolds.]

¶ Inside a Dataset, Relational Density Is Data's Best Friend. Outside a Dataset, It's Too Dark to Read · 29 September 2009 essay/tech

Stefano Mazzocchi (of Freebase) and David Karger (of MIT) have been holding a slow but interesting conversation about data reconciliation. It's been phrased as a sort of debate between two arbitrarily polarized approaches to the problem of cleaning up data so that you don't have multiple variant references to the same real data-point: either you try to do this cleanup "in the data", so that it's done and you don't have to worry about it any more; or you leave the data alone and figure you'll incorporate the cleanup into your query process when you actually start asking specific questions.

But I think this is less of a debate than Stefano and (particularly) David are making it seem. Or, rather, that the real polarization is along a slightly different axis. The biggest difference between Stefano's world and David's happens before you even start to worry about when you're going to start worrying about data cleanup: Freebase is attempting to build a single huge structured-data-representation of (eventually) all knowledge; David's work is largely about building lots of little structured-data-representations of individual (relatively) small blobs of knowledge.

Within a dataset, though, Stefano and David (and I) are in enthusiastic agreement: internal consistency is what makes data meaningful. If your data doesn't know whether "Beatles" and "The Beatles" are one band or two different ones, your answers to any question that touches either one are liable to be so pathetically wrong that people will very quickly stop asking you things. In the classic progression from Data to Information to Knowledge to Wisdom, this is what that first step means: Data is isolated, Information is consolidated. It would be inane to be against this. (Unless, I guess, you were also against Knowledge and Wisdom.)

It's on the definition of "within", though, that most of the interesting issues under the heading of "the Semantic Web" wobble. David says "I argued in favor of letting individuals make their own [datasets] ... and worry about merging them with other people’s data later." He really does mean individuals, and the MIT-developed tool he has in mind (Exhibit) is designed (and only suited) for fairly small datasets. Freebase is officially a database of everything, but of course within this everything are lots of subsets, and Stefano is really talking more about cleaning up these subsets than the whole universe. Reconciling all the artist references to "Beatles" and "The Beatles" from albums and songs is a straightforward practical task both beneficial and probably necessary for asking questions of a song/album/artist dataset, whether it's embedded in a larger one or not. Reconciling the "'The Beatles" who are the artist of the song "Day Tripper" with "The Beatles" who are depicted in cartoon form on a particular vintage lunchbox for sale on eBay, on the other hand, is less conceptually straightforward, and of more obscure practical import.

The thing I'm working on at ITA falls in between Exhibit and Freebase on this dataset axis, both in capacity and design. We handle datasets much bigger than you could put in Exhibit, and allow you to explore and analyze them in ways that Exhibit cannot; but we separate datasets, partly for scalability but even more importantly to specifically keep out of any unnecessary quagmire of trying to reconcile not just the bits of data but the structures.

And my own scouting report, from having done a lot of data-reconciliation in a system designed for it, and in other lives of a lot of more-painful data-reconciliation in various systems not really designed for it, is the same as Stefano's report from his experiences at Freebase: getting from data to information, at scale, is hard. It's mainly hard in ways that machines cannot just solve for you, and anybody who thinks RDF is a solution to data-reconciliation is, as Stefano puts it, confusing model mixability with semantic reconciliation, and has probably not noticed because they've only been playing with toy datasets, which is like thinking you're learning Architecture by building a few birdhouses.

And this is all exactly why I have repeatedly argued for treating the "Semantic Web" technology problem as a database-tech problem first, and a web-tech problem only secondarily. David complains, justifiably, about the pain of combining two small, simple folk-dance datasets he himself cares about. But just as Stefano says in another example, the syntax problems here (e.g. text in a YouTube description box, rather than formally-labeled data records) are fairly trivial compared to the modeling differences. All the URIs and XSL transformations in the world are not going to allow every two datasets to magically operate as one. Some person, without or without tools to magnify their effort, is going to have to rephrase one dataset in the argot of the other.

And to say another thing I've said before again, the fact that RDF doesn't itself solve these problems is not its terminal design flaw. Its terrible flaw is that it isn't even a sufficient foundation upon which to build the tools that would help solve these problems. It takes the right fundamental idea, that the most basic conceptual structure of data is things and the relationships between them, but then becomes so invertedly obsessed with the theoretical purity of this reduction that it leaves a whole layer of needed practical elaboration unbuilt. We shouldn't need abstruse design patterns to get ordered lists, or rickety reasoning engines just to get relationships that go both directions, or endless syntax officiousness that gets us expensive precision with no gain in accuracy.

This effort has let us down. And worse, it has constrained smart people so that their earnest efforts cannot help but let us down. After years and years of waiting, we have no Semantic Web of information, we have only Linked Data, where the word "Linked" is so tenuously justified it might as well be replaced with some pink-ink-drinking Seuss pet, and the word "Data" is tragically all too accurate. We have all these computers, living abbreviated lives of quiet depreciation, filled with the data that should become our wisdom, and yearning, if they are allowed to yearn for even one thing, to be able to tell us what they know.

¶ (How) We Can Work It Out · 23 September 2009 essay/tech

[Fair warning: This is another post about data-modeling and query languages, and it isn't likely to be the last. It may or may not be interesting to people with personal interests in those topics, but I think it's pretty unlikely to be interesting to people without those interests. You have been warned.]

In data-modeling you usually live in fear of the word "usually". Accounting for the subsequent "but sometimes" is usually where a simple, manageable data-model starts its ugly metamorphosis towards tangled and unusable. Same with "mostly", "more often", "occasionally". Most data-modeling contexts are only really ever happy with "always" and "never"; and the real world is usually not so helpfully binary.

DiscO, my data-model for discographies, notes that Tracks, Releases and Sequences can all have Artists, but that usually Tracks would get their Artist indirectly from their Release, which would in turn get it from its Sequence.

What this means, in practical terms, is that when we're entering or importing data, we don't necessarily want to have to set Artist individually on every single Sequence, Release and Track. But when we're querying the data, we want to be able to ask about Artist on every one of those.

Data-modeling people call this "inference", and it's a whole academic subject of its own, deeply entangled with abstract logic and belief-system consistency. The Sequence/Release/Track problem sounds theoretically straightforward at first, but gets very hard very quickly once you realize that some Releases are compilations of Tracks by multiple artists, and some Sequences have Releases by multiple artists. Thus it's not quite true that Sequence "contains" Release, and upon that "not quite" most academic approaches will founder and demand to be released from this vagueness via a rococo expansion of the domain model to separate the notions of Single-Artist and Multiple-Artist Sequences and Releases.

But "usually" is a good concept. And for lots of real-world data problems it can be handled without flailing into existential abstraction. We can keep our model simple, and fill in the implied data with a very general mechanism: the relationships between "actual" values and "effective" values can themselves be described in queries.

Releases, we said, may have Artists directly, or may get them implicitly from their Sequence. More specifically, we probably mean something like this:

- If a Release has an Artist, directly, use that.
- If it doesn't, get all the Sequences in which it occurs.
- Ignore any Sequence that itself has no Artist or multiple Artists.
- If all the remaining (single-Artist) Sequences have the same Artist, use that.
- Otherwise we don't know.

This can be written out in Thread in pretty much these exact steps:

Release|Artist=(.Artist;(.Sequence:(.Artist::#=1).Artist::#=1))

This isn't a syntax tutorial, but ";" means otherwise, and "::#=1" tests a list to see if it has exactly one entry, so maybe you can sort of see what the query is doing.

A Track, then, follows the same pattern, but has to check one more level:

- If a Track has an Artist, directly, use that.
- If it doesn't, get all the Releases in which it occurs.
- Ignore any Release that itself has no Artist or multiple Artists.
- If all the remaining (single-Artist) Releases have the same Artist, use that.
- Otherwise, get all the Sequences for all the Releases on which the track appears.
- Ignore any Sequence that itself has no Artist or multiple Artists.
- If all the remaining (single-Artist) Sequences have the same Artist, use that.
- Otherwise we don't know.

Or, in Thread:

Track|Artist=(
.Artist;
.(.Release:(.Artist::#=1).Artist::#=1);
.(.Release.Sequence:(.Artist::#=1).Artist::#=1))

In this system, once you've learned the query-language (and I'm not saying this is a trivial task, but it's not that hard), you can do most anything. The language with which you ask questions is the same language you use for stipulating answers.

Query-language-geek postscript: And, of course, it's the same language you use for obsessively fiddling with your questions and answers because you just can't help it. That Track query has some redundancy, and while redundancy isn't always bad, it's almost always fun to see if you can get rid of it. In this case we're asking the same question ("What's your artist?") in all three steps of the query. We can rearrange it so that we get all the Tracks and Releases and Sequences first, then ask all the Artist questions at the end:

Track|Artist=(...__,Release,Sequence:(.Artist::#=1).Artist::#=1)

"...__," may initially look a little like ants playing Limbo, but "...__,Release,Sequence" means "get these things, their Releases and their Sequences, and then add those things' Releases and Sequences, etc. until you get to the end. So this version of the query builds up the complete list of all the places from which this Track could derive its artist, keeps only the ones that have a single artist, and then sees if we're left with a single artist at the end of the whole process. Depending on what we want to do if our data has internal contradictions, this might actually be functionally better than the first version, not just shorter to type.

But DiscO also has all that stuff about Original Versions, and it would be nice if our Artist inference used that information, too. If Past Masters [Remastered] is an alternate version of Past Masters, Vol. 1 & 2, and Past Masters, Vol. 1 & 2 belongs to the Beatles' Compilations sequence, then we should be able to deduce that the version of "We Can Work It Out" on Past Masters [Remastered] is by the Beatles. Our experimental elimination of redundancy now pays off a little bit, because we only have to add in "Original Version" in one place:

Track|Artist=(...__,Release,Sequence,Original Version:(.Artist::#=1).Artist::#=1)

And interestingly, since both Tracks and Releases can have Original Versions, this whole thing actually works for either type, and thus we can combine the two and have even fewer things to worry about:

Track,Release|Artist=(...__,Release,Sequence,Original Version:(.Artist::#=1).Artist::#=1)

Having fewer things to worry about is (usually) good.

¶ Data Tripping (and DiscO) · 18 September 2009 essay/tech

What Beatles album is "Day Tripper" on?

This is partially a trick question, of course, as "Day Tripper" was originally a non-album single, but it has been on several Beatles compilations over the years, including the red 1962-1966 best-of, and in the remastered 2009 catalog it lands on both the mono and stereo versions of Past Masters.

Starting from scratch, it would take you, as a person, a little while to figure this answer out on the web. There's a Wikipedia page for the song, which is the top search hit for the above question in both Google and Bing, and it explains the non-album-ness and mentions several compilations, but it doesn't (at least as of this moment) clarify current availability, and none of the pages for the non-current compilations refer you explicitly (again, as of this moment) to the right place.

There are a few paths that lead you to a voluminous page for the Beatles discography, on which "Day Tripper" is again mentioned several times, but this page doesn't itemize the track-listings of compilations, and the "Day Tripper" links just go back to the original song-page. But eventually you might blunder into the separate pages for the Mono and Stereo box sets, and from there you might wander over to Amazon.

Here even more potential confusion awaits you. The top Google hit for "amazon past masters", at the moment, is the now-obsolete second volume of the old CD edition. Searching on "past masters" on Amazon itself gets you the Remastered edition as the top hit, but idiotically suggests buying it together with the old Volume 2, and you would have to scrutinize the track lists to realize that Past Masters [Remastered] subsumes Past Masters, Vol. 1 and Past Masters, Vol. 2.

In fact, if you backtrack and try searching for "Day Tripper" on Amazon, the top hit is Rubber Soul, the album on which "Day Tripper" would have appeared, chronologically, but doesn't. And, for good measure, that top hit is the obsolete 1990 CD, not the new remaster.

Bleh.

But at least you're a person, and via a judicious combination of intuition and stubbornness and asking some friends who might know, you can eventually solve these information problems. If you were a computer, you'd be fucked.

Which is OK on some existential level, because if you were a computer you probably wouldn't appreciate the song, anyway. But the point of computers is to help people do things, and a computer ought to be a particularly helpful tool when the thing you need to do is sort through some data.

But for the computer to help you puzzle through this data, the data has to be modeled usefully by people first. There are several prominent sources of meticulously structured data about music, so this should be easy. But here, sadly, people have let us down again. And again, and again. Let's see how.

All Music Guide

A text-search for "Day Tripper" (there's no other query interface) returns a full page of cryptic results. There's an "Occurrences" column, and although it's not clear exactly what that means, it's obvious that more is supposed to be better, and the first listing has 360 where none of the rest have more than 13, so presumably that's the "right" one.

Clicking this gets you 8 pages of results, which is annoying in itself (the splitting of them into pages, I mean). They're sorted by Artist, which sounds reasonable enough, except that the ones without artists are sorted first, and thus the first page of results is almost totally crap. There are lots of Beatles releases listed, but they get split between pages 1 and 2 of the results list, making it impossible to look at them all at once.

But if you drill into one of them at random, and then click on "Day Tripper" in the track listing, you do finally get to a page that lists all (or several, anyway) Beatles releases on which this song appears. There are 24, though, including such things as "Five Nights in a Judo Arena", which human intuition might guess is not a normative release, but a computer would have no basis for dismissing. These releases are in date-order, at least, but this turns out to be worse than worthless for our current question, because All Music has modeled Past Masters [Remastered] not as an album, but as an alternate manifestation of the album Past Masters, Vol. 1 & 2, which means it appears way up in the middle of this list, labeled 1988, because that was the year of its earliest issue (on cassette!).

Looking through the data, in fact, we see that although All Music has lots of individual detail on most kinds of things, it has essentially nothing that models the relationships of things to each other, or in groups. There is no modeled connection between Past Masters, Vol. 1 & 2, Past Masters, Vol. 2, The Beatles: Mono Box Set and The Beatles: Stereo Box Set, even though v1/2 subsumes v2, and both boxes subsume both.

And there's no modeling of "in print", or any notion of representing the subset of albums that represent the current core catalog. So a computer can't use this data to answer real questions by itself. Source fail.

MusicBrainz

This is a database first, not a guide, and thus a more likely candidate for well-structured data anyway, and one where I won't pick at their explicitly-secondary browsing UI.

The good news is that MusicBrainz has the kind of data we need. They have some relationships between tracks, like one being a mashup of some others, so presumably they could add one to express that the Mono Masters of "Day Tripper" is a different version from the Past Masters [Remastered] one, but the same underlying song. They already have a reconciliation mechanism by which they can say that the "Day Tripper" on 1962-1966 is "the same" as the one on Past Masters, although at the moment the reconciliation data looks too noisy for real use.

They even have the notion of one release being part of a set, although I didn't find very many examples of sets, and in particular I can't tell if a release can be part of more than one set. But if they can, that might be a mechanism for expressing official catalogs, current availability, and various other kinds of groupings and subsets.

So current source fail, but at least there's hope here.

Freebase

Freebase is easily the most sophisticated public attempt at universal data-modeling, at the moment, but this is a caveat as well as a compliment. Freebase models attempt to represent everything that could possibly exist, and thus tend to drift quickly from usable simplicity towards abstractly-correct awkwardness, usually coming to rest far into the latter.

So if you search on "Day Tripper", you will find that there are results of that name as "Topic", "Composition", "Musical Album" and "Musical Track", with at least dozens of the latter. Freebase fails the usability test even more spectacularly than All Music, as the list of Musical Tracks is presented with no grouping or distinguishing information at all, just "Day Tripper (Musical Track)" after "Day Tripper (Musical Track)", and you have to click on each one to get any clarifying info. "Day Tripper" the composition does not link to any of the tracks, and "Day Tripper" the album turns out to be a compilation of Beatles covers which does not, at least as far as the listed information shows, even contain the song "Day Tripper".

And if you delve into the internals of the Freebase music schema, you can quickly develop a guess about why the data has not all been filled in: there's too damn much structure. A music/track is a recording of a music/composition. The track can appear on multiple music/releases, each of which is a publication of a particular music/album. Unless you need to model who was in the band during the making of an album, in which case the album links instead to a set of music/recording_contributions, which is each a combination of albums, contributors and roles.

Oh, except compositions can be recorded "as albums", in which case they link to music/album without going through music/track, and tracks can appear directly on releases without going through music/album. And there's no current property for saying that a given track is an alternate version of another, but from following Freebase modeling discussions I can confidently guess that they'd model that by saying that a music/track links to something like a music/track_derivation, which itself is a combination of original track, derivative track, deriving artist (or music/track_derivation_contribution) and derivation type. And Freebase's query-language doesn't provide any recursion, so if these relationships chain, good luck querying them.

Music Ontology

This isn't a database, just an attempt at a model for one. And, grimly, it's another quantum level more elaborately and unusably correct than the Freebase model. Even "MO Basics" (and the "Overview" has 22 more tables of explication beyond these "Basics", without getting into the "details") includes conceptual distinctions between Composition, Arrangement, Recording, Musical Work, Musical Item, MusicalExpression and MusicalManifestation. And then there are pages upon pages of minutely itemized trivia like beginsAtDuration, djmix_of, paid_download (and "paiddownload", which is different in some way I couldn't figure out), AnalogSignal, isFactorOf... This list is bad because it's too long, but the fact that it's in the schema means that it's also bad because no matter how long it is, it will never include every nuance you ever find yourself wanting, and thus over time it will only accumulate more debris.

A tour-de-force into a cul-de-sac.

The Rest of the Web

Searching on any particular band or bit of music will unearth dozens or hundreds of other sites that contain bits of the information we need: stores, discographies, databases, forums, fan pages, official sites. Almost universally, these are either unqueriable flat HTML pages, or tree-structured databases with even less interlinking than the above sites. Encyclopaedia Metallum, my favorite metal site, has full track listings for a genuinely mind-boggling number of releases by an astonishing number of bands, but the tracks themselves are not data-objects and a machine can find out nothing about them. There are several lovingly hand-crafted Beatles discographies on the web, all far too detailed for our original casual query, and all essentially useless to a computer attempting to help us.

So: Ugh. Triple ugh because a) the population of people willing to put time and energy into filling out music-related data-forms is obviously huge, b) the modeling problems are not intractably complicated in any theoretical sense, c) MusicBrainz and Freebase, at least (and the system I'm designing at work, I think), seem to be technically sufficient to represented the data correctly. If only we had a better plan.

DiscO

So here's my attempt at a better plan. I call it DiscO, for Discographic Ontology; that is, it's a scheme for structuring discographies. It is not an attempt at an abstract physics of human air-vibrating creativity, it is just an outline of a way to write down the information about bands, the music they've made, and how that music was released. It's intended to be simple enough that you can imagine people actually filling in the data, but expressive enough that the data can support interesting queries. And it's specifically intended to model nuance abstractly, so that it can accommodate some kinds of new needs without perpetually having to itself expand.

There are four basic types:

Artist - The Beatles, Big Country, Megadeth, Frank Zappa, whatever...

Release - an individual album, single, compilation, whatever; Rubber Soul, Past Masters [Remastered], "Day Tripper"/"We Can Work It Out"...

Track - an individual version of a song; "Day Tripper", "Day Tripper [mono]", "Day Tripper (performed live on pan flute and triangle by Zamfir and Elmo)", etc.

Sequence - any collection of releases; Original Albums, Japanese Cassette Singles, 2009 Remasters, etc.

These are related to each other like this:

Artists mostly have Sequences. Sequences can be anything, but many artists would have some standard ones: Original Albums, Singles, Compilations, Remastered Albums, Current Catalog.

Sequences have Releases (and Artists).

Releases have Dates, Labels and Tracks. A Release may have an Artist directly, but more often would have one indirectly via a Sequence.

Releases may be related to each other via Alternate Version/Original Version links. Thus Past Masters, Vol. 1 & 2 and Past Masters [Remastered] are both Releases, but Past Masters [Remastered] has an Original Version link to Past Masters, Vol. 1 & 2, and Past Masters, Vol. 1 & 2 has an Alternate Version link to Past Masters [Remastered].

Tracks have Durations. A Track may have an Artist directly (so individual tracks on multi-artist compilations can be attributed correctly), but more often would have one indirectly via Release (which itself may have one indirectly via Sequence).

Tracks may also be related to each other via Alternate Version/Original Version links. "Day Tripper" and "Day Tripper [mono]" are both Tracks, but "Day Tripper" has an Alternate Version link to "Day Tripper [mono]", and "Day Tripper [mono]" has an Original Version link to "Day Tripper". (We can get into geek arguments about which versions are the same and which are derivations (of which!), if we want, but whatever we decide, we can model.)

Restated in schema-ish form, that's:

Artist
- Sequence
- Release
- Track

Sequence
- Artist
- Release

Release
- Sequence
- Artist
- Date
- Label
- Track
- Original Version
- Alternate Version

Track
- Artist
- Duration
- Original Version
- Alternate Version

I think that's basically enough. What it gives up in expressiveness, it gains in usability. Our Beatles data can now, I think, be modeled both tractably and informatively. We can hook up all the versions of albums and versions of songs. We can create whatever sequences we need, and since the sequences themselves are just data, it's fine to have "Canadian Singles" for the Beatles and "Fanzine Flexis" for The Bedsitters without implying that either band should also have the other.

And using Thread, the query-language I will (before long, hopefully) be attempting to spread through the universe, we can start to ask our questions in a way the computer can answer:

Track:=Day Tripper.Release

This is our naive query. It gets all the releases that have any track called exactly "Day Tripper". Good for assuring us there's some data in the bucket, but not much help in answering our question.

Track:=Day Tripper.Release:(.Artist:=The Beatles)

That limits our results to albums by the Beatles, but there are still too many. With our fully-interlinked data-model, though, we can now actually ask something that is much closer to what we mean:

Artist:=The Beatles.Sequence:=Current Catalog.Release:(.Track:=Day Tripper)

That is, find the artist The Beatles, get their Current Catalog sequence, get that sequence's releases, and filter those releases down to the ones that contain a track called exactly "Day Tripper". This is progress.

But "called exactly 'Day Tripper'" will exclude "Day Tripper [mono]", which isn't what we want to do. We're trying to ask a musical question about a song, not a typographical question about a title. But this, too, we have the powers to cope with:

Track|Day Trippers=(...__,Original Version:=Day Tripper)

Artist:=The Beatles.Sequence:=Current Catalog.Release:(.Track.Day Trippers)|
DT Versions=(.Track:Day Trippers=?),
Other Tracks=(.Track:Day Trippers=_._Count)

This time we first define a new inferred relationship on Track called "Day Trippers", which gets the Track, all its Original Versions, all their Original Versions (recursively), and then filters this set of tracks down to just the ones called "Day Tripper".

Then we get the Beatles' current catalog releases again, but this time instead of checking each release for a track named "Day Tripper", we use our Day Trippers relationship to check for a track that is, or is derived from, "Day Tripper". And then, for each of the releases that have one, we infer two new relationships: "DT Versions" tells us which track(s) on this release are versions of "Day Tripper", and "Other Tracks" counts the tracks on this release that are not derivations of "Day Tripper".

I.e.:

#	Release	DT Versions	Other Tracks
1	Past Masters [Remastered]	Day Tripper	32
2	Mono Box Set	Day Tripper [mono]	212
3	Stereo Box Set	Day Tripper	238

So now we know our choices. It took us so long to find out, but we found out.

Tantalizing Postscript: But now that we have these three options before us, how do their contents overlap or differ, track by track?! We could bring up three different windows and squint at them. Or we could ask the computer:

Artist:=The Beatles.Sequence:=Current Catalog.Release:(.Track.Day Trippers)
/(.Track...__,Original Version:##1)/=nodes

Aaah. I see now. (How nice it will be when I'm allowed to show you...)

¶ The Problem With Answers · 27 March 2009 essay/tech

In his post explaining his departure from Google, Douglas Bowman says "Yes, it's true that a team at Google couldn't decide between two blues, so they're testing 41 shades between each blue to see which one performs better. I had a recent debate over whether a border should be 3, 4 or 5 pixels wide, and was asked to prove my case. I can't operate in an environment like that."

He's not really trying to have an opinion-changing last word in an argument against data-driven product decisions, but if he were, this is not how to do it. If you believe that the proper width of a border can be tested, then Bowman's refusal to subject his intuitions to quantitative confirmation just sounds like petulant prima-donna nonsense. If you can test 41 shades of blue, this line of reasoning goes, you don't need to guess, so a guessing specialist is an annoying waste of everybody's time.

The great advantage of testing and data, of course, is that you get precise, decisive answers you can act on. Shade 31, with 3.7%, trouncing runner-up shade 14 with only 3.4%! Apply shade 31, declare progress.

But the great disadvantage of testing and data is that you get precise, decisive answers you can and will act on, but you almost never know what question you really asked. Sure, the people who saw shade 31 did some measurable thing at some measurable rate. But why? Is it shade 31? Or is it the constrast between shade 31 and the old shade? Or is it the interplay between shade 31 and some other thing you aren't thinking about, or possibly don't even control? Are you going to run your test again in a month to see if the results have changed? Did you cross-correlate them with HTTP_REFERER and index the colors on the pages people came from? What about all the combinations of these 41 shades and 41 backgrounds and 8 fonts and 3 border widths (12 if you vary each side of the box separately!) and 41 line-heights and 19 title-bar wordings and the color of the tie Jon Stewart was wearing the night before? Which things matter? How do you know?

You don't. And if you need to add some new element, tomorrow, you don't know which of the tests you've already run it invalidates. Are you going to rerun all of them, permuted 41 more new ways?

No. You are going to sheepishly post a job opening for a new guessing specialist. Bowman already had his last word. It was "Goodbye".

¶ How We Talk About What We Know · 2 March 2009 essay/tech

An adequate computer language allows humans to communicate with machines about machine concerns.

A good computer language also facilitates communication between humans about machine concerns.

A great language allows machines to participate in conversations between humans about human concerns.

There are not very many of this last sort. As I've mentioned before, I'm trying to write one. I've been calling it a query language, but I've started to think I shouldn't. It's a language for talking about data-relationships, where most other things called "query languages" are for excerpting data, and the two are different qualitative goals even when the individual tasks end up being logistically similar. I'm trying to do for data-relationships what the system for symbolic algebra did for numbers. Not what algebra did for numbers, thankfully, just what we accomplished by making up a written syntax for expressing algebra compactly and precisely.

So here's just one real-world example from yesterday. We were talking, elsewhere, about how you calculate overall ratings for bands in a large reviews database. The simplest thing is just to average all their ratings. In Thread, my data-relationship language, this is:

Artist|(.Album.Rating._Average)

I.e.: For each artist, get their albums, then get those albums' ratings, then average all the ratings. But this is maybe not the best statistic, as it weights albums proportionally to the number of reviews. Maybe we want to average the ratings for each album, and then average the album-averages to get the artist average. That's a hard sentence for a person to read, and the computer can't read it at all. But in Thread it's just:

Artist|(.Album.(.Rating._Average)._Average)

Run this, though, and you see that the top of the list is dominated by bands with very small numbers of very high ratings. Not really what we're trying to find out. So let's include only bands with at least 25 ratings:

Artist:(.Album.Rating::#25)|(.Album.(.Rating._Average)._Average)

This is better, but maybe not as much better as you'd think. It turns out that there are a number of bands for which a small number of people have written a large number of reviews. Maybe what we really want is to average the ratings for each user, not for each album. That way one person giving the same high rating to 8 different albums counts as 1, not 8. And we'll only consider artists with ratings from 25 different users, not just 25 ratings total. This is:

Artist:|(.Album.Rating/User::#25.(.group._Average)._Average)

Better, but it's still pretty easy to game this by creating new accounts and filing one very high rating from each of them. We can mitigate that, though, by trusting only ratings from users who have rated, say, at least 5 different albums, from at least 3 different artists. That's:

Album|Trusted Rating=(.Rating:(.User:(.Rating.Album::#5.Artist::#3)))

Artist|(.Album.Trusted Rating/User::#25.(.group._Average)._Average)

Better again. But there are still a few pretty obscure things at the top of the list. This doesn't prove that the results are flawed, of course, but scrutinizing them, and thinking about the sample-size effects of rating variation at this scale, reveals that the highest and lowest ratings are having pretty dramatic effects. Perhaps it would be smart to toss out the top and bottom 10% of the per-reviewer averages, averaging only the middle 10%. This keeps one perspective-challenged fan or one vengeful ex-bassist from single-handedly jumping the ratings up or down. Thus:

Album|Trusted Rating=(.Rating:(.User:(.Rating.Album::#5.Artist::#3)))

Artist|(.Album.Trusted Rating/User::#25.(.group._Average)#._Trim 10%._Average)

The result of this, in fact, is this leaderboard. By these rules Immolation is currently the top-ranked band in the Encyclopaedia Metallum.

The English version of this final formulation is "bands with 25+ reviewers of their full-length albums, counting only reviewers who have filed at least 5 reviews and covered at least 3 bands; scored by averaging the ratings from each reviewer, dropping the top and bottom 10% of these reviewer-averages, and then averaging the remainder". This is a long sentence for people, and a useless sentence for machines, and as long as this is our canonical format, we will be at considerable risk for error every time we retranslate into a computer language. Put this in SPARQL or SQL or MQL, though, and it would be essentially inaccessible to people. So you chose between knowing what you want and not necessarily getting it, or knowing what you're getting but not whether it's what you want.

I think we have to do better. The human stakes for data-comprehension are approaching critical levels, and our tools have not kept up. Worse, the shiny new tools in the big labs are not ready yet and not even that great.

So Thread is my own personal attempt at doing better. Could it be the language we could actually share, humans and computers, to talk about data? I can't prove it is yet, and the project in which it's embedded is still working towards its public debut, so you can't make up your own mind yet, either. But for the past couple years I've been using it to talk to computers, and to myself, and even to a few coworkers, and the experience at least gives me hope. I know it's powerful, and I know it's compact.

Like any language, of course, we'd have to learn it. I make no claims of it being "intuitive", whatever meaning that term might have for a symbolic-reasoning language, nor do I claim it's trivially implemented at scale. It's cryptic in its own particular way, and poses its own technical challenges. But I'm not trying to minimize anybody's absolute difficulty, I'm trying to maximize the ratio of power to difficulty. If, reading those examples above, without a formal tutorial or even an actual diagram of the data model in question, you have at least a sense of what might be going on, then it's at least possible I'm getting somewhere.

[Note from a few days later: in re-reading these queries I actually noticed a methodological error! The first time I did this, I neglected to sort the ratings before trimming the first and last few. That is, I did this:

Album|Trusted Rating=(.Rating:(.User:(.Rating.Album::#5.Artist::#3)))

Artist|(.Album.Trusted Rating/User::#25.(.group._Average)._Trim 10%._Average)

where I should have done this:

Album|Trusted Rating=(.Rating:(.User:(.Rating.Album::#5.Artist::#3)))

Artist|(.Album.Trusted Rating/User::#25.(.group._Average)#._Trim 10%._Average)

The operative difference is the "#" for sorting right before "._Trim 10%" in the second query, which is what makes the trim function take off the highest and lowest ratings, rather than just the first and last.

But even this error is kind of my point. The language is a tool for me to talk to myself over time.]

¶ Whole Data (live version 2) and Thread · 16 October 2008 essay/tech

I gave another version of my Whole Data blog post as part of a panel called "Death of the Relational Database?" at the Web 3.0 conference today.

My answer to the question in the panel-title is that the relational database was always mostly a coping mechanism for a time of relative scarcity, when our data volume and needs exceeded our storage and processing capacities. This era is not entirely past us, but with more memory, faster data-loading, faster processing and better programming tools, there are increasingly many datasets of larger and larger size for which we can now support qualitatively richer abstractions. Where we used to have to break data apart, and thus were constrained to interact with it in fragmented and broken ways, we can now increasingly afford to keep it whole.

And here, then, to try to explain why that matters, is my current four-point reduction of the Whole Data manifesto:

1. What logic allows, storage may not constrain.

The forms of the logical data-model and its inquiries are by definition independent of any extra-logical storage mechanics. That is, a proper data system exists for the express purpose of sustaining the illusion that the logical data-model is real. This must also be true over time: logically-sensible changes in the logical data model cannot be constrained by storage mechanics.

2. All relationships are lists.

Not only must multiple-value relationships and no-value relationships be exactly as easily expressed, manipulated and inquired about as single-value relationships, but every fundamental data operation should thrive on multiplicity and sequence.

3. There is no up or down, only onwards from here.

Humans impose hierarchy and directionality on data, but different humans recognize different hierarchies in the same data at the same time. The system itself must be agnostic. No fact is inherently the property of another, they are all independent and structurally equal, and although pairs of relationships can be inverses of each other, there is no absolute “up” or “down”, or “forwards” or “backwards”. A database of albums, artists and years, for example, is exactly as much a database of years as it is of artists as it is of albums, not by the generous will of its schema-creator, but by the intrinsic nature of data. Most significantly, you must be able to start at any node and see everything else in this dataset to which, and from which, it is connected. In fact, turn this around and you get the essential practical definition of a “dataset”: it is a collection of data in which all internal connections are expressed.

4. Human inquiry follows paths; machine inquiry follows all the paths at once.

The reason to put data into computers is so that computers can answer questions that are meaningful to humans, but faster for computers. A data model and query language exist to communicate human needs to computers, not to communicate machine preferences to humans, and the responsiveness of a data system is its qualitative ability to answer questions with human motivations, not its quantitative throughput.

At the end of this talk I also did the first (very brief) public demo of Thread, the path-based query language I've written as part of the data-modeling (among other things) system I've been working on at ITA Software. I'm looking forward (to put it mildly) to being able to talk much more about this, but for now I just whirled through a fast series of mostly-unexplicated queries that at least gestured in the direction of the points in my manifesto:

Album:Year=2000

albums from the year 2000; a one-to-one relationship in the old world, but in the new world the year 2000 is a real thing, too, making this many-to-one

Album:Artist=Nightwish

albums by Nightwish; a more familiar many-to-one relationship

Artist:Album~Dark

Artists who did an album with "Dark" in the name; a one-to-many relationship with the same syntax as above

Label:Album~Dark

Labels that put out albums with "Dark" in the name; same syntax again, but following a different path to Albums

Album:~Dark.Label

same question again, but following the path from Albums, instead of to them

Artist:(.Album.Label:=Spinefarm)

Artists who've had albums on Spinefarm; same pattern as above, but filtering on a compound relationship

Label:=Spinefarm.Album.Artist

or go the other way

Artist:(.Album:#5.Year:>2000)

Artists whose fifth album came out after 2000; multiples and missing values (one of the bands in the sample data had only four albums)

Artist:!(.Album:#5)

artists who don't have a fifth album; filtering on absence

I also mentioned in the talk, but didn't actually show, the hypothetical query "Who are all the living movie directors who ever directed Cary Grant?":

Actor:=Cary Grant.Film.Director:!(.Date of Death)

Go ahead and try to answer that in IMDB without a query language. For that matter, try to answer it in Freebase with a query language. (Or, even worse, try to answer it in Freebase yesterday, before I fixed a bunch of dates of death by hand...)

There's also a short article up on semanticweb.com at the moment, which if nothing else gives a transcribed sense of how I explain these ideas differently in a phone interview than I do in writing.

And there's space for discussion here.

¶ (Just Slightly) Beyond CSV · 29 July 2008 essay/tech

I like CSV. It addresses a common data-exchange need, and does it with admirably little conceptual or syntactic overhead. As formats go, it's totally unglamorous, but glamor is very definitely not an end in itself. If you need to exchange a simple table of simple data, one datum per row/column, CSV is a fine tool.

I also like JSON. Actually, I really like JSON. It has the data-modeling virtues of my old Elemental XML proposal, with a cleaner semantic distinction between hashes and arrays. I would like to see the spec amended to allow keys to be strings or numbers, rather than only strings, but other than that I think the whole thing is a solid, self-consistent, useful, appropriately scaled solution to an impressively large set of data-exchange and serialization problems. Certainly anything you could model with CSV, you could put into JSON instead, and not at all vice versa.

But the Law of Conservation of Complexity applies here, as in most things. Although JSON syntax is pretty simple, any particular JSON data model can be arbitrarily complex, and the modeling issues in data-exchange are always far more involved than the syntactic ones. The important simplicity of CSV is in constraining the data model to be a table.

Flexibility, however, can be employed in moderation. It is possible, for example, to render a CSV model in JSON syntax. That is, to turn this:

Artist,Albums
Cradle of Filth,7
Nightwish,8
To/Die/For,4

into this:

[
["Artist","Albums"],
["Cradle of Filth","7"],
["Nightwish","8"],
["To\/Die\/For","4"]
]

The advantage of doing so will, I hope, become quickly obvious once you realize that JSON then makes it trivially easy to have cell values be arrays. If we want a list of artists, each with their list of albums, CSV forces us to refactor:

Album,Artist
Cruelty and the Beast,Cradle of Filth
Damnation and a Day,Cradle of Filth
Dusk...and Her Embrace,Cradle of Filth
Midian,Cradle of Filth
Nymphetamine,Cradle of Filth
The Principle of Evil Made Flesh,Cradle of Filth
Thornography,Cradle of Filth
Angels Fall First,Nightwish
Bless the Child,Nightwish
Century Child,Nightwish
Dark Passion Play,Nightwish
Oceanborn,Nightwish
Once,Nightwish
Over the Hills and Far Away,Nightwish
Wishmaster,Nightwish
All Eternity,To/Die/For
Epilogue,To/Die/For
IV,To/Die/For
Jaded,To/Die/For

This is annoying at minimum, and unworkable if we also wanted to see other properties of artists. But in JSON, or this modeling reduction of JSON that we might as well call "JSV", we can simply say:

[
["Artist","Album"],
["Cradle of Filth",["The Principle of Evil Made Flesh","Dusk...and Her Embrace","Cruelty and the Beast","Midian","Damnation and a Day","Nymphetamine","Thornography"]],
["Nightwish",["Angels Fall First","Oceanborn","Wishmaster","Over the Hills and Far Away","Century Child","Bless the Child","Once","Dark Passion Play"]],
["To\/Die\/For",["All Eternity","Epilogue","Jaded","IV"]]
]

JSV retains the modeling simplicity of CSV (a single row of "fields" defined at the top), but allows a cell to be either a single value or a list of them. This is still unglamorous, but to me it's a big gain in expressivity for essentially no technical cost. I suggest that if combining lists and tables isn't a big deal to you (and you're the sort of person to whom a data-serialization format could be a big deal), maybe you've allowed tabular single-mindedness to beat all the natural human list-thinking out of you. Time to reclaim your right to multiplicity. Time to reclaim lots of your rights, really. Yet another argument for more lists, everywhere around us.