furia furialog · New Particles · The War Against Silence · Aedliga (songs) · photography · code · other things     ↑vF
16 April 2016 to 27 March 2009 · tagged essay/tech
[This is the script from a talk I delivered at the EMP Pop Conference today. It was written to be read aloud at an intentionally headlong pace, with somewhat-carefully timed blasts of interstitial music. I've included playable clip-links for the songs here, but the clips are usually from the middles of the songs, and I was playing the beginnings of them in the talk, so it's different. The whole playlist is here, although playing it as a standalone thing would make no sense at all.]  

 

I used to take software jobs to be able to buy records, but buying records is now a way to hear all the world's music like collecting cars is a way to see more of the solar system.  

So now I work at Spotify as a zookeeper for playlist-making robots. Recommendation robots have existed for a while now, but people have mostly used them for shopping. Go find me things I might want to buy. "You bought a snorkel, maybe you'd like to buy these other snorkels?"  

But what streaming music makes possible, which online music stores did not, is actual programmed music experiences. Instead of trying to sell you more snorkels, these robots can take you out to swim around with the funny-looking fish.  

And as robots begin to craft your actual listening experience, it is reasonable, and maybe even morally imperative, to ask if a playlist robot can have an authorial voice, and, if so, what it is?  

The answer is: No. Robots have no taste, no agenda, no soul, no self. Moreover, there is no robot. I talk about robots because it's funny and gives you something you can picture, but that's not how anything really happens.  

How everything really happens is this: people listen to songs. Different people listen to different songs, and we count which ones, and then try to use computers to do math to find patterns in these numbers. That's what my job actually involves. I go to work, I sit down at my desk (except I actually stand at my fancy Spotify standing desk, because I heard that sitting will kill you and if you die you miss a lot of new releases), and I type computer programs that count the actions of human listeners and do math and produce lists of songs.  

So when anybody talks about a fight between machines and humans in music recommendation, you should know that those people do not know what the fuck they are talking about. Music recommendations are machines "versus" humans in the same way that omelets are spatulas "versus" eggs.  

So the good news is that you can stop worrying that robots are trying to poison your listening. But the bad news is that you can start worrying about food safety and whether the people operating your spatulas have the faintest idea what food is supposed to taste like.  

Because data makes some amazing things possible, but it also makes terrible, incoherent, counter-productive things possible. And I'm going to tell you about some of them.  

Counting is the most basic kind of math, and yet even just counting things usefully, in music streaming, is harder than you probably think. For example, this is the most streamed track by the most streamed artist on Spotify:  

Various Artists "Kelly Clarkson on Annie Lennox"  

Do you recognize the band? They are called "Various Artists", and that is their song "Kelly Clarkson on Annie Lennox", from their album Women in Music - 2015 Stories.  

But OK, that's obviously not what we meant. We just need to exclude short commentary tracks, and then this is the most streamed music track by the most streamed artist on Spotify:  

Various Artists "El Preso"  

Except that's "Various Artists" again. The most streamed music track by an actual artist on Spotify is:  

Rihanna "Work"  

OK, so that's starting to make some sense. Pretty much all exercises in programmatic music discovery begin with this: can you "discover" Rihanna?  

Spotify just launched in Indonesia, and I happen to know that Indonesian music is awesome, because there are people there and they make music, so let's find out what the most popular Indonesian song is.  

Justin Bieber "Love Yourself"  

I kind of wanted to know what the most popular Indonesian song is, not just the song that is most popular in Indonesia. But if I restrict my query to artists whose country of origin is Indonesia, I get this:  

Isyana Sarasvati "Kau Adalah"  

Which seems like it might be the Indonesian Lisa Loeb. It's by Isyana Sarasvati, and I looked her up, and she is Indonesian! She's 23, and her Wikipedia page discusses the scholarship she got from the government of Singapore to study music at an academy there, and lists her solo recitals.  

It turns out that our data about where artists are from is decent where we have it, but a lot of times we just don't. 34 of the top 100 songs in Indonesia are by artists for whom we don't have locations.  

But remember math? Math is cool. In addition to counting listeners in Indonesia, we can compare the listening in Indonesia to the listening in the rest of the world, and find the songs are that most distinctively popular in Indonesia. That gets us to this:  

TheOvertunes "Cinta Adalah"  

That is The Overtunes, who turn out to be a band of three Indonesian brothers who became famous when one of them won X Factor Indonesia in 2013.  

But that's still not really what I wanted. It's like being curious about Indonesian food and buying a bag of Indonesian supermarket-brand potato chips.  

I kind of wanted to hear some, I dunno, Indonesian Indie music. I assume they have some, because they have people, and they have X Factor, and that's bound piss some people off enough to start their own bands.  

So if we switch from just counting to doing a bit more data analysis -- actually, quite a lot of data analysis -- we can discover that yes, there is an indie scene in Indonesia, and we can computationally model which bands are more or less a part of it, and without ever stepping foot in Indonesia, we can produce an algorithmic introduction to The Sound of Indonesian Indie, and it begins with this:  

Sheila on 7 "Dan..."  

That is Shelia on 7 doing "Dan...", and I looked them up, too. Rolling Stone Indonesia said that their debut album was one of the 150 Greatest Indonesian Albums of All Time, and they are the first band to sell more than 1m copies of each of their first 3 albums in Indonesia alone.  

Of course, they're also on Sony Music Indonesia, and I assume that at least some of those millions of people who bought their first 3 albums, before Spotify launched in Indonesia and destroyed the album-sales market, are still alive and still remember them. One of the hard parts about running a global music service from your headquarters in Stockholm and your music-intelligence outpost in Boston, is that you need to be able to find Indonesian music that people who already know about Indonesian music don't already know about.  

But once you've modeled the locally-unsurprising canonical core of Indonesian Indie music, you can use that to find people who spend unusually large blocks of their listening time listening to canonical Indonesian Indie music (most of whom are in Indonesia, but they don't have to be; some of them might be off at a music academy in Singapore, where Spotify has been available since 2013), and then you can calculate what music is most distinctively popular among serious Indonesian Indie fans, even if you have no data to tell you where it comes from. And that gets us things like this:  

Sisitipsi "Alkohol"  

That is "Alkohol" by Sistipsi. A Google search for that song finds only 8400 results, which appear to all be in Indonesian. Their band home page is a wordpress.com site, and they had 263 global Spotify listeners last month.  

PILOTZ "Memang Aku"  

PILOTZ, with a Z. Also from Indonesia! 117 listeners.  

Hellcrust "Janji Api"  

Hellcrust. 44 listeners last month. I looked them up, and yes, they're from Jakarta.  

199x "Goodest Riddance"  

199x. 14 monthly listeners! Also, maybe actually from Malaysia, not Indonesia, but in music recommending it's almost as impressive if you can be a little bit wrong as it is if you can be right, because usually when you're wrong you'll get Polish folk-techno or metalcore with Harry Potter fanfic lyrics.  

So that's what a lot of my days are like. Pose a question, write some code, find some songs, and then try to figure out whether those songs are even vaguely answering the question or not.  

And if the question is about Indonesia, that method kind of works.  

But we also have 100 million listeners on Spotify, and we would like to be able to produce personalized listening experiences for each of them. Actually, we'd like to be able to produce multiple listening experiences for each of them. And we can't hire all of our listeners to work for us full-time curating their own individual personal music experiences, because apparently the business model doesn't work? So it's computers or nothing.  

People, it turns out, are somewhat harder than countries.  

For starters, here is the track I have played the most on Spotify:  

Jewel "Twinkle, Twinkle Little Star"  

As humans, we might guess that it is not quite accurate to say that that is my favorite song, and we might have a very specific theory about why that is. As humans, we might guess that the number of times I have played the song after that has a different meaning:  

CHVRCHES "Leave a Trace"  

In the latter case, I love CHVRCHES so much. But in the former case, I love my daughter even more than I love CHVRCHES, and some nights she really needs to hear Jewel sing "Twinkle Twinkle Little Star" at bedtime.  

And if we are still in the early days of algorithmically programmed listening experiences, at all, then we're in what I hope we will look back on as the early- to mid- prehistory of algorithmic personalized listening experiences. I can't tell you exactly how they work, because we're still trying to make them work. But I can tell you 7 things I've learned that I think are principles to guide us towards a future in which dumbfoundingly amazing music you could never find on your own just flows out of the invisible sea of information directly into your ears. When you want it to, I don't mean you can't shut it off.  

1. No music listener is ever only one thing.  

I mean, you can't assume they are. I have a coworker named Matt who basically only listens to skate-punk music, ever, and we test all personalization things on him first, because you can tell immediately if it's wrong. Right: Warzone "Rebels Til We Die". Wrong: The Damned "Wounded Wolf - Original Mix". But other than him, almost everybody turns out to have some non-obvious combination of tastes. I listen to beepy electronica (Red Cell "Vial of Dreams") and gentle soothing Dark Funeral "Where Shadows Forever Reign" and Kangding Ray "Ardent", and sentimental Southern European arena pop (Gianluca Corrao "Amanti d'estate"), and if you just average that all together it turns out you mostly end up with mopey indie music that I don't like at all: Wyvern Lingo "Beast at the Door"  

2. All information is partial.  

We know what you play on Spotify, but we don't know what you listen to on the radio in the car, or what your spouse plays in your house while you're making dinner, or what you loved as a kid or even what you played incessantly on Rdio before it went bankrupt. For example, this is one of my favorite new albums this year: Magnum "Sacred Blood 'Divine' Lies". I adore Magnum, but I hadn't played them on Spotify at all. But my robot knew they were similar to other things it knew I liked. Sometimes music "discovery" is not about discovering things that you don't know, it's about the computer inferring aspects of your taste that you had previously hidden from it.  

3. Variety is good.  

It is in the interest of listeners and Spotify and music makers if people listen to more and more varied music. If all anybody wanted to hear was this once a day -- Adele "Hello" -- there would be no music business and no streaming and no joy or sunlight. Part of my job is to crack open the shell of the sky. Terabrite "Hello". If you are excited to hear what happens next, you will be more likely to pay us $10, and we will pay the artists more for the music you play, and they will make more of it instead of getting terrible day-jobs working for inbound marketing companies, and the world will be a better place.  

4. People like discovering new music.  

They may hate the song you want them to love. They may have a limited tolerance for doing work to discover music, or for trial-and-erroring through lots of music they don't like in order to find it, but neither of those things mean that they wouldn't be thrilled by the right new song if somebody could find it for them. One of you will come up after this to ask me what this song is: Sweden "Stocholm". One of you, probably a different person, will wonder about this: Draper/Prides "Break Over You". I have like a million of those. I mean actually like an actual million of those.  

5. Bernie Sanders is right.  

It is in the interest of the world of music creators if the streaming music business exerts a bit of democratic-socialist pressure against income inequality. The incremental human value of another person listening to "Shake It Off" again is arguably positive, but it's probably also considerably smaller than the value of that person listening to a new song by a new songwriter who doesn't already have enough money to live out the rest of their life inside a Manhattan loft whose walls are covered with thumbdrives full of bitcoins and #1-fan selfies. Anthem Lights "Shake It Off". Taylor, if you're listening, I'm going to keep playing shitty covers of your songs until you put the real ones back on Spotify. That's how it works.  

6. If you're going to try to play people what they actually like, you have to be prepared for whatever that is.  

DJ Loppetiss "Janteloven 2016"  

That's "Russelåter", which is a crazy Norwegian thing where high school kids finish their exams way before the end of the senior year, so in the spring they get together in little gangs, give themselves goofy gang names, purchase actual tour buses from the previous year's gangs, have them repainted with their gang logo, commission terrible crap-EDM gang theme songs from Norwegian producers for whom this is the most profitable local music market, and then spend weeks driving around the suburbs of Oslo in these buses, drinking and never changing their clothes and blasting their appalling theme songs. I did not make this up.  

7. Recommendation incurs responsibility.  

If people are going to give up minutes of their finite lives to listen to something they would otherwise never have been burdened with, it better have the potential, however vague or elusive, to change their life. You can't, however tantalizing the prospect might seem, just play something because you want to. (Aedliga "Futility Has Its Limits") Like I said, you definitely can't do that. If you do that, the robots win.  

Thank you.
Through a roundabout series of connections, I got invited to be part of a roundtable panel at EMP Pop 2015, which ended up (in keeping with this year's themes of Music, Weirdness and Transgression) being a group deliberation on the subject of The Worst Song in the World.  

And since I was going to be there, and conference rules allowed for solo proposals in addition to the group thing, I figured I might as well also try something fun and weird and outside of my usual current data-alchemical domain.  

In the end the thing ended up being not quite free of data-alchemy in the same way that my songs without drums always somehow develop drum tracks. But it's not about data alchemy. At least mostly not.  

All the talks are supposed to eventually be available in audio form, but in the meantime, here is the script I was more or less working from. To reproduce the auditorium experience you should blast at least the first 20 seconds or so of each song as you encounter it in the text, and imagine me intoning the names of the songs in monster-truck-rally announcer-voice, and then saying everything else really fast and excitedly because a) you only get 20 minutes, and b) it was 9:20am on the Sunday morning after the Saturday night conference party and some people might need a little help relocating their attentiveness.  

(Also, be forewarned that neither the talk nor the music discussed is intended for underage audiences or people who are insecure about religion or genuinely frightened by grown men growling like monsters.)  
 

The Satan:Noise Ratio
or
Triangulations of the Abyss  

I grew up in what I wouldn't call a religious community, exactly, but certainly one that was dominated by the assumption of Christianity. My social status was kind of established when I told two members of the football team that the universe was formed out of dust, not Godliness, and it really didn't make any difference whether you liked that idea or not. This was second grade. We had a football team in second grade.  

By the time I discovered heavy metal, I was pretty ready for some kind of comprehensive alternative. Science fiction, existentialism, atheism, algebra, Black Sabbath. These all seemed to frighten people, which suggested they were good and powerful ingredients. But if you're going to fight against football in Texas, you have to have your shit organized. You need a program.  

Obviously as an atheist I wasn't going to believe in Satan any more than I was going to believe in elves, but the idea of Satanism seemed potentially compelling anyway. Like Scientology, but with roots, and better iconography, and fewer videotapes to buy. And I had learned a lot from reading the liner notes to Rush albums, so I dug into Black Sabbath albums with the same enthusiasm.  

Black Sabbath "After Forever"  

[You have to remember that at the time, that was really heavy. But the words go like this:]  

I think it was true it was people like you that crucified Christ
I think it is sad the opinion you had was the only one voiced
Will you be so sure when your day is near, say you don't believe?
You had the chance but you turned it down, now you can't retrieve  

Puzzling. But then, as if realizing they were missing something, they got a new singer whose name was Dio, and made an album called Heaven & Hell.  

Black Sabbath "Heaven & Hell"  

Sing me a song, you're a singer
Do me a wrong, you're a bringer of evil
The Devil is never a maker
The less that you give, you're a taker
So it's on and on and on, it's Heaven and Hell, oh well

Fool, fool! You've got to bleed for the dancer!  

The music: solid. The lyrics? Not exactly "Red Barchetta".  

But OK, what about Judas Priest. Didn't two guys kill themselves after listening to Judas Priest? Now we're getting serious.  

Judas Priest "Saints in Hell"  

Cover your fists
Razor your spears
It's been our possession
For 8,000 years
Fetch the scream eagles
Unleash the wild cats
Set loose the king cobras
And blood sucking bats  

OK, if I wanted a fucking rhyming "evil" version of Noah's Ark...  

But whatever. Before I found the Satanism I was looking for, New Wave happened, and it turned out that androgyny and drum machines scared the football boys way more than Satan.  

And then I left Texas and went to Harvard and took on a very different set of social challenges. So the next time I cycled back into metal, as I always do no matter how many other things I'm into, I wasn't looking for more elaborate pentagrams to shock football boys, I was looking for more hermeneutic nuances to situate and contextualize metal for comparative-lit majors who listened to the Minutemen and the Talking Heads.  

Slayer. The Antichrist. Fucking yes. Slayer makes Sabbath with Ozzy sound like Wings, and Sabbath with Dio sound like Van Halen with Sammy Hagar.  

Slayer "The Antichrist"  

I am the Antichrist
All love is lost
Insanity is what I am
Eternally my soul will rot (rot... rot)  

So, that's not Satanic, that's Christian. I mean, it's sort of ironic, Slayer of course were the original modern hipsters.  

But what about Bathory? In Nomine Satanas. Fucking Latin! Or something...  

Bathory "In Nomine Satanas"  

Ink the pen with blood
Now sign your destiny to me  

Jesus fucking christ: more fealty.  

Emperor. These are Norwegian actual church-burning dudes. Although, it's Scandinavia, so the church-burning was actually part of a progressive urban planning scheme with multi-use pentagrams in pleasant, radiant-heated public spaces.  

Emperor "Inno a Satana"  

O' mighty Lord of the Night. Master of beasts. Bringer of awe and derision.
Thou whose spirit lieth upon every act of oppression, hatred and strife.
Thou whose presence dwelleth in every shadow.
Thou who strengthen the power of every quietus.
Thou who sway every plague and storm.
Harkee.  

Satan's uvula! "Harkee"?  

Gorgoroth "Possessed by Satan"  

worldwide revolution has occurred
holy war, execution of sodomy
We are possessed by the moon
We are possessed by evil
We are possessed by Satan
possessed
possessed by satan
and then we rape the nuns with desire  

We rape the nuns with desire? This is a program of sorts, I guess. But not one that offered solutions to any problems I actually had. But after a while, I kind of stopped asking music to solve any problems in my life that weren't about music. As an adult, the main thing I asked from my Satanic Norwegian metal was leads for where I could find more of it. The most constant internal theme in my life has been the desperate gnawing suspicion that all the music I know is only the tiniest sliver of what actually exists.  

And maybe what we fear guides our evasions so inexorably that we always end up confirming our suspicions by our nature, but my love of metal motivated and informed my work designing data-analysis software as much as it haunted my attempts to understand emotional resonance, and gradually over the years my writing about music for people bled into writing about music for computers, and that's how I eventually ended up at Spotify, where we have a lot of computers and the largest mass of data about music that humanity has ever collected. And this makes it possible to find out about a lot of metal that you might not otherwise know about. A lot. And a lot of everything else. So I ended up making this genre map, to try to make some sense of it all.  

 

And having organized the world into 1375 genres (which is approximately 666 times 2), I can now answer some other questions about them. Just a few days ago, in fact, purely coincidentally and in no way because I was writing this talk at the last minute without a really clear idea where I was going with it, I decided to reverse-index all the words in the titles of all the songs in the world, and then, using BLACK MATH, find and rank the words that appear most disproportionately in each genre.  

It wasn't totally obvious whether this would produce a magic quantification of scattered souls, or a polite visit from some Mumford-and-Sons fans in the IT department, but here are some examples of what it produced in a few genres you might know:  

a cappella: medley love somebody your girl home time over will with when need around life what tonight song that don't just  

acoustic blues: blues woman boogie baby mama moan down mississippi gonna ain't going worried chicago shake long don't rider jail poor woogie  

modern country rock: country beer that's that whiskey love good like cowboy truck don't she's carolina back ain't just wanna this with dirt  

east coast hip hop: featuring edited kool explicit rhyme triple hood shit album game check ghetto what streets money flow version that style  

west coast rap: gangsta dogg featuring niggaz nate snoop hood ghetto playa money pimp thang shit smoke game bitch life funk ain't west  

I'd say that shit is doing something. [The whole thing is here.]  

Using this, I can finally figure out the most Satanic of all metal subgenres. It is Black Thrash, whose top words go like this:  

satanic blasphemy unholy death infernal antichrist satan hell blood holocaust evil metal nuclear doom vengeance black flames darkness funeral iron  

If Satanism is fucking anywhere, it is here.  

Nifelheim "Envoy of Lucifer"  

OK, no idea what they're saying there.  

Destroyer 666 "Satanic Speed Metal"  

Um.  

Warhammer "The Claw of Religion"  

Since the beginning of time
A weapon was built and protected
To keep the balance in line
To guard the "forces of the light"
Do you hear the cries of all the ones that fell?  

Isn't that actually the narration from the beginning of The Fifth Element?  

Sathanas "Reign of the Antichrist"  

From the fall of grace-I shall rise again
Avenging chosen one-Known as Satan’s son  

Well, it's certainly Satanic. But it's Satanism as mirror-image Christianity. Like, imagine if Jackson Pollock's avant-garde transgression was taking Vermeer paintings and repainting them with left and right reversed!!!! To be fair, that's the usual way in which revolutions collapse into politics, hating the status quo's conclusions but being unable to escape its assumptions.  

However, I have a lot of other metal subgenres to work with, and I can actually reorganize the world as if Black Thrash were its point of origin, and then as we move slowly away from that point, genre by genre, we can start to see the patterns change.  

"Satan" begins to disappear.  

 

"Christ" goes away.  

 

"Damnation" no longer so much of a concern.  

 

"Chaos" starts to appear.  

 

"Darkness" is everywhere.  

 

"Eternal" fascinates us.  

 

As does "Beyond".  

 

"Death", always death.  

 

And over and over, at the top of almost every list that doesn't start with "Death": "Flesh".  

 

Except groove metal, where the number 1 term is "Reissue".  

So my mistake, maybe, was in assuming I was looking for a philosophy that called itself Satanic. Give up that constraint, and ideas start to coalesce after all.  

Entombed "Left Hand Path"  

No one will take my soul away
I carry my own will and make my day  

Enslaved "Ethica Odini"  

You have the key to mystery
Pick up the runes; unveil and see  

Dantalion "Onward to Darkness"  

Existence is your own adversary,
a path full of pain and madness.  

Mitochondrion "Eternal Contempt of Man"  

Now the earth, sea, and sky all have torn
Now a gate from the void hath been born
Both the watchers and the unholy do agree
Eradicate that vermin filth humanity  

Dodecahedron "I, Chronocrator"  

Reigning formulas undone
Oaths sworn into silence
Our world will be without form
Our earth will be void  

We are approaching a version of Nihilism that is not an absence, but an embrace of nothingness, an embrace of the finite, of finity.  

Celtic Frost "Os Abysmi Vel Daath"  

Where I am there is no thing.
No God, no me, no inbetween.  

Totalselfhatred "Enlightenment"  

OK, first of all, the band is called Totalselfhatred, and they sound like this. Dreamy.  

I cannot change your destiny, can only help you think
As far as my horizons lead - your thoughts will be more deep
Hope inside is torturing me - keeps painfully alive
A light inside, a knowledge deep, that shines so bright!  

And then, maybe, the grand masters of this, Deathspell Omega.  

Deathspell Omega "Chaining the Katechon"  

That's a 22-minute song, and it does not fade in.  

The task to be achieved, human vocation
Is to become intensely mortal
Not to shrink back
Before the voices
coming from the gallows tree
A work making increasing sense
By its lack of sense
In the history of times there is
But the truth of bones and dust.  

Here, then, are some potential tenets of a chaotic black metal philosophical program:  

1. Babel. Acceptance of chaos, instead of a futile struggle for order or serenity
2. The Codex. To exist in chaos is to seek complexity over simplicity
3. The Void. There is beauty in darkness
4. The Scythe. There are either no illusions, or all illusions, but either way, only death is real  

Which all adds up, I think, to something that I basically understood in second grade, after all: grimly acknowledged free will. That is the philosophical core of metal, as an art form. That is the exact rebellion I was seeking. To choose Satan, and particularly to choose Satan without giving him any positive qualities, is to assert that the act of choosing is more important than the actual choice. To choose death is to assert that choosing is more important than living. To choose death symbolically is somewhat more powerful than choosing it literally, because you can choose it symbolically more than once, while gives you a chance to refine your symbolism.  

Blut Aus Nord "The Choir of the Dead"  

That is Blut Aus Nord's "The Choir of the Dead", from an album actually called The Work Which Transforms God. What does it say? I dunno. But what does it mean? "Hail Satan" is "Think for yourself" plus noise.  

Thank you, and see you in Hell.  
 

[The whole playlist that I was playing from is on Spotify here: Triangulations of the Abyss.]  

Thanks to the Program Committee and the audience for indulging this whim, and particularly to Eric Weisbard for backing up his early-morning scheduling of this racket by showing up to moderate the session himself.
As part of a conference on Music and Genre at McGill University in Montreal, over this past weekend, I served as the non-academic curiosity at the center of a round-table discussion about the nature of musical genres, and of the natures of efforts to understand genres, and of the natures of efforts to understand the efforts to understand genres. Plus or minus one or two levels of abstraction, I forget exactly.  

My "talk" to open this conversation was not strictly scripted to begin with, and I ended up rewriting my oblique speaking notes more or less over from scratch as the day was going on, anyway. One section, which I added as I listened to other people talk about the kinds of distinctions that "genres" represent, attempted to list some of the kinds of genres I have in my deliberately multi-definitional genre map. There ended up being so many of these that I mentioned only a selection of them during the talk. So here, for extended (potential) amusement, is the whole list I had on my screen:  
 

Kinds of Genres
(And note that this isn't even one kind of kind of genre...)  

- conventional genre (jazz, reggae)
- subgenre (calypso, sega, samba, barbershop)
- region (malaysian pop, lithumania)
- language (rock en espanol, hip hop tuga, telugu, malayalam)
- historical distance (vintage swing, traditional country)
- scene (slc indie, canterbury scene, juggalo, usbm)
- faction (east coast hip hop, west coast rap)
- aesthetic (ninja, complextro, funeral doom)
- politics (riot grrrl, vegan straight edge, unblack metal)
- aspirational identity (viking metal, gangster rap, skinhead oi, twee pop)
- retrospective clarity (protopunk, classic peruvian pop, emo punk)
- jokes that stuck (crack rock steady, chamber pop, fourth world)
- influence (britpop, italo disco, japanoise)
- micro-feud (dubstep, brostep, filthstep, trapstep)
- technology (c64, harp)
- totem (digeridu, new tribe, throat singing, metal guitar)
- isolationism (faeroese pop, lds, wrock)
- editorial precedent (c86, zolo, illbient)
- utility (meditation, chill-out, workout, belly dance)
- cultural (christmas, children's music, judaica)
- occasional (discofox, qawaali, disco polo)
- implicit politics (chalga, nsbm, dangdut)
- commerce (coverchill, guidance)
- assumed listening perspective (beatdown, worship, comic)
- private community (orgcore, ectofolk)
- dominant features (hip hop, metal, reggaeton)
- period (early music, ska revival)
- perspective of provenance (classical (composers), orchestral (performers))
- emergent self-identity (skweee, progressive rock)
- external label (moombahton, laboratorio, fallen angel)
- gender (boy band, girl group)
- distribution (viral pop, idol, commons, anime score, show tunes)
- cultural institution (tin pan alley, brill building pop, nashville sound)
- mechanism (mashup, hauntology, vaporwave)
- radio format (album rock, quiet storm, hurban)
- multiple dimensions (german ccm, hindustani classical)
- marketing (world music, lounge, modern classical, new age)
- performer demographics (military band, british brass band)
- arrangement (jazz trio, jug band, wind ensemble)
- competing terminology (hip hop, rap; mpb, brazilian pop music)
- intentions (tribute, fake)
- introspective fractality (riddim, deep house, chaotic black metal)
- opposition (alternative rock, r-neg-b, progressive bluegrass)
- otherness (noise, oratory, lowercase, abstract, outsider)
- parallel terminology (gothic symphonic metal, gothic americana, gothic post-punk; garage rock, uk garage)
- non-self-explanatory (fingerstyle, footwork, futurepop, jungle)
- invented distinctions (shimmer pop, shiver pop; soul flow, flick hop)
- nostalgia (new wave, no wave, new jack swing, avant-garde, adult standards)
- defense (relaxative, neo mellow)  
 

That was at the beginning of the talk. At the end I had a different attempt at an amusement prepared, which was a short outline of my mental draft of the paper I would write about genre evolution, if I wrote papers. In a way this is also a way of listing kinds of kinds of things:  
 

The Every-Noise-at-Once Unified Theory of Musical Genre Evolution
  1. There is a status quo;
  2. Somebody becomes dissatisfied with it;
  3. Several somebodies find common ground in their various dissatisfactions;
  4. Somebody gives this common ground a name, and now we have Thing;
  5. The people who made thing before it was called Thing are now joined by people who know Thing as it is named, and have thus set out to make Thing deliberately, and now we have Thing and Modern Thing, or else Classic Thing and Thing, depending on whether it happened before or after we graduated from college;
  6. Eventually there's enough gravity around Thing for people to start trying to make Thing that doesn't get sucked into the rest of Thing, and thus we get Alternative Thing, which is the non-Thing thing that some people know about, and Deep Thing, which is the non-Thing thing that only the people who make Deep Thing know;
  7. By now we can retroactively identify Proto-Thing, which is the stuff before Thing that sounds kind of thingy to us now that we know Thing;
  8. Thing eventually gets reintegrated into the mainstream, and we get Pop Thing;
  9. Pop Thing tarnishes the whole affair for some people, who head off grumpily into Post Thing;
  10. But Post Thing is kind of dreary, and some people set out to restore the original sense of whatever it was, and we get Neo-Thing;
  11. Except Neo-Thing isn't quite the same as the original Thing, so we get Neo-Traditional Thing, for people who wish none of this ever happened except the original Thing;
  12. But Neo-Thing and Neo-Traditional Thing are both kind of precious, and some people who like Thing still also want to be rock stars, and so we get Nu Thing;
  13. And this is all kind of fractal, so you could search-and-replace Thing with Post Thing or Pop Thing or whatever, and after a couple iterations you can quickly end up with Post-Neo-Traditional Pop Post-Thing.
 

And it would be awesome.  
 
 
 
 

[Also, although I was the one glaringly anomalous non-academic at this academic conference, let posterity record the cover of the conference program.]  

We will look back on these days, I think, as some weird interlude after the invention of computers but before we actually grasped what they meant for us. The Age we are stumbling towards, I am very sure, is the Age of Data. And when we get there, we will be there because we have sublimated the state-machine mechanics of computers beneath the logical structural abstractions of information and relation, and begun to inhabit this new higher world without reference to its substrate.  

I spent 5 years of my life trying to help bring this future about. That is, in a sense I've spent my whole adult life trying to help bring this future about, but for those 5 years I got to work on it very directly. I designed, and our team built, an attempt at a prototype of what a new data exploration system could be like, and at the core of this was my attempt at a draft of a language for discussing data the way algebra is a language for discussing math. These are the elements out of which this new age's alchemies will be constituted. And there were moments, as the system began to come into its own, when I felt the twitches of power awakening. You could conjure shapes out of data with this thing. It made information malleable, made it flow.  

The computer programmers on the team sometimes referred to the project as a system for "non-programmers", and I've come to think of that as both its potential and its downfall. Programmers never say "non-programmers" as a compliment. At best it's merely condescending, at worst it's a euphemism for "idiot" or a semi-aware admission of incomprehension. For programmers, programming is by definition an end, not a means, and therefore the motivations of non-programmers are inherently mysterious and alien. But what we built was for non-programmers in the same way that a bridge is for non-engineers. That is, the whole point of it was to represent a different interaction model between people and information than the ones offered by, at one end, programming languages, and at the other spreadsheets and traditional database programs. As I said over and over throughout those 5 years, I was trying to get us to do for hyper-connected datasets what VisiCalc once did for columns of numbers. I wasn't trying to simplify; if anything, I was making some things harder, or at least less familiar. This new age is not a subset of a previous age. It is not for lesser people, and its challenges are not of a simpler character.  

And as Google now shuts that system down, literally unceremoniously, and 5 years of my work and dreams and visions are at least nominally obliterated, I feel a little sadness but mostly relief. I'm still very convinced that our tools -- humanity's tools -- for interacting with data are hopelessly primitive. I'm still convinced that it won't make a whole lot of difference what those tools are if kids don't grow up learning how to think about data in the first place. I'm still convinced that I have a blurry, fractured vision of what it might take to change these things.  

But I also realize two more things.  

First, the system we built was only a beginning, and it had hardened into a premature finality long before its official corporate fate was settled. The query language I invented was cool, but the successor to it, which I'm sketching in my head whether I want to or not, is a different sort of thing yet again. And I was never going to reach it incrementally, arguing over every syntax decision on the way. Sometimes you have to just start over. The next one will not aspire to be the Visicalc of anything. It's not better business tools we need. The problem is not that we are alienated from our inner accountants. The thing we need first is not even an algebra of data, probably, but an arithmetic of data. We need an inversion of "normalization" in which you don't write data wrong and then endure six Herculean labors to make it obscurely more pleasing to capricious gods, but rather a way of writing it in the first place with an inherent expressive gravity towards truth because more true is always more powerful. This is a task in applied philosophy, not programming and not engineering and not even science. We need to imagine what Plato would have done when his record collection got too big for his cave.  

Second, I still believe that we all deserve better tools, tools more suited for our actual tasks and needs as people whose lives and choices and options are increasingly functions in, not merely of, information. But in the process of exploring what I mean by that I've become a non-non-programmer myself. At my new job I am an engineer. And sometimes, when you think you know what the better world looks like, you can bring pieces of it up out of your dreams. You can walk where the new paths will be. With enough belief, you can walk where the bridges will be. I will come back to these paths, one way or another, but you never do great things by imagining what people you don't understand might want for purposes you don't grasp or embrace. You should trust your own judgment only where you love beyond reason. Anybody could do nearly anything with Needle, and the business cases for it all involved hypothetical big companies doing hypothetical big things with hypothetical big data that repeatedly never actually materialized (and might have been hypoethical if they had). But left to my own invented devices, I always ended up using it for music data.  

So I have followed my own love, and my own obsessions, deeper into that data. At my new job, I am trying to make sense of the largest music database in the world, which is a lot more fun than what I was doing before, and harder, and of rather more direct and demonstrable relevance to anything. On my own, I will continue the music projects I started in Needle. The Discordance evolved out of empath, and so I've evolved it back in, with less marginalia but maybe more coherence. For the Pazz & Jop I've built a stats site far more specific than I could ever have done in the generalized environment of Needle. These will grow as I play with them, and probably there will be other things. I spent 5 years trying to build fancy tools, but it's pretty amazing what you can do with just a hammer. I was Needle's most dedicated user, but in the end, both sadly and happily, I don't actually need it any more. Nobody will miss it more than I will, but maybe nobody will really miss it very much. The moral, I think, and maybe even the ethic, is that these systems do not matter. This isn't the first system I worked on only to see it shut down, and it won't be the last. Software is the epitome of ephemera, necessary in aggregate but needless in every mundane specific.  

But the things we learn from these systems stay learned. Even the ways of learning remain ways after their original demonstrations disintegrate. This is another phrasing of the point about this Age, in fact: the flow from Data to Information to Knowledge to Wisdom is not a function of syntax or platforms or prevalence or virtualization. It is something we do, to which the technology is merely witness. We must teach our children how to think about data because the data survives where the systems fail. We must teach ourselves to be children again in this new Age, because its most transformative truths still await discovery, and are anything but mundane or needless, and we will never recognize them unless we can recall what it felt like in our hearts when everything was amazing and new and ahead of us, and the act of waking was an invitation to wonder to show us a way.
[May 2012 note: Needle, the database system I used to collect, analyze and show this information, was acquired and shut down by Google. Thus many of the links below go to non-functional snapshots of Needle pages I took before the shutdown. The points should survive.]  
 

Boston Magazine recently published their annual Best Schools ranking. They've been doing this for years, and are known for various other Boston rankings as well (places to live, places to eat...), so by now you'd expect them to be pretty good at it.  

Here's what "pretty good at it" amounts to, in 2011: two lists of 135 school districts, one with some configuration information (enrollment, student/teacher ratio, per-pupil spending, graduation rate, number of sports teams, what kind of pre-k they offer, how many AP classes), the second with test scores, and exactly this much methodological transparency: "we crunched the data and came up with this".  

Some obvious things that you can't do with this information:  

- sort it by any criteria other than the magazine's rank
- see the stuff in the first table alongside the stuff in the second
- understand which figures are actually part of the ranking, in what weights
- fact-check it
- compare it in bulk to any other information about these schools
- compare it to any other information about the towns served by these districts
- figure out why certain towns were included or excluded
- find out what towns are even meant by non-town district names if you don't already happen to know
- evaluate the significance of any individual factor, or the correlations of any set of them  

This is not a proud state of the art. And the quality of secondary journalism around it emphasizes the point further: this article about Salem's low ranking basically just turns a table-row into prose sentences, with no context or analysis, and fails to even realize that the 135 districts in the ranking represent just the immediate vicinity of Boston, not the whole state. This Melrose article claims Melrose "climbed" from 97th last year to 94th, but then has to add a note that last year's ranking was of high schools, not whole districts, and thus not even the same thing. Swampscott exults in making the top 50. Malden fights back at being ranked 119th. But nobody actually knows what the rankings mean or signify, because Boston Magazine doesn't say.  
 
 

In an attempt to improve this situation a little, I imported these two tables of information into Needle:  

 

This in itself was sufficient to unify the two tables and render them malleable, which seems to me like the most basic start. Now at least you can re-sort them yourself, and choose what to look at next to what.  

And a little sorting, in fact, quickly reveals some statistical oddities. North Attleborough was listed with an SAT Reading score of 823, which since SAT scores only go up to 800, is very obviously wrong. Some trivial research verifies that this was a typo for 523, and while typos happen in journalism all the time, a typo in data journalism is a dangerous indication that some human has been retyping things by hand, which is never good. (This datum has now been fixed in the magazine's table.)  

More interestingly, when you start scrutinizing each district's 5th/8th/10th-grade MCAS scores, you find some surprising skews. Here are the MCAS and SAT scores for Georgetown:  

MCAS 5 English: 74
MCAS 5 Science: 54
MCAS 5 Math: 42  

MCAS 8 English: 81
MCAS 8 Science: 36
MCAS 8 Math: 51  

MCAS 10 English: 92
MCAS 10 Science: 90
MCAS 10 Math: 88  

SAT Reading: 570
SAT Writing: 566
SAT Math: 584  

Boston Magazine says they "looked within those districts to determine how schools were improving (or not) over time". But that's not what these scores are measuring. These aren't time-slices for a single cohort, these are different tests being given to different kids. If you're interested in history, the Department of Education profile of Georgetown includes annual MCAS results for 2006-2009, and all you have to do is scan down the page to spot the weird anomaly that is 8th grade Science. Every other test has healthy dark-blue bars for "Advanced" scores; but in 8th grade Science virtually no kids managed Advanced scores in any year. This pattern repeats in Wellesley in an even more dramatic fashion. An article from Wellesley Patch explains that their 8th grade science curriculum doesn't cover "space", while the MCAS does. It's an interesting ideological question whether curricula should be matched to the standardized tests, but whatever your opinion on that, it seems clearly misleading to interpret this policy issue as a quality issue.  
 
 

A little more sorting repeatedly raised another question: why is Cambridge ranked 25th? In virtually every test-score-based sort it falls close to the bottom of the table. In the magazine's ranking, Cambridge comes in ahead of Westford, at #26. But observe these scores for the two:  

MCAS 5 English: 59 - 88
MCAS 5 Math: 53 - 86
MCAS 5 Science: 45 - 85
MCAS 8 English: 75 - 95
MCAS 8 Math: 45 - 86
MCAS 8 Science: 34 - 78
MCAS 10 English: 70 - 97
MCAS 10 Math: 77 - 95
MCAS 10 Science: 59 - 94
SAT Reading: 498 - 587
SAT Writing: 493 - 582
SAT Math: 503 - 602
Graduation Rate: 85.2 - 94.6  

This doesn't even look close. But then notice these:  

Students per Teacher: 10.5 - 14.6
Per-Pupil Spending: $25,737 - $10,697  

Cambridge's spending per student is remarkable. It's almost 50% higher than the next highest, which is Waltham at $18,960. The 10.5 students per teacher is also the best ratio of the 135 schools listed, with 115th-ranked Salem in second place with 11. These factors seem like they should matter, and clearly they must be part of the magazine's ranking calculation, but if they're so uniformly not translating to better test scores or graduation rates in Cambridge, does this really make any sense?  

At least we ought to be able to say that these, along with the other non-test characteristics in the magazine like the number of sports teams and the number of AP classes, are different sorts of statistics than test scores. This seems increasingly true as you start looking at them in detail. Plymouth is listed as having 94 sports teams, for example. Can you name 94 different sports? I can't, and the Plymouth High School web site only claims they participate in 19. Newton is listed as having 39 AP classes, and Boston as having 155. But there are only 34 AP subjects, so it seems like a pretty safe guess that in these two multi-high-school districts the magazine is adding the totals for each school. It's hard for me to see what that accomplishes.  

So for my own interest, at least, I created my own Quant Score, which is calculated like this:  

- take all 9 of the listed MCAS scores, drop the lowest one, and sum the other 8
- divide each of the three SAT scores by 3, to put them into a range where they're each worth around twice as much as an individual MCAS score, and add those in
- multiply the graduation rate by 2, to put it into a similar range to the SAT scores, and add that in, as well  

These factors are admittedly arbitrary, so you're welcome to try your own variations, but at least I'm telling you exactly what goes into mine, so you know what you're varying against. I deliberately left out all the other descriptive metrics from this calculation, including student/teacher ratio and spending. I then reranked the schools according to these Quant Scores. See the comparison of the magazine's ranking and mine here.  

The differences are pretty dramatic. Three schools from outside the magazine's top 20 move into my top 10 (and 2 more from outside the magazine's top 10). The magazine's #s 6 and 8 drop to 28 and 30 in my list. Watertown and Waltham drop from 53 and 54 in the magazine to 100 and 114 in my list. Swampscott will be displeased to see that my re-ranking them sends then back out of the top 50. Malden will probably not be much appeased that I've bumped them up from 119 to 118. Acton and Winchester will be thinking about staging parades for me. And Cambridge (where I live, and where my pre-K daughter will go to school unless we take some drastic action) plunges from 25th to 107th.  
 
 

But these are not answers, these are more questions. Most obviously: Why? I'm not claiming my Quant Score is definitive in any way, but it measures something, and I'm willing to claim that what it measures is something more coherent than what the magazine's rank measures. So this sets me off on the quest for better explanations, for which we obviously need more data.  

Needle is good at integrating data, so I have integrated a bunch of it: per-capita incomes, town populations, unemployment rates, district demographic breakdowns, lunch subsidy percentages and 2010 election results. Some of these apply to towns, not districts, and several districts serve multiple towns, but Needle loves one-to-one and one-to-many relationships the same, so I've done properly weighted multi-town averages. (Don't try that in a spreadsheet.)  

And then I started comparing things. Per-pupil spending seems like it ought to matter, but it shows very little statistical correlation to quant scores. Student/teacher ratios, sports-team counts and AP classes also seem like they ought to matter, but the numbers don't support this.  

Per-capita income, on the other hand, matters. The percentage of students receiving lunch subsidies matters even more. In fact, this last factor (the precise calculation I used was adding the percentage of students receiving free lunch and half of the percentage of students receiving partially subsidized lunch) is the single best predictor of quant score that I've found so far. This is depressingly unsurprising: poverty at home is hard to overcome: hard enough for individuals, and even harder in aggregate.  
 
 

With this in mind, then, I ran a quick linear regression of quant score as a strict function of lunch-subsidy percentage, and used that to calculate predicted quant scores for each district. The depressing headline is how small those variations are. In a quant-score range from 1531 to 727, only 10 districts did more than 100 quant points better than predicted, and only 10 districts did more than 100 points worse. If I use the square roots of the lunch-subsidy percentages, instead, only 6 districts beat their predictions by 100, and only 8 miss by 100.  

If I toss in town unemployment rates, Democratic vote percentages in the 2010 Senate election, and town per-capita income, I can get my predictions so close that only 1 school did more than 100 points better than expected, and only two did more than 100 points worse. This is daunting precision.  

But OK, even if the variations are small, they're there. So surely this is where those aspirational metrics like spending must come into play. Throwing money at students in school may not be able to counteract poverty at home, but doesn't it at least help?  

No.  

Students per Teacher? No.
AP classes? No.
Percentage of minority students? No.  

I'm by no means saying that there isn't an explanation, or more of an explanation, or other factors. But if there are, I haven't found them yet.  

But at least I'm trying. And I give you the data so you can try, too. I submit that this is what data journalism should be trying to do. We are trying to find knowledge in data. Secrecy and opaqueness and non-interactivity are counter-productive. It's more than hard enough to find truth even with all the data arrayed in front of us. If there's an equivalent of the Hippocratic Oath for data journalists, it should be that we will endeavor to never make the truth more obscure.  
 
 

[Space for discussion here.]  
 
 

[Postscript, 10 September: The more I thought about that 823/523 error, the more I worried that there might be other errors that weren't as obvious, so I used Needle to cross-check all the test-scores against the official DOE figures. Two more were wrong. Manchester Essex's SAT Reading score was 559, not 599, which I'm guessing would lower their #6 magazine rank, perhaps considerably. In my rankings it dropped them from 28 to 31. Ashland's SAT Reading score was also wrong, 531 not 537, but this didn't change their rank in my method. Both corrections moved those schools' scores closer to my predictions.]  

[Postscript, 12 September: But charter schools do better relative to expectations, right? Nope.]
[From a note to the public-lod@w3.org mailing list.]  

Data "beauty" might be subjective, and the same data may have different applicability to different tasks, but there are a lot of obvious and straightforward ways of thinking about the quality of a dataset independent of the particular preferences of individual beholders. Here are just some of them:  

1. Accuracy: Are the individual nodes that refer to factual information factually and lexically correct. Like, is Chicago spelled "Chigaco" or does the dataset say its population is 2.7?  

2. Intelligibility: Are there human-readable labels on things, so you can tell what one is when you're looking at it? Is there a model, so you can tell what questions you can ask? If a thing has multiple labels (or a set of owl:sameAs things have multiple labels), do you know which (or if) one is canonical?  

3. Referential Correspondence: If a set of data points represents some set of real-world referents, is there one and only one point per referent? If you have 9,780 data points representing cities, but 5 of them are "Chicago", "Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and "Chicagoland", that's bad.  

4. Completeness: Where you have data representing a clear finite set of referents, do you have them all? All the countries, all the states, all the NHL teams, etc? And if you have things related to these sets, are those projections complete? Populations of every country? Addresses of arenas of all the hockey teams?  

5. Boundedness: Where you have data representing a clear finite set of referents, is it unpolluted by other things? E.g., can you get a list of current real countries, not mixed with former states or fictional empires or adminstrative subdivisions?  

6. Typing: Do you really have properly typed nodes for things, or do you just have literals? The first president of the US was not "George Washington"^^xsd:string, it was a person whose name-renderings include "George Washington". Your ability to ask questions will be constrained or crippled if your data doesn't know the difference.  

7. Modeling Correctness: Is the logical structure of the data properly represented? Graphs are relational databases without the crutch of "rows"; if you screw up the modeling, your queries will produce garbage.  

8. Modeling Granularity: Did you capture enough of the data to actually make use of it. ":us :president :george_washington" isn't exactly wrong, but it's pretty limiting. Model presidencies, with their dates, and you've got much more powerful data.  

9. Connectedness: If you're bringing together datasets that used to be separate, are the join points represented properly. Is the US from your country list the same as (or owl:sameAs) the US from your list of presidencies and the US from your list of world cities and their populations?  

10. Isomorphism: If you're bring together datasets that used to be separate, are their models reconciled? Does an album contain songs, or does it contain tracks which are publications of recordings of songs, or something else? If each data point answers this question differently, even simple-seeming queries may be intractable.  

11. Currency: Is the data up-to-date? As of when?  

12. Model Uniformity: Are discretionary modeling decisions made the same way throughout the dataset, so that you don't have to ask many permutations of the same question to get different subsets of the answer? Nobody should have to worry whether some presidents and presidencies are asserted in only one direction and some only the other.  

13. Attribution: If your data comes from multiple sources, or in multiple batches, can you tell which came from where? If a source becomes obsolete or discredited or corrupted, can you take its data out again?  

14. History: If your data has been edited, can you tell how and when and by whom? Can you undo errors, both individual (no, those two people aren't the same, after all) and programmatic (those two datasets should have been aligned with different keys)?  

15. Internal Consistency: Do the populations of your counties add up to the populations of your states? Do the substitutes going into your soccer matches balance the substitutes going out? Would you notice if errors were introduced?  

16. Legality: Is the license under which the data can be used clearly defined, ideally in a machine readable way?  

17. Sustainability: Is there is some credible basis or evidence for believing the data will be kept available and current? If it's your data, what commitment to its maintenance are you making?  

18. Authority: Is the source of the data a credible authority on the subject? Did you find a list of NY Charter Schools, or the list?  

[Revision of #12 and addition of 16-18 suggested by Dave Reynolds.]
Stefano Mazzocchi (of Freebase) and David Karger (of MIT) have been holding a slow but interesting conversation about data reconciliation. It's been phrased as a sort of debate between two arbitrarily polarized approaches to the problem of cleaning up data so that you don't have multiple variant references to the same real data-point: either you try to do this cleanup "in the data", so that it's done and you don't have to worry about it any more; or you leave the data alone and figure you'll incorporate the cleanup into your query process when you actually start asking specific questions.  

But I think this is less of a debate than Stefano and (particularly) David are making it seem. Or, rather, that the real polarization is along a slightly different axis. The biggest difference between Stefano's world and David's happens before you even start to worry about when you're going to start worrying about data cleanup: Freebase is attempting to build a single huge structured-data-representation of (eventually) all knowledge; David's work is largely about building lots of little structured-data-representations of individual (relatively) small blobs of knowledge.  

Within a dataset, though, Stefano and David (and I) are in enthusiastic agreement: internal consistency is what makes data meaningful. If your data doesn't know whether "Beatles" and "The Beatles" are one band or two different ones, your answers to any question that touches either one are liable to be so pathetically wrong that people will very quickly stop asking you things. In the classic progression from Data to Information to Knowledge to Wisdom, this is what that first step means: Data is isolated, Information is consolidated. It would be inane to be against this. (Unless, I guess, you were also against Knowledge and Wisdom.)  

It's on the definition of "within", though, that most of the interesting issues under the heading of "the Semantic Web" wobble. David says "I argued in favor of letting individuals make their own [datasets] ... and worry about merging them with other people’s data later." He really does mean individuals, and the MIT-developed tool he has in mind (Exhibit) is designed (and only suited) for fairly small datasets. Freebase is officially a database of everything, but of course within this everything are lots of subsets, and Stefano is really talking more about cleaning up these subsets than the whole universe. Reconciling all the artist references to "Beatles" and "The Beatles" from albums and songs is a straightforward practical task both beneficial and probably necessary for asking questions of a song/album/artist dataset, whether it's embedded in a larger one or not. Reconciling the "'The Beatles" who are the artist of the song "Day Tripper" with "The Beatles" who are depicted in cartoon form on a particular vintage lunchbox for sale on eBay, on the other hand, is less conceptually straightforward, and of more obscure practical import.  

The thing I'm working on at ITA falls in between Exhibit and Freebase on this dataset axis, both in capacity and design. We handle datasets much bigger than you could put in Exhibit, and allow you to explore and analyze them in ways that Exhibit cannot; but we separate datasets, partly for scalability but even more importantly to specifically keep out of any unnecessary quagmire of trying to reconcile not just the bits of data but the structures.  

And my own scouting report, from having done a lot of data-reconciliation in a system designed for it, and in other lives of a lot of more-painful data-reconciliation in various systems not really designed for it, is the same as Stefano's report from his experiences at Freebase: getting from data to information, at scale, is hard. It's mainly hard in ways that machines cannot just solve for you, and anybody who thinks RDF is a solution to data-reconciliation is, as Stefano puts it, confusing model mixability with semantic reconciliation, and has probably not noticed because they've only been playing with toy datasets, which is like thinking you're learning Architecture by building a few birdhouses.  

And this is all exactly why I have repeatedly argued for treating the "Semantic Web" technology problem as a database-tech problem first, and a web-tech problem only secondarily. David complains, justifiably, about the pain of combining two small, simple folk-dance datasets he himself cares about. But just as Stefano says in another example, the syntax problems here (e.g. text in a YouTube description box, rather than formally-labeled data records) are fairly trivial compared to the modeling differences. All the URIs and XSL transformations in the world are not going to allow every two datasets to magically operate as one. Some person, without or without tools to magnify their effort, is going to have to rephrase one dataset in the argot of the other.  

And to say another thing I've said before again, the fact that RDF doesn't itself solve these problems is not its terminal design flaw. Its terrible flaw is that it isn't even a sufficient foundation upon which to build the tools that would help solve these problems. It takes the right fundamental idea, that the most basic conceptual structure of data is things and the relationships between them, but then becomes so invertedly obsessed with the theoretical purity of this reduction that it leaves a whole layer of needed practical elaboration unbuilt. We shouldn't need abstruse design patterns to get ordered lists, or rickety reasoning engines just to get relationships that go both directions, or endless syntax officiousness that gets us expensive precision with no gain in accuracy.  

This effort has let us down. And worse, it has constrained smart people so that their earnest efforts cannot help but let us down. After years and years of waiting, we have no Semantic Web of information, we have only Linked Data, where the word "Linked" is so tenuously justified it might as well be replaced with some pink-ink-drinking Seuss pet, and the word "Data" is tragically all too accurate. We have all these computers, living abbreviated lives of quiet depreciation, filled with the data that should become our wisdom, and yearning, if they are allowed to yearn for even one thing, to be able to tell us what they know.
[Fair warning: This is another post about data-modeling and query languages, and it isn't likely to be the last. It may or may not be interesting to people with personal interests in those topics, but I think it's pretty unlikely to be interesting to people without those interests. You have been warned.]  
 

In data-modeling you usually live in fear of the word "usually". Accounting for the subsequent "but sometimes" is usually where a simple, manageable data-model starts its ugly metamorphosis towards tangled and unusable. Same with "mostly", "more often", "occasionally". Most data-modeling contexts are only really ever happy with "always" and "never"; and the real world is usually not so helpfully binary.  

DiscO, my data-model for discographies, notes that Tracks, Releases and Sequences can all have Artists, but that usually Tracks would get their Artist indirectly from their Release, which would in turn get it from its Sequence.  

What this means, in practical terms, is that when we're entering or importing data, we don't necessarily want to have to set Artist individually on every single Sequence, Release and Track. But when we're querying the data, we want to be able to ask about Artist on every one of those.  

Data-modeling people call this "inference", and it's a whole academic subject of its own, deeply entangled with abstract logic and belief-system consistency. The Sequence/Release/Track problem sounds theoretically straightforward at first, but gets very hard very quickly once you realize that some Releases are compilations of Tracks by multiple artists, and some Sequences have Releases by multiple artists. Thus it's not quite true that Sequence "contains" Release, and upon that "not quite" most academic approaches will founder and demand to be released from this vagueness via a rococo expansion of the domain model to separate the notions of Single-Artist and Multiple-Artist Sequences and Releases.  

But "usually" is a good concept. And for lots of real-world data problems it can be handled without flailing into existential abstraction. We can keep our model simple, and fill in the implied data with a very general mechanism: the relationships between "actual" values and "effective" values can themselves be described in queries.  

Releases, we said, may have Artists directly, or may get them implicitly from their Sequence. More specifically, we probably mean something like this:  

- If a Release has an Artist, directly, use that.
- If it doesn't, get all the Sequences in which it occurs.
- Ignore any Sequence that itself has no Artist or multiple Artists.
- If all the remaining (single-Artist) Sequences have the same Artist, use that.
- Otherwise we don't know.  

This can be written out in Thread in pretty much these exact steps:  

Release|Artist=(.Artist;(.Sequence:(.Artist::#=1).Artist::#=1))
 

This isn't a syntax tutorial, but ";" means otherwise, and "::#=1" tests a list to see if it has exactly one entry, so maybe you can sort of see what the query is doing.  

A Track, then, follows the same pattern, but has to check one more level:  

- If a Track has an Artist, directly, use that.
- If it doesn't, get all the Releases in which it occurs.
- Ignore any Release that itself has no Artist or multiple Artists.
- If all the remaining (single-Artist) Releases have the same Artist, use that.
- Otherwise, get all the Sequences for all the Releases on which the track appears.
- Ignore any Sequence that itself has no Artist or multiple Artists.
- If all the remaining (single-Artist) Sequences have the same Artist, use that.
- Otherwise we don't know.  

Or, in Thread:  

Track|Artist=(
.Artist;
.(.Release:(.Artist::#=1).Artist::#=1);
.(.Release.Sequence:(.Artist::#=1).Artist::#=1))
 

In this system, once you've learned the query-language (and I'm not saying this is a trivial task, but it's not that hard), you can do most anything. The language with which you ask questions is the same language you use for stipulating answers.  
 

Query-language-geek postscript: And, of course, it's the same language you use for obsessively fiddling with your questions and answers because you just can't help it. That Track query has some redundancy, and while redundancy isn't always bad, it's almost always fun to see if you can get rid of it. In this case we're asking the same question ("What's your artist?") in all three steps of the query. We can rearrange it so that we get all the Tracks and Releases and Sequences first, then ask all the Artist questions at the end:  

Track|Artist=(...__,Release,Sequence:(.Artist::#=1).Artist::#=1)
 

"...__," may initially look a little like ants playing Limbo, but "...__,Release,Sequence" means "get these things, their Releases and their Sequences, and then add those things' Releases and Sequences, etc. until you get to the end. So this version of the query builds up the complete list of all the places from which this Track could derive its artist, keeps only the ones that have a single artist, and then sees if we're left with a single artist at the end of the whole process. Depending on what we want to do if our data has internal contradictions, this might actually be functionally better than the first version, not just shorter to type.  

But DiscO also has all that stuff about Original Versions, and it would be nice if our Artist inference used that information, too. If Past Masters [Remastered] is an alternate version of Past Masters, Vol. 1 & 2, and Past Masters, Vol. 1 & 2 belongs to the Beatles' Compilations sequence, then we should be able to deduce that the version of "We Can Work It Out" on Past Masters [Remastered] is by the Beatles. Our experimental elimination of redundancy now pays off a little bit, because we only have to add in "Original Version" in one place:  

Track|Artist=(...__,Release,Sequence,Original Version:(.Artist::#=1).Artist::#=1)
 

And interestingly, since both Tracks and Releases can have Original Versions, this whole thing actually works for either type, and thus we can combine the two and have even fewer things to worry about:  

Track,Release|Artist=(...__,Release,Sequence,Original Version:(.Artist::#=1).Artist::#=1)
 

Having fewer things to worry about is (usually) good.
What Beatles album is "Day Tripper" on?  

This is partially a trick question, of course, as "Day Tripper" was originally a non-album single, but it has been on several Beatles compilations over the years, including the red 1962-1966 best-of, and in the remastered 2009 catalog it lands on both the mono and stereo versions of Past Masters.  

Starting from scratch, it would take you, as a person, a little while to figure this answer out on the web. There's a Wikipedia page for the song, which is the top search hit for the above question in both Google and Bing, and it explains the non-album-ness and mentions several compilations, but it doesn't (at least as of this moment) clarify current availability, and none of the pages for the non-current compilations refer you explicitly (again, as of this moment) to the right place.  

There are a few paths that lead you to a voluminous page for the Beatles discography, on which "Day Tripper" is again mentioned several times, but this page doesn't itemize the track-listings of compilations, and the "Day Tripper" links just go back to the original song-page. But eventually you might blunder into the separate pages for the Mono and Stereo box sets, and from there you might wander over to Amazon.  

Here even more potential confusion awaits you. The top Google hit for "amazon past masters", at the moment, is the now-obsolete second volume of the old CD edition. Searching on "past masters" on Amazon itself gets you the Remastered edition as the top hit, but idiotically suggests buying it together with the old Volume 2, and you would have to scrutinize the track lists to realize that Past Masters [Remastered] subsumes Past Masters, Vol. 1 and Past Masters, Vol. 2.  

In fact, if you backtrack and try searching for "Day Tripper" on Amazon, the top hit is Rubber Soul, the album on which "Day Tripper" would have appeared, chronologically, but doesn't. And, for good measure, that top hit is the obsolete 1990 CD, not the new remaster.  

Bleh.  
 

But at least you're a person, and via a judicious combination of intuition and stubbornness and asking some friends who might know, you can eventually solve these information problems. If you were a computer, you'd be fucked.  

Which is OK on some existential level, because if you were a computer you probably wouldn't appreciate the song, anyway. But the point of computers is to help people do things, and a computer ought to be a particularly helpful tool when the thing you need to do is sort through some data.  

But for the computer to help you puzzle through this data, the data has to be modeled usefully by people first. There are several prominent sources of meticulously structured data about music, so this should be easy. But here, sadly, people have let us down again. And again, and again. Let's see how.  
 

All Music Guide  

A text-search for "Day Tripper" (there's no other query interface) returns a full page of cryptic results. There's an "Occurrences" column, and although it's not clear exactly what that means, it's obvious that more is supposed to be better, and the first listing has 360 where none of the rest have more than 13, so presumably that's the "right" one.  

Clicking this gets you 8 pages of results, which is annoying in itself (the splitting of them into pages, I mean). They're sorted by Artist, which sounds reasonable enough, except that the ones without artists are sorted first, and thus the first page of results is almost totally crap. There are lots of Beatles releases listed, but they get split between pages 1 and 2 of the results list, making it impossible to look at them all at once.  

But if you drill into one of them at random, and then click on "Day Tripper" in the track listing, you do finally get to a page that lists all (or several, anyway) Beatles releases on which this song appears. There are 24, though, including such things as "Five Nights in a Judo Arena", which human intuition might guess is not a normative release, but a computer would have no basis for dismissing. These releases are in date-order, at least, but this turns out to be worse than worthless for our current question, because All Music has modeled Past Masters [Remastered] not as an album, but as an alternate manifestation of the album Past Masters, Vol. 1 & 2, which means it appears way up in the middle of this list, labeled 1988, because that was the year of its earliest issue (on cassette!).  

Looking through the data, in fact, we see that although All Music has lots of individual detail on most kinds of things, it has essentially nothing that models the relationships of things to each other, or in groups. There is no modeled connection between Past Masters, Vol. 1 & 2, Past Masters, Vol. 2, The Beatles: Mono Box Set and The Beatles: Stereo Box Set, even though v1/2 subsumes v2, and both boxes subsume both.  

And there's no modeling of "in print", or any notion of representing the subset of albums that represent the current core catalog. So a computer can't use this data to answer real questions by itself. Source fail.  
 

MusicBrainz  

This is a database first, not a guide, and thus a more likely candidate for well-structured data anyway, and one where I won't pick at their explicitly-secondary browsing UI.  

The good news is that MusicBrainz has the kind of data we need. They have some relationships between tracks, like one being a mashup of some others, so presumably they could add one to express that the Mono Masters of "Day Tripper" is a different version from the Past Masters [Remastered] one, but the same underlying song. They already have a reconciliation mechanism by which they can say that the "Day Tripper" on 1962-1966 is "the same" as the one on Past Masters, although at the moment the reconciliation data looks too noisy for real use.  

They even have the notion of one release being part of a set, although I didn't find very many examples of sets, and in particular I can't tell if a release can be part of more than one set. But if they can, that might be a mechanism for expressing official catalogs, current availability, and various other kinds of groupings and subsets.  

So current source fail, but at least there's hope here.  
 

Freebase  

Freebase is easily the most sophisticated public attempt at universal data-modeling, at the moment, but this is a caveat as well as a compliment. Freebase models attempt to represent everything that could possibly exist, and thus tend to drift quickly from usable simplicity towards abstractly-correct awkwardness, usually coming to rest far into the latter.  

So if you search on "Day Tripper", you will find that there are results of that name as "Topic", "Composition", "Musical Album" and "Musical Track", with at least dozens of the latter. Freebase fails the usability test even more spectacularly than All Music, as the list of Musical Tracks is presented with no grouping or distinguishing information at all, just "Day Tripper (Musical Track)" after "Day Tripper (Musical Track)", and you have to click on each one to get any clarifying info. "Day Tripper" the composition does not link to any of the tracks, and "Day Tripper" the album turns out to be a compilation of Beatles covers which does not, at least as far as the listed information shows, even contain the song "Day Tripper".  

And if you delve into the internals of the Freebase music schema, you can quickly develop a guess about why the data has not all been filled in: there's too damn much structure. A music/track is a recording of a music/composition. The track can appear on multiple music/releases, each of which is a publication of a particular music/album. Unless you need to model who was in the band during the making of an album, in which case the album links instead to a set of music/recording_contributions, which is each a combination of albums, contributors and roles.  

Oh, except compositions can be recorded "as albums", in which case they link to music/album without going through music/track, and tracks can appear directly on releases without going through music/album. And there's no current property for saying that a given track is an alternate version of another, but from following Freebase modeling discussions I can confidently guess that they'd model that by saying that a music/track links to something like a music/track_derivation, which itself is a combination of original track, derivative track, deriving artist (or music/track_derivation_contribution) and derivation type. And Freebase's query-language doesn't provide any recursion, so if these relationships chain, good luck querying them.  
 

Music Ontology  

This isn't a database, just an attempt at a model for one. And, grimly, it's another quantum level more elaborately and unusably correct than the Freebase model. Even "MO Basics" (and the "Overview" has 22 more tables of explication beyond these "Basics", without getting into the "details") includes conceptual distinctions between Composition, Arrangement, Recording, Musical Work, Musical Item, MusicalExpression and MusicalManifestation. And then there are pages upon pages of minutely itemized trivia like beginsAtDuration, djmix_of, paid_download (and "paiddownload", which is different in some way I couldn't figure out), AnalogSignal, isFactorOf... This list is bad because it's too long, but the fact that it's in the schema means that it's also bad because no matter how long it is, it will never include every nuance you ever find yourself wanting, and thus over time it will only accumulate more debris.  

A tour-de-force into a cul-de-sac.  
 

The Rest of the Web  

Searching on any particular band or bit of music will unearth dozens or hundreds of other sites that contain bits of the information we need: stores, discographies, databases, forums, fan pages, official sites. Almost universally, these are either unqueriable flat HTML pages, or tree-structured databases with even less interlinking than the above sites. Encyclopaedia Metallum, my favorite metal site, has full track listings for a genuinely mind-boggling number of releases by an astonishing number of bands, but the tracks themselves are not data-objects and a machine can find out nothing about them. There are several lovingly hand-crafted Beatles discographies on the web, all far too detailed for our original casual query, and all essentially useless to a computer attempting to help us.  

So: Ugh. Triple ugh because a) the population of people willing to put time and energy into filling out music-related data-forms is obviously huge, b) the modeling problems are not intractably complicated in any theoretical sense, c) MusicBrainz and Freebase, at least (and the system I'm designing at work, I think), seem to be technically sufficient to represented the data correctly. If only we had a better plan.  
 

DiscO  

So here's my attempt at a better plan. I call it DiscO, for Discographic Ontology; that is, it's a scheme for structuring discographies. It is not an attempt at an abstract physics of human air-vibrating creativity, it is just an outline of a way to write down the information about bands, the music they've made, and how that music was released. It's intended to be simple enough that you can imagine people actually filling in the data, but expressive enough that the data can support interesting queries. And it's specifically intended to model nuance abstractly, so that it can accommodate some kinds of new needs without perpetually having to itself expand.  
 

There are four basic types:  

Artist - The Beatles, Big Country, Megadeth, Frank Zappa, whatever...  

Release - an individual album, single, compilation, whatever; Rubber Soul, Past Masters [Remastered], "Day Tripper"/"We Can Work It Out"...  

Track - an individual version of a song; "Day Tripper", "Day Tripper [mono]", "Day Tripper (performed live on pan flute and triangle by Zamfir and Elmo)", etc.  

Sequence - any collection of releases; Original Albums, Japanese Cassette Singles, 2009 Remasters, etc.  
 

These are related to each other like this:  

Artists mostly have Sequences. Sequences can be anything, but many artists would have some standard ones: Original Albums, Singles, Compilations, Remastered Albums, Current Catalog.  

Sequences have Releases (and Artists).  

Releases have Dates, Labels and Tracks. A Release may have an Artist directly, but more often would have one indirectly via a Sequence.  

Releases may be related to each other via Alternate Version/Original Version links. Thus Past Masters, Vol. 1 & 2 and Past Masters [Remastered] are both Releases, but Past Masters [Remastered] has an Original Version link to Past Masters, Vol. 1 & 2, and Past Masters, Vol. 1 & 2 has an Alternate Version link to Past Masters [Remastered].  

Tracks have Durations. A Track may have an Artist directly (so individual tracks on multi-artist compilations can be attributed correctly), but more often would have one indirectly via Release (which itself may have one indirectly via Sequence).  

Tracks may also be related to each other via Alternate Version/Original Version links. "Day Tripper" and "Day Tripper [mono]" are both Tracks, but "Day Tripper" has an Alternate Version link to "Day Tripper [mono]", and "Day Tripper [mono]" has an Original Version link to "Day Tripper". (We can get into geek arguments about which versions are the same and which are derivations (of which!), if we want, but whatever we decide, we can model.)  

Restated in schema-ish form, that's:  

Artist
- Sequence
- Release
- Track  

Sequence
- Artist
- Release  

Release
- Sequence
- Artist
- Date
- Label
- Track
- Original Version
- Alternate Version  

Track
- Artist
- Duration
- Original Version
- Alternate Version  

I think that's basically enough. What it gives up in expressiveness, it gains in usability. Our Beatles data can now, I think, be modeled both tractably and informatively. We can hook up all the versions of albums and versions of songs. We can create whatever sequences we need, and since the sequences themselves are just data, it's fine to have "Canadian Singles" for the Beatles and "Fanzine Flexis" for The Bedsitters without implying that either band should also have the other.  

And using Thread, the query-language I will (before long, hopefully) be attempting to spread through the universe, we can start to ask our questions in a way the computer can answer:  

Track:=Day Tripper.Release
 

This is our naive query. It gets all the releases that have any track called exactly "Day Tripper". Good for assuring us there's some data in the bucket, but not much help in answering our question.  

Track:=Day Tripper.Release:(.Artist:=The Beatles)
 

That limits our results to albums by the Beatles, but there are still too many. With our fully-interlinked data-model, though, we can now actually ask something that is much closer to what we mean:  

Artist:=The Beatles.Sequence:=Current Catalog.Release:(.Track:=Day Tripper)
 

That is, find the artist The Beatles, get their Current Catalog sequence, get that sequence's releases, and filter those releases down to the ones that contain a track called exactly "Day Tripper". This is progress.  

But "called exactly 'Day Tripper'" will exclude "Day Tripper [mono]", which isn't what we want to do. We're trying to ask a musical question about a song, not a typographical question about a title. But this, too, we have the powers to cope with:  

Track|Day Trippers=(...__,Original Version:=Day Tripper)  

Artist:=The Beatles.Sequence:=Current Catalog.Release:(.Track.Day Trippers)|
DT Versions=(.Track:Day Trippers=?),
Other Tracks=(.Track:Day Trippers=_._Count)
 

This time we first define a new inferred relationship on Track called "Day Trippers", which gets the Track, all its Original Versions, all their Original Versions (recursively), and then filters this set of tracks down to just the ones called "Day Tripper".  

Then we get the Beatles' current catalog releases again, but this time instead of checking each release for a track named "Day Tripper", we use our Day Trippers relationship to check for a track that is, or is derived from, "Day Tripper". And then, for each of the releases that have one, we infer two new relationships: "DT Versions" tells us which track(s) on this release are versions of "Day Tripper", and "Other Tracks" counts the tracks on this release that are not derivations of "Day Tripper".  

I.e.:  

# Release DT Versions Other Tracks
1Past Masters [Remastered] Day Tripper 32
2Mono Box Set Day Tripper [mono] 212
3Stereo Box Set Day Tripper 238
 

So now we know our choices. It took us so long to find out, but we found out.  
 
 
 

Tantalizing Postscript: But now that we have these three options before us, how do their contents overlap or differ, track by track?! We could bring up three different windows and squint at them. Or we could ask the computer:  

Artist:=The Beatles.Sequence:=Current Catalog.Release:(.Track.Day Trippers)
/(.Track...__,Original Version:##1)/=nodes
 

Aaah. I see now. (How nice it will be when I'm allowed to show you...)
In his post explaining his departure from Google, Douglas Bowman says "Yes, it's true that a team at Google couldn't decide between two blues, so they're testing 41 shades between each blue to see which one performs better. I had a recent debate over whether a border should be 3, 4 or 5 pixels wide, and was asked to prove my case. I can't operate in an environment like that."  

He's not really trying to have an opinion-changing last word in an argument against data-driven product decisions, but if he were, this is not how to do it. If you believe that the proper width of a border can be tested, then Bowman's refusal to subject his intuitions to quantitative confirmation just sounds like petulant prima-donna nonsense. If you can test 41 shades of blue, this line of reasoning goes, you don't need to guess, so a guessing specialist is an annoying waste of everybody's time.  

The great advantage of testing and data, of course, is that you get precise, decisive answers you can act on. Shade 31, with 3.7%, trouncing runner-up shade 14 with only 3.4%! Apply shade 31, declare progress.  

But the great disadvantage of testing and data is that you get precise, decisive answers you can and will act on, but you almost never know what question you really asked. Sure, the people who saw shade 31 did some measurable thing at some measurable rate. But why? Is it shade 31? Or is it the constrast between shade 31 and the old shade? Or is it the interplay between shade 31 and some other thing you aren't thinking about, or possibly don't even control? Are you going to run your test again in a month to see if the results have changed? Did you cross-correlate them with HTTP_REFERER and index the colors on the pages people came from? What about all the combinations of these 41 shades and 41 backgrounds and 8 fonts and 3 border widths (12 if you vary each side of the box separately!) and 41 line-heights and 19 title-bar wordings and the color of the tie Jon Stewart was wearing the night before? Which things matter? How do you know?  

You don't. And if you need to add some new element, tomorrow, you don't know which of the tests you've already run it invalidates. Are you going to rerun all of them, permuted 41 more new ways?  

No. You are going to sheepishly post a job opening for a new guessing specialist. Bowman already had his last word. It was "Goodbye".
Site contents published by glenn mcdonald under a Creative Commons BY/NC/ND License except where otherwise noted.