furialog

24 September 2021 to 23 September 2009 · tagged essay/tech

¶ A short essay about long playlists of short tracks of rain noises and streaming-music economics. · 24 September 2021 essay/tech

Rolling Stone published this recent story (https://www.rollingstone.com/pro/features/spotify-sleep-music-playlists-lady-gaga-1223911/) about the streaming success of the sleep-noise artist/label/scheme Sleep Fruits, who chop up background rain-noise recordings into :30 lengths to maximize streaming playcounts.

Sleep Fruits is undeniably and intentionally exploiting the systemic weakness of the industry-wide :30-or-more-is-a-play rule, as too are audiobook licensors who split their long content into :30 "chapters". The :30 thing is a bad rule. Most of the straightforward alternatives are also bad, so it wasn't an obviously insane initial system design-choice, but this abuse vector is logical and inevitable.

The effect of the abuse for the label doing it is simple: exploitative multiplication of their "natural" streams by a large factor. x6 if you compare it to rain noise sliced into pop-song-size lengths.

The effect on the rest of the streaming economy is more complicated. More money to Sleep Fruits does mean less money to somebody else, at least in the short term.

Under the current pro-rata royalty-allocation system used by all major subscription streaming services (one big pool, split by stream-share), the effects of Sleep Fruits' abuse are distributed across the whole subscription pool. The burden is shared by all other artists, collectively, but is fractional and negligible for any individual artist. In addition, under pro-rata if an individual listener plays Sleep Fruits overnight, every night, it doesn't change the value of their "real" music-listening activity during the day. Those artists get the same benefit from those fans as they would from a listener who sleeps in silence.

Under the oft-proposed user-centric payment system, in which each listener's payments are split according to only their plays, Sleep Fruits' short-track abuse tactic would be less effective for them. Not as much less effective as you might think, because the same two things that inflate their overall numbers (long-duration background playing + short tracks) inflate their share of each listener's plays. But less, because in the pro-rata model one listener can direct more revenue than they contributed, and in the user-centric model they can't.

In the user-centric model, though, if an individual listener listens to Sleep Fruits overnight, that directly reduces the money that goes to their daytime artists. Where pro-rata disperses the burden, user-centric would concentrate it on the kinds of artists whose fans also listen to background noise. This is probably worse in overall fairness, and it's definitely worse in terms of the listener-artist relationship, which is one of the key emotional arguments for the user-centric model.

The interesting additional economic twist to this particular case, though, is that sleeping to background noise works very badly if it's interrupted by ads. Background listening is thus a powerful incentive for paid subscriptions over ad-supported streaming. (Audiobooks similarly, since they essentially require full on-demand listening control.) So if Sleep Fruits drives background listeners to subscribe, it might be bringing in additional money that could offset or even exceed the amount extracted by its abuse. (Maybe. The counterfactual here is hard to assess quantitatively.)

And although the :30 rule is what made this example newsworthy in its exaggerated effect, in truth it's probably not really the fundamental problem. The deeper issue is just that we subjectively value music based on the attention we pay to it, but we haven't figured out a good way to translate between attention paid and money paid. Switching from play-share to time-share would eliminate the advantage of cutting up rain noise into :30 lengths, but wouldn't change the imbalance between 8 hours/night of sleep loops and 1-2 hours/day of music listening. CDs "solved" this by making you pay for your expected attention with a high fixed entry price, which isn't really any more sensible.

I don't think we're going to solve this with just math, which disappoints me personally, since I'm pretty good at solving math-solvable things with math. But in general I think time-share is a slightly closer approximation of attention-share than play-share, and thus preferable. And rather than trying to discount low-attention listening, which seems problematic and thankless and negative, it seems more practical and appealing to me to try to add incremental additional rewards to high-attention fandom. E.g. higher-cost subscription plans in which the extra money goes directly to artists of the listener's choice, in the form of microfanclubs supported by platform-provided community features. There are a lot of people who, like me, used to spend a lot more than $10/month on music, and could probably be convinced to spend more than that again if there were reasons.

Of course, not coincidentally, I have ideas about community features that can be provided with math. Lots of ideas. They come to me every :30 while I sleep.

PS: I've seen some speculation that Sleep Fruits is buying their streams. I'm involved enough in fraud-detection at Spotify to say with at least a little bit of confidence that this is probably not the case. Large-scale fraud is pretty easy to detect, and the scale of this is large. It's abusing a systemic weakness, but not obviously dishonestly.

¶ 2019 in Music · 6 January 2020 essay/listen/tech

I starting making one music-list a year some time in the 80s, before I really knew enough for there to be any sense to this activity. For a while in the 90s and 00s I got more serious about it, and statistically way better-informed, but there's actually no amount of informedness that transforms a single person's opinions about music into anything that inherently matters to anybody other than people (if any) who happen to share their specific tastes, or extraordinarily patient and maybe slightly creepy friends.

Collect people together, though, and the patterns of their love are sometimes very interesting. For several years I presided computationally over an assembly of nominal expertise, trying to find ways to turn hundreds of opinions into at least plural insights. Hundreds of people is not a lot, though, and asking people to pretend their opinions matter is a dubious way to find out what they really love. I'm not really sad we stopped doing that.

Hundreds of millions of people isn't that much, yet, but it's getting there, and asking people to spend their lives loving all the innumerable things they love is a more realistic proposition than getting them to make short numbered lists on annual deadlines. Finding an individual person who shares your exact taste, in the real world, is not only laborious to the point of preventative difficulty, but maybe not even reliably possible in theory. Finding groups of people in the virtual world who collectively approximate aspects of your taste is, due to the primitive current state of data-transparency, no easier for you.

But it has been my job, for the last few years, to try to figure out algorithmic ways to turn collective love and listening patterns into music insights and experiences. I work at Spotify, so I have extremely good information about what people like in Sweden and Norway, fairly decent information about most of the rest of Europe, the Americas and parts of Asia, and at least glimmers of insight about literally almost everywhere else on Earth. I don't know that much about you, but I know a little bit about a lot of people.

So now I make a lot of lists. Here, in fact, are algorithmically-generated playlists of the songs that defined, united and distinguished the fans and love and new music in 2000+ genres and countries around the world in 2019:

2019 Around the World

You probably don't share my tastes, and this is a pretty weak unifying force for everybody who isn't me, but there are so many stronger ones. Maybe some of the ones that pull on you are represented here. Maybe some of the communities implied and channeled here have been unknowingly incomplete without you. Maybe you have not yet discovered half of the things you will eventually adore. Maybe this is how you find them.

I found a lot of things this year, myself, some of them in this process of trying to find music for other people, and some of them just by listening. You needn't care about what I like. And if for some reason you do, you can already find out what it is in unmanageable weekly detail. But I like to look back at my own years. Spotify's official forms of nostalgia so far define years purely by listening dates, but as a music geek of a particular sort, what I mean by a year is music that was both made and heard then. New music.

I no longer want to make this list by applying manual reductive retroactive impressions to what I remember of the year, but I also don't have to. Adapting my collective engines to the individual, then, here is the purely data-generated playlist of the new music to which I demonstrated the most actual listening attachment in 2019:

2019 Greatest Hits (for glenn mcdonald)

And for segmented nostalgia, because that's what kind of nostalgist I am, I also used genre metadata and a very small amount of manual tweaking to almost automatically produce three more specialized lists:

Bright Swords in the Void (Metal and metal-adjacent noises, from the floridly melodic to the stochastically apocalyptic.)
Gradient Dissent (Ambient, noise, epicore and other abstract geometries.)
Dancing With Tears (Pop, rock, hip hop and other sentimental forms.)

And finally, although surely this, if anything, will be of interest to absolutely nobody but me, I also used a combination of my own listening, broken down by genre, and the global 2019 genre lists, to produce a list of the songs I missed or intentionally avoided despite their being popular with the fans of my favorite genres.

2019 Greatest Misses (for glenn mcdonald)

I made versions of this Misses list in November and December, to see what I was in danger of missing before the year actually ended, so these songs are the reverse-evolutionary survivors of two generations of augmented remedial listening. But I played it again just now, and it still sounds basically great to me. I'm pretty sure I could spend the next year listening to nothing but songs I missed in 2019 despite trying to hear them all, and it would be just as great in sonic terms. There's something hypothetically comforting in that, at least until I starting trying to figure out what kind of global catastrophe I'm effectively imagining here. I'm alive, but all the musicians in the world are dead? Or there's no surviving technology for recording music, but somehow Spotify servers and the worldwide cell and wifi networks still work?

Easier to live. I now declare 2019 complete and archived. Onwards.

¶ If You Do That, the Robots Win · 16 April 2016 essay/listen/tech

[This is the script from a talk I delivered at the EMP Pop Conference today. It was written to be read aloud at an intentionally headlong pace, with somewhat-carefully timed blasts of interstitial music. I've included playable clip-links for the songs here, but the clips are usually from the middles of the songs, and I was playing the beginnings of them in the talk, so it's different. The whole playlist is here, although playing it as a standalone thing would make no sense at all.]

I used to take software jobs to be able to buy records, but buying records is now a way to hear all the world's music like collecting cars is a way to see more of the solar system.

So now I work at Spotify as a zookeeper for playlist-making robots. Recommendation robots have existed for a while now, but people have mostly used them for shopping. Go find me things I might want to buy. "You bought a snorkel, maybe you'd like to buy these other snorkels?"

But what streaming music makes possible, which online music stores did not, is actual programmed music experiences. Instead of trying to sell you more snorkels, these robots can take you out to swim around with the funny-looking fish.

And as robots begin to craft your actual listening experience, it is reasonable, and maybe even morally imperative, to ask if a playlist robot can have an authorial voice, and, if so, what it is?

The answer is: No. Robots have no taste, no agenda, no soul, no self. Moreover, there is no robot. I talk about robots because it's funny and gives you something you can picture, but that's not how anything really happens.

How everything really happens is this: people listen to songs. Different people listen to different songs, and we count which ones, and then try to use computers to do math to find patterns in these numbers. That's what my job actually involves. I go to work, I sit down at my desk (except I actually stand at my fancy Spotify standing desk, because I heard that sitting will kill you and if you die you miss a lot of new releases), and I type computer programs that count the actions of human listeners and do math and produce lists of songs.

So when anybody talks about a fight between machines and humans in music recommendation, you should know that those people do not know what the fuck they are talking about. Music recommendations are machines "versus" humans in the same way that omelets are spatulas "versus" eggs.

So the good news is that you can stop worrying that robots are trying to poison your listening. But the bad news is that you can start worrying about food safety and whether the people operating your spatulas have the faintest idea what food is supposed to taste like.

Because data makes some amazing things possible, but it also makes terrible, incoherent, counter-productive things possible. And I'm going to tell you about some of them.

Counting is the most basic kind of math, and yet even just counting things usefully, in music streaming, is harder than you probably think. For example, this is the most streamed track by the most streamed artist on Spotify:

Various Artists "Kelly Clarkson on Annie Lennox"

Do you recognize the band? They are called "Various Artists", and that is their song "Kelly Clarkson on Annie Lennox", from their album Women in Music - 2015 Stories.

But OK, that's obviously not what we meant. We just need to exclude short commentary tracks, and then this is the most streamed music track by the most streamed artist on Spotify:

Various Artists "El Preso"

Except that's "Various Artists" again. The most streamed music track by an actual artist on Spotify is:

Rihanna "Work"

OK, so that's starting to make some sense. Pretty much all exercises in programmatic music discovery begin with this: can you "discover" Rihanna?

Spotify just launched in Indonesia, and I happen to know that Indonesian music is awesome, because there are people there and they make music, so let's find out what the most popular Indonesian song is.

Justin Bieber "Love Yourself"

I kind of wanted to know what the most popular Indonesian song is, not just the song that is most popular in Indonesia. But if I restrict my query to artists whose country of origin is Indonesia, I get this:

Isyana Sarasvati "Kau Adalah"

Which seems like it might be the Indonesian Lisa Loeb. It's by Isyana Sarasvati, and I looked her up, and she is Indonesian! She's 23, and her Wikipedia page discusses the scholarship she got from the government of Singapore to study music at an academy there, and lists her solo recitals.

It turns out that our data about where artists are from is decent where we have it, but a lot of times we just don't. 34 of the top 100 songs in Indonesia are by artists for whom we don't have locations.

But remember math? Math is cool. In addition to counting listeners in Indonesia, we can compare the listening in Indonesia to the listening in the rest of the world, and find the songs are that most distinctively popular in Indonesia. That gets us to this:

TheOvertunes "Cinta Adalah"

That is The Overtunes, who turn out to be a band of three Indonesian brothers who became famous when one of them won X Factor Indonesia in 2013.

But that's still not really what I wanted. It's like being curious about Indonesian food and buying a bag of Indonesian supermarket-brand potato chips.

I kind of wanted to hear some, I dunno, Indonesian Indie music. I assume they have some, because they have people, and they have X Factor, and that's bound piss some people off enough to start their own bands.

So if we switch from just counting to doing a bit more data analysis -- actually, quite a lot of data analysis -- we can discover that yes, there is an indie scene in Indonesia, and we can computationally model which bands are more or less a part of it, and without ever stepping foot in Indonesia, we can produce an algorithmic introduction to The Sound of Indonesian Indie, and it begins with this:

Sheila on 7 "Dan..."

That is Shelia on 7 doing "Dan...", and I looked them up, too. Rolling Stone Indonesia said that their debut album was one of the 150 Greatest Indonesian Albums of All Time, and they are the first band to sell more than 1m copies of each of their first 3 albums in Indonesia alone.

Of course, they're also on Sony Music Indonesia, and I assume that at least some of those millions of people who bought their first 3 albums, before Spotify launched in Indonesia and destroyed the album-sales market, are still alive and still remember them. One of the hard parts about running a global music service from your headquarters in Stockholm and your music-intelligence outpost in Boston, is that you need to be able to find Indonesian music that people who already know about Indonesian music don't already know about.

But once you've modeled the locally-unsurprising canonical core of Indonesian Indie music, you can use that to find people who spend unusually large blocks of their listening time listening to canonical Indonesian Indie music (most of whom are in Indonesia, but they don't have to be; some of them might be off at a music academy in Singapore, where Spotify has been available since 2013), and then you can calculate what music is most distinctively popular among serious Indonesian Indie fans, even if you have no data to tell you where it comes from. And that gets us things like this:

Sisitipsi "Alkohol"

That is "Alkohol" by Sistipsi. A Google search for that song finds only 8400 results, which appear to all be in Indonesian. Their band home page is a wordpress.com site, and they had 263 global Spotify listeners last month.

PILOTZ "Memang Aku"

PILOTZ, with a Z. Also from Indonesia! 117 listeners.

Hellcrust "Janji Api"

Hellcrust. 44 listeners last month. I looked them up, and yes, they're from Jakarta.

199x "Goodest Riddance"

199x. 14 monthly listeners! Also, maybe actually from Malaysia, not Indonesia, but in music recommending it's almost as impressive if you can be a little bit wrong as it is if you can be right, because usually when you're wrong you'll get Polish folk-techno or metalcore with Harry Potter fanfic lyrics.

So that's what a lot of my days are like. Pose a question, write some code, find some songs, and then try to figure out whether those songs are even vaguely answering the question or not.

And if the question is about Indonesia, that method kind of works.

But we also have 100 million listeners on Spotify, and we would like to be able to produce personalized listening experiences for each of them. Actually, we'd like to be able to produce multiple listening experiences for each of them. And we can't hire all of our listeners to work for us full-time curating their own individual personal music experiences, because apparently the business model doesn't work? So it's computers or nothing.

People, it turns out, are somewhat harder than countries.

For starters, here is the track I have played the most on Spotify:

Jewel "Twinkle, Twinkle Little Star"

As humans, we might guess that it is not quite accurate to say that that is my favorite song, and we might have a very specific theory about why that is. As humans, we might guess that the number of times I have played the song after that has a different meaning:

CHVRCHES "Leave a Trace"

In the latter case, I love CHVRCHES so much. But in the former case, I love my daughter even more than I love CHVRCHES, and some nights she really needs to hear Jewel sing "Twinkle Twinkle Little Star" at bedtime.

And if we are still in the early days of algorithmically programmed listening experiences, at all, then we're in what I hope we will look back on as the early- to mid- prehistory of algorithmic personalized listening experiences. I can't tell you exactly how they work, because we're still trying to make them work. But I can tell you 7 things I've learned that I think are principles to guide us towards a future in which dumbfoundingly amazing music you could never find on your own just flows out of the invisible sea of information directly into your ears. When you want it to, I don't mean you can't shut it off.

1. No music listener is ever only one thing.

I mean, you can't assume they are. I have a coworker named Matt who basically only listens to skate-punk music, ever, and we test all personalization things on him first, because you can tell immediately if it's wrong. Right: Warzone "Rebels Til We Die". Wrong: The Damned "Wounded Wolf - Original Mix". But other than him, almost everybody turns out to have some non-obvious combination of tastes. I listen to beepy electronica (Red Cell "Vial of Dreams") and gentle soothing Dark Funeral "Where Shadows Forever Reign" and Kangding Ray "Ardent", and sentimental Southern European arena pop (Gianluca Corrao "Amanti d'estate"), and if you just average that all together it turns out you mostly end up with mopey indie music that I don't like at all: Wyvern Lingo "Beast at the Door"

2. All information is partial.

We know what you play on Spotify, but we don't know what you listen to on the radio in the car, or what your spouse plays in your house while you're making dinner, or what you loved as a kid or even what you played incessantly on Rdio before it went bankrupt. For example, this is one of my favorite new albums this year: Magnum "Sacred Blood 'Divine' Lies". I adore Magnum, but I hadn't played them on Spotify at all. But my robot knew they were similar to other things it knew I liked. Sometimes music "discovery" is not about discovering things that you don't know, it's about the computer inferring aspects of your taste that you had previously hidden from it.

3. Variety is good.

It is in the interest of listeners and Spotify and music makers if people listen to more and more varied music. If all anybody wanted to hear was this once a day -- Adele "Hello" -- there would be no music business and no streaming and no joy or sunlight. Part of my job is to crack open the shell of the sky. Terabrite "Hello". If you are excited to hear what happens next, you will be more likely to pay us $10, and we will pay the artists more for the music you play, and they will make more of it instead of getting terrible day-jobs working for inbound marketing companies, and the world will be a better place.

4. People like discovering new music.

They may hate the song you want them to love. They may have a limited tolerance for doing work to discover music, or for trial-and-erroring through lots of music they don't like in order to find it, but neither of those things mean that they wouldn't be thrilled by the right new song if somebody could find it for them. One of you will come up after this to ask me what this song is: Sweden "Stocholm". One of you, probably a different person, will wonder about this: Draper/Prides "Break Over You". I have like a million of those. I mean actually like an actual million of those.

5. Bernie Sanders is right.

It is in the interest of the world of music creators if the streaming music business exerts a bit of democratic-socialist pressure against income inequality. The incremental human value of another person listening to "Shake It Off" again is arguably positive, but it's probably also considerably smaller than the value of that person listening to a new song by a new songwriter who doesn't already have enough money to live out the rest of their life inside a Manhattan loft whose walls are covered with thumbdrives full of bitcoins and #1-fan selfies. Anthem Lights "Shake It Off". Taylor, if you're listening, I'm going to keep playing shitty covers of your songs until you put the real ones back on Spotify. That's how it works.

6. If you're going to try to play people what they actually like, you have to be prepared for whatever that is.

DJ Loppetiss "Janteloven 2016"

That's "Russelåter", which is a crazy Norwegian thing where high school kids finish their exams way before the end of the senior year, so in the spring they get together in little gangs, give themselves goofy gang names, purchase actual tour buses from the previous year's gangs, have them repainted with their gang logo, commission terrible crap-EDM gang theme songs from Norwegian producers for whom this is the most profitable local music market, and then spend weeks driving around the suburbs of Oslo in these buses, drinking and never changing their clothes and blasting their appalling theme songs. I did not make this up.

7. Recommendation incurs responsibility.

If people are going to give up minutes of their finite lives to listen to something they would otherwise never have been burdened with, it better have the potential, however vague or elusive, to change their life. You can't, however tantalizing the prospect might seem, just play something because you want to. (Aedliga "Futility Has Its Limits") Like I said, you definitely can't do that. If you do that, the robots win.

Thank you.

¶ The Satan:Noise Ratio · 19 April 2015 essay/listen/tech

Through a roundabout series of connections, I got invited to be part of a roundtable panel at EMP Pop 2015, which ended up (in keeping with this year's themes of Music, Weirdness and Transgression) being a group deliberation on the subject of The Worst Song in the World.

And since I was going to be there, and conference rules allowed for solo proposals in addition to the group thing, I figured I might as well also try something fun and weird and outside of my usual current data-alchemical domain.

In the end the thing ended up being not quite free of data-alchemy in the same way that my songs without drums always somehow develop drum tracks. But it's not about data alchemy. At least mostly not.

All the talks are supposed to eventually be available in audio form, but in the meantime, here is the script I was more or less working from. To reproduce the auditorium experience you should blast at least the first 20 seconds or so of each song as you encounter it in the text, and imagine me intoning the names of the songs in monster-truck-rally announcer-voice, and then saying everything else really fast and excitedly because a) you only get 20 minutes, and b) it was 9:20am on the Sunday morning after the Saturday night conference party and some people might need a little help relocating their attentiveness.

(Also, be forewarned that neither the talk nor the music discussed is intended for underage audiences or people who are insecure about religion or genuinely frightened by grown men growling like monsters.)

The Satan:Noise Ratio
or
Triangulations of the Abyss

I grew up in what I wouldn't call a religious community, exactly, but certainly one that was dominated by the assumption of Christianity. My social status was kind of established when I told two members of the football team that the universe was formed out of dust, not Godliness, and it really didn't make any difference whether you liked that idea or not. This was second grade. We had a football team in second grade.

By the time I discovered heavy metal, I was pretty ready for some kind of comprehensive alternative. Science fiction, existentialism, atheism, algebra, Black Sabbath. These all seemed to frighten people, which suggested they were good and powerful ingredients. But if you're going to fight against football in Texas, you have to have your shit organized. You need a program.

Obviously as an atheist I wasn't going to believe in Satan any more than I was going to believe in elves, but the idea of Satanism seemed potentially compelling anyway. Like Scientology, but with roots, and better iconography, and fewer videotapes to buy. And I had learned a lot from reading the liner notes to Rush albums, so I dug into Black Sabbath albums with the same enthusiasm.

Black Sabbath "After Forever"

[You have to remember that at the time, that was really heavy. But the words go like this:]

I think it was true it was people like you that crucified Christ
I think it is sad the opinion you had was the only one voiced
Will you be so sure when your day is near, say you don't believe?
You had the chance but you turned it down, now you can't retrieve

Puzzling. But then, as if realizing they were missing something, they got a new singer whose name was Dio, and made an album called Heaven & Hell.

Black Sabbath "Heaven & Hell"

Sing me a song, you're a singer
Do me a wrong, you're a bringer of evil
The Devil is never a maker
The less that you give, you're a taker
So it's on and on and on, it's Heaven and Hell, oh well
…
Fool, fool! You've got to bleed for the dancer!

The music: solid. The lyrics? Not exactly "Red Barchetta".

But OK, what about Judas Priest. Didn't two guys kill themselves after listening to Judas Priest? Now we're getting serious.

Judas Priest "Saints in Hell"

Cover your fists
Razor your spears
It's been our possession
For 8,000 years
Fetch the scream eagles
Unleash the wild cats
Set loose the king cobras
And blood sucking bats

OK, if I wanted a fucking rhyming "evil" version of Noah's Ark...

But whatever. Before I found the Satanism I was looking for, New Wave happened, and it turned out that androgyny and drum machines scared the football boys way more than Satan.

And then I left Texas and went to Harvard and took on a very different set of social challenges. So the next time I cycled back into metal, as I always do no matter how many other things I'm into, I wasn't looking for more elaborate pentagrams to shock football boys, I was looking for more hermeneutic nuances to situate and contextualize metal for comparative-lit majors who listened to the Minutemen and the Talking Heads.

Slayer. The Antichrist. Fucking yes. Slayer makes Sabbath with Ozzy sound like Wings, and Sabbath with Dio sound like Van Halen with Sammy Hagar.

Slayer "The Antichrist"

I am the Antichrist
All love is lost
Insanity is what I am
Eternally my soul will rot (rot... rot)

So, that's not Satanic, that's Christian. I mean, it's sort of ironic, Slayer of course were the original modern hipsters.

But what about Bathory? In Nomine Satanas. Fucking Latin! Or something...

Bathory "In Nomine Satanas"

Ink the pen with blood
Now sign your destiny to me

Jesus fucking christ: more fealty.

Emperor. These are Norwegian actual church-burning dudes. Although, it's Scandinavia, so the church-burning was actually part of a progressive urban planning scheme with multi-use pentagrams in pleasant, radiant-heated public spaces.

Emperor "Inno a Satana"

O' mighty Lord of the Night. Master of beasts. Bringer of awe and derision.
Thou whose spirit lieth upon every act of oppression, hatred and strife.
Thou whose presence dwelleth in every shadow.
Thou who strengthen the power of every quietus.
Thou who sway every plague and storm.
Harkee.

Satan's uvula! "Harkee"?

Gorgoroth "Possessed by Satan"

worldwide revolution has occurred
holy war, execution of sodomy
We are possessed by the moon
We are possessed by evil
We are possessed by Satan
possessed
possessed by satan
and then we rape the nuns with desire

We rape the nuns with desire? This is a program of sorts, I guess. But not one that offered solutions to any problems I actually had. But after a while, I kind of stopped asking music to solve any problems in my life that weren't about music. As an adult, the main thing I asked from my Satanic Norwegian metal was leads for where I could find more of it. The most constant internal theme in my life has been the desperate gnawing suspicion that all the music I know is only the tiniest sliver of what actually exists.

And maybe what we fear guides our evasions so inexorably that we always end up confirming our suspicions by our nature, but my love of metal motivated and informed my work designing data-analysis software as much as it haunted my attempts to understand emotional resonance, and gradually over the years my writing about music for people bled into writing about music for computers, and that's how I eventually ended up at Spotify, where we have a lot of computers and the largest mass of data about music that humanity has ever collected. And this makes it possible to find out about a lot of metal that you might not otherwise know about. A lot. And a lot of everything else. So I ended up making this genre map, to try to make some sense of it all.

And having organized the world into 1375 genres (which is approximately 666 times 2), I can now answer some other questions about them. Just a few days ago, in fact, purely coincidentally and in no way because I was writing this talk at the last minute without a really clear idea where I was going with it, I decided to reverse-index all the words in the titles of all the songs in the world, and then, using BLACK MATH, find and rank the words that appear most disproportionately in each genre.

It wasn't totally obvious whether this would produce a magic quantification of scattered souls, or a polite visit from some Mumford-and-Sons fans in the IT department, but here are some examples of what it produced in a few genres you might know:

a cappella: medley love somebody your girl home time over will with when need around life what tonight song that don't just

acoustic blues: blues woman boogie baby mama moan down mississippi gonna ain't going worried chicago shake long don't rider jail poor woogie

modern country rock: country beer that's that whiskey love good like cowboy truck don't she's carolina back ain't just wanna this with dirt

east coast hip hop: featuring edited kool explicit rhyme triple hood shit album game check ghetto what streets money flow version that style

west coast rap: gangsta dogg featuring niggaz nate snoop hood ghetto playa money pimp thang shit smoke game bitch life funk ain't west

I'd say that shit is doing something. [The whole thing is here.]

Using this, I can finally figure out the most Satanic of all metal subgenres. It is Black Thrash, whose top words go like this:

satanic blasphemy unholy death infernal antichrist satan hell blood holocaust evil metal nuclear doom vengeance black flames darkness funeral iron

If Satanism is fucking anywhere, it is here.

Nifelheim "Envoy of Lucifer"

OK, no idea what they're saying there.

Destroyer 666 "Satanic Speed Metal"

Um.

Warhammer "The Claw of Religion"

Since the beginning of time
A weapon was built and protected
To keep the balance in line
To guard the "forces of the light"
Do you hear the cries of all the ones that fell?

Isn't that actually the narration from the beginning of The Fifth Element?

Sathanas "Reign of the Antichrist"

From the fall of grace-I shall rise again
Avenging chosen one-Known as Satan’s son

Well, it's certainly Satanic. But it's Satanism as mirror-image Christianity. Like, imagine if Jackson Pollock's avant-garde transgression was taking Vermeer paintings and repainting them with left and right reversed!!!! To be fair, that's the usual way in which revolutions collapse into politics, hating the status quo's conclusions but being unable to escape its assumptions.

However, I have a lot of other metal subgenres to work with, and I can actually reorganize the world as if Black Thrash were its point of origin, and then as we move slowly away from that point, genre by genre, we can start to see the patterns change.

"Satan" begins to disappear.

"Christ" goes away.

"Damnation" no longer so much of a concern.

"Chaos" starts to appear.

"Darkness" is everywhere.

"Eternal" fascinates us.

As does "Beyond".

"Death", always death.

And over and over, at the top of almost every list that doesn't start with "Death": "Flesh".

Except groove metal, where the number 1 term is "Reissue".

So my mistake, maybe, was in assuming I was looking for a philosophy that called itself Satanic. Give up that constraint, and ideas start to coalesce after all.

Entombed "Left Hand Path"

No one will take my soul away
I carry my own will and make my day

Enslaved "Ethica Odini"

You have the key to mystery
Pick up the runes; unveil and see

Dantalion "Onward to Darkness"

Existence is your own adversary,
a path full of pain and madness.

Mitochondrion "Eternal Contempt of Man"

Now the earth, sea, and sky all have torn
Now a gate from the void hath been born
Both the watchers and the unholy do agree
Eradicate that vermin filth humanity

Dodecahedron "I, Chronocrator"

Reigning formulas undone
Oaths sworn into silence
Our world will be without form
Our earth will be void

We are approaching a version of Nihilism that is not an absence, but an embrace of nothingness, an embrace of the finite, of finity.

Celtic Frost "Os Abysmi Vel Daath"

Where I am there is no thing.
No God, no me, no inbetween.

Totalselfhatred "Enlightenment"

OK, first of all, the band is called Totalselfhatred, and they sound like this. Dreamy.

I cannot change your destiny, can only help you think
As far as my horizons lead - your thoughts will be more deep
Hope inside is torturing me - keeps painfully alive
A light inside, a knowledge deep, that shines so bright!

And then, maybe, the grand masters of this, Deathspell Omega.

Deathspell Omega "Chaining the Katechon"

That's a 22-minute song, and it does not fade in.

The task to be achieved, human vocation
Is to become intensely mortal
Not to shrink back
Before the voices
coming from the gallows tree
A work making increasing sense
By its lack of sense
In the history of times there is
But the truth of bones and dust.

Here, then, are some potential tenets of a chaotic black metal philosophical program:

1. Babel. Acceptance of chaos, instead of a futile struggle for order or serenity
2. The Codex. To exist in chaos is to seek complexity over simplicity
3. The Void. There is beauty in darkness
4. The Scythe. There are either no illusions, or all illusions, but either way, only death is real

Which all adds up, I think, to something that I basically understood in second grade, after all: grimly acknowledged free will. That is the philosophical core of metal, as an art form. That is the exact rebellion I was seeking. To choose Satan, and particularly to choose Satan without giving him any positive qualities, is to assert that the act of choosing is more important than the actual choice. To choose death is to assert that choosing is more important than living. To choose death symbolically is somewhat more powerful than choosing it literally, because you can choose it symbolically more than once, while gives you a chance to refine your symbolism.

Blut Aus Nord "The Choir of the Dead"

That is Blut Aus Nord's "The Choir of the Dead", from an album actually called The Work Which Transforms God. What does it say? I dunno. But what does it mean? "Hail Satan" is "Think for yourself" plus noise.

Thank you, and see you in Hell.

[The whole playlist that I was playing from is on Spotify here: Triangulations of the Abyss.]

Thanks to the Program Committee and the audience for indulging this whim, and particularly to Eric Weisbard for backing up his early-morning scheduling of this racket by showing up to moderate the session himself.

¶ Post-Neo-Traditional Pop Post-Thing · 29 September 2014 essay/listen/tech

As part of a conference on Music and Genre at McGill University in Montreal, over this past weekend, I served as the non-academic curiosity at the center of a round-table discussion about the nature of musical genres, and of the natures of efforts to understand genres, and of the natures of efforts to understand the efforts to understand genres. Plus or minus one or two levels of abstraction, I forget exactly.

My "talk" to open this conversation was not strictly scripted to begin with, and I ended up rewriting my oblique speaking notes more or less over from scratch as the day was going on, anyway. One section, which I added as I listened to other people talk about the kinds of distinctions that "genres" represent, attempted to list some of the kinds of genres I have in my deliberately multi-definitional genre map. There ended up being so many of these that I mentioned only a selection of them during the talk. So here, for extended (potential) amusement, is the whole list I had on my screen:

Kinds of Genres
(And note that this isn't even one kind of kind of genre...)

- conventional genre (jazz, reggae)
- subgenre (calypso, sega, samba, barbershop)
- region (malaysian pop, lithumania)
- language (rock en espanol, hip hop tuga, telugu, malayalam)
- historical distance (vintage swing, traditional country)
- scene (slc indie, canterbury scene, juggalo, usbm)
- faction (east coast hip hop, west coast rap)
- aesthetic (ninja, complextro, funeral doom)
- politics (riot grrrl, vegan straight edge, unblack metal)
- aspirational identity (viking metal, gangster rap, skinhead oi, twee pop)
- retrospective clarity (protopunk, classic peruvian pop, emo punk)
- jokes that stuck (crack rock steady, chamber pop, fourth world)
- influence (britpop, italo disco, japanoise)
- micro-feud (dubstep, brostep, filthstep, trapstep)
- technology (c64, harp)
- totem (digeridu, new tribe, throat singing, metal guitar)
- isolationism (faeroese pop, lds, wrock)
- editorial precedent (c86, zolo, illbient)
- utility (meditation, chill-out, workout, belly dance)
- cultural (christmas, children's music, judaica)
- occasional (discofox, qawaali, disco polo)
- implicit politics (chalga, nsbm, dangdut)
- commerce (coverchill, guidance)
- assumed listening perspective (beatdown, worship, comic)
- private community (orgcore, ectofolk)
- dominant features (hip hop, metal, reggaeton)
- period (early music, ska revival)
- perspective of provenance (classical (composers), orchestral (performers))
- emergent self-identity (skweee, progressive rock)
- external label (moombahton, laboratorio, fallen angel)
- gender (boy band, girl group)
- distribution (viral pop, idol, commons, anime score, show tunes)
- cultural institution (tin pan alley, brill building pop, nashville sound)
- mechanism (mashup, hauntology, vaporwave)
- radio format (album rock, quiet storm, hurban)
- multiple dimensions (german ccm, hindustani classical)
- marketing (world music, lounge, modern classical, new age)
- performer demographics (military band, british brass band)
- arrangement (jazz trio, jug band, wind ensemble)
- competing terminology (hip hop, rap; mpb, brazilian pop music)
- intentions (tribute, fake)
- introspective fractality (riddim, deep house, chaotic black metal)
- opposition (alternative rock, r-neg-b, progressive bluegrass)
- otherness (noise, oratory, lowercase, abstract, outsider)
- parallel terminology (gothic symphonic metal, gothic americana, gothic post-punk; garage rock, uk garage)
- non-self-explanatory (fingerstyle, footwork, futurepop, jungle)
- invented distinctions (shimmer pop, shiver pop; soul flow, flick hop)
- nostalgia (new wave, no wave, new jack swing, avant-garde, adult standards)
- defense (relaxative, neo mellow)

That was at the beginning of the talk. At the end I had a different attempt at an amusement prepared, which was a short outline of my mental draft of the paper I would write about genre evolution, if I wrote papers. In a way this is also a way of listing kinds of kinds of things:

The Every-Noise-at-Once Unified Theory of Musical Genre Evolution

There is a status quo;
Somebody becomes dissatisfied with it;
Several somebodies find common ground in their various dissatisfactions;
Somebody gives this common ground a name, and now we have Thing;
The people who made thing before it was called Thing are now joined by people who know Thing as it is named, and have thus set out to make Thing deliberately, and now we have Thing and Modern Thing, or else Classic Thing and Thing, depending on whether it happened before or after we graduated from college;
Eventually there's enough gravity around Thing for people to start trying to make Thing that doesn't get sucked into the rest of Thing, and thus we get Alternative Thing, which is the non-Thing thing that some people know about, and Deep Thing, which is the non-Thing thing that only the people who make Deep Thing know;
By now we can retroactively identify Proto-Thing, which is the stuff before Thing that sounds kind of thingy to us now that we know Thing;
Thing eventually gets reintegrated into the mainstream, and we get Pop Thing;
Pop Thing tarnishes the whole affair for some people, who head off grumpily into Post Thing;
But Post Thing is kind of dreary, and some people set out to restore the original sense of whatever it was, and we get Neo-Thing;
Except Neo-Thing isn't quite the same as the original Thing, so we get Neo-Traditional Thing, for people who wish none of this ever happened except the original Thing;
But Neo-Thing and Neo-Traditional Thing are both kind of precious, and some people who like Thing still also want to be rock stars, and so we get Nu Thing;
And this is all kind of fractal, so you could search-and-replace Thing with Post Thing or Pop Thing or whatever, and after a couple iterations you can quickly end up with Post-Neo-Traditional Pop Post-Thing.

And it would be awesome.

[Also, although I was the one glaringly anomalous non-academic at this academic conference, let posterity record the cover of the conference program.]

¶ Needless · 30 May 2012 essay/tech

We will look back on these days, I think, as some weird interlude after the invention of computers but before we actually grasped what they meant for us. The Age we are stumbling towards, I am very sure, is the Age of Data. And when we get there, we will be there because we have sublimated the state-machine mechanics of computers beneath the logical structural abstractions of information and relation, and begun to inhabit this new higher world without reference to its substrate.

I spent 5 years of my life trying to help bring this future about. That is, in a sense I've spent my whole adult life trying to help bring this future about, but for those 5 years I got to work on it very directly. I designed, and our team built, an attempt at a prototype of what a new data exploration system could be like, and at the core of this was my attempt at a draft of a language for discussing data the way algebra is a language for discussing math. These are the elements out of which this new age's alchemies will be constituted. And there were moments, as the system began to come into its own, when I felt the twitches of power awakening. You could conjure shapes out of data with this thing. It made information malleable, made it flow.

The computer programmers on the team sometimes referred to the project as a system for "non-programmers", and I've come to think of that as both its potential and its downfall. Programmers never say "non-programmers" as a compliment. At best it's merely condescending, at worst it's a euphemism for "idiot" or a semi-aware admission of incomprehension. For programmers, programming is by definition an end, not a means, and therefore the motivations of non-programmers are inherently mysterious and alien. But what we built was for non-programmers in the same way that a bridge is for non-engineers. That is, the whole point of it was to represent a different interaction model between people and information than the ones offered by, at one end, programming languages, and at the other spreadsheets and traditional database programs. As I said over and over throughout those 5 years, I was trying to get us to do for hyper-connected datasets what VisiCalc once did for columns of numbers. I wasn't trying to simplify; if anything, I was making some things harder, or at least less familiar. This new age is not a subset of a previous age. It is not for lesser people, and its challenges are not of a simpler character.

And as Google now shuts that system down, literally unceremoniously, and 5 years of my work and dreams and visions are at least nominally obliterated, I feel a little sadness but mostly relief. I'm still very convinced that our tools -- humanity's tools -- for interacting with data are hopelessly primitive. I'm still convinced that it won't make a whole lot of difference what those tools are if kids don't grow up learning how to think about data in the first place. I'm still convinced that I have a blurry, fractured vision of what it might take to change these things.

But I also realize two more things.

First, the system we built was only a beginning, and it had hardened into a premature finality long before its official corporate fate was settled. The query language I invented was cool, but the successor to it, which I'm sketching in my head whether I want to or not, is a different sort of thing yet again. And I was never going to reach it incrementally, arguing over every syntax decision on the way. Sometimes you have to just start over. The next one will not aspire to be the Visicalc of anything. It's not better business tools we need. The problem is not that we are alienated from our inner accountants. The thing we need first is not even an algebra of data, probably, but an arithmetic of data. We need an inversion of "normalization" in which you don't write data wrong and then endure six Herculean labors to make it obscurely more pleasing to capricious gods, but rather a way of writing it in the first place with an inherent expressive gravity towards truth because more true is always more powerful. This is a task in applied philosophy, not programming and not engineering and not even science. We need to imagine what Plato would have done when his record collection got too big for his cave.

Second, I still believe that we all deserve better tools, tools more suited for our actual tasks and needs as people whose lives and choices and options are increasingly functions in, not merely of, information. But in the process of exploring what I mean by that I've become a non-non-programmer myself. At my new job I am an engineer. And sometimes, when you think you know what the better world looks like, you can bring pieces of it up out of your dreams. You can walk where the new paths will be. With enough belief, you can walk where the bridges will be. I will come back to these paths, one way or another, but you never do great things by imagining what people you don't understand might want for purposes you don't grasp or embrace. You should trust your own judgment only where you love beyond reason. Anybody could do nearly anything with Needle, and the business cases for it all involved hypothetical big companies doing hypothetical big things with hypothetical big data that repeatedly never actually materialized (and might have been hypoethical if they had). But left to my own invented devices, I always ended up using it for music data.

So I have followed my own love, and my own obsessions, deeper into that data. At my new job, I am trying to make sense of the largest music database in the world, which is a lot more fun than what I was doing before, and harder, and of rather more direct and demonstrable relevance to anything. On my own, I will continue the music projects I started in Needle. The Discordance evolved out of empath, and so I've evolved it back in, with less marginalia but maybe more coherence. For the Pazz & Jop I've built a stats site far more specific than I could ever have done in the generalized environment of Needle. These will grow as I play with them, and probably there will be other things. I spent 5 years trying to build fancy tools, but it's pretty amazing what you can do with just a hammer. I was Needle's most dedicated user, but in the end, both sadly and happily, I don't actually need it any more. Nobody will miss it more than I will, but maybe nobody will really miss it very much. The moral, I think, and maybe even the ethic, is that these systems do not matter. This isn't the first system I worked on only to see it shut down, and it won't be the last. Software is the epitome of ephemera, necessary in aggregate but needless in every mundane specific.

But the things we learn from these systems stay learned. Even the ways of learning remain ways after their original demonstrations disintegrate. This is another phrasing of the point about this Age, in fact: the flow from Data to Information to Knowledge to Wisdom is not a function of syntax or platforms or prevalence or virtualization. It is something we do, to which the technology is merely witness. We must teach our children how to think about data because the data survives where the systems fail. We must teach ourselves to be children again in this new Age, because its most transformative truths still await discovery, and are anything but mundane or needless, and we will never recognize them unless we can recall what it felt like in our hearts when everything was amazing and new and ahead of us, and the act of waking was an invitation to wonder to show us a way.

¶ Quantifying School · 9 September 2011 essay/tech

[May 2012 note: Needle, the database system I used to collect, analyze and show this information, was acquired and shut down by Google. Thus many of the links below go to non-functional snapshots of Needle pages I took before the shutdown. The points should survive.]

Boston Magazine recently published their annual Best Schools ranking. They've been doing this for years, and are known for various other Boston rankings as well (places to live, places to eat...), so by now you'd expect them to be pretty good at it.

Here's what "pretty good at it" amounts to, in 2011: two lists of 135 school districts, one with some configuration information (enrollment, student/teacher ratio, per-pupil spending, graduation rate, number of sports teams, what kind of pre-k they offer, how many AP classes), the second with test scores, and exactly this much methodological transparency: "we crunched the data and came up with this".

Some obvious things that you can't do with this information:

- sort it by any criteria other than the magazine's rank
- see the stuff in the first table alongside the stuff in the second
- understand which figures are actually part of the ranking, in what weights
- fact-check it
- compare it in bulk to any other information about these schools
- compare it to any other information about the towns served by these districts
- figure out why certain towns were included or excluded
- find out what towns are even meant by non-town district names if you don't already happen to know
- evaluate the significance of any individual factor, or the correlations of any set of them

This is not a proud state of the art. And the quality of secondary journalism around it emphasizes the point further: this article about Salem's low ranking basically just turns a table-row into prose sentences, with no context or analysis, and fails to even realize that the 135 districts in the ranking represent just the immediate vicinity of Boston, not the whole state. This Melrose article claims Melrose "climbed" from 97th last year to 94th, but then has to add a note that last year's ranking was of high schools, not whole districts, and thus not even the same thing. Swampscott exults in making the top 50. Malden fights back at being ranked 119th. But nobody actually knows what the rankings mean or signify, because Boston Magazine doesn't say.

In an attempt to improve this situation a little, I imported these two tables of information into Needle:

· Needle - Boston Public Schools 2011

This in itself was sufficient to unify the two tables and render them malleable, which seems to me like the most basic start. Now at least you can re-sort them yourself, and choose what to look at next to what.

And a little sorting, in fact, quickly reveals some statistical oddities. North Attleborough was listed with an SAT Reading score of 823, which since SAT scores only go up to 800, is very obviously wrong. Some trivial research verifies that this was a typo for 523, and while typos happen in journalism all the time, a typo in data journalism is a dangerous indication that some human has been retyping things by hand, which is never good. (This datum has now been fixed in the magazine's table.)

More interestingly, when you start scrutinizing each district's 5th/8th/10th-grade MCAS scores, you find some surprising skews. Here are the MCAS and SAT scores for Georgetown:

MCAS 5 English: 74
MCAS 5 Science: 54
MCAS 5 Math: 42

MCAS 8 English: 81
MCAS 8 Science: 36
MCAS 8 Math: 51

MCAS 10 English: 92
MCAS 10 Science: 90
MCAS 10 Math: 88

SAT Reading: 570
SAT Writing: 566
SAT Math: 584

Boston Magazine says they "looked within those districts to determine how schools were improving (or not) over time". But that's not what these scores are measuring. These aren't time-slices for a single cohort, these are different tests being given to different kids. If you're interested in history, the Department of Education profile of Georgetown includes annual MCAS results for 2006-2009, and all you have to do is scan down the page to spot the weird anomaly that is 8th grade Science. Every other test has healthy dark-blue bars for "Advanced" scores; but in 8th grade Science virtually no kids managed Advanced scores in any year. This pattern repeats in Wellesley in an even more dramatic fashion. An article from Wellesley Patch explains that their 8th grade science curriculum doesn't cover "space", while the MCAS does. It's an interesting ideological question whether curricula should be matched to the standardized tests, but whatever your opinion on that, it seems clearly misleading to interpret this policy issue as a quality issue.

A little more sorting repeatedly raised another question: why is Cambridge ranked 25th? In virtually every test-score-based sort it falls close to the bottom of the table. In the magazine's ranking, Cambridge comes in ahead of Westford, at #26. But observe these scores for the two:

MCAS 5 English: 59 - 88
MCAS 5 Math: 53 - 86
MCAS 5 Science: 45 - 85
MCAS 8 English: 75 - 95
MCAS 8 Math: 45 - 86
MCAS 8 Science: 34 - 78
MCAS 10 English: 70 - 97
MCAS 10 Math: 77 - 95
MCAS 10 Science: 59 - 94
SAT Reading: 498 - 587
SAT Writing: 493 - 582
SAT Math: 503 - 602
Graduation Rate: 85.2 - 94.6

This doesn't even look close. But then notice these:

Students per Teacher: 10.5 - 14.6
Per-Pupil Spending: $25,737 - $10,697

Cambridge's spending per student is remarkable. It's almost 50% higher than the next highest, which is Waltham at $18,960. The 10.5 students per teacher is also the best ratio of the 135 schools listed, with 115th-ranked Salem in second place with 11. These factors seem like they should matter, and clearly they must be part of the magazine's ranking calculation, but if they're so uniformly not translating to better test scores or graduation rates in Cambridge, does this really make any sense?

At least we ought to be able to say that these, along with the other non-test characteristics in the magazine like the number of sports teams and the number of AP classes, are different sorts of statistics than test scores. This seems increasingly true as you start looking at them in detail. Plymouth is listed as having 94 sports teams, for example. Can you name 94 different sports? I can't, and the Plymouth High School web site only claims they participate in 19. Newton is listed as having 39 AP classes, and Boston as having 155. But there are only 34 AP subjects, so it seems like a pretty safe guess that in these two multi-high-school districts the magazine is adding the totals for each school. It's hard for me to see what that accomplishes.

So for my own interest, at least, I created my own Quant Score, which is calculated like this:

- take all 9 of the listed MCAS scores, drop the lowest one, and sum the other 8
- divide each of the three SAT scores by 3, to put them into a range where they're each worth around twice as much as an individual MCAS score, and add those in
- multiply the graduation rate by 2, to put it into a similar range to the SAT scores, and add that in, as well

These factors are admittedly arbitrary, so you're welcome to try your own variations, but at least I'm telling you exactly what goes into mine, so you know what you're varying against. I deliberately left out all the other descriptive metrics from this calculation, including student/teacher ratio and spending. I then reranked the schools according to these Quant Scores. See the comparison of the magazine's ranking and mine here.

The differences are pretty dramatic. Three schools from outside the magazine's top 20 move into my top 10 (and 2 more from outside the magazine's top 10). The magazine's #s 6 and 8 drop to 28 and 30 in my list. Watertown and Waltham drop from 53 and 54 in the magazine to 100 and 114 in my list. Swampscott will be displeased to see that my re-ranking them sends then back out of the top 50. Malden will probably not be much appeased that I've bumped them up from 119 to 118. Acton and Winchester will be thinking about staging parades for me. And Cambridge (where I live, and where my pre-K daughter will go to school unless we take some drastic action) plunges from 25th to 107th.

But these are not answers, these are more questions. Most obviously: Why? I'm not claiming my Quant Score is definitive in any way, but it measures something, and I'm willing to claim that what it measures is something more coherent than what the magazine's rank measures. So this sets me off on the quest for better explanations, for which we obviously need more data.

Needle is good at integrating data, so I have integrated a bunch of it: per-capita incomes, town populations, unemployment rates, district demographic breakdowns, lunch subsidy percentages and 2010 election results. Some of these apply to towns, not districts, and several districts serve multiple towns, but Needle loves one-to-one and one-to-many relationships the same, so I've done properly weighted multi-town averages. (Don't try that in a spreadsheet.)

And then I started comparing things. Per-pupil spending seems like it ought to matter, but it shows very little statistical correlation to quant scores. Student/teacher ratios, sports-team counts and AP classes also seem like they ought to matter, but the numbers don't support this.

Per-capita income, on the other hand, matters. The percentage of students receiving lunch subsidies matters even more. In fact, this last factor (the precise calculation I used was adding the percentage of students receiving free lunch and half of the percentage of students receiving partially subsidized lunch) is the single best predictor of quant score that I've found so far. This is depressingly unsurprising: poverty at home is hard to overcome: hard enough for individuals, and even harder in aggregate.

With this in mind, then, I ran a quick linear regression of quant score as a strict function of lunch-subsidy percentage, and used that to calculate predicted quant scores for each district. The depressing headline is how small those variations are. In a quant-score range from 1531 to 727, only 10 districts did more than 100 quant points better than predicted, and only 10 districts did more than 100 points worse. If I use the square roots of the lunch-subsidy percentages, instead, only 6 districts beat their predictions by 100, and only 8 miss by 100.

If I toss in town unemployment rates, Democratic vote percentages in the 2010 Senate election, and town per-capita income, I can get my predictions so close that only 1 school did more than 100 points better than expected, and only two did more than 100 points worse. This is daunting precision.

But OK, even if the variations are small, they're there. So surely this is where those aspirational metrics like spending must come into play. Throwing money at students in school may not be able to counteract poverty at home, but doesn't it at least help?

No.

Students per Teacher? No.
AP classes? No.
Percentage of minority students? No.

I'm by no means saying that there isn't an explanation, or more of an explanation, or other factors. But if there are, I haven't found them yet.

But at least I'm trying. And I give you the data so you can try, too. I submit that this is what data journalism should be trying to do. We are trying to find knowledge in data. Secrecy and opaqueness and non-interactivity are counter-productive. It's more than hard enough to find truth even with all the data arrayed in front of us. If there's an equivalent of the Hippocratic Oath for data journalists, it should be that we will endeavor to never make the truth more obscure.

[Space for discussion here.]

[Postscript, 10 September: The more I thought about that 823/523 error, the more I worried that there might be other errors that weren't as obvious, so I used Needle to cross-check all the test-scores against the official DOE figures. Two more were wrong. Manchester Essex's SAT Reading score was 559, not 599, which I'm guessing would lower their #6 magazine rank, perhaps considerably. In my rankings it dropped them from 28 to 31. Ashland's SAT Reading score was also wrong, 531 not 537, but this didn't change their rank in my method. Both corrections moved those schools' scores closer to my predictions.]

[Postscript, 12 September: But charter schools do better relative to expectations, right? Nope.]

¶ 18 Ways to Think About Data Quality · 12 April 2011 essay/tech

[From a note to the public-lod@w3.org mailing list.]

Data "beauty" might be subjective, and the same data may have different applicability to different tasks, but there are a lot of obvious and straightforward ways of thinking about the quality of a dataset independent of the particular preferences of individual beholders. Here are just some of them:

1. Accuracy: Are the individual nodes that refer to factual information factually and lexically correct. Like, is Chicago spelled "Chigaco" or does the dataset say its population is 2.7?

2. Intelligibility: Are there human-readable labels on things, so you can tell what one is when you're looking at it? Is there a model, so you can tell what questions you can ask? If a thing has multiple labels (or a set of owl:sameAs things have multiple labels), do you know which (or if) one is canonical?

3. Referential Correspondence: If a set of data points represents some set of real-world referents, is there one and only one point per referent? If you have 9,780 data points representing cities, but 5 of them are "Chicago", "Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and "Chicagoland", that's bad.

4. Completeness: Where you have data representing a clear finite set of referents, do you have them all? All the countries, all the states, all the NHL teams, etc? And if you have things related to these sets, are those projections complete? Populations of every country? Addresses of arenas of all the hockey teams?

5. Boundedness: Where you have data representing a clear finite set of referents, is it unpolluted by other things? E.g., can you get a list of current real countries, not mixed with former states or fictional empires or adminstrative subdivisions?

6. Typing: Do you really have properly typed nodes for things, or do you just have literals? The first president of the US was not "George Washington"^^xsd:string, it was a person whose name-renderings include "George Washington". Your ability to ask questions will be constrained or crippled if your data doesn't know the difference.

7. Modeling Correctness: Is the logical structure of the data properly represented? Graphs are relational databases without the crutch of "rows"; if you screw up the modeling, your queries will produce garbage.

8. Modeling Granularity: Did you capture enough of the data to actually make use of it. ":us :president :george_washington" isn't exactly wrong, but it's pretty limiting. Model presidencies, with their dates, and you've got much more powerful data.

9. Connectedness: If you're bringing together datasets that used to be separate, are the join points represented properly. Is the US from your country list the same as (or owl:sameAs) the US from your list of presidencies and the US from your list of world cities and their populations?

10. Isomorphism: If you're bring together datasets that used to be separate, are their models reconciled? Does an album contain songs, or does it contain tracks which are publications of recordings of songs, or something else? If each data point answers this question differently, even simple-seeming queries may be intractable.

11. Currency: Is the data up-to-date? As of when?

12. Model Uniformity: Are discretionary modeling decisions made the same way throughout the dataset, so that you don't have to ask many permutations of the same question to get different subsets of the answer? Nobody should have to worry whether some presidents and presidencies are asserted in only one direction and some only the other.

13. Attribution: If your data comes from multiple sources, or in multiple batches, can you tell which came from where? If a source becomes obsolete or discredited or corrupted, can you take its data out again?

14. History: If your data has been edited, can you tell how and when and by whom? Can you undo errors, both individual (no, those two people aren't the same, after all) and programmatic (those two datasets should have been aligned with different keys)?

15. Internal Consistency: Do the populations of your counties add up to the populations of your states? Do the substitutes going into your soccer matches balance the substitutes going out? Would you notice if errors were introduced?

16. Legality: Is the license under which the data can be used clearly defined, ideally in a machine readable way?

17. Sustainability: Is there is some credible basis or evidence for believing the data will be kept available and current? If it's your data, what commitment to its maintenance are you making?

18. Authority: Is the source of the data a credible authority on the subject? Did you find a list of NY Charter Schools, or the list?

[Revision of #12 and addition of 16-18 suggested by Dave Reynolds.]

¶ Inside a Dataset, Relational Density Is Data's Best Friend. Outside a Dataset, It's Too Dark to Read · 29 September 2009 essay/tech

Stefano Mazzocchi (of Freebase) and David Karger (of MIT) have been holding a slow but interesting conversation about data reconciliation. It's been phrased as a sort of debate between two arbitrarily polarized approaches to the problem of cleaning up data so that you don't have multiple variant references to the same real data-point: either you try to do this cleanup "in the data", so that it's done and you don't have to worry about it any more; or you leave the data alone and figure you'll incorporate the cleanup into your query process when you actually start asking specific questions.

But I think this is less of a debate than Stefano and (particularly) David are making it seem. Or, rather, that the real polarization is along a slightly different axis. The biggest difference between Stefano's world and David's happens before you even start to worry about when you're going to start worrying about data cleanup: Freebase is attempting to build a single huge structured-data-representation of (eventually) all knowledge; David's work is largely about building lots of little structured-data-representations of individual (relatively) small blobs of knowledge.

Within a dataset, though, Stefano and David (and I) are in enthusiastic agreement: internal consistency is what makes data meaningful. If your data doesn't know whether "Beatles" and "The Beatles" are one band or two different ones, your answers to any question that touches either one are liable to be so pathetically wrong that people will very quickly stop asking you things. In the classic progression from Data to Information to Knowledge to Wisdom, this is what that first step means: Data is isolated, Information is consolidated. It would be inane to be against this. (Unless, I guess, you were also against Knowledge and Wisdom.)

It's on the definition of "within", though, that most of the interesting issues under the heading of "the Semantic Web" wobble. David says "I argued in favor of letting individuals make their own [datasets] ... and worry about merging them with other people’s data later." He really does mean individuals, and the MIT-developed tool he has in mind (Exhibit) is designed (and only suited) for fairly small datasets. Freebase is officially a database of everything, but of course within this everything are lots of subsets, and Stefano is really talking more about cleaning up these subsets than the whole universe. Reconciling all the artist references to "Beatles" and "The Beatles" from albums and songs is a straightforward practical task both beneficial and probably necessary for asking questions of a song/album/artist dataset, whether it's embedded in a larger one or not. Reconciling the "'The Beatles" who are the artist of the song "Day Tripper" with "The Beatles" who are depicted in cartoon form on a particular vintage lunchbox for sale on eBay, on the other hand, is less conceptually straightforward, and of more obscure practical import.

The thing I'm working on at ITA falls in between Exhibit and Freebase on this dataset axis, both in capacity and design. We handle datasets much bigger than you could put in Exhibit, and allow you to explore and analyze them in ways that Exhibit cannot; but we separate datasets, partly for scalability but even more importantly to specifically keep out of any unnecessary quagmire of trying to reconcile not just the bits of data but the structures.

And my own scouting report, from having done a lot of data-reconciliation in a system designed for it, and in other lives of a lot of more-painful data-reconciliation in various systems not really designed for it, is the same as Stefano's report from his experiences at Freebase: getting from data to information, at scale, is hard. It's mainly hard in ways that machines cannot just solve for you, and anybody who thinks RDF is a solution to data-reconciliation is, as Stefano puts it, confusing model mixability with semantic reconciliation, and has probably not noticed because they've only been playing with toy datasets, which is like thinking you're learning Architecture by building a few birdhouses.

And this is all exactly why I have repeatedly argued for treating the "Semantic Web" technology problem as a database-tech problem first, and a web-tech problem only secondarily. David complains, justifiably, about the pain of combining two small, simple folk-dance datasets he himself cares about. But just as Stefano says in another example, the syntax problems here (e.g. text in a YouTube description box, rather than formally-labeled data records) are fairly trivial compared to the modeling differences. All the URIs and XSL transformations in the world are not going to allow every two datasets to magically operate as one. Some person, without or without tools to magnify their effort, is going to have to rephrase one dataset in the argot of the other.

And to say another thing I've said before again, the fact that RDF doesn't itself solve these problems is not its terminal design flaw. Its terrible flaw is that it isn't even a sufficient foundation upon which to build the tools that would help solve these problems. It takes the right fundamental idea, that the most basic conceptual structure of data is things and the relationships between them, but then becomes so invertedly obsessed with the theoretical purity of this reduction that it leaves a whole layer of needed practical elaboration unbuilt. We shouldn't need abstruse design patterns to get ordered lists, or rickety reasoning engines just to get relationships that go both directions, or endless syntax officiousness that gets us expensive precision with no gain in accuracy.

This effort has let us down. And worse, it has constrained smart people so that their earnest efforts cannot help but let us down. After years and years of waiting, we have no Semantic Web of information, we have only Linked Data, where the word "Linked" is so tenuously justified it might as well be replaced with some pink-ink-drinking Seuss pet, and the word "Data" is tragically all too accurate. We have all these computers, living abbreviated lives of quiet depreciation, filled with the data that should become our wisdom, and yearning, if they are allowed to yearn for even one thing, to be able to tell us what they know.

¶ (How) We Can Work It Out · 23 September 2009 essay/tech

[Fair warning: This is another post about data-modeling and query languages, and it isn't likely to be the last. It may or may not be interesting to people with personal interests in those topics, but I think it's pretty unlikely to be interesting to people without those interests. You have been warned.]

In data-modeling you usually live in fear of the word "usually". Accounting for the subsequent "but sometimes" is usually where a simple, manageable data-model starts its ugly metamorphosis towards tangled and unusable. Same with "mostly", "more often", "occasionally". Most data-modeling contexts are only really ever happy with "always" and "never"; and the real world is usually not so helpfully binary.

DiscO, my data-model for discographies, notes that Tracks, Releases and Sequences can all have Artists, but that usually Tracks would get their Artist indirectly from their Release, which would in turn get it from its Sequence.

What this means, in practical terms, is that when we're entering or importing data, we don't necessarily want to have to set Artist individually on every single Sequence, Release and Track. But when we're querying the data, we want to be able to ask about Artist on every one of those.

Data-modeling people call this "inference", and it's a whole academic subject of its own, deeply entangled with abstract logic and belief-system consistency. The Sequence/Release/Track problem sounds theoretically straightforward at first, but gets very hard very quickly once you realize that some Releases are compilations of Tracks by multiple artists, and some Sequences have Releases by multiple artists. Thus it's not quite true that Sequence "contains" Release, and upon that "not quite" most academic approaches will founder and demand to be released from this vagueness via a rococo expansion of the domain model to separate the notions of Single-Artist and Multiple-Artist Sequences and Releases.

But "usually" is a good concept. And for lots of real-world data problems it can be handled without flailing into existential abstraction. We can keep our model simple, and fill in the implied data with a very general mechanism: the relationships between "actual" values and "effective" values can themselves be described in queries.

Releases, we said, may have Artists directly, or may get them implicitly from their Sequence. More specifically, we probably mean something like this:

- If a Release has an Artist, directly, use that.
- If it doesn't, get all the Sequences in which it occurs.
- Ignore any Sequence that itself has no Artist or multiple Artists.
- If all the remaining (single-Artist) Sequences have the same Artist, use that.
- Otherwise we don't know.

This can be written out in Thread in pretty much these exact steps:

Release|Artist=(.Artist;(.Sequence:(.Artist::#=1).Artist::#=1))

This isn't a syntax tutorial, but ";" means otherwise, and "::#=1" tests a list to see if it has exactly one entry, so maybe you can sort of see what the query is doing.

A Track, then, follows the same pattern, but has to check one more level:

- If a Track has an Artist, directly, use that.
- If it doesn't, get all the Releases in which it occurs.
- Ignore any Release that itself has no Artist or multiple Artists.
- If all the remaining (single-Artist) Releases have the same Artist, use that.
- Otherwise, get all the Sequences for all the Releases on which the track appears.
- Ignore any Sequence that itself has no Artist or multiple Artists.
- If all the remaining (single-Artist) Sequences have the same Artist, use that.
- Otherwise we don't know.

Or, in Thread:

Track|Artist=(
.Artist;
.(.Release:(.Artist::#=1).Artist::#=1);
.(.Release.Sequence:(.Artist::#=1).Artist::#=1))

In this system, once you've learned the query-language (and I'm not saying this is a trivial task, but it's not that hard), you can do most anything. The language with which you ask questions is the same language you use for stipulating answers.

Query-language-geek postscript: And, of course, it's the same language you use for obsessively fiddling with your questions and answers because you just can't help it. That Track query has some redundancy, and while redundancy isn't always bad, it's almost always fun to see if you can get rid of it. In this case we're asking the same question ("What's your artist?") in all three steps of the query. We can rearrange it so that we get all the Tracks and Releases and Sequences first, then ask all the Artist questions at the end:

Track|Artist=(...__,Release,Sequence:(.Artist::#=1).Artist::#=1)

"...__," may initially look a little like ants playing Limbo, but "...__,Release,Sequence" means "get these things, their Releases and their Sequences, and then add those things' Releases and Sequences, etc. until you get to the end. So this version of the query builds up the complete list of all the places from which this Track could derive its artist, keeps only the ones that have a single artist, and then sees if we're left with a single artist at the end of the whole process. Depending on what we want to do if our data has internal contradictions, this might actually be functionally better than the first version, not just shorter to type.

But DiscO also has all that stuff about Original Versions, and it would be nice if our Artist inference used that information, too. If Past Masters [Remastered] is an alternate version of Past Masters, Vol. 1 & 2, and Past Masters, Vol. 1 & 2 belongs to the Beatles' Compilations sequence, then we should be able to deduce that the version of "We Can Work It Out" on Past Masters [Remastered] is by the Beatles. Our experimental elimination of redundancy now pays off a little bit, because we only have to add in "Original Version" in one place:

Track|Artist=(...__,Release,Sequence,Original Version:(.Artist::#=1).Artist::#=1)

And interestingly, since both Tracks and Releases can have Original Versions, this whole thing actually works for either type, and thus we can combine the two and have even fewer things to worry about:

Track,Release|Artist=(...__,Release,Sequence,Original Version:(.Artist::#=1).Artist::#=1)

Having fewer things to worry about is (usually) good.