furia furialog · The War Against Silence · Aedliga (songs) · photography · code · other things     ↑vF
25 September 2012 to 24 March 2011
Somewhere around 1992 or so, I had to write my "personal goals" at work for the first time. I put down some stuff about design and usability, which I have long since forgotten, but I distinctly remember that the last item was "Exploit the power of information technology to improve my music collection."  

My job, at the time, had nothing to do with music. It did provide me with an internet connection for downloading Gary Numan discographies from Usenet, and it did once send me to a UI conference in Amsterdam, from which I came back with approximately 130 pounds of European CDs. But the connection from what I was doing to what I was listening to remained a little abstract.  

But that job got me the one after that, which got me the one before this, which got me to now. The scope of "music collection" has expanded quite a bit over this time, obviously, but in fact I am very literally now paid to exploit the power of information technology to improve the world's experience of music.  

The vast majority of things I work on, to this end, involve identification or disambiguation or extrapolation or interpolation. They are corrective or exploratory or contextual. They try to carry you from somewhere to somewhere else, or to rescue you from something squelchy along the way. They try to get computers to understand enough about human cues and contexts and constraints to fill humanless spaces with rough facsimiles of what humans would have suggested if they'd had the time.  

But I just did one that isn't so much like this. One of the computed song metrics I've been working on is Discovery, which attempts to quantify the idea of songs emerging. This is a detection metric, not an aesthetic one. It isn't trying to tell you songs you should listen to, it's trying to find the songs that people are in fact starting to listen to more than the established prominence of the artists can explain. Discovery songs tend to be new, but a fair number of songs get their unexpected breaks after they've already been out for a year or two. And once enough people discover a Discovery, obviously, it stops qualifying.  

There are a lot of fairly arbitrary thresholds and weights involved in this calculation. How much prominence constitutes critical mass, and how much constitutes overexposure? How old can something be and still be sorta new? And the notion is inherently subjective, of course: anything you've heard a dozen times yourself is no longer a discovery to you, whereas a song 1.3 billion people have been dancing to across Asia every morning for the last 6 months still could be. And there is no way to stipulate correct answers against which this score can be quantitatively tested.  

But provable correctness is overrated, or at least sometimes irrelevant. Music discovery is working if you can use it to discover music. And now you can test this particular method yourself. Starting today I'm going to be maintaining an official Echo Nest playlist on Spotify with the top 100 songs of the moment according to this discovery score. It's here:  

The Echo Nest Discovery  

If you subscribe to it, each week's new songs will get flagged for you automatically.  

The songs the robots find cheerfully comply with no particular style or pattern. The set is not coherent in any human sense. It's in rank order, but the ranking logic will not be evident or audible. There will be interminable dubstep remixes next to country laments next to Christian hardcore next to Azorean folk traditionalism. You should expect to hate half of these, and find half of the rest inscrutable. This is definitively impersonal. Somebody, somewhere, liked each of these songs, but we've stripped them of any idea of whom or why.  

And yet, the fact that you don't know those people doesn't mean you won't agree with them. I endorse this discovery experiment on the purely empirical grounds that I have personally discovered things this way myself. So, for the record, I'm also going to maintain a personal playlist of what I liked from the robot one:  

Treasures the Robots Brought Me  

My suspicion, which you're welcome to evaluate for yourself, is that my human picks will be essentially no less random for you than the overall robot list. But if you find something, either way, we both win.  

[I'm not sure how the robots win. Probably they win no matter what. If I have a non-musical goal to add at this job, it's probably to keep trying to make sure the human/robot thing is never zero-sum...]
Just as night begins to fall, you come across some people. They're cold, and they're scared, but they've made a pile of wood and they're trying to start a fire.  

The Salesperson's Dilemma is: How much can you charge them for matches?  

The Businessperson's Dilemma is: If you give them some matches for free tonight, can you sell them new axes in the morning?  

Then they get the fire started.  

And now the Entrepreneur's Dilemma is: Can you persuade them to try the bizarre experiment of rearranging the pile of wood into a big hollow box, called a "house", faster than they can burn the rest of the wood?
I have a post up on the company blog at work today. It's about music or math, depending on your perspective.
We will look back on these days, I think, as some weird interlude after the invention of computers but before we actually grasped what they meant for us. The Age we are stumbling towards, I am very sure, is the Age of Data. And when we get there, we will be there because we have sublimated the state-machine mechanics of computers beneath the logical structural abstractions of information and relation, and begun to inhabit this new higher world without reference to its substrate.  

I spent 5 years of my life trying to help bring this future about. That is, in a sense I've spent my whole adult life trying to help bring this future about, but for those 5 years I got to work on it very directly. I designed, and our team built, an attempt at a prototype of what a new data exploration system could be like, and at the core of this was my attempt at a draft of a language for discussing data the way algebra is a language for discussing math. These are the elements out of which this new age's alchemies will be constituted. And there were moments, as the system began to come into its own, when I felt the twitches of power awakening. You could conjure shapes out of data with this thing. It made information malleable, made it flow.  

The computer programmers on the team sometimes referred to the project as a system for "non-programmers", and I've come to think of that as both its potential and its downfall. Programmers never say "non-programmers" as a compliment. At best it's merely condescending, at worst it's a euphemism for "idiot" or a semi-aware admission of incomprehension. For programmers, programming is by definition an end, not a means, and therefore the motivations of non-programmers are inherently mysterious and alien. But what we built was for non-programmers in the same way that a bridge is for non-engineers. That is, the whole point of it was to represent a different interaction model between people and information than the ones offered by, at one end, programming languages, and at the other spreadsheets and traditional database programs. As I said over and over throughout those 5 years, I was trying to get us to do for hyper-connected datasets what VisiCalc once did for columns of numbers. I wasn't trying to simplify; if anything, I was making some things harder, or at least less familiar. This new age is not a subset of a previous age. It is not for lesser people, and its challenges are not of a simpler character.  

And as Google now shuts that system down, literally unceremoniously, and 5 years of my work and dreams and visions are at least nominally obliterated, I feel a little sadness but mostly relief. I'm still very convinced that our tools -- humanity's tools -- for interacting with data are hopelessly primitive. I'm still convinced that it won't make a whole lot of difference what those tools are if kids don't grow up learning how to think about data in the first place. I'm still convinced that I have a blurry, fractured vision of what it might take to change these things.  

But I also realize two more things.  

First, the system we built was only a beginning, and it had hardened into a premature finality long before its official corporate fate was settled. The query language I invented was cool, but the successor to it, which I'm sketching in my head whether I want to or not, is a different sort of thing yet again. And I was never going to reach it incrementally, arguing over every syntax decision on the way. Sometimes you have to just start over. The next one will not aspire to be the Visicalc of anything. It's not better business tools we need. The problem is not that we are alienated from our inner accountants. The thing we need first is not even an algebra of data, probably, but an arithmetic of data. We need an inversion of "normalization" in which you don't write data wrong and then endure six Herculean labors to make it obscurely more pleasing to capricious gods, but rather a way of writing it in the first place with an inherent expressive gravity towards truth because more true is always more powerful. This is a task in applied philosophy, not programming and not engineering and not even science. We need to imagine what Plato would have done when his record collection got too big for his cave.  

Second, I still believe that we all deserve better tools, tools more suited for our actual tasks and needs as people whose lives and choices and options are increasingly functions in, not merely of, information. But in the process of exploring what I mean by that I've become a non-non-programmer myself. At my new job I am an engineer. And sometimes, when you think you know what the better world looks like, you can bring pieces of it up out of your dreams. You can walk where the new paths will be. With enough belief, you can walk where the bridges will be. I will come back to these paths, one way or another, but you never do great things by imagining what people you don't understand might want for purposes you don't grasp or embrace. You should trust your own judgment only where you love beyond reason. Anybody could do nearly anything with Needle, and the business cases for it all involved hypothetical big companies doing hypothetical big things with hypothetical big data that repeatedly never actually materialized (and might have been hypoethical if they had). But left to my own invented devices, I always ended up using it for music data.  

So I have followed my own love, and my own obsessions, deeper into that data. At my new job, I am trying to make sense of the largest music database in the world, which is a lot more fun than what I was doing before, and harder, and of rather more direct and demonstrable relevance to anything. On my own, I will continue the music projects I started in Needle. The Discordance evolved out of empath, and so I've evolved it back in, with less marginalia but maybe more coherence. For the Pazz & Jop I've built a stats site far more specific than I could ever have done in the generalized environment of Needle. These will grow as I play with them, and probably there will be other things. I spent 5 years trying to build fancy tools, but it's pretty amazing what you can do with just a hammer. I was Needle's most dedicated user, but in the end, both sadly and happily, I don't actually need it any more. Nobody will miss it more than I will, but maybe nobody will really miss it very much. The moral, I think, and maybe even the ethic, is that these systems do not matter. This isn't the first system I worked on only to see it shut down, and it won't be the last. Software is the epitome of ephemera, necessary in aggregate but needless in every mundane specific.  

But the things we learn from these systems stay learned. Even the ways of learning remain ways after their original demonstrations disintegrate. This is another phrasing of the point about this Age, in fact: the flow from Data to Information to Knowledge to Wisdom is not a function of syntax or platforms or prevalence or virtualization. It is something we do, to which the technology is merely witness. We must teach our children how to think about data because the data survives where the systems fail. We must teach ourselves to be children again in this new Age, because its most transformative truths still await discovery, and are anything but mundane or needless, and we will never recognize them unless we can recall what it felt like in our hearts when everything was amazing and new and ahead of us, and the act of waking was an invitation to wonder to show us a way.
· My version (the long form)
· Other people's version  

Metal (ZIP download, 207MB)  

1. Blood Stain Child: "Stargazer" (4:21)
2. Elizium: "Nemesis" (5:29)
3. Unleash the Archers: "Realm of Tomorrow" (5:15)
4. Pantheist: "Be Here" (10:47)
5. Subway to Sally: "Nichts ist fur immer" (3:37)
6. Dornenreich: "Fahrte Der Nacht" (4:49)
7. Jesu: "Sedatives" (5:10)
8. Terra Tenebrosa: "Probing The Abyss" (6:03)
9. Thy Catafalque: "Fekete mezõk" (9:20)
10. Lifelover: "Expandera" (3:49)
11. Wolfchant: "Black Fire" (4:20)
12. Avven: "Ros" (3:13)
13. Heretoir: "Fatigue" (7:17)
14. Alcest: "Elévation (Rerecorded)" (13:26)
15. Agrypnie: "Augenblick" (7:28)
16. Frijgard: "Frijgard" (5:39)
17. Dalriada: "Mennyei Harang" (6:16)
18. Arven: "Dark Red Desire" (4:12)
19. Oak Pantheon: "Architect of the Void Pt II" (6:45)
20. Kampfar: "Bergtatt (In D Major - Bonus Track)" (5:28)
21. In Flames: "Sounds Of A Playground Fading" (4:43)
22. Andraste: "Black Birds Of Carrion" (5:16)
23. Nemesea: "Afterlife" (3:12)
24. Leviathan: "Blood Red And True" (6:57)  

Non-Metal (ZIP download, 187MB)  

1. Airborne Toxic Event: "All At Once" (5:16)
2. Gazelle Twin: "Changelings" (3:23)
3. Joy Formidable: "The Greatest Light Is The Greatest Shade" (5:19)
4. Harris, Emmylou: "Hard Bargain" (3:22)
5. Hatfield, Juliana: "Failure" (3:35)
6. Keene, Tommy: "Already Made Up Your Mind" (3:29)
7. M83: "OK Pal" (3:58)
8. Low: "Nothing But Heart" (8:10)
9. Sounds: "Won't Let Them Tear Us Apart" (4:07)
10. Blondie: "Mother" (3:09)
11. Sloan: "Green Gardens, Cold Montreal" (2:01)
12. The Decemberists: "Calamity Song" (3:50)
13. Mountain Goats: "Damn These Vampires" (3:24)
14. Buckner, Richard: "Escape" (3:32)
15. Roxette: "She's Got Nothing On (But the Radio)" (3:33)
16. Roxette: "Speak to Me" (3:41)
17. Hatfield, Juliana: "Don't Wanna Dance (Brad Walsh Remix)" (3:09)
18. Clarkson, Kelly: "I Forgive You" (3:04)
19. Reik: "No Te Quiero Olvidar" (3:38)
20. Magnum: "Wild Angels" (5:41)
21. Mogwai: "George Square Thatcher Death Party" (3:59)
22. Kerretta: "A Ways to Uprise" (5:29)
23. Emily Bezar: "May In Mesolimbia" (18:21)
24. Ulver: "Stone Angels" (14:53)
25. M83: "Raconte-moi une histoire" (4:07)
[May 2012 note: Needle, the database system I used to collect, analyze and show this information, was acquired and shut down by Google. Thus many of the links below go to non-functional snapshots of Needle pages I took before the shutdown. The points should survive.]  

Boston Magazine recently published their annual Best Schools ranking. They've been doing this for years, and are known for various other Boston rankings as well (places to live, places to eat...), so by now you'd expect them to be pretty good at it.  

Here's what "pretty good at it" amounts to, in 2011: two lists of 135 school districts, one with some configuration information (enrollment, student/teacher ratio, per-pupil spending, graduation rate, number of sports teams, what kind of pre-k they offer, how many AP classes), the second with test scores, and exactly this much methodological transparency: "we crunched the data and came up with this".  

Some obvious things that you can't do with this information:  

- sort it by any criteria other than the magazine's rank
- see the stuff in the first table alongside the stuff in the second
- understand which figures are actually part of the ranking, in what weights
- fact-check it
- compare it in bulk to any other information about these schools
- compare it to any other information about the towns served by these districts
- figure out why certain towns were included or excluded
- find out what towns are even meant by non-town district names if you don't already happen to know
- evaluate the significance of any individual factor, or the correlations of any set of them  

This is not a proud state of the art. And the quality of secondary journalism around it emphasizes the point further: this article about Salem's low ranking basically just turns a table-row into prose sentences, with no context or analysis, and fails to even realize that the 135 districts in the ranking represent just the immediate vicinity of Boston, not the whole state. This Melrose article claims Melrose "climbed" from 97th last year to 94th, but then has to add a note that last year's ranking was of high schools, not whole districts, and thus not even the same thing. Swampscott exults in making the top 50. Malden fights back at being ranked 119th. But nobody actually knows what the rankings mean or signify, because Boston Magazine doesn't say.  

In an attempt to improve this situation a little, I imported these two tables of information into Needle:  


This in itself was sufficient to unify the two tables and render them malleable, which seems to me like the most basic start. Now at least you can re-sort them yourself, and choose what to look at next to what.  

And a little sorting, in fact, quickly reveals some statistical oddities. North Attleborough was listed with an SAT Reading score of 823, which since SAT scores only go up to 800, is very obviously wrong. Some trivial research verifies that this was a typo for 523, and while typos happen in journalism all the time, a typo in data journalism is a dangerous indication that some human has been retyping things by hand, which is never good. (This datum has now been fixed in the magazine's table.)  

More interestingly, when you start scrutinizing each district's 5th/8th/10th-grade MCAS scores, you find some surprising skews. Here are the MCAS and SAT scores for Georgetown:  

MCAS 5 English: 74
MCAS 5 Science: 54
MCAS 5 Math: 42  

MCAS 8 English: 81
MCAS 8 Science: 36
MCAS 8 Math: 51  

MCAS 10 English: 92
MCAS 10 Science: 90
MCAS 10 Math: 88  

SAT Reading: 570
SAT Writing: 566
SAT Math: 584  

Boston Magazine says they "looked within those districts to determine how schools were improving (or not) over time". But that's not what these scores are measuring. These aren't time-slices for a single cohort, these are different tests being given to different kids. If you're interested in history, the Department of Education profile of Georgetown includes annual MCAS results for 2006-2009, and all you have to do is scan down the page to spot the weird anomaly that is 8th grade Science. Every other test has healthy dark-blue bars for "Advanced" scores; but in 8th grade Science virtually no kids managed Advanced scores in any year. This pattern repeats in Wellesley in an even more dramatic fashion. An article from Wellesley Patch explains that their 8th grade science curriculum doesn't cover "space", while the MCAS does. It's an interesting ideological question whether curricula should be matched to the standardized tests, but whatever your opinion on that, it seems clearly misleading to interpret this policy issue as a quality issue.  

A little more sorting repeatedly raised another question: why is Cambridge ranked 25th? In virtually every test-score-based sort it falls close to the bottom of the table. In the magazine's ranking, Cambridge comes in ahead of Westford, at #26. But observe these scores for the two:  

MCAS 5 English: 59 - 88
MCAS 5 Math: 53 - 86
MCAS 5 Science: 45 - 85
MCAS 8 English: 75 - 95
MCAS 8 Math: 45 - 86
MCAS 8 Science: 34 - 78
MCAS 10 English: 70 - 97
MCAS 10 Math: 77 - 95
MCAS 10 Science: 59 - 94
SAT Reading: 498 - 587
SAT Writing: 493 - 582
SAT Math: 503 - 602
Graduation Rate: 85.2 - 94.6  

This doesn't even look close. But then notice these:  

Students per Teacher: 10.5 - 14.6
Per-Pupil Spending: $25,737 - $10,697  

Cambridge's spending per student is remarkable. It's almost 50% higher than the next highest, which is Waltham at $18,960. The 10.5 students per teacher is also the best ratio of the 135 schools listed, with 115th-ranked Salem in second place with 11. These factors seem like they should matter, and clearly they must be part of the magazine's ranking calculation, but if they're so uniformly not translating to better test scores or graduation rates in Cambridge, does this really make any sense?  

At least we ought to be able to say that these, along with the other non-test characteristics in the magazine like the number of sports teams and the number of AP classes, are different sorts of statistics than test scores. This seems increasingly true as you start looking at them in detail. Plymouth is listed as having 94 sports teams, for example. Can you name 94 different sports? I can't, and the Plymouth High School web site only claims they participate in 19. Newton is listed as having 39 AP classes, and Boston as having 155. But there are only 34 AP subjects, so it seems like a pretty safe guess that in these two multi-high-school districts the magazine is adding the totals for each school. It's hard for me to see what that accomplishes.  

So for my own interest, at least, I created my own Quant Score, which is calculated like this:  

- take all 9 of the listed MCAS scores, drop the lowest one, and sum the other 8
- divide each of the three SAT scores by 3, to put them into a range where they're each worth around twice as much as an individual MCAS score, and add those in
- multiply the graduation rate by 2, to put it into a similar range to the SAT scores, and add that in, as well  

These factors are admittedly arbitrary, so you're welcome to try your own variations, but at least I'm telling you exactly what goes into mine, so you know what you're varying against. I deliberately left out all the other descriptive metrics from this calculation, including student/teacher ratio and spending. I then reranked the schools according to these Quant Scores. See the comparison of the magazine's ranking and mine here.  

The differences are pretty dramatic. Three schools from outside the magazine's top 20 move into my top 10 (and 2 more from outside the magazine's top 10). The magazine's #s 6 and 8 drop to 28 and 30 in my list. Watertown and Waltham drop from 53 and 54 in the magazine to 100 and 114 in my list. Swampscott will be displeased to see that my re-ranking them sends then back out of the top 50. Malden will probably not be much appeased that I've bumped them up from 119 to 118. Acton and Winchester will be thinking about staging parades for me. And Cambridge (where I live, and where my pre-K daughter will go to school unless we take some drastic action) plunges from 25th to 107th.  

But these are not answers, these are more questions. Most obviously: Why? I'm not claiming my Quant Score is definitive in any way, but it measures something, and I'm willing to claim that what it measures is something more coherent than what the magazine's rank measures. So this sets me off on the quest for better explanations, for which we obviously need more data.  

Needle is good at integrating data, so I have integrated a bunch of it: per-capita incomes, town populations, unemployment rates, district demographic breakdowns, lunch subsidy percentages and 2010 election results. Some of these apply to towns, not districts, and several districts serve multiple towns, but Needle loves one-to-one and one-to-many relationships the same, so I've done properly weighted multi-town averages. (Don't try that in a spreadsheet.)  

And then I started comparing things. Per-pupil spending seems like it ought to matter, but it shows very little statistical correlation to quant scores. Student/teacher ratios, sports-team counts and AP classes also seem like they ought to matter, but the numbers don't support this.  

Per-capita income, on the other hand, matters. The percentage of students receiving lunch subsidies matters even more. In fact, this last factor (the precise calculation I used was adding the percentage of students receiving free lunch and half of the percentage of students receiving partially subsidized lunch) is the single best predictor of quant score that I've found so far. This is depressingly unsurprising: poverty at home is hard to overcome: hard enough for individuals, and even harder in aggregate.  

With this in mind, then, I ran a quick linear regression of quant score as a strict function of lunch-subsidy percentage, and used that to calculate predicted quant scores for each district. The depressing headline is how small those variations are. In a quant-score range from 1531 to 727, only 10 districts did more than 100 quant points better than predicted, and only 10 districts did more than 100 points worse. If I use the square roots of the lunch-subsidy percentages, instead, only 6 districts beat their predictions by 100, and only 8 miss by 100.  

If I toss in town unemployment rates, Democratic vote percentages in the 2010 Senate election, and town per-capita income, I can get my predictions so close that only 1 school did more than 100 points better than expected, and only two did more than 100 points worse. This is daunting precision.  

But OK, even if the variations are small, they're there. So surely this is where those aspirational metrics like spending must come into play. Throwing money at students in school may not be able to counteract poverty at home, but doesn't it at least help?  


Students per Teacher? No.
AP classes? No.
Percentage of minority students? No.  

I'm by no means saying that there isn't an explanation, or more of an explanation, or other factors. But if there are, I haven't found them yet.  

But at least I'm trying. And I give you the data so you can try, too. I submit that this is what data journalism should be trying to do. We are trying to find knowledge in data. Secrecy and opaqueness and non-interactivity are counter-productive. It's more than hard enough to find truth even with all the data arrayed in front of us. If there's an equivalent of the Hippocratic Oath for data journalists, it should be that we will endeavor to never make the truth more obscure.  

[Space for discussion here.]  

[Postscript, 10 September: The more I thought about that 823/523 error, the more I worried that there might be other errors that weren't as obvious, so I used Needle to cross-check all the test-scores against the official DOE figures. Two more were wrong. Manchester Essex's SAT Reading score was 559, not 599, which I'm guessing would lower their #6 magazine rank, perhaps considerably. In my rankings it dropped them from 28 to 31. Ashland's SAT Reading score was also wrong, 531 not 537, but this didn't change their rank in my method. Both corrections moved those schools' scores closer to my predictions.]  

[Postscript, 12 September: But charter schools do better relative to expectations, right? Nope.]
People stop me on the street, with wild looks in their eyes, and ask "What are the best Scrabble racks?!"  

[From a note to the public-lod@w3.org mailing list.]  

Data "beauty" might be subjective, and the same data may have different applicability to different tasks, but there are a lot of obvious and straightforward ways of thinking about the quality of a dataset independent of the particular preferences of individual beholders. Here are just some of them:  

1. Accuracy: Are the individual nodes that refer to factual information factually and lexically correct. Like, is Chicago spelled "Chigaco" or does the dataset say its population is 2.7?  

2. Intelligibility: Are there human-readable labels on things, so you can tell what one is when you're looking at it? Is there a model, so you can tell what questions you can ask? If a thing has multiple labels (or a set of owl:sameAs things have multiple labels), do you know which (or if) one is canonical?  

3. Referential Correspondence: If a set of data points represents some set of real-world referents, is there one and only one point per referent? If you have 9,780 data points representing cities, but 5 of them are "Chicago", "Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and "Chicagoland", that's bad.  

4. Completeness: Where you have data representing a clear finite set of referents, do you have them all? All the countries, all the states, all the NHL teams, etc? And if you have things related to these sets, are those projections complete? Populations of every country? Addresses of arenas of all the hockey teams?  

5. Boundedness: Where you have data representing a clear finite set of referents, is it unpolluted by other things? E.g., can you get a list of current real countries, not mixed with former states or fictional empires or adminstrative subdivisions?  

6. Typing: Do you really have properly typed nodes for things, or do you just have literals? The first president of the US was not "George Washington"^^xsd:string, it was a person whose name-renderings include "George Washington". Your ability to ask questions will be constrained or crippled if your data doesn't know the difference.  

7. Modeling Correctness: Is the logical structure of the data properly represented? Graphs are relational databases without the crutch of "rows"; if you screw up the modeling, your queries will produce garbage.  

8. Modeling Granularity: Did you capture enough of the data to actually make use of it. ":us :president :george_washington" isn't exactly wrong, but it's pretty limiting. Model presidencies, with their dates, and you've got much more powerful data.  

9. Connectedness: If you're bringing together datasets that used to be separate, are the join points represented properly. Is the US from your country list the same as (or owl:sameAs) the US from your list of presidencies and the US from your list of world cities and their populations?  

10. Isomorphism: If you're bring together datasets that used to be separate, are their models reconciled? Does an album contain songs, or does it contain tracks which are publications of recordings of songs, or something else? If each data point answers this question differently, even simple-seeming queries may be intractable.  

11. Currency: Is the data up-to-date? As of when?  

12. Model Uniformity: Are discretionary modeling decisions made the same way throughout the dataset, so that you don't have to ask many permutations of the same question to get different subsets of the answer? Nobody should have to worry whether some presidents and presidencies are asserted in only one direction and some only the other.  

13. Attribution: If your data comes from multiple sources, or in multiple batches, can you tell which came from where? If a source becomes obsolete or discredited or corrupted, can you take its data out again?  

14. History: If your data has been edited, can you tell how and when and by whom? Can you undo errors, both individual (no, those two people aren't the same, after all) and programmatic (those two datasets should have been aligned with different keys)?  

15. Internal Consistency: Do the populations of your counties add up to the populations of your states? Do the substitutes going into your soccer matches balance the substitutes going out? Would you notice if errors were introduced?  

16. Legality: Is the license under which the data can be used clearly defined, ideally in a machine readable way?  

17. Sustainability: Is there is some credible basis or evidence for believing the data will be kept available and current? If it's your data, what commitment to its maintenance are you making?  

18. Authority: Is the source of the data a credible authority on the subject? Did you find a list of NY Charter Schools, or the list?  

[Revision of #12 and addition of 16-18 suggested by Dave Reynolds.]
You could do math and numbers without periods and commas.  

In the US we use commas as thousands separators, like "1,456,012". There's no functional significance of this. "1456012" is the same number, it's just slightly harder for humans to read quickly. A computer can read it fine. You could add 47 to it easily enough. As a syntax element, then, this is both pure and optional.  

(To be really strict about this, the comma is not quite pure syntax, because although you can always take commas out with functional impunity, you can't put them in with the same freedom. Most humans will not recognize "1,4,5,6,01,2" as an alternate rendition of "1,456,012", and it's debatable whether a computer should or shouldn't.)  

In the US we use the period as the decimal point, like "1.25". This is a functional element of the language of math, obviously, as it's critically different from "125" or "12.5". But it's still optional. We could write 1.25 as 1¼ or 1 25/100. The decimal point is, in a sense, a shortcut notation for expressing fractions whose denominators are negative powers of 10. This is a elegantly sensible idea, since the digits of a decimal-point-less number are already expressing powers of 10, and it allows a whole repertoire of arithmetic techniques to be applied to fractions numbers the same way they're used on whole numbers.  

But adding syntax is never free. Giving the period special meaning in numbers introduces some amount of potential confusion when numbers are used in English sentences, where the period has a different syntactic function. Now when I write "I set the intensity to 1.5.", the first period belongs to the number and the second to the sentence. If people started using numbers in names, we might find ourselves saying "I have an appointment to see Dr. 1.5, Jr..", in which there are four periods serving three different functions.  

But as a tradeoff, this seems OK. We don't commonly use numbers in names, and sentence-ending periods are always followed by spaces while decimal points always aren't, so there's no appreciable ambiguity from a human perspective. The overloading of functions makes a computer's task harder if it's necessary to parse English sentences, but there the use of periods for abbreviation is already a much trickier problem, so the incremental effect of adding the decimal-point use doesn't affect the overall complexity very much.  

But if this were a language-design decision we were actually trying to make, we would also consider the alternatives. If we expect to want to embed snippets of our number language in our sentence language, which is what we're doing in language terms, then we have basically four choices: pick symbols that don't conflict, subsume the number language into the sentence language, quote whole snippets of the number language where we embed them into the sentence language, and/or mark the special characters from the sentence language, where they appear in the embedded bits of number language, to indicate that they are not performing their sentence functions in that context (this last is often called "escaping", in the sense that the characters so-marked are escaping from their normal responsibilities).  

Picking symbols that don't conflict is easy, of course, if you can just make up your own symbols. But if we pretend that we're making this decimal-point decision today, in a world filled with already-manufactured keyboards with only certain characters on them, and we agree that we want to use one of these characters, then there aren't a lot of good choices. Almost all the characters on a keyboard already have meaning in sentences, so you're mostly left with ^ ~ and `. Two of these already have math uses, though, and the backwards quote doesn't seem very appealing here.  

The other three solutions, subsumption, quoting and escaping, move the embedding problem out of the embedded language (numbers) into the embedding language (sentences).  

For subsumption we provide constructs in the sentence language for everything in the number language. This, in fact, has exactly been done. You can write numbers and math without using symbols or numerals: "One point two five divided by the sum of nine point four and eleven." That's the sentence version of 1.25/(9.4+11). The cool insight, from the sentence language's perspective, is that anything you can describe in words, you can say in a sentence by writing out those words, and if you do that, you can reserve punctuation for its special sentence functions. The tradeoff is particularly obvious in this case: the number language is very symbol-heavy, so the sentence-language translation is a lot different from the number-language equivalent, and in particular the number-language's careful balance of content (numbers) and function (numeric operations on those numbers) has been destroyed by turning both numerals and punctuation into words. So the cost of eliminating the ambiguity of symbols, by this method, is that translated snippets of number-language have to be mentally translated back by the reader. Given that the symbol confusion that motivated us not very severe, it's probably not worth incurring this cost, at least most of the time, but it's good to know it's an option.  

For quoting we need a way to mark the point where we switch from the sentence language into the number language, and the point where we switch back. The meta-language of paragraphs already has a block-quotation method for separating languages, which we can use for this, so we're just exploring the idea of an additional method for inline quotation, putting number-snippets inside of sentences without block-quotes or paragraph breaks.  

The sentence language already has quotation marks, of course, so we might be tempted to use those. But actually, the quotation marks in the sentence language are doing a complicated mixture of language functions: some quoting, some escaping, some nesting, and often a specific language-internal punctuation role for denoting speech. Parentheses and square brackets also have established sentence-language roles. But braces (the squiggly brackets: { } ) are available on keyboards, and not used (for much) in the sentence language. So we could adopt those as our quoting symbols. This will work well or poorly depending on whether the embedded languages use curly brackets. The number language we're talking about doesn't, but remember we're now talking about a sentence-language feature, so it has to be considered in the context of all the other languages we might want to embed in sentences. I can easily think of a few that use braces (several programming languages, the expanded math language of sets...), so that's not a clear win.  

So, in fact, we adopt a kind of hybrid solution. Snippets of the number language look different enough from the sentence language that we usually just jam them in and hope for the best, and in the rare cases where there are numbery bits of sentence-language in proximity to number-language snippets, we can use the sentence-language's function for rewriting the numbery bits without numerals. So if I'm talking about a numeric-expression collection, I can write "I have two 3+5s and one 9/13", instead of "I have 2 3+5s and 1 9/13". This is pretty complicated to explain in formal language-design terms, but it's easy to do.  

The last approach to consider is escaping. Instead of marking whole blocks of the number language, we could mark just the specific characters in the number-language that would normally have functions in the sentence-language. Two common tactics for this, from machine-oriented languages, are prefacing the characters with backslashes or doubling them. So instead of 1.25, we'd do 1\.25 or 1..25. The former approach has the advantage that we can use it for lots of characters, so we can use it for dashes and slashes and asterisks and other things which also appear in both the number language and the sentence language. But the readability cost of this, for the embedded snippet, can be high. How do you like "1\.25\/\(9\.4\+11\)"? And backslash escaping gets really ugly if you need levels of nested languages, because then you have to start backslashing the blackslashes. So here again, the cost of the solution is high compared to the benefit in ambiguity-reduction.  

So, all things considered, sticking with unadorned decimal points in numbers in sentences seems, pretty clearly, like the best approach, given all the factors we could think to consider.  

Language-design decisions are always provisional like this. If factors change, we might reconsider. Note, for example, how we've mostly changed from writing fractions by stacking tiny numbers vertically to writing normal numerals with slashes like 1/3, because numbers are increasingly expressed by people using keyboards, typing into software oriented around lines of text, and 1/3 is much easier than finding the special ⅓ character or relying on the software you happen to be using having the proper input/typography support for expressing arbitrary fractions the old way.  

So, if you've read this far, hopefully you've learned at least one thing. And probably that one thing is that you do not want to be a language designer.
Site contents published by glenn mcdonald under a Creative Commons BY/NC/ND License except where otherwise noted.