furia furialog · Every Noise at Once · New Particles · The War Against Silence · Aedliga (songs) · photography · other things · contact
15 May 2010 to 10 February 2010
The World Cup has a long, storied history. 708 matches across 18 tournaments, involving 80ish different countries, more than 5000 players, and more than 2000 goals. That's a lot of soccer.  

It's not really that much data, though. Soccer isn't a sport for actuaries. My AAC file for Shakira's 2010 World Cup theme song is approximately three times the size of my data file containing more or less all salient info about the entire match-history of the Cup finals. When people talk about Big Data, this is not what they mean. This is Small Data.  

Even Small Data can be hard to get right, though. Who scored the second Cuban goal in their 3-3 draw with Romania on 5 June 1938? I'm betting you don't quite remember the guy's name, either.  

FIFA's official stats page for this match claims that the second Cuban goal was scored by Jose Magrina in the 69th minute. The listed Cuban lineup, however, includes no such player among either the starters or the substitutes.  

Planet World Cup's version has the middle Cuban goal scored by "Maquina", who the FIFA lineup doesn't list either, and PWC doesn't have a lineup to reference. They also have the second Romanian goal coming after the second Cuban goal, not before.  

Scoreshelf has a page for Carlos Maquina Oliveira, which matches FIFA's listing of Carlos Oliveira, so that's something. But Scoreshelf credits him with the second and third Cuban goals, and their match page has fairly different timings for all six goals, and credits the first Romanian goal to a different player.  

InfoFootballOnline's 1938 stats page claims Maquina had 2 goals for Cuba, but it also claims Héctor Socorro had 3, including 2 of the 3 in the 3-3 draw, contradicting other sites' credit of one of the Cuban goals to Tomas Fernández.  

The Wikipedia page for 1938 has yet another set of timings, and expresses its own creativity by giving the third Cuban goal to Juan Tuñas.  

So I can say, with pretty good confidence, that I don't know who scored these goals, nor when they occurred. If you know of an explicably authoritative source for the goal credits for Cuba's 1938 World Cup games, send me the reference. But of all the versions, FIFA's official one is clearly and ironically the least sensible, as it involves a player that none of these sources, including FIFA themselves, list as having been in the game. Machines can't tell us how we're wrong, but they ought to be able to easily tell us when we're not making any sense.  

So for my version of World Cup History in Needle, I've made my own decisions, too, but I can at least say, with computationally verified certainty, that they're internally consistent. Across this whole small sprawling history, there are no goals or cards attributed to unknown players, the itemized goal totals (including own-goals) match the official final scores, the computed champions match the record books. I've fixed dozens of games where FIFA's stats list starters being replaced by ghosts, or 12th players joining the fray. I found the two cases where players were carded while they weren't even in the game, and looked them up to make sure that's what actually happened. I've checked that there are no goals credited during overtime of games that didn't have any, I fixed the game that was listed as happening in February, and I fixed the extra errors my own non-soccer-specific software introduced (did you know that "NGA", the country-code for Nigeria, is also the Vietnamese word for Russia?).  

It shouldn't have come to this. I'm an art major working for a company that makes airline software. Between FIFA and a dozen sports and news companies, somebody who lives by this data ought to have fixed it all years ago. The worst thing is, I suspect they've been trying. And yet, every source I checked had problems I could tell they couldn't find. When all data had to do was look right in the galley proofs, it was a publishing problem. But bringing publishing tools to data problems is like bringing a Strunk & White to a math fight.

Percentage of baby names that were used for both sexes.
I suggest listening to this solo album by Jónsi of Sigur Rós while pouring through this writhing index of World Cup stats.
I was doing some more analysis on Encyclopaedia Metallum data, as I seem to end up doing one way or another most weeks, and it occurred to me to add in country populations so I could do bands per capita. This is not a rigorous statistic, since for some regions of the world it probably says more about the EM contributor-base than it does about the actual distributions of musicians, but the top of the list was pretty striking:  

1. Finland - 471 bands per million people
2. Sweden - 330
3. Norway - 239
4. Iceland - 189
5. Denmark - 113  

Yes, I have rediscovered the existence of Scandinavia statistically!  

For comparison:  

28. United States - 49
29. United Kingdom - 46  

And way down at the very end (among countries with at least 10 bands, at any rate):  

87. China - 0.097
88. India - 0.068  

There are a fair number of places that have, as far as EM knows, no metal bands at all, but I've got all the populations in place, so if anybody starts a band on Pitcairn or Wallis et Futuna next week, I'm ready.  

(Interesting, too, that Finland leads the world in both metal bands and Winter Olympic medals per capita.)  

(In the interest of accuracy, I should note that if you include all countries, Liechtenstein actually comes in third with 251 bands per million people (by which we mean 9 bands for ~36,000 people). Liechtenstein is one of only two countries in the world where there are more Gothic Metal bands than any other style. The other place is Kuwait, where Gothic Metal bands outnumber all other styles, put together, by a commanding margin of 1 to 0. The band is called Nocturna.)
My week to pick three albums for the new ILX Metal Listening Club. Like a book club, but for metal albums. All are welcome to join us in contemplation of Fates Warning's Parallels, Cradle of Filth's Nymphetamine and HIM's Screamworks: Love in Theory and Practice.
I have a post on my work-blog about why geeky-sounding data-modeling issues matter to even simple-looking data.  

The post uses some Oscar-award data as an example, as I just put together a Needle version of Oscar History, so if you ever wanted to be able to answer some obscure statistical question about the Oscars, now you can. (Or you can ask me, and I can...)
From Revisiting HTTP based Linked Data:
What is Linked Data?  

"Data Access by Reference" mechanism for Data Objects (or Entities) on HTTP networks. It enables you to Identify a Data Object and Access its structured Data Representation via a single Generic HTTP scheme based Identifier (HTTP URI). Data Object representation formats may vary; but in all cases, they are hypermedia oriented, fully structured, and negotiable within the context of a client-server message exchange.

If there's anybody in the universe (who isn't already a Semantic Web expert) who reads that and thinks "Yes! I gotta get one of those!", I'd be very surprised.  

For that matter, if there's anybody in the data business, other than the writer, who reads that and thinks "Yes! That's the Grand Quest to which I have given my Life and Loyalty!", I'd be almost as surprised, and rather suspicious.  

Contrast this:
What is Whole Data?  

Whole Data is a method of storing and representing your data so that you can explore it as easily as you browse the web, and examine or analyze it from any perspective at any time. It gives you and the computer a common language for talking about your data, so that the computer can answer your questions by examining your data the way you would, but way faster.

I'm not saying this is the same level as "We hold these truths to be self-evident...", or "I'm not going to pay a lot for this muffler", either. It's a technical appeal, not a Radio Free Earth broadcast. But it's closer to comprehensible, at least, isn't it?
If you'd rather see a glimpse of Needle, my work project, involving sports instead of music, I just did a quick chomp through the medal-winner data from the Vancouver Olympics. Tables, breakdowns, cross-tabulations, etc.!  

Needle - Vancouver 2010 Olympics
There's a startup called Etacts whose software can import your email (headers), and give you a view by person instead of by message. This is pretty obviously something you'd like to be able to see, if you have any significant volume of correspondence or correspondents. And it's pretty obviously a large leap of faith to give some random startup the ability to log into your email account.  

But, even more, this is a perfect example of something you ought to be able to do, at least basically, with your data without needing any extra application. Etacts provides some unique contact-management features, but the fundamental function of rotating data to some other axis is, I contend, rightfully a property of the data. You already own the view of your email by recipient, because it is immanent in your email. But your email service probably doesn't let you see it. I would like it to be the responsibility of a data-holding service to give you full logical access to that data.  

This is a huge part of why my work project matters to me. It is my attempt to design a universal data-modeling paradigm, and associated query-language, that could be a candidate standard for what it means to provide "full logical access" to data.
Usage Notes  

"In lieu of" means "instead of", with the general implication that the object of the phrase is absent or unavailable. So "I ordered onion rings in lieu of fries" if they were out of fries and you were forced to choose something else, but "I ordered onion rings instead of fries" is better if it's just your choice. "In place of" works for both, too.  

"In light of" means "as a result of", with the sense of having changed something because you observed (thus the "light", by which you see) something else. So "In light of the overwhelming shift in demand from fries to onion rings, I recommend we reduce our potato order." Or just "Let's get more onions instead of so many potatoes. Everybody hates potatoes now, apparently."  

"In favor of" means "in place of", like "instead of" but in the opposite order, with a sense of some kind of trade-off. "We reduced our potato order in favor of more onions to satisfy rampant onion-ring demand." Or "We bought more onions instead of potatoes, because everybody is ordering onion rings for some weird reason now. Did somebody on Chowhound post something good about our onion rings? Or bad about our fries?!"  

In light of frequent misuse, in lieu of specific needs I recommend eschewing all three in favor of saying what you know you mean.
Site contents published by glenn mcdonald under a Creative Commons BY/NC/ND License except where otherwise noted.