furialog

¶ How to Say Only What You're Saying (With Owls) · 26 October 2009 tech

In data-aggregation we are constantly trying to deduplicate references. One dataset says "Olecito", one says "Olecito -- Cambridge", a third says "Olecito Taqueria", and a fourth says "Cielito Taqueria" in the same city as the others, and we'd really like to know whether we have four restaurants, or one, or they're all imaginary or closed or what.

Aligning the variant names is a problem in itself, but it's not the hard one. The hard one is when the concepts themselves don't quite line up. One dataset may be a list of employers, for example, so one "Olecito" is actually a corporate entity with various business properties, including a business address for their accountant's office, and if we show up there wanting a steak quesadilla we will probably (although I don't want to underestimate the guy) be disappointed.

In Linked Data practice (and I'm not currently arguing that you need to know what that is if you don't already), the most common tool for asserting equivalence is a relationship called "owl:sameAs". Out of context this seems like some kind of Star Wars joke by way of Harry Potter: those were not the droids you were looking for, but these are the owls you want. But the "owl:" part is just a bit of pointlessly off-putting data-modeling pedantry in which every single thing you say has to be individually attributed, like saying "These scrambled (in the Epicurious sense) eggs (in the Wikipedia sense) are (in the OED sense) cold (in the Eastern West Virginia State College Department of Thermopedantics and Breakfast sense)."

But never mind, the point of this goofy label is just to say that two different names are actually names for the same thing. Or concept, or even restaurant. The way the function works, officially, it is an absolute statement. It is inherently transitive and symmetric, so if we say that "Olecito", the corporation, is the same owls as "Olecito Taqueria", the excellent take-out restaurant, and "Olecito Taqueria" is a name for the same thing somebody once mistakenly typed as "Cielito Taqueria", then we are saying that anything said about any of those is true of the others. And there we are in the CPA's office-park again, annoying him by demanding Jarritos and a torta.

Practitioner arguments about this issue are obscure even by my standards. (You can see one here, but you've been warned.) Towards the end of that scattered argument, though, it occurred to me what I think the official owls need.

Sometimes we really are trying to say that two different names are wholly interchangeable, particularly when we're cleaning up a single dataset. So having a way to say that is necessary and good.

But across datasets what we usually mean is, in English, something more like "For our current purposes, the thing called X in somebody else's dataset can be taken to provide some missing information about Y in our dataset." This is different from the same-owls thing in two essential ways: it's contextual, not absolute; and even more importantly, it's one-way. We are not saying, or even implying, that our information about Y ought to be inserted into anybody's database as information about X or Z or anything. We are making no claim about whether X and Y are the same existentially, or for any other purpose than this.

The relevant web standards are missing this concept. Maybe oddly, maybe not: RDF and OWL arose out of taxonomy, not data-cleanup. So they have a way to say that one whole class of things are derived from another class of things, and that one relationship is a variation on another relationship; but not that one individual data-concept is a contextual elaboration on another.

Adding it, though, would be trivial, at least logically. It could be called "owl:represents", as in "My owl is elsewhere, so I'll send this note with yours." And in the other direction, if you are the recipient of this representation, the relationship would be "owl:isRepresentedBy". The assertion is neither transitive nor symmetric. What I say about your data is completely separate from what you say about mine, or what you say about anybody else's. I have no business saying whether my data matches yours for your purposes, and no desire to have my statements about my purposes conflated with anything broader. I'm only trying to say what I'm saying. More precisely, I'm trying to say only what I'm saying. We each keep our owls, and you know that I used yours to send a message, but that doesn't make my owl your owl, or my message your message, or anything inane like that.

Surely the owls, at least, would appreciate this.