furia furialog · Every Noise at Once · New Particles · The War Against Silence · Aedliga (songs) · photography · other things
Here is a gift, of unspecified value, to the field of set-comparison math: The Empath Coefficient, an alternate measure of the alignment between two sets. Conceptually this is intended as a rough proxy for measuring the degree to which the unseen or impractical-to-measure motivation behind the membership of set A also informs the membership of set B, but the math is what it is, so the next time you find yourself comparing the Cosine, Dice and Tanimoto coefficients, looking for something faster than TF-IDF to make some sense of your world, here's another thing to try. This is the one I used in empath, my recent similarity-analysis of heavy-metal bands, if you want to see lots of examples of it in action.  

At its base, the Empath Coefficient is an asymmetric measure, based on the idea that in a data distribution with some elements that appear in many sets and some that appear in only a few, it is not very interesting to discover that everything is "similar" to the most-popular things. E.g., "People who bought Some Dermatological Diseases of the Domesticated Turtle also bought Harry Potter and the...". In the Empath calculation, then, the size of the Harry Potter set (the one you're comparing) affects the similarity more than the size of the Turtle set (the one you're trying to learn about). I have arrived at a 1:3 weighting through experimenting with a small number of data-sets, and do not pretend to offer any abstract mathmatical justification for this ratio, so if you want to parameterize the second-set weight and call that the Npath Coefficient, go ahead.  

Where the Dice Coefficient, then, divides the size of the overlap by the average size of the two sets (call A the size of the first set, B the size of the second set, and V the size of the overlap):

V/((A+B)/2)
or
2V/(A+B)

the core of the Empath Coefficient adjusts this to:

V/((A+3B)/4)
or
4V/(A+3B)

By itself, though, that calculation will still be uninformatively dominated by small overlaps between small sets, so I further discount the similarities based on the overlap size. Like this:

(1-1/(V+1)) * V/((A+3B)/4)
or
4V(1-1/(V+1))/(A+3B)

So if the overlap size (V) is only 1, the core score is multiplied by 1/2 [1-1/(1+1)], if it's 2 the core score is multiplied by 2/3 [1-1/(2+1)], etc. And then, for good measure, I parameterize the whole thing to allow the assertion of a minimum overlap size, M, which goes into the adjustment numerator like this:

4V(1-M/(V+1))/(A+3B)

This way the sample-size penalties are automatically calibrated to the threshold, and below the threshold the scores actually go negative. You can obviously overlay a threshold on the other coefficients in pre- or post-processing, but I think it's much cooler to have the math just take care of it.  



I also sometimes use another simpler asymmetric calculation, the Subset Coefficient, which produces very similar rankings to Empath's for any given A against various Bs (especially if the sets are all large):

(V-1)/B

The concept here is that we take A as stipulated, and then compare B to A's subset of B, again deducting points for small sample-sizes. The biggest disadvantage of Subset is that scores for As of different sizes are not calibrated against each other, so comparing A1/B1 similarity to A2/B2 similarity won't necessarily give you useful results. But sometimes you don't care about that.  

This is the one I used for calculating artist clusters from 2006 music-poll data, where cross-calibration was inane to worry about because the data was so limited to begin with.  



Here, then, are the forumlae for all five of these coefficients:  

Cosine: V/sqrt(AB)
Dice: 2V/(A+B)
Tanimoto: V/(A+B-V)
Subset: (V-1)/B
Empath: 4V(1-M/(V+1))/(A+3B)  

And here are some example scores and ranks:  

# A B V Dice Rank Tanimoto Rank Cosine Rank Subset Rank Empath Rank
1100 100 100 1.000 1 1.000 1 1.000 1 0.990 1 0.990 1
210 10 10 1.000 1 1.000 1 1.000 1 0.900 2 0.909 2
310 10 5 0.500 3 0.333 3 0.500 5 0.400 4 0.417 4
410 10 2 0.200 12 0.111 12 0.200 12 0.100 11 0.133 12
510 5 3 0.400 6 0.250 6 0.424 6 0.400 5 0.360 5
65 10 3 0.400 6 0.250 6 0.424 6 0.200 8 0.257 8
710 5 2 0.267 10 0.154 10 0.283 10 0.200 7 0.213 10
85 10 2 0.267 10 0.154 10 0.283 10 0.100 11 0.152 11
96 6 2 0.333 9 0.200 9 0.333 9 0.167 9 0.222 9
106 4 2 0.400 6 0.250 6 0.408 8 0.250 6 0.296 6
116 2 2 0.500 3 0.333 3 0.577 3 0.500 3 0.444 3
122 6 2 0.500 3 0.333 3 0.577 3 0.167 9 0.267 7
 

A few things to note:  

- In 1 & 2, notice that Dice, Tanimoto and Cosine all produce 1.0 scores for congruent sets, no matter what their size. Subset and Empath only approach 1, and give higher scores to larger sets. The idea is that the larger the two sets are, the more unlikely it is that they coincide by chance.  

- 5 & 6, 7 & 8 and 11 & 12 are reversed pairs, so you can see how the two asymmetric calculations handle them.  

- Empath produces the finest granularity of scores, by far, including no ties even within this limited set of examples. Whether this is good or bad for any particular data-set of yours is up to you to decide.  

- Since all of these work with only the set and overlap sizes, none of them take into account the significance of two sets overlapping at some specific element. If you want to probability-weight, to say that sharing a seldom-shared element is worth more than sharing an often-shared element, then look up term frequency -- inverse document frequency, and plan to spend more calculation cycles. Sometimes you need this. (I used tf-idf for comparing music-poll voters, where the set of set-sizes was so small that without taking into account the popularity/obscurity of the albums on which voters overlapped, you couldn't get any interesting numbers at all.)  



There may or may not be some clear mathematical way to assess the fitness of each of these various measurements for a given data-set, based on its connectedness and distribution, but at any rate I am not going to provide one. If you actually have data with overlapping sets whose similarity you're trying to measure, I suggest trying all five, and examining their implications for some corner of your data you personally understand, where you yourself can meta-evaluate the scores and rankings that the math produces. I do not contend that my equations produce more-objective truths than the other ones; only that the stories they tell me about things I know are plausible, and the stories they have told me about things I didn't know have usually proven to be interesting.
Site contents published by glenn mcdonald under a Creative Commons BY/NC/ND License except where otherwise noted.