When the Jaccard similarity index isn’t the appropriate device for the job, and what to do as a substitute
I’ve been considering currently about considered one of my go-to information science instruments, one thing we use fairly a bit at Aampe: the Jaccard index. It’s a similarity metric that you simply compute by taking the dimensions of the intersection of two units and dividing it by the dimensions of the union of two units. In essence, it’s a measure of overlap.
For my fellow visible learners:
Many (myself included) have sung the praises of the Jaccard index as a result of it is useful for lots of use instances the place that you must determine the similarity between two teams of parts. Whether or not you’ve obtained a comparatively concrete use case like cross-device identification decision, or one thing extra summary, like characterize latent consumer curiosity classes primarily based on historic consumer habits — it’s actually useful to have a device that quantifies what number of elements two issues share.
However Jaccard just isn’t a silver bullet. Typically it’s extra informative when it’s used together with different metrics than when it’s used alone. Typically it’s downright deceptive.
Let’s take a more in-depth have a look at just a few instances when it’s not fairly acceptable, and what you may need to do as a substitute (or alongside).
The issue: The larger one set is than the opposite (holding the dimensions of the intersection equal), the extra it depresses the Jaccard index.
In some instances, you don’t care if two units are reciprocally related. Possibly you simply need to know if Set A principally intersects with Set B.
Let’s say you’re attempting to determine a taxonomy of consumer curiosity primarily based on looking historical past. You’ve a log of all of the customers who visited http://www.luxurygoodsemporium.com and a log of all of the customers who visited http://superexpensiveyachts.com (neither of that are reside hyperlinks at press time; fingers crossed nobody creepy buys these domains sooner or later).
Say that out of 1,000 customers who browsed for tremendous costly yachts, 900 of them additionally regarded up some luxurious items — however 50,000 customers visited the posh items website. Intuitively, you may interpret these two domains as related. Almost everybody who patronized the yacht area additionally went to the posh items area. Looks like we may be detecting a latent dimension of “high-end buy habits.”
However as a result of the variety of customers who have been into yachts was a lot smaller than the variety of customers who have been into luxurious items, the Jaccard index would find yourself being very small (0.018) regardless that the overwhelming majority of the yacht-shoppers additionally browsed luxurious items!
What to do as a substitute: Use the overlap coefficient.
The overlap coefficient is the dimensions of the intersection of two units divided by the dimensions of the smaller set. Formally:
Let’s visualize why this may be preferable to Jaccard in some instances, utilizing probably the most excessive model of the issue: Set A is a subset of Set B.
When Set B is fairly shut in measurement to Set B, you’ve obtained a good Jaccard similarity, as a result of the dimensions of the intersection (which is the dimensions of Set A) is near the dimensions of the union. However as you maintain the dimensions of Set A relentless and improve the dimensions of Set B, the dimensions of the union will increase too, and…the Jaccard index plummets.
The overlap coefficient doesn’t. It stays yoked to the dimensions of the smallest set. That implies that whilst the dimensions of Set B will increase, the dimensions of the intersection (which on this case is the entire measurement of Set A) will at all times be divided by the dimensions of Set A.
Let’s return to our consumer curiosity taxonomy instance. The overlap coefficient is capturing what we’re concerned about right here — the consumer base for yacht-buying is linked to the posh items consumer base. Possibly the search engine optimisation for the yacht web site is not any good, and that’s why it’s not patronized as a lot as the posh items website. With the overlap coefficient, you don’t have to fret about one thing like that obscuring the connection between these domains.
Professional tip: if all you will have are the sizes of every set and the dimensions of the intersection, yow will discover the dimensions of the union by summing the sizes of every set and subtracting the dimensions of the intersection. Like this:
Additional studying: https://medium.com/rapids-ai/similarity-in-graphs-jaccard-versus-the-overlap-coefficient-610e083b877d
The issue: When set sizes are very small, your Jaccard index is lower-resolution, and generally that overemphasizes relationships between units.
Let’s say you’re employed at a start-up that produces cellular video games, and also you’re creating a recommender system that implies new video games to customers primarily based on their earlier enjoying habits. You’ve obtained two new video games out: Mecha-Crusaders of the Cyber Void II: Prisoners of Vengeance, and Freecell.
A spotlight group most likely wouldn’t peg these two as being very related, however your evaluation reveals a Jaccard similarity of .4. No nice shakes, but it surely occurs to be on the upper finish of the opposite pairwise Jaccards you’re seeing — in any case, Bubble Crush and Bubble Exploder solely have a Jaccard similarity of .39. Does this imply your cyberpunk RPG and Freecell are extra intently associated (so far as your recommender is worried) than Bubble Crush and Bubble Exploder?
Not essentially. Since you took a more in-depth have a look at your information, and solely 3 distinctive system IDs have been logged enjoying Mecha-Crusaders, solely 4 have been logged enjoying Freecell, and a pair of of them simply occurred to have performed each. Whereas Bubble Crush and Bubble Exploder have been every visited by lots of of units. As a result of your samples for the 2 new video games are so small, a probably coincidental overlap makes the Jaccard similarity look a lot greater than the true inhabitants overlap would most likely be.
What to do as a substitute: Good information hygiene is at all times one thing to bear in mind right here — you’ll be able to set a heuristic to wait till you’ve collected a sure pattern measurement to contemplate a set in your similarity matrix. Like all estimates of statistical energy, there’s a component of judgment to this, primarily based on the everyday measurement of the units you’re working with, however keep in mind the final statistical finest follow that bigger samples are typically extra consultant of their populations.
However an alternative choice you will have is to log-transform the dimensions of the intersection and the dimensions of the union. This output ought to solely be interpreted when evaluating two modified indices to one another.
In the event you do that for the instance above, you get a rating fairly near what you had earlier than for the 2 new video games (0.431). However since you will have so many extra observations within the Bubble style of video games, the log-transformed intersection and log-transformed union are so much nearer collectively — which interprets to a a lot greater rating.
Caveat: The trade-off right here is that you simply lose some decision when the union has a whole lot of parts in it. Including 100 parts to the intersection of a union with 1000’s of parts might imply the distinction between a daily Jaccard rating of .94 and .99. Utilizing the log remodel strategy may imply that including 100 parts to the intersection solely strikes the needle from a rating of .998 to .999. It is determined by what’s necessary to your use case!
The issue: You’re evaluating two teams of parts, however collapsing the weather into units leads to a lack of sign.
That is why utilizing a Jaccard index to check two items of textual content just isn’t at all times a terrific concept. It may be tempting to take a look at a pair of paperwork and need to get a measure of their similarity primarily based on what tokens are shared between them. However the Jaccard index assumes that the weather within the two teams to be in contrast are distinctive. Which flattens out phrase frequency. And in pure language evaluation, token frequency is usually actually necessary.
Think about you’re evaluating a e-book about vegetable gardening, the Bible, and a dissertation in regards to the life cycle of the white-tailed deer. All three of those paperwork may embody the token “deer,” however the relative frequency of the “deer” token will range dramatically between the paperwork. The a lot greater frequency of the phrase “deer” within the dissertation most likely has a distinct semantic affect than the scarce makes use of of the phrase “deer” within the different paperwork. You wouldn’t need a similarity measure to only overlook about that sign.
What to do as a substitute: Use cosine similarity. It’s not only for NLP anymore! (But additionally it’s for NLP.)
Briefly, cosine similarity is a option to measure how related two vectors are in multidimensional house (no matter the magnitude of the vectors). The path a vector goes in multidimensional house is determined by the frequencies of the scale which are used to outline the house, so details about frequency is baked in.
To make it simple to visualise, let’s say there are solely two tokens we care about throughout the three paperwork: “deer” and “bread.” Every textual content makes use of these tokens a distinct variety of instances. The frequency of those tokens develop into the scale that we plot the three texts in, and the texts are represented as vectors on this two-dimensional airplane. As an example, the vegetable gardening e-book mentions deer 3 instances and bread 5 instances, so we plot a line from the origin to (3, 5).
Right here you need to have a look at the angles between the vectors. θ1 represents the similarity between the dissertation and the Bible; θ2, the similarity between the dissertation and the vegetable gardening e-book; and θ3, the similarity between the Bible and the vegetable gardening e-book.
The angles between the dissertation and both of the opposite texts is fairly massive. We take that to imply that the dissertation is semantically distant from the opposite two — a minimum of comparatively talking. The angle between the Bible and the gardening e-book is small relative to every of their angles with the dissertation, so we’d take that to imply there’s much less semantic distance between the 2 of them than from the dissertation.
However we’re speaking right here about similarity, not distance. Cosine similarity is a metamorphosis of the angle measurement of the 2 vectors into an index that goes from 0 to 1*, with the identical intuitive sample as Jaccard — 0 would imply two teams don’t have anything in frequent, and nearer you get to 1 the extra related the 2 teams are.
* Technically, cosine similarity can go from -1 to 1, however we’re utilizing it with frequencies right here, and there will be no frequencies lower than zero. So we’re restricted to the interval of 0 to 1.
Cosine similarity is famously utilized to textual content evaluation, like we’ve accomplished above, however it may be generalized to different use instances the place frequency is necessary. Let’s return to the posh items and yachts use case. Suppose you don’t merely have a log of which distinctive customers went to every website, you even have the counts of variety of instances the consumer visited. Possibly you discover that every of the 900 customers who went to each web sites solely went to the posh items website a couple of times, whereas they went to their yacht web site dozens of instances. If we consider every consumer as a token, and due to this fact as a distinct dimension in multidimensional house, a cosine similarity strategy may push the yacht-heads a bit of additional away from the posh good patrons. (Be aware you can run into scalability points right here, relying on the variety of customers you’re contemplating.)
Additional studying: https://medium.com/geekculture/cosine-similarity-and-cosine-distance-48eed889a5c4
I nonetheless love the Jaccard index. It’s easy to compute and customarily fairly intuitive, and I find yourself utilizing it on a regular basis. So why write a complete weblog publish dunking on it?
As a result of nobody information science device may give you an entire image of your information. Every of those completely different measures let you know one thing barely completely different. You will get useful info out of seeing the place the outputs of those instruments converge and the place they differ, so long as you already know what the instruments are literally telling you.
Philosophically, we’re towards one-size-fits-all approaches at Aampe. After on a regular basis we’ve spent taking a look at what makes customers distinctive, we’ve realized the worth of leaning into complexity. So we expect the broader the array of instruments you need to use, the higher — so long as you know the way to make use of them.