Comparing TF-IDF, GloVe, and SBERT
I’m exploring better algorithms for findlike. At the moment, only lexical search algorithms have been implemented, but I know for a fact that semantic similarity trumps lexical search by a landslide.
So I ran a quick test to see how large the difference is between a context-aware model (SBERT) and a free-context model (GloVe) and statistical algorithms (TF-IDF, BM25).
Algorithm | Pre-trained model |
---|---|
GloVe | globe-wiki-gigaword-50 |
SBERT | all-MiniLM-L6-v2 |
When comparing NLP models and algorithms, people tend to use the same cliché phrase pairs like “I like to watch television / I’m wearing a wristwatch”. So let’s perk up things a bit and use more realistic sentences, preferably ones that could well have come straight out of my digital garden:
Reference) “In fact, writing down what matters is an art, as students make a lot of effort to process information and shape it into something useful.”
Sentence A) “Summarizing and transforming huge chunks of text into meaningful knowledge is a rewarding, albeit demanding craft.”
Sentence B) “The Zettelkasten method is the preferred personal knowledge management system for avid note takers nowadays.”
Sentence C) “As a matter of fact, given this useful information, a lot of art students are down to make some effort and get into shape, he writes.”
Notice that:
- Sentence A conveys more or less the same idea as reference sentence but with a different wording.
- Sentence B is loosely related to reference sentence, and belongs to the same overarching subject (information processing).
- Sentence C has nothing to do with reference sentence, but shares a lot of its word roots: “fact”, “write”, “matter”, “art”, “student”, “effort”, “information”, “shape”, “useful”.
Results
sentence | Expected similarity | TF-IDF | BM25 | GloVe (averaged) | GloVe + SIF | SBERT |
---|---|---|---|---|---|---|
A | medium/high | 0.0 | 0.0 | 0.96 | -0.29 | 0.39 |
B | medium/low | 0.0 | 0.0 | 0.93 | -0.63 | 0.17 |
C | low | 0.75 | 1.59 | 0.99 | 0.47 | 0.69 |
Unsurprisingly, the statistical approaches TF-IDF and BM25 see zero similarity between the reference sentence and A and B, and a lot of similarity between reference and C, given that almost every word in the former is also present in the latter.
However, I confess that I expected more from pre-trained SBERT. It wasn’t able to capture the essence of each sentence, especially between Reference and C. Although they overlap in lexical grounds, the two sentences have significantly different themes.