Bruno Arine

Comparing TF-IDF, GloVe, and SBERT

I’m exploring better algorithms for findlike. At the moment, only lexical search algorithms have been implemented, but I know for a fact that semantic similarity trumps lexical search by a landslide.

So I ran a quick test to see how large the difference is between a context-aware model (SBERT) and a free-context model (GloVe) and statistical algorithms (TF-IDF, BM25).

Table 1: Algorithms and the pre-trained models used in this experiment.
Algorithm Pre-trained model
GloVe globe-wiki-gigaword-50
SBERT all-MiniLM-L6-v2

When comparing NLP models and algorithms, people tend to use the same cliché phrase pairs like “I like to watch television / I’m wearing a wristwatch”. So let’s perk up things a bit and use more realistic sentences, preferably ones that could well have come straight out of my digital garden:

Reference) “In fact, writing down what matters is an art, as students make a lot of effort to process information and shape it into something useful.”

Sentence A) “Summarizing and transforming huge chunks of text into meaningful knowledge is a rewarding, albeit demanding craft.”

Sentence B) “The Zettelkasten method is the preferred personal knowledge management system for avid note takers nowadays.”

Sentence C) “As a matter of fact, given this useful information, a lot of art students are down to make some effort and get into shape, he writes.”

Notice that:

  • Sentence A conveys more or less the same idea as reference sentence but with a different wording.
  • Sentence B is loosely related to reference sentence, and belongs to the same overarching subject (information processing).
  • Sentence C has nothing to do with reference sentence, but shares a lot of its word roots: “fact”, “write”, “matter”, “art”, “student”, “effort”, “information”, “shape”, “useful”.

Results

sentence Expected similarity TF-IDF BM25 GloVe (averaged) GloVe + SIF SBERT
A medium/high 0.0 0.0 0.96 -0.29 0.39
B medium/low 0.0 0.0 0.93 -0.63 0.17
C low 0.75 1.59 0.99 0.47 0.69

Unsurprisingly, the statistical approaches TF-IDF and BM25 see zero similarity between the reference sentence and A and B, and a lot of similarity between reference and C, given that almost every word in the former is also present in the latter.

However, I confess that I expected more from pre-trained SBERT. It wasn’t able to capture the essence of each sentence, especially between Reference and C. Although they overlap in lexical grounds, the two sentences have significantly different themes.