Bruno Arine

findlike: a CLI tool for finding similar documents

I’ve just released findlike, a CLI tool that helps you find lexically similar documents in relation to a reference file or an ad-hoc query.

ELI5

Imagine you have a document, and you want to find other documents on your computer that talk about similar things. You can use findlike to do that. It looks at the words and phrases used in your reference document or a specific question you have, and then finds other files that use similar language. It’s a bit like doing a Google search across your own files.

Features

  • Choose between Okapi BM25 and TF-IDF algorithms for the lexical similarity calculation between file contents
  • Recursive search option
  • Control over parameters like maximum number of results, whether to display similarity scores etc.
  • Optionally return results in JSON format
  • Multilingual support
  • Highly configurable and can be used as backend for other programs (e.g. personal knowledge management systems, Emacs, etc.)

Installation

Please refer to the official repository for installation instructions and advanced usage.

Examples

Here are some examples of how it works. Suppose you have a directory full of documents, and you want to search those that are similar to a reference document. The way to perform this search is by typing this in your terminal:

$ findlike reference_file.txt

reference.txt
candidate_10.txt
candidate_07.txt
candidate_09.txt
candidate_08.txt
candidate_06.txt
candidate_04.txt
candidate_05.txt
candidate_03.txt
candidate_02.txt

Easy like this.

You can also search in other directories by passing the option -d /path/to/another/directory, or limit the search to a certain file extension with -f "*.txt" , or show the similarity scores with the -s option. But the most useful option if you are using it to feed results to a third-party plugin is formatting the results as JSON with -F json. Example:

$ findlike reference_file.txt -F json -s -m 3 | jq

Output:

[
  {
    "score": 1.0000000000000002,
    "target": "reference.txt"
  },
  {
    "score": 0.6781150718947749,
    "target": "candidate_10.txt"
  },
  {
    "score": 0.6149717839789649,
    "target": "candidate_07.txt"
  }
]

Installation instructions and the set of available options can be found in the official repository.

Motivation

Some say that the greatest benefit of having a personal knowledge management system (PKM) like Obsidian and org-roam is serendipity. You want these systems to surprise you. To make old notes resurface, and help you find out connections between topics that you wouldn’t think to be possible previously.

So you’re adding a new note to your PKM system, and you’d like to connect it with your other notes. What if you have thousands of notes? Should you read them one by one? There should be a way to narrow down this search.

That was my pet peeve since I started dabbling in the magical world of PKMs and digital gardens. I wanted the system to do that job for me, and show me the possible connections between my new note and the rest of my knowledge base.

I was not asking much. Lexical algorithms like TF-IDF and BM25 have been around for a long while, and they have been solid enough to sustain lots of search systems in production. But I couldn’t find any program that implemented it for straightforward usage.

Because Emacs is my go-to text editor and PKM system (with the org-roam package), my first idea was to devise an Emacs package to do the work.

Though Lisp is a fascinating programming language, it doesn’t seem to be ideal for matrix operations. Therefore, I resorted to Python and its weathered suite of numerical packages to perform the heavy lifting. It looked for similar documents in relation to the current buffer, and returned an Org-mode ready list of results. This is how org-similarity was born.

Mixing Lisp and non-Lisp code in the same Emacs package is not unheard of, so I was cool with that. (Even though handling virtual environments and Python package management from inside Emacs became my personal hell, and unsurprisingly, most of the tickets in the issue tracker were somehow related to that mess.)

Not to mention that the Python script that served as the backend engine for org-similarity couldn’t be reused anywhere else.

So I had enough reasons to create a separation of concerns and turn the engine behind org-similarity into a standalone program. Thus findlike is born, a command-line tool that can be used anywhere.

What I love about command-line programs is that they are universal. You can use them in the shell as standalone programs, or incorporate them as part of your script/plugin. Hell, you can even make it work for you when serving dynamic web content.

Caveats

TF-IDF and BM25 are solid lexical search algorithms, but they don’t identify semantic relations between documents. In other words, if you have a document about the hurricane Katrina, it’s possible that Bob Dylan’s “Hurricane” lyrics will show up as a possible match.

In future updates, I’ll probably include semantic similarity search as an option, provided the algorithm doesn’t take up too much space, is fast enough, and is compatible with the MIT license.