❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdaySearch Engines

Improving search relevance with word proximity

18 November 2024 at 01:00

My website search engine uses text search to identify documents relevant to a given term. Up until recently, the search engine treated every word in a term independently.

For example, consider the query β€œall too well”. Documents would be found that contain any of the words in the query. Then, the results would be ordered according to their lexical relevance, as measured by TF/IDF (which I replaced with BM25).

TF/IDF and BM25 do not account for the proximety of words in documents. This means that a document that mentions β€œall too well” directly would be treated the same as a document that mentions all the three component words separately.

I have recently updated my site search engine to take into account word proximety when ranking documents.

The ranking process is as follows:

  1. Find candidate documents that contain words in a query.
  2. Calculate the BM25 scores for each document given the query.
  3. For each candidate document, identify if the words in the query appear directly in sequence at any point in the document.
  4. If words in a query appear together in a document, boost the rank of the document.

With this process and the search β€œall too well”, a blog post that contains that exact phrase will be considered more relevant than one that contains the component words.

This approach has a few benefits for my blog search:

  1. If you paste in a blog post title to my search engine, the blog post with that title should show up first. This is because the word proximity boost pushes the blog post up to the top.
  2. Named entities with two or more words (i.e. β€œAll Too Well”) will return more relevant results, because word proximity is considered when ranking.
  3. Generally, it is easier to find documents that contain a phrase.

Let’s walk through an example comparing the old and new algorithms.

We’ll use β€œall too well” as the example. With this query, the intent is to find my writings related to Taylor Swift’s song β€œall too well”.

For the query β€œall too well”, BM25 without a proximity boost returns:

  • Beyond Tellerrand 2022
  • Advent of Technical Writing: Facilitating Ideas
  • Announcing Tay Tay Lyric of the Day

The first two results are not related to the query.

For the query β€œall too well”, BM25 with a proximity boost returns the following as the most relevant results 1:

  • Taylor Swift Subreddit Acronym Reference
  • Analyzing use of Taylor Swift song name acronyms on Reddit
  • Announcing Tay Tay Lyric of the Day

All three results above are related to the query.

Given the comparison above, it is clear that the word proximity boost is significant.

In β€œHow to find word collocations in a document”, I walked through how I implemented the logic to find phrases in a document using sets. I recommend reviewing the post if you are interested in learning how I implemented a solution to efficiently check if a document contains a multi-word phrase.

1

Ironically, this blog post may become the highest ranked blog post for the query β€œall too well” because of how many times I use β€œall too well” in this post.

[↩]
❌
❌