Improving search relevance with word proximity
My website search engine uses text search to identify documents relevant to a given term. Up until recently, the search engine treated every word in a term independently.
For example, consider the query βall too wellβ. Documents would be found that contain any of the words in the query. Then, the results would be ordered according to their lexical relevance, as measured by TF/IDF (which I replaced with BM25).
TF/IDF and BM25 do not account for the proximety of words in documents. This means that a document that mentions βall too wellβ directly would be treated the same as a document that mentions all the three component words separately.
I have recently updated my site search engine to take into account word proximety when ranking documents.
The ranking process is as follows:
- Find candidate documents that contain words in a query.
- Calculate the BM25 scores for each document given the query.
- For each candidate document, identify if the words in the query appear directly in sequence at any point in the document.
- If words in a query appear together in a document, boost the rank of the document.
With this process and the search βall too wellβ, a blog post that contains that exact phrase will be considered more relevant than one that contains the component words.
This approach has a few benefits for my blog search:
- If you paste in a blog post title to my search engine, the blog post with that title should show up first. This is because the word proximity boost pushes the blog post up to the top.
- Named entities with two or more words (i.e. βAll Too Wellβ) will return more relevant results, because word proximity is considered when ranking.
- Generally, it is easier to find documents that contain a phrase.
Letβs walk through an example comparing the old and new algorithms.
Weβll use βall too wellβ as the example. With this query, the intent is to find my writings related to Taylor Swiftβs song βall too wellβ.
For the query βall too wellβ, BM25 without a proximity boost returns:
- Beyond Tellerrand 2022
- Advent of Technical Writing: Facilitating Ideas
- Announcing Tay Tay Lyric of the Day
The first two results are not related to the query.
For the query βall too wellβ, BM25 with a proximity boost returns the following as the most relevant results 1:
- Taylor Swift Subreddit Acronym Reference
- Analyzing use of Taylor Swift song name acronyms on Reddit
- Announcing Tay Tay Lyric of the Day
All three results above are related to the query.
Given the comparison above, it is clear that the word proximity boost is significant.
In βHow to find word collocations in a documentβ, I walked through how I implemented the logic to find phrases in a document using sets. I recommend reviewing the post if you are interested in learning how I implemented a solution to efficiently check if a document contains a multi-word phrase.
Ironically, this blog post may become the highest ranked blog post for the query βall too wellβ because of how many times I use βall too wellβ in this post.
[β©]