In an earlier article, we reviewed the definition of relevance which consists of 2 other metrics; recall and precision. The challenge, as the article also explained, is that there is an inverse relation between recall and precision. Increasing recall by making matching more flexible also introduces more false positives thus reducing precision.
In a later article, I introduced a strategy for overcoming that challenge by redefining the traditional definition for relevance. The strategy involves improving recall incrementally and then ensuring that the most relevant results are shown at the top and in the appropriate order before improving recall any further.
In this article, I will demonstrate a concrete but basic example to implement that strategy by introducing the concept of degrees of similarity. Degrees of similarity refers to the multiple ways a piece of text can be similar to another text. By searching using different degrees of similarity we increase recall but we can also tune precision based on the number and types of degree of similarity that match.
Degrees of Similarity and Text Analysis
Different degrees of similarity can be created by analyzing the same text in different ways and creating different field variations.
Keyword Analysis
In an earlier article, we reviewed the definition of relevance which consists of 2 other metrics; recall and precision. The challenge, as the article also explained, is that there is an inverse relation between recall and precision. Increasing recall by making matching more flexible also introduces more false positives thus reducing precision.
In a later article, I introduced a strategy for overcoming that challenge by redefining the traditional definition for relevance. The strategy involves improving recall incrementally and then ensuring that the most relevant results are shown at the top and in the appropriate order before improving recall any further.
In this article, I will demonstrate a concrete but basic example to implement that strategy by introducing the concept of degrees of similarity. Degrees of similarity refers to the multiple ways a piece of text can be similar to another text. By searching using different degrees of similarity we increase recall but we can also tune precision based on the number and types of degree of similarity that match.
Standard Analysis
The standard analyzer is the default analyzer used for text fields if no other analyzer is specified. At a high level it performs the following analysis:
Breaks text into tokens based on word separators (e.g. whitespace, periods)
Converts all text to lowercase
Stemming Analysis
Stemming is the process of reducing a word down its root (stem) based on a given language. For example, the root for the word "testing" is "test" and the root for the word "computer" is "comput". Searching with stemming is more flexible since it matches different variations of a word like different tenses or whether a noun is singular or plural.
Stemming is applied using a stemmer token filter that is appropriate for the language of the text. The stemmer token filter can be added to an another analyzer at the end of its filter chain. The example below shows how to add the english stemming on top of the standard analyzer:
(Coming Soon)
Putting It all Together
This is a natural fit for using the fields (formerly known as multi-fields) property to apply different analyzers on the same field. The mapping below shows how to analyze a single field using keyword, standard and stemming analysis:
(Coming Soon)
Degrees of Similarity and Querying
The strategy requires a query approach that meets the following requirements:
Must allow to query a field analyzed with different analyzers
The more analyzers match the better the score
Each analyzer can be tuned separately
All of these requirements can be accomplished using a multi-match query because:
It allows querying multiple fields (requirement #1)
Can be configured with the type most_fields so that the scores from each field variation are added together (requirement #2)
Allows for explicit boosting on field variations (requirements #3)
The query would look similar to the example below:
(Coming Soon)
Implicit and Explicit Boosting
By using the most_fields type, a result is implicitly boosted when multiple variations of a field match. However, that is not always desirable since some field variations may be overlapping with each other. In such cases, it is be better to reduce the boost of the similar field variations to avoid aggressively boosting low score matches that happen to match multiple but similar variations. Additionally, less weight should be given to field variations with more flexible analyzers. For example, matching against word stems is not as good as a match as it is matching against fields analyzed with the standard analyzer.
Degrees of Similarity and Tuning
In our specific example, the degrees of similarity overlap and the boosting needs to be adjusted explicitly:
Keyword matches will always also match the standard and stem field variations
Standard matches will always also match the stem field variation
The query below shows how the boosting for the different degrees of similarity was adjusted:
(Coming Soon)
The boosting expressed above translates to:
Keyword matches are boosted by 1.5
Standard-only matches are boosted by 1
Stem-only matches are boosted by 0.5
Next Steps
The approach described in this article can be easily expanded with more degrees of similarity by adding more field variations with different analyzers. In future articles, I will explain how to further increase the recall and precision of this basic example by adding fuzzy phrase matching.
Comments