Introduction
People click on top items in search and recommendations more often because they are on top, not because of their relevancy. If you order your search results with an ML model, they may eventually degrade in quality because of such a positive self-reinforcing feedback loop. How can this problem be solved?
Nowadays, search ranking and recommendation systems rely on a large amount of data to train machine learning models like Learning-to-Rank (LTR) models to rank results for a given query, and implicit user feedbacks (e.g. click data) have become the dominant source of data collection due to their abundance and low cost, particularly for major Internet companies. Nevertheless, one disadvantage of this data collection method is that the data may be highly skewed. One of the most prominent biases is position bias, which occurs when users are inclined towards clicking on higher ranked results.
In this article, we’re going to discuss the following topics:
- Which types of biases do exist, and how to measure them?
- Overcoming position bias with Inverse Propensity Weighting and downsides of such approach.
- Position-aware Learning is a way to teach your ML model to consider bias while training.
This article was published as a part of the Data Science Blogathon.
Table of Contents
Biases in Ranking
Every time you present a list of things, such as search results or recommendations (or autocomplete suggestions and contact lists), to a human being, we can hardly ever impartially evaluate all the items in the list.
A cascade click model assumes that people evaluate all the items in the list sequentially before they find the relevant one. But then it means that things on the bottom have a smaller chance to be evaluated at all, hence will organically have fewer clicks:
Top items receive more clicks only because of their position — this behavior is called position bias. However, the position bias is not the only bias in item lists, there are plenty of other dangerous things to watch out for:
- Presentation bias: For example, due to a 3×3 grid layout, an item on position #4 (right under the #1 top one) may receive more clicks than item #3 in the corner.
- Model bias: When you train an ML model on historical data generated by the same model.
- Some more obscure biases like clickbait, duration, and popularity – see an excellent overview of such biases for details.
In practice, the position bias is the strongest one — and removing it while training may improve your model reliability.
Experiment: Measuring Position Bias
We conducted a small crowd-sourced research about position bias. With a RankLens dataset, we used a Google Keyword Planner tool to generate a set of queries to find a particular movie.
With a set of movies and corresponding actual queries, we have a perfect search evaluation dataset — all items are well-known for a wider audience, and we know the correct labels in advance.
All major crowd-sourcing platforms like Amazon Mechanical Turk and Toloka.ai have out-of-the-box templates for typical search evaluation:
But there’s a nice trick in such templates, preventing you from shooting yourself in the foot with position bias: each item must be examined independently. Even if multiple items are present on screen, their ordering is random!
But does random item order prevents people from clicking on the first results?
The raw data for the experiment is available on github.com/metarank/msrd, but the main observation is that people still click more on the first position, even on randomly-ranked items!
Inverse Propensity Weighting
But how can you offset the impact of position on implicit feedback you get from clicks? Each time you measure the click probability of an item, you observe the combination of two independent variables:
- Bias: The probability of clicking on a specific position in the list.
- Relevance: The importance of the item within the current context (like BM25 score coming from ElasticSearch, and cosine similarity in recommendations)
In the MSRD dataset mentioned in the previous paragraph, it’s hard to distinguish the impact of position independently from BM25 relevance as you only observe them combined together.
For example, 18% of clicks are happening on position #1. Does this only happen because we have the most relevant item presented there? Will the same item on position #20 get the same amount of clicks?
The Inverse Propensity Weighting approach suggests that the observed click probability on a position is just a combination of two independent variables:
And then, if you estimate the click probability on each position (the propensity), you can weight all your relevance labels with it and get an actual unbiased relevance:
But how can you estimate the propensity in practice? The most common method is introducing a minor shuffling to rankings so that the same items within the same context (e.g., for a search query) will be evaluated on different positions.
But adding extra shuffling will definitely degrade your business metrics like CTR and Conversion Rate. Are there any less invasive alternatives not involving shuffling?
Position-Aware Learning
A position-aware approach to ranking suggests asking your ML model to optimize both ranking relevancy and position impact at the same time:
- On training time, you use item position as an input feature,
- In the prediction stage, you replace it with a constant value.
In other words, you trick your ranking ML model into detecting how position affects relevance during the training but zero out this feature during the prediction: all the items are simultaneously being presented in the same position.
But which constant value should you choose? The authors of the PAL paper did a couple of numerical experiments on selecting the optimal value — the rule of thumb is not to pick too high positions, as there’s too much noise.
Practical PAL
The PAL approach is already a part of multiple open-source tools for building recommendations and searches:
- ToRecSys implements PAL as a bias-elimination approach to train recommender systems on biased data.
- Metarank can use a PAL-driven feature to train an unbiased LambdaMART Learn-to-Rank model.
As the position-aware approach is just a hack around feature engineering, in Metarank, it is only a matter of adding yet another feat
On an MSRD dataset mentioned above, such a PAL-inspired ranking feature has quite a high importance value compared to other ranking features:
Conclusion
The position-aware learning approach is not only limited to pure ranking tasks and position de-biasing: you can use this trick to overcome any other type of bias:
- For the presentation bias due to a grid layout, you can introduce a pair of features for an item’s row and column position during the training. But swap them to a constant during the prediction.
- For the model bias, when items presented more often receive more clicks — you can introduce a “number of clicks” training feature and replace it with a constant value on prediction time.
The ML model trained with the PAL approach should produce an unbiased prediction. Considering the simplicity of the PAL approach, it can also be applied in other areas of ML where biased training data is a usual thing.
While conducting this research, we made the following main observations:
- Position bias can be present even in unbiased datasets.
- Shuffling-based approaches like IPW can overcome the problem of bias, but introducing extra jitter in predictions may cost you a lot by lowering business metrics like CTR.
- The Position-aware learning approach makes your ML model learn the impact of bias, improving the prediction quality.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Read the full article here