Recommendation system using BERT embeddings

Published in

Analytics Vidhya

4 min readMar 8, 2021

When you look into any social media platform you are most likely to see lot of recommendations from them like “Suggested for you”. These recommendations are mostly would be of your current interest or something that could interest you in the future depending upon your older similar interests. This sums up to the two major different methods by which most companies refer new products to their customers and they are:

Content based filtering: The recommendations by this method are highly in correlation with your subject of interests and their attributes. For example: Assume you like Arsenal football club and their content on Youtube, so you are more likely to see suggestions like AFTV, Premier league etc, since all these have attributes like football, Arsenal in common.
Collaborative filtering: This is more of recommendation based on multiple users and their interests. For example: Suppose your friend likes Manchested United, Real Madrid and Ligue1 while you like Arsenal, Barcelona and Bundesliga there is a high chance that you friend could receive suggestions regarding to Bayern Munich while you might receive recommendations involving PSG because both of you like football and vice versa.

So here I have tried to create a content based recommendation system on youtube trending videos dataset acquired from the following Kaggle source: Trending videos 2021wherein I have only used the UK version. The dataset contains the predominant features like title, description, view counts, likes etc. Below is a sample of the dataset:

For any machine learning task involving text, one has to process them and convert into numbers for the machine to intrepret. Since we will be using only the title of the dataset for our task we will do some basic preprocessing steps involving removing special characters, lowering characters etc. Below snippet executes the required preprocessing steps.

text-preprocessing

For creating word embeddings we will be using pretrained BERT embeddings which are hosted on Tensorflow hub and can be downloaded for fine tuning, transfer learning etc. Please visit tf-hub for more instructions on how to use various models. Here I have used a smaller version of bert un_cased for preprocessing like removal of stop words etc. Then the coressponding embedding vectors are created for each title present in our dataset using small_bert pretrained embeddings. The resulting embedding will have both the pooled output for the whole sequence/title and also the output for each token in the sequence, but here we will be using only the pooled outputs owning to both reduce the usage of computation power and the model being an unsupervised learning model. The code snippet for the same is given below:

Embedding generation for each sentence.

As you can see above I have generated encodings for all the titles present in the dataset. So we need to create encodings for our words of interest and find similarity between our interests and the encodings of the title. I have used cosine similarity to determine the similarity between the vectors. Cosine similarity in simple words is the innerproduct of two given vectors, the more the value of it signifies the more similar the two vectors are. So now lets query the dataset using our various interests and rank the cosine similarity scores along with their corresponding title.

Let us assume our interests as : Action, Hollywood, Thrillers and look at the coressponding recommendations from the model

Seems the results from the model are pretty satisfactory with some are related to movie trailers and documentaries with some shadowy content

Now let us see the same for the following interests : Arsenal, Europa league, Premier league

Yep the results are highly relevant here

Let us check one more : Music, Taylor Swift, Imagine Dragons

Surely our model works as a peach :) here, the recommendations made by it are highly reasonable and could do wonders with more customizations.

So guys here we have created our own recommendation system using youtube titles and these videos are only the ones that were trending in the UK, we could do better with more varied data and recommend channels rather than videos directly. Thank you people for spending your valuable time on reading through this, will be back with something soon again. Please do share with others if you like the article.

Recommendation system using BERT embeddings

Written by Vishnu Nandakumar