A quick look at Recommendation Systems

Ogulcan Ertunc
7 min readMar 17, 2021
Photo by Dollar Gill on Unsplash

No matter how much we are not aware of the recommendation systems in our daily life, they are services that are almost instant and can direct us when necessary. However, recommendation systems are basically created to predict what users might like, what they need and present them to users when there are many options available.

When there are too many options, they quickly present them to the user by running in the background and indexing them, rather than processing them and keeping the user waiting.

Although creating a simple recommendation system is thought to be quite simple, the difficult part of this is to establish a suggestion system that is logical, can affect the user, and can add value / earn money thanks to this service. For example, why are all recommendation system instances always giving Amazon, Google, Netflix or Spotify? Because the investment made by these companies in this business is so consistent that while increasing the quality of the service they provide, it also provides them with a return.

Photo by Charles Deluvio on Unsplash

Recommendation systems cover a wide range of fields and can be created using a variety of techniques, from simple to highly complex. While complex recommendation systems can be created with different techniques (eg deep learning, using unstructured data, etc.). As such, they are well suited for the artificial intelligence world and more specifically for unsupervised learning; However, we can do these operations to a certain extent in a rule-based manner. As users continue to consume content and provide more data, these systems can be built to provide better and better recommendations.

Now that we have an introduction to a short topic, let’s look at the basic recommendation systems:

  • Simple recommendations
    - General recommendations made with business knowledge or simple techniques.
    - The top-rated categories of the category are systems based on recommending bestsellers or trending ones.
  • Association Rule Learning
  • Content-Based (Filtering)
    - Neighborhood Methods
    - Correlation, Distance, etc...
    - To recommend from users who generally have similar preferences.
  • Collaborative Filtering
    - User-Based
    - Item-Based
  • and maybe the Hybrit Advice System.

These systems can work very useful depending on their location, but it would not be wrong to say that the most important two are Content-Based and Collaborative Filtering. Therefore, a detailed explanation for these two species would be more helpful.

Content-Based

Photo by Austin Chan on Unsplash

Content-based systems also make recommendations based on the user’s history, and often the success rate is poor if there is not enough data for the user. However, as the amount of data belonging to the user increases, the system's success increases. We can divide content-based systems into 4:

  • By Content:
    It is the best-known and most basic contextual recommendation system. It focuses on suggesting content, similar to the previously preferred product or ingredient. However, for this, the product/service must be in a structure that contains content information.
  • Popular Content:
    It aims to highlight the product by using its special features in order to make the product/service attractive to a wide audience by using its popularity. It can be built with attributes like hype, price, feature, and popularity. This is generally preferred for new content.
  • Latent (Hidden) Factor Modeling:
    Unlike the content similarity approach, it tries to reveal the user’s personal interests, habits, and shopping style and reveal information focused on individual characteristics. For this, the user’s information must be obtained from different angles.
  • Subject Modeling:
    Basically, this modeling is a focused version of Hidden Factor Modeling for a specific purpose. Interests can be extracted by analyzing unstructured text to identify certain characteristics of the user. For example, reviewing reviews/comments instead of looking at the service/product purchase history.

Collaborative Filtering

Photo by Taras Shypka on Unsplash

It is a filtering suggestion based on similar actions of different people with characteristics in their own history.

  • User-Based Filtering:
    This strategy involves comparing users’ backgrounds and grouping similar users. By grouping, the target audience is assigned to the recipient and a suggestion is made, but it is helpful to quickly give advice to a user with little information about it.
  • Item-Based Filtering:
    This strategy, in which we look at items rather than users, aims to make a grouping of items. In other words, suggestions are made to the potential user among the groups created among the products in that group.

We can also make hybrid modeling by mixing these two methods. I will try to show this in the sample project.

If you want to access the project repo, you can click here.

Case Study & Descriptions

For Case, I used MovieLens 20M Dataset dataset available in Kaggle.

First of all, let’s look at our datasets, we have 2 data sets, one shows the movies, and the other shows the ratings. As a first step, I merge these two datasets.

1. User Based

movie = pd.read_csv('movie.csv')
rating = pd.read_csv('rating.csv')
df = movie.merge(rating, how="left", on="movieId")
df.head()

Later, I remove the year portion in the titles, and then I simplify the data frame if the number of votes on a movie is less than 1000, as it will be rare for me and may affect my future transactions.

df['title'] = df.title.str.replace('(\(\d\d\d\d\))', '')
df['title'] = df['title'].apply(lambda x: x.strip())
a = pd.DataFrame(df["title"].value_counts())
rare_movies = a[a["title"] <= 1000].index
common_movies = df[~df["title"].isin(rare_movies)]
user_movie_df = common_movies.pivot_table(index=["userId"], columns=["title"], values="rating")
df_ = user_movie_df.copy()
df.head()

Then I choose a random user to make suggestions.

#random_user = int(pd.Series(df.index).sample(1).values)
# random user 108170
user_id = 108170

I find the movies watched by the randomly selected user.

user_id_df = user_movie_df[user_movie_df.index == user_id]
user_id_df.head()

movies_watched = user_id_df.columns[user_id_df.notna().any()].tolist()
len(movies_watched)

I am trying to access the IDs of the users who watched the same movies. I wanted to have a logical association between the users I have and the randomly selected user by setting a certain limit here. So I save ids of users who watched at least 60% of these 186 movies.

movies_watched_df = user_movie_df[movies_watched]
movies_watched_df.head()
movies_watched_df.shape

user_movie_count = movies_watched_df.T.notnull().sum()
user_movie_count = user_movie_count.reset_index()
user_movie_count.columns = ["userId", "movie_count"]

perc = len(movies_watched) * 60 / 100
users_same_movies = user_movie_count[user_movie_count["movie_count"] > perc]["userId"]

I looked at the correlation between the users to be recommended and the most similar users, and I gathered the data of the user I selected and other users.

final_df = pd.concat([movies_watched_df[movies_watched_df.index.isin(users_same_movies.index)], user_id_df[movies_watched]])
final_df.head()
final_df.shape
final_df.T.corr()

corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()
corr_df = pd.DataFrame(corr_df, columns=["corr"])
corr_df.index.names = ['user_id_1', 'user_id_2']
corr_df = corr_df.reset_index()
corr_df.head()

I found the most similar users by sorting the correlations I received in descending order.

top_users = corr_df[(corr_df["user_id_1"] == user_id) & (corr_df["corr"] >= 0.65)][["user_id_2", "corr"]]\
.reset_index(drop=True)

top_users = top_users.sort_values(by='corr', ascending=False)
top_users.rename(columns={"user_id_2": "userId"}, inplace=True)
top_users.head()

By creating a data frame called top_users_ratings, I kept the users, correlations, and ratings. Then I created a variable named weighted_rating using correlation with Rating.

top_users_ratings = top_users.merge(rating[["userId", "movieId", "rating"]], how='inner')
top_users_ratings['weighted_rating'] = top_users_ratings['corr'] * top_users_ratings['rating']
top_users_ratings.head()

From this data frame I created, I made a movie recommendation over the top 5 people.

temp = top_users_ratings.groupby('movieId').sum()[['corr', 'weighted_rating']]
temp.columns = ['sum_corr', 'sum_weighted_rating']
temp.head()
recommendation_df = pd.DataFrame()
recommendation_df['weighted_average_recommendation_score'] = temp['sum_weighted_rating'] / temp['sum_corr']
recommendation_df['movieId'] = temp.index
recommendation_df = recommendation_df.sort_values(by='weighted_average_recommendation_score', ascending=False)
recommendation_df.head(30)

movie = pd.read_csv('movie.csv')
movies_from_user_based = movie.loc[movie['movieId'].isin(recommendation_df['movieId'].head(10))]['title']
movies_from_user_based.head(30)
movies_from_user_based[:5].values

2. Item Based

Unlike the previous steps, I will make an item-based recommendation this time. That’s why I’m importing my datasets again.

movie = pd.read_csv('Lectures/Week 10/Dosyalar/movie.csv')
rating = pd.read_csv('Lectures/Week 10/Dosyalar/rating.csv')
df = movie.merge(rating, how="left", on="movieId")
df.head()

Then I edit the content of the Genre variable, and finally, I adapt the type of timestamp to the process I’m going to do.

df["genre"] = df["genres"].apply(lambda x: x.split("|")[0])
df.drop("genres", inplace=True, axis=1)
df["timestamp"] = pd.to_datetime(df["timestamp"], format='%Y-%m-%d')
df.info()
# Seperate the timestamp variable to year, month and day
df["year"] = df["timestamp"].dt.year
df["month"] = df["timestamp"].dt.month
df["day"] = df["timestamp"].dt.day
df.head()
df["title"].nunique()
# unique title num is 26213
a = pd.DataFrame(df["title"].value_counts())
a.head()

rare_movies = a[a["title"] <= 1000].index
common_movies = df[~df["title"].isin(rare_movies)]
common_movies.shape
common_movies["title"].nunique()

item_movie_df = common_movies.pivot_table(index=["userId"], columns=["title"], values="rating")
item_movie_df.shape
user_movie_df.head(10)
item_movie_df.columns

len(item_movie_df.columns)
common_movies["title"].nunique()

I find the id of the movie with the most up-to-date rating among the films that the recommended user has given 5 points, with the change of the year. Later, I printed out the movies belonging to these IDs.

movieId = rating[(rating["rating"] == 5.0) & (rating["userId"] ==user_id)].sort_values(by="timestamp",ascending=False)["movieId"][0:1].values[0]
movie_title = movie[movie["movieId"] == movieId]["title"].str.replace('(\(\d\d\d\d\))', '').str.strip().values[0]

movie = item_movie_df[movie_title]
movie_item_based = item_movie_df.corrwith(movie).sort_values(ascending=False)
movie_item_based[1:6].index

3. Hybrid Recommendation

By combining the outputs we get from item-based and user-based, we can recommend these 10 movies under the name of hybrid recommendation.

References

  1. https://developers.google.com/machine-learning/recommendation
  2. https://www.sciencedirect.com/science/article/pii/S1110866515000341
  3. https://www.veribilimiokulu.com/

--

--

Ogulcan Ertunc

I’m an IT Consultant,graduated Data Analytics. I’m a Data Enthusiast 💻 😃 passionate about learning and working with new tech. https://github.com/ogulcanertunc