Study | StudyLover

Content-Based Recommender System

Recommender Systems : Collaborative filtering

Unit 2: Guide to Machine Learning Algorithms

A content-based recommender system works by suggesting items that are similar to other items a user has liked in the past. The "content" in the name refers to the attributes or features of the items themselves. The core idea is simple: "If you like an item with certain features, you'll also like other items with similar features."

For example, if you watch a lot of action movies directed by Christopher Nolan, a content-based system will learn that you like the "action" genre and the director "Christopher Nolan." It will then recommend other movies that share these attributes.

How It Works

1. Item Profiling: The first step is to create a profile for each item in your catalog. This involves extracting a set of key features that describe the item. For movies, this could be the genre, director, actors, plot keywords, etc. This is often represented as a vector for each item.

2. User Profiling: The system then builds a profile for each user that summarizes their preferences. This user profile is created based on the items the user has positively rated or interacted with. For instance, if a user has liked several action movies, their user profile will have a high score for the "action" genre.

3. Recommendation: To make a recommendation, the system compares the item profiles of unseen items to the user's profile. It then calculates a similarity score (often using metrics like cosine similarity) between the user's profile and each item's profile. The items with the highest similarity scores are then recommended to the user.

Advantages and Disadvantages

Advantages:

No "Cold Start" for Items: It can recommend new items that haven't been rated by any users yet, as long as their features are available.
User Independence: It doesn't need data from other users to make a recommendation for a specific user.
Interpretability: The recommendations are easy to explain (e.g., "We're recommending this movie because you like other action movies").

Disadvantages:

Limited Serendipity: It can create a "filter bubble" where the user is only recommended items that are very similar to what they've already seen, making it hard to discover new interests.
Requires Good Feature Extraction: The quality of the recommendations is highly dependent on the quality of the features extracted from the items.

# --- 1. Import Necessary Libraries ---

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import linear_kernel

# --- 2. Prepare Sample Data ---

# Create a sample dataset of movies with their genres.

data = {

'movie_id': [1, 2, 3, 4, 5, 6, 7, 8],

'title': [

'The Dark Knight', 'Inception', 'Forrest Gump', 'Pulp Fiction',

'The Shawshank Redemption', 'Goodfellas', 'The Matrix', 'Saving Private Ryan'

'genres': [

'Crime|Drama', 'Drama', 'Biography|Crime|Drama',

'Action|Sci-Fi', 'Drama|War'

]

}

movies_df = pd.DataFrame(data)

print("--- Movie Dataset ---")

print(movies_df)

# --- 3. Feature Extraction (TF-IDF) ---

# We need to convert the genres string into a numerical representation.

# TfidfVectorizer is a great tool for this. It will:

# 1. Treat each genre as a "word".

# 2. Calculate a score for each genre in each movie, which represents how important that genre is to the movie.

tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string if any genres are missing

movies_df['genres'] = movies_df['genres'].fillna('')

# Create the TF-IDF matrix

tfidf_matrix = tfidf.fit_transform(movies_df['genres'])

print("\n--- TF-IDF Matrix Shape ---")

print(tfidf_matrix.shape)

# --- 4. Calculate Item-Item Similarity ---

# We'll use the linear_kernel (which is faster than cosine_similarity for this case)

# to calculate the similarity between all pairs of movies based on their genres.

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Create a mapping from movie titles to their index in the DataFrame.

indices = pd.Series(movies_df.index, index=movies_df['title']).drop_duplicates()

# --- 5. Create the Recommendation Function ---

def get_content_based_recommendations(title):

"""

Generates movie recommendations based on the similarity of their genres.

"""

# Get the index of the movie that matches the title.

idx = indices[title]

# Get the pairwise similarity scores of all movies with that movie.

sim_scores = list(enumerate(cosine_sim[idx]))

# Sort the movies based on the similarity scores.

sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

# Get the scores of the 10 most similar movies (we'll skip the first one, as it's the movie itself).

sim_scores = sim_scores[1:11]

# Get the movie indices.

movie_indices = [i[0] for i in sim_scores]

# Return the titles of the most similar movies.

return movies_df['title'].iloc[movie_indices]

# --- 6. Get Recommendations ---

# Let's get recommendations for a user who liked "The Dark Knight".

movie_liked = 'The Dark Knight'

print(f"\n--- Recommendations for someone who liked '{movie_liked}' ---")

recommendations = get_content_based_recommendations(movie_liked)

print(recommendations.to_string(index=False))

Recommender Systems Collaborative filtering