Study | StudyLover

The Role of Model Evaluation in Recommender Systems

Unit 2: Guide to Machine Learning Algorithms

Model evaluation is a critical step in building any machine learning system, but it's especially important—and nuanced—for recommender systems. Its primary role is to answer the fundamental question: "Are the recommendations actually good and useful to the user?"

Without a proper evaluation, you have no objective way of knowing if your system is performing well, if a change you made improved or worsened the recommendations, or how it compares to other potential algorithms.

Why is it Different from Standard Classification/Regression?

You can't use simple metrics like accuracy in the same way you would for a classification problem. This is because:

1. Implicit Negatives: The data is sparse. Just because a user hasn't rated a movie doesn't mean they dislike it; they probably just haven't seen it. We can't simply treat all unrated items as "negative" examples.

2. The Goal is Ranking, Not Just Prediction: A good recommender system doesn't just predict if a user will like an item. It needs to present a ranked list of the best items that the user is most likely to engage with. The order of the recommendations matters.

Offline Evaluation: How We Test a Recommender

The most common way to evaluate a recommender system without deploying it to real users (which is called online evaluation or A/B testing) is through offline evaluation. The process is:

1. Split the Data: We take our historical user-item interaction data and split it into a training set and a test set.

2. Train the Model: We train our recommender system using only the data in the training set.

3. Test the Model: We then ask the model to generate a list of top-N recommendations for each user. We compare this recommended list to the items in the test set (the "ground truth" of what the user actually liked).

Common Evaluation Metrics

We use metrics that are designed to evaluate ranked lists:

Precision@k: This is one of the most intuitive metrics. It answers the question: "Out of the top 'k' items we recommended, how many did the user actually like?" If we recommend 10 movies and the user liked 3 of them from our list, the Precision@10 is 30%. This measures the quality and relevance of the recommendations.
Recall@k: This metric answers the question: "Out of all the items the user actually liked, how many did we manage to recommend in our top 'k' list?" If the user liked a total of 5 movies in the test set and we recommended 3 of them in our top 10, the Recall@10 is 60%. This measures the coverage or how well we found all the relevant items.

By using these metrics, we can get a reliable, quantitative score for how well our recommender system is performing.

# --- 1. Import Necessary Libraries ---

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

from sklearn.model_selection import train_test_split

# --- 2. Prepare Sample Data ---

# We'll use a slightly larger dataset to make the evaluation more meaningful.

ratings_data = {

'user_id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6],

'movie_title': [

'The Dark Knight', 'Inception', 'Forrest Gump',

'Inception', 'Pulp Fiction', 'The Matrix',

'The Dark Knight', 'Forrest Gump', 'The Shawshank Redemption',

'Pulp Fiction', 'The Godfather', 'Goodfellas',

'Inception', 'The Dark Knight', 'The Matrix',

'Forrest Gump', 'The Shawshank Redemption'

'rating': [5, 5, 4, 5, 4, 5, 5, 5, 5, 5, 4, 5, 4, 5, 4, 5, 5]

}

movies_data = {

'movie_title': ['The Dark Knight', 'Inception', 'Forrest Gump', 'Pulp Fiction', 'The Godfather', 'The Matrix', 'The Shawshank Redemption', 'Goodfellas'],

}

ratings_df = pd.DataFrame(ratings_data)

movies_df = pd.DataFrame(movies_data)

# --- 3. Split Data for Evaluation ---

# We split the ratings data into a training set and a test set.

# The model will be built using the training set.

# The test set will be used as the "ground truth" to evaluate our recommendations.

train_df, test_df = train_test_split(ratings_df, test_size=0.2, random_state=42)

# --- 4. Content-Based Filtering Component ---

# This component is built using the full movie list, as content features are static.

tfidf = TfidfVectorizer(stop_words='english')

tfidf_matrix = tfidf.fit_transform(movies_df['genres'])

content_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

content_sim_df = pd.DataFrame(content_sim, index=movies_df['movie_title'], columns=movies_df['movie_title'])

# --- 5. Collaborative Filtering Component ---

# This component is built ONLY on the training data.

user_item_matrix_train = train_df.pivot_table(index='user_id', columns='movie_title', values='rating').fillna(0)

# Ensure all movies from the main movies_df are in the columns

user_item_matrix_train = user_item_matrix_train.reindex(columns=movies_df['movie_title'], fill_value=0)

collab_sim = cosine_similarity(user_item_matrix_train.T)

collab_sim_df = pd.DataFrame(collab_sim, index=user_item_matrix_train.columns, columns=user_item_matrix_train.columns)

# --- 6. Create a Weighted Hybrid Recommender ---

def get_hybrid_recommendations(user_id, alpha=0.5):

"""

Generates recommendations for a user based on their training data history.

"""

# Get the list of movies the user has watched in the training set.

watched_movies_train = user_item_matrix_train.loc[user_id][user_item_matrix_train.loc[user_id] > 0].index

# Calculate the average score for movies the user has watched

total_scores = pd.Series(0.0, index=movies_df['movie_title'])

for movie_title in watched_movies_train:

content_scores = content_sim_df[movie_title]

if movie_title in collab_sim_df:

collab_scores = collab_sim_df[movie_title]

else:

collab_scores = pd.Series(0, index=content_sim_df.index)

# Aggregate scores

total_scores += (alpha * collab_scores) + ((1 - alpha) * content_scores)

# Sort the movies based on the aggregated hybrid score.

sorted_scores = total_scores.sort_values(ascending=False)

# Filter out movies the user has already watched.

recommendations = []

for movie, score in sorted_scores.items():

if movie not in watched_movies_train:

recommendations.append(movie)

return recommendations

# --- 7. Model Evaluation (Precision@k) ---

def calculate_precision_at_k(k=3):

"""

Calculates the average Precision@k for all users in the test set.

"""

precisions = []

# Get a list of users in the test set

test_users = test_df['user_id'].unique()

for user_id in test_users:

# Get the "ground truth" - movies the user liked in the test set

true_positives = test_df[test_df['user_id'] == user_id]['movie_title'].tolist()

# Generate top-k recommendations for the user

recommendations = get_hybrid_recommendations(user_id, alpha=0.5)[:k]

# Calculate the number of relevant items in the recommendations

hits = len(set(recommendations) & set(true_positives))

# Calculate precision for this user

if len(recommendations) > 0:

precision = hits / len(recommendations)

precisions.append(precision)

# Return the average precision across all users

if len(precisions) > 0:

return sum(precisions) / len(precisions)

else:

return 0.0

# Calculate and print the evaluation metric

# Note: With a tiny dataset, the results might be 0 or 1, but this demonstrates the process.

avg_precision = calculate_precision_at_k(k=3)

print(f"\n--- Model Evaluation ---")

print(f"Average Precision@3 for the model is: {avg_precision:.4f}")

# --- 8. Get Final Recommendations for a User ---

user_to_recommend = 3

print(f"\n--- Top 3 Recommendations for User {user_to_recommend} ---")

final_recommendations = get_hybrid_recommendations(user_to_recommend, alpha=0.5)

if final_recommendations:

for i, movie in enumerate(final_recommendations[:3]):

print(f"{i+1}. {movie}")

else:

print("No new recommendations found.")

Hybrid Recommender Systems