Movie Recommendation System In Python
Recommendation systems have become an integral part of our digital lives, influencing everything from the movies we watch to the products we buy. In this tutorial, we'll delve into the world of building a movie recommendation system using Python. We'll explore the core concepts, implement a content-based filtering approach, and discuss potential improvements and challenges.
Understanding the Code
To build a recommendation system, we need a dataset that contains information about movies and potentially user ratings. For this tutorial, we'll use the movielens dataset.
- Importing Libraries:
pandas (pd)
: Used for data manipulation and analysis (reading CSV, dataframes)numpy (np)
: Used for numerical computations (vectorization)CountVectorizer
fromsklearn.feature_extraction.text
: Converts text data into numerical featurescosine_similarity
fromsklearn.metrics.pairwise
: Calculates similarity between vectors
- Loading and Exploring the Movie Data:
- Reads the movie data from a CSV file into a pandas dataframe (
df
). - Displays the first few rows (
head()
) and statistical summary (describe()
) of the data to understand its content.
- Reads the movie data from a CSV file into a pandas dataframe (
- Feature Selection and Preprocessing:
- Defines a list of features (
features
) relevant for recommendations like genres, cast, and director. - Checks for missing values (NaN) in the
cast
column. - Creates a function
combine_features
to concatenate all chosen features into a single string for each movie, stored in a new columncombined_features
. - Fills missing values in the chosen features with an empty string (
""
) to avoid errors during processing.
- Defines a list of features (
- Text Vectorization and Similarity Calculation:
- Uses
CountVectorizer
to convert the combined features text into numerical vectors (count_matrix
). - Calculates the cosine similarity between each movie based on their vector representations using
cosine_similarity
. Cosine similarity measures how similar two vectors are.
- Uses
- The
get_title_from_index
Function:
This function takes an index as input and returns the corresponding movie title from the DataFrame
df
.df[df.index == index]
: This line filters the DataFrame to rows where the index matches the inputindex
.["title"]
: Selects only thetitle
column from the filtered DataFrame..values[0]
: Extracts the title as a string from the resulting Series.
- The
get_index_from_title
Function:
This function takes a movie title as input and returns its corresponding index in the DataFrame.
df[df.title == title]
: Filters the DataFrame to rows where thetitle
matches the inputtitle
.["index"]
: Selects only theindex
column from the filtered DataFrame..values[0]
: Extracts the index as an integer from the resulting Series.
- Getting User Input and Finding Recommendations
This code block handles user input and retrieves the movie index:
- Prompts the user to enter their favorite movie title.
- Uses
get_index_from_title
to obtain the index of the movie. - Handles potential exceptions (e.g., movie not found) using a
try-except
block.
- Finding Similar Movies & Printing Recommendations
Finding Similar Movies:
- It calculates the similarity scores between the user's favorite movie and all other movies using the
cosine_sim
matrix. - Sorts the similarity scores in descending order to identify the most similar movies.
- Stores the indices of the top 10 similar movies in the
sorted_similar_movies
list.
Printing Recommendations:
- Iterates through the
sorted_similar_movies
list. - For each movie index, retrieves the corresponding movie title using
get_title_from_index
. - Prints the movie title.
- Breaks the loop after printing 10 recommendations.
- It calculates the similarity scores between the user's favorite movie and all other movies using the
This code completes the movie recommendation system, providing a list of movies similar to the user's favorite movie based on content-based filtering.
Output:
Note: This is a basic example and can be improved by incorporating more sophisticated techniques, such as user-based or item-based collaborative filtering, hybrid approaches, and advanced evaluation metrics.
Download