Movie Recommendation System In Python
Recommendation systems have become an integral part of our digital lives, influencing everything from the movies we watch to the products we buy. In this tutorial, we'll delve into the world of building a movie recommendation system using Python. We'll explore the core concepts, implement a content-based filtering approach, and discuss potential improvements and challenges.
Understanding the Code
To build a recommendation system, we need a dataset that contains information about movies and potentially user ratings. For this tutorial, we'll use the movielens dataset.
- Importing Libraries:
pandas (pd): Used for data manipulation and analysis (reading CSV, dataframes)numpy (np): Used for numerical computations (vectorization)CountVectorizerfromsklearn.feature_extraction.text: Converts text data into numerical featurescosine_similarityfromsklearn.metrics.pairwise: Calculates similarity between vectors
- Loading and Exploring the Movie Data:
- Reads the movie data from a CSV file into a pandas dataframe (
df). - Displays the first few rows (
head()) and statistical summary (describe()) of the data to understand its content.
- Reads the movie data from a CSV file into a pandas dataframe (
- Feature Selection and Preprocessing:
- Defines a list of features (
features) relevant for recommendations like genres, cast, and director. - Checks for missing values (NaN) in the
castcolumn. - Creates a function
combine_featuresto concatenate all chosen features into a single string for each movie, stored in a new columncombined_features. - Fills missing values in the chosen features with an empty string (
"") to avoid errors during processing.
- Defines a list of features (
- Text Vectorization and Similarity Calculation:
- Uses
CountVectorizerto convert the combined features text into numerical vectors (count_matrix). - Calculates the cosine similarity between each movie based on their vector representations using
cosine_similarity. Cosine similarity measures how similar two vectors are.
- Uses
- The
get_title_from_indexFunction:
This function takes an index as input and returns the corresponding movie title from the DataFrame
df.df[df.index == index]: This line filters the DataFrame to rows where the index matches the inputindex.["title"]: Selects only thetitlecolumn from the filtered DataFrame..values[0]: Extracts the title as a string from the resulting Series.
- The
get_index_from_titleFunction:
This function takes a movie title as input and returns its corresponding index in the DataFrame.
df[df.title == title]: Filters the DataFrame to rows where thetitlematches the inputtitle.["index"]: Selects only theindexcolumn from the filtered DataFrame..values[0]: Extracts the index as an integer from the resulting Series.
- Getting User Input and Finding Recommendations
This code block handles user input and retrieves the movie index:
- Prompts the user to enter their favorite movie title.
- Uses
get_index_from_titleto obtain the index of the movie. - Handles potential exceptions (e.g., movie not found) using a
try-exceptblock.
- Finding Similar Movies & Printing Recommendations
Finding Similar Movies:
- It calculates the similarity scores between the user's favorite movie and all other movies using the
cosine_simmatrix. - Sorts the similarity scores in descending order to identify the most similar movies.
- Stores the indices of the top 10 similar movies in the
sorted_similar_movieslist.
Printing Recommendations:
- Iterates through the
sorted_similar_movieslist. - For each movie index, retrieves the corresponding movie title using
get_title_from_index. - Prints the movie title.
- Breaks the loop after printing 10 recommendations.
- It calculates the similarity scores between the user's favorite movie and all other movies using the
This code completes the movie recommendation system, providing a list of movies similar to the user's favorite movie based on content-based filtering.
Output:
Note: This is a basic example and can be improved by incorporating more sophisticated techniques, such as user-based or item-based collaborative filtering, hybrid approaches, and advanced evaluation metrics.
Download