Python data analysis - TMDb 5000 Movie Database movie data analysis

I just finished learning Python recently, so I'll take this project to practice. To know that a lot of crawler data can't be processed only by Excel and sql, Python is absolutely the best weapon for data analysis, and data cleaning, data processing and visualization can be completed. Practice to know, although there are many problems in the process, but after learning the ability to improve! This time, I will make a complete record of my projects and share them for future memories and second learning. Also welcome the comments of interested students, hope we can learn together and make progress together!

Article directory

1 project background

1.1 story background

The most important thing about data analysis is to tell a story! How to tell the story? It is nothing more than asking questions, analyzing problems and solving problems. Because the analyzed data solves a problem, only when the story is told can there be selling points, and only when others think your data analysis is valuable, can they be willing to pay for your data analysis. Don't say more, ha ha (⊙ o ⊙) Get back to the point.

My story is that Wang Sicong wants to open up Wanda film market overseas. This time, he is thinking about how to make money by making commercial films? After all, some big movies that cost more than $100 million to make will fail. This issue is more important to the film industry than ever before. Therefore, he invited the company's data analysts to help him solve the problem and give some suggestions. Based on the data, can we analyze whether there is a unified formula for the success of commercial films? To help him make better decisions.

1.2 ask questions

The ultimate problem to be solved is: what are the factors influencing the box office?

Next, we will analyze it in different dimensions:

  • What kind of movie do audiences like? What are the subject keywords?
  • How does movie style change over time?
  • Does the movie budget affect the box office?
  • What are the directors with high box office or high score?
  • When is the best time to release a movie?
  • Is it better to make an original film or to adapt a film?

2 Project Report

International practice starts with a data report:

3 understanding data

3.1 data source

The data used this time is from the kaggle platform (click the link to download TMDb 5000 Movie Database ) It contains the data of nearly 5000 films in the U.S. region from 1916 to 2017, including budget, director, box office, film score and other information.

3.2 data fields

The original dataset contains two files:

  • Tmdb? 5000? Movies: basic movie information, including 20 variables
  • Tmdb? 5000? Credits: actor information, including 4 variables

4 data cleaning

4.1 import data

Import package

# Data import
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

#Visual display in the interface
%matplotlib inline  
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']  #Used to display Chinese
plt.rcParams['axes.unicode_minus'] = False  #Used to display negative sign normally
import seaborn as sns
# Learning seaborn reference:

import json
import warnings
from wordcloud import WordCloud, STOPWORDS

movies = pd.read_csv('F:\\tmdb-movie-metadata\\tmdb_5000_movies.csv',encoding = 'utf_8')
credits = pd.read_csv('F:\\tmdb-movie-metadata\\tmdb_5000_credits.csv',encoding = 'utf_8') #see information

Operation result:

Continued table:

From the above information, there are 4803 records.

# Both data frames have a title column, as well as movies.rigid'title
# The above three data columns are duplicate, and two are deleted
del credits['title']
del movies['original_title']

# Connecting two csv files
merged = pd.merge(movies, credits, left_on='id', right_on='movie_id', how='left')

# Delete columns that do not need analysis

Operation result:

It can be seen from the above that data is missing in the release date and run time columns.
Next, we deal with the missing values.

4.2 missing value handling

Only 3 records are missing, and online search is adopted to complete the information.

4.2.1 complete release date

# Find missing value record - release date

The operation result is:

The title of the missing film is America Is Still the Place, dated '2014-06-01'

# Fill in missing values
df['release_date'] = df['release_date'].fillna('2014-06-01')

4.2.2 completion of runtime

# Find missing value record - runtime

Operation result:

The movie runtime of the missing record is 94min and 240min respectively

# Fill in missing values based on row labels
df.loc[2656] = df.loc[2656].fillna('94, limit=1')
df.loc[4140] = df.loc[4140].fillna('240, limit=1')

4.3 repeated value processing


Run result: there are 4803 non duplicate IDS, which can be cons id ered as no duplicate data.

4.4 date value processing

Convert the release date column to a date type:

#Convert date format, add year month day column
#If the date does not meet the timestamp limit, errors ='ignore' will return the original input without an error.
#errors='coerce 'will force the date beyond NaT and return NaT.

df['release_year'] = pd.to_datetime(df.release_date, format = '%Y-%m-%d',errors='coerce').dt.year
df['release_month'] = pd.to_datetime(df.release_date).apply(lambda x: x.month)
df['release_day'] = pd.to_datetime(df.release_date).apply(lambda x:

Operation result:

Develop the habit of viewing data columns after processing~

4.5 screening data

Use one of the data analyst's favorite syntax:


Operation result:

  • The data of box office, budget, popularity and score 0 should be removed;
  • For films with too low scoring number, the scoring is not statistically significant. Here, data with scoring number greater than 50 is screened.
df = df[(df.vote_count >= 50) &(df.budget * df.revenue * df.popularity * df.vote_average !=0)].reset_index(drop = 'True')


To view the filter results:

At this time, there are 2961 data, including 19 fields.

4.6 json data conversion

**Note: the six columns * * genes, keywords, production [companies, production [countries, cast, crew are all json data and need to be processed as a list for analysis.

Treatment method:

  1. json itself is of string type, first convert to dictionary list
  2. Then convert the dictionary list to a string separated by ','
json_column = ['genres','keywords','production_companies','production_countries','cast','crew']

# 1-json itself is of string type, first convert to dictionary list
for i in json_column:
    df[i] = df[i].apply(json.loads)
# Extract name
# 2-convert dictionary list to string separated by ','
def get_name(x):
    return ','.join([i['name'] for i in x]) 
df['cast'] = df['cast'].apply(get_name)

# Extract derector
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']

df['crew'] = df['crew'].apply(get_director)

for j in json_column[0:4]:
    df[j] = df[j].apply(get_name)

rename_dict = {'cast':'actor','crew':'director'}
df.rename(columns=rename_dict, inplace=True)

Operation result:

4.7 data backup

# Back up the original DF
org_df = df.copy()

5 data analysis

5.1 why

Want to explore the factors that affect box office, from the film market trends, audience preferences, film directors, release time, ratings and keywords and other dimensions to provide appropriate suggestions for practitioners.

5.2 what

5.2.1 film type

# Define a collection to get all movie types
genre = set()
for i in df['genres'].str.split(','): # Remove the separator between strings to get a single movie type
    genre = set().union(i,genre)    # Set union
    # genre.update(i) #Or use the update method


Operation result:

Note that there are redundant elements in the gene collection: empty single quotes, so you need to remove them

genre.discard('') # Remove redundant elements

Operation result:

#Turn genre into a list
genre_list = list(genre) 

# Create data box - Movie Type
genre_df = pd.DataFrame()  

#One hot encoding of movie types
for i in genre_list:
    # If it contains type i, it is encoded as 1, otherwise it is encoded as 0
    genre_df[i] = df['genres'].str.contains(i).apply(lambda x: 1 if x else 0)    

#Change the index of the data box to the year
genre_df.index = df['release_year']

Operation result: number of movie types (bar chart)

# Calculate the total number of movies of each type and arrange them in descending order
grnre_sum = genre_df.sum().sort_values(ascending = False)
# visualization
plt.rcParams['font.sans-serif'] = ['SimHei']  #Used to display Chinese
plt.title('Total number of different types of movies',fontsize=20)
plt.xlabel('Movie genre',fontsize=16)
plt.savefig("Number of different movie types-Bar chart.png",dpi=300) #Call plt.savefig() before proportion of movie types (pie chart)

gen_shares = grnre_sum / grnre_sum.sum()

# Set the other class. When the proportion of movie types is less than% 1, all of them belong to the other class
others = 0.01
gen_pie = gen_shares[gen_shares >= others]
gen_pie['others'] = gen_shares[gen_shares < others].sum()

# Set split properties
# When the proportion is less than or equal to% 2, increase the percentage of edge deviation radius of each pie piece
explode = (gen_pie <= 0.02)/10

# Pie chart

plt.title('Percentage of different movie types',fontsize=20)
plt.savefig("Percentage of different movie types-Pie chart.png",dpi=300) trend of film type change (line chart)

#The trend of movie type changing with time
gen_year_sum = genre_df.sort_index(ascending = False).groupby('release_year').sum()
gen_year_sum_sub = gen_year_sum[['Drama','Comedy','Thriller','Action','Adventure','Crime','Romance','Science Fiction']]
plt.xlabel('Particular year', fontsize=16)
plt.ylabel('Number', fontsize=16)
plt.title('Trends of different films', fontsize=20)

plt.savefig("Number of different movie types-Broken line figure 2.png",dpi=600) budget / profit of different movie types (combination chart)

# Calculate profit for different movie types
# Step 1 - create a profile ﹣ dataframe
profit_df = pd.DataFrame()
profit_df = pd.concat([genre_df.reset_index(), df['profit']], axis=1)
# Step 2 - create a profit series with the abscissa of gene
# Step 3-find the average profit of each kind of gene
for i in genre_list:
    profit_s.loc[i]=profit_df.loc[:,[i,'profit']].groupby(i, as_index=False).mean().loc[1,'profit']
profit_s = profit_s.sort_values(ascending = True)

# Calculate budget for different types of movies
# Step 1 - create a profile ﹣ dataframe
budget_df = pd.DataFrame()
budget_df = pd.concat([genre_df.reset_index(), df['budget']], axis=1)
# Step 2 - create budget series with the abscissa of gene
# Step 3 - find the budget mean of each kind of gene
for j in genre_list:
    budget_s.loc[j]=budget_df.loc[:,[j,'budget']].groupby(j, as_index=False).mean().loc[1,'budget']

# Next, merge the profile and budget
profit_budget = pd.concat([profit_s, budget_s], axis=1)
profit_budget.columns = ['profit', 'budget']

#Add profit rate column
profit_budget['rate'] = (profit_budget['profit']/profit_budget['budget'])*100
# Descending sort
profit_budget_sort=profit_budget.sort_values(by='budget',ascending = False)

# Draw average budget and profit margin of different types of films (combination chart)
x = profit_budget_sort.index
y1 = profit_budget_sort.budget
y2 = profit_budget_sort.rate
# Returns the number of rows in the profit
length = profit_budget_sort.shape[0]

fig = plt.figure(figsize=(12,9))
# Left axis
ax1 = fig.add_subplot(1,1,1),length),y1,color='b',label='Average budget')
plt.xticks(range(0,length),x,rotation=90, fontsize=12)  # Change the abscissa name
ax1.set_xlabel('Particular year')                   # Set x-axis label, Y-axis label
ax1.set_ylabel('Average budget',fontsize=16)

#Right axis
# Share x axis, generate secondary axis
ax2 = ax1.twinx()
ax2.set_ylabel('Average profit margin',fontsize=16)

# Display profit rate axis in percentage format
import matplotlib.ticker as mtick
yticks = mtick.FormatStrFormatter(fmt)

# Set picture title
ax1.set_title('Average budget and profit margin of different types of films',fontsize=20)
plt.savefig("Average budget of different films+Profit margin.png",dpi=300)

# Draw budgets and profits for different types of films (bar chart)
profit_budget_sort.iloc[:,0:2].plot(kind='bar', figsize=(12,9),color = ['darkorange','b'])
plt.title('Budget and Profit',fontsize = 20)
plt.xlabel('len',fontsize = 16)
plt.savefig('Budget and profit of different types of films-Bar chart.png',dpi=300)

5.2.2 movie keywords

#keywords analysis
keywords_list = []
for i in df['keywords']:
#Concatenate a list of strings into a long string
lis = ''.join(keywords_list)
#Set stop words
stopwords = set(STOPWORDS)
wordcloud = WordCloud(
                background_color = 'black',
                random_state=9, # Set a random seed for random shading
                stopwords = stopwords,
                max_words = 3000,
plt.savefig('Word cloud diagram.png',dpi=300)

Operation result:

5.3 when

Check the type of runtime and find that it is the type of object, that is, string
So, advanced row data transformation

# shifting clause
df.runtime = df.runtime.convert_objects(convert_numeric=True)

Operation result:

5.3.1 film duration

# visualization
sns.distplot(df.runtime,bins = 20)
sns.despine(left = True) # Use the despine() method to remove the axis, the top and right axes are removed by default
plt.savefig('Movie duration histogram.png',dpi=300)

Operation result:

5.3.2 time of issue

fig = plt.figure(figsize=(8,6))

x = list(range(1,13))
y1 = df.groupby('release_month').revenue.size()
y2 = df.groupby('release_month').revenue.mean()# Average box office of single piece per month

# Left axis
ax1 = fig.add_subplot(1,1,1),y1,color='b',label='Number of films')
ax1.set_xlabel('Month')                   # Set x-axis label, Y-axis label
ax1.set_ylabel('Number of films',fontsize=16)

# Right axis
ax2 = ax1.twinx()
plt.plot(x,y2,'ro--',label='Average box office of single piece per month')
ax2.set_ylabel('Average box office of single piece per month',fontsize=16)

plt.savefig('Number of films per month and average box office of single film.png',dpi=300)

Operation result:

5.4 where

This data set collects film data in the U.S. region. For the film production companies and countries, no analysis will be made in the context of this story.

5.5 who

5.5.1 box office distribution and top 10 directors

# Create data box - Director
director_df = pd.DataFrame()
director_df = df[['director','revenue','budget','profit','vote_average']]
director_df = director_df.groupby(by = 'director').mean().sort_values(by='revenue',ascending = False) # Mean value

# Draw box office distribution histogram
director_df['revenue'].plot.hist(bins=100, figsize=(8,6)) 
plt.xlabel('box office')
plt.title('Box office distribution of different directors')
plt.savefig('Box office distribution of different directors.png',dpi = 300)

# Top 10 directors
director_df.revenue.sort_values(ascending = True).tail(10).plot(kind='barh',figsize=(8,6))
plt.xlabel('box office',fontsize = 16)
plt.ylabel('director',fontsize = 16)
plt.title('Box office ranking Top10 Director',fontsize = 20)
plt.savefig('Box office ranking Top10 Director.png',dpi = 300)

5.5.2 rating distribution and directors with top 10 rating

# Draw director rating histogram
director_df['vote_average'].plot.hist(bins=18, figsize=(8,6)) 
plt.title('Rating distribution of different directors')
plt.savefig('Rating distribution of different directors.png',dpi = 300)

# Top 10 directors
director_df.vote_average.sort_values(ascending = True).tail(10).plot(kind='barh',figsize=(8,6))
plt.xlabel('score',fontsize = 16)
plt.ylabel('director',fontsize = 16)
plt.title('Score ranking Top10 Director',fontsize = 20)
plt.savefig('Score ranking Top10 Director.png',dpi = 300)

5.6 how

5.6.1 proportion of original VS adaptation (pie chart)

# Create data box
original_df = pd.DataFrame()
original_df['keywords'] = df['keywords'].str.contains('based on').map(lambda x: 1 if x else 0)
original_df['profit'] = df['profit']
original_df['budget'] = df['budget']

# Calculation
novel_cnt = original_df['keywords'].sum() # Number of adaptations
original_cnt = original_df['keywords'].count() - original_df['keywords'].sum() # Number of original works
# Group by original or not
original_df = original_df.groupby('keywords', as_index = False).mean() # Note that the average of profit and budget is calculated here
# Add series
original_df['count'] = [original_cnt, novel_cnt]
# Calculate profit margin
original_df['profit_rate'] = (original_df['profit'] / original_df['budget'])*100

# Modify index
original_df.index = ['original', 'based_on_novel']
# Calculate percentage
original_pie = original_df['count'] / original_df['count'].sum()

# Pie chart
plt.title('Adaptation VS Original',fontsize=20)
plt.savefig('Adaptation VS Original.png',dpi=300)

5.6.2 original VS adapted budget / profit margin (portfolio chart)

x = original_df.index
y1 = original_df.budget
y2 = original_df.profit_rate

fig= plt.figure(figsize = (8,6))

# Left axis
ax1 = fig.add_subplot(1,1,1),y1,color='b',label='Average budget',width=0.25)
plt.xticks(rotation=0, fontsize=12)  # Change the abscissa name
ax1.set_xlabel('Original VS Adaptation')                   # Set x-axis label, Y-axis label
ax1.set_ylabel('Average budget',fontsize=16)

#Right axis
# Share x axis, generate secondary axis
ax2 = ax1.twinx()
ax2.plot(x,y2,'ro-.',linewidth=5,label='Average profit margin') 
ax2.set_ylabel('Average profit margin',fontsize=16)
ax2.legend(loc=1,fontsize=10) # loc=1,2,3,4 represent four corners respectively, which are in the same order as the four quadrants

# Display profit rate axis in percentage format
import matplotlib.ticker as mtick
yticks = mtick.FormatStrFormatter(fmt)

plt.savefig('Adaptation VS Original budget and profit margin.png',dpi=300)

5.7 how much

5.7.1 calculation of correlation coefficient

# Calculation of correlation coefficient matrix
revenue_corr = df[['runtime','popularity','vote_average','vote_count','budget','revenue']].corr()

            annot=True, # Show dimensions in each cell
            cmap="Blues", # Set fill color: yellow, green, blue
#             cmap="YlGnBu", # Set fill color: yellow, green, blue
#             cmap="coolwarm", # Set fill color: warm and cold
            cbar=True,  # Show color bar
            linewidths=0.5, # Add small space between cells for easy data reading
            # fmt='%.2f%%',  # This is to ensure that the display result is an integer (formatted output). There is a problem here
plt.savefig('Box office correlation matrix.png',dpi=300)

Operation result:

5.7.2 scatter diagram of box office influencing factors

# Draw a scatter diagram
fig = plt.figure(figsize=(17,5))

# # Learning seaborn reference:
ax1 = plt.subplot(1,3,1)
ax1 = sns.regplot(x='budget', y='revenue', data=revenue_df, x_jitter=.1,color='r',marker='x')
# marker: 'x','o','v','^','<'
# Jitter: jitter item, indicating jitter degree

ax2 = plt.subplot(1,3,2)
ax2 = sns.regplot(x='popularity', y='revenue', data=revenue_df, x_jitter=.1,color='g',marker='o')

ax3 = plt.subplot(1,3,3)
ax3 = sns.regplot(x='vote_count', y='revenue', data=revenue_df, x_jitter=.1,color='b',marker='v')

Published 3 original articles, won praise 1, visited 261
Private letter follow

Tags: JSON Lambda encoding Python

Posted on Fri, 13 Mar 2020 02:23:31 -0400 by Calvin770D