Investigating how the IMDb ratings for TV Shows/Series are affected by their length in seasons and episodes
![IMDB](https://private-user-images.githubusercontent.com/143189371/269648159-ee04e645-62f2-4d8a-b749-2565cc1da55a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDM0OTIsIm5iZiI6MTczOTMwMzE5MiwicGF0aCI6Ii8xNDMxODkzNzEvMjY5NjQ4MTU5LWVlMDRlNjQ1LTYyZjItNGQ4YS1iNzQ5LTI1NjVjYzFkYTU1YS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMVQxOTQ2MzJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jNDNiZGM3MDE5N2FlZTY3YWM1NGQ2M2UzOWI4YTc4ZmUzMDg4OTcyOTY2MjA1MWIwNTRlNjljNGRmYWNiNjIwJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.P1-jR4SuvbHJuqCBPrRSJNPzirJqC2v_I5m3T4fzvGc)
How are the IMDb ratings of the viewers for TV Shows/Series affected by their length in seasons and episodes? Does that effect change when you take into account the popularity of the series?
Welcome to our GitHub repository! In the modern digital era we live in, we are presented with the opportunity to be able to evaluate and express a personal opinion in almost any area of our daily life. More specifically, in the field of entertainment through films and series, the rating system plays a particularly important role in the public's decision to choose between a movie or a series, while at the same time it is subconsciously influencing in advance the individual's attitude towards what he or she is about to watch. Therefore it is significant to realize the factors that influence people's opinion and judgement, and alongside to correlate their behaviour with the way they approach and interact with films and series. In this project we will research the IMDb ratings for TV Shows/Series and we will try to analyze the way in which those ratings are influenced by the number of seasons and episodes of each series. We will also attempt to measure the popularity by including the number of votes for each series on the IMDb website, and consider whether its impact on the rating differs based on the length of the series. We are going to utilize a linear regression model to obtain our results, as well as include plots and graphs to illustrate our findings.
- IMDb (Internet Movie Database) is a go-to online platform for information about movies, TV shows, actors, directors, and more.
- It offers details like titles, release dates, cast info, ratings, and reviews, making it a popular resource for entertainment enthusiasts and professionals.
- Subsets of IMDb data can be accessed for personal/non-commercial purposes.
- This project will use IMDb to make research about the interaction between the length of Tv-shows and their overall ratings.
- Data files for episodes: https://datasets.imdbws.com/title.episode.tsv.gz
- Data files for ratings: https://datasets.imdbws.com/title.ratings.tsv.gz
- Data files for titles: https://datasets.imdbws.com/title.basics.tsv.gz
Firstly, we have downloaded the required data from the IMDb Developer website. For our project, we needed three datasets - episodes, ratings and titles.
Datasets:
- episodes - data on all episodes of all TV series - identifier for the series and episode
- ratings - data on average ratings and number of votes for all series
- titles - data on title, start and end year for all series
All three datasets contain one common variable: tconst, which is the unique identifier for different series and movies in the rating and names datasets. In the episodes dataset, however, it is the unique identifier for episodes. Therefore, the tconst of episodes will be deleted, and parentTconst, which is equivalent to tconst in the other two datasets, will be renamed to tconst in order to cleanly merge the final dataset.
Dataset | Variable | Definition |
---|---|---|
rating | averageRating | weighted average of all the individual user ratings |
rating | numVotes | number of votes the title has received |
episodes | parentTconst | alphanumeric identifier of the parent TV Series |
episodes | seasonNumber | season number the episode belongs to |
episodes | episodeNumber | episode number of the tconst in the TV series |
names | titleType | the type/format of the title (e.g. movie, series) |
names | primaryTitle | the title used by the filmmakers on promotional materials at the point of release |
names | originalTitle | original title, in the original language |
names | isAdult | 0: non-adult title; 1: adult title |
names | startYear | represents the release year of a title. In the case of TV Series, it is the series start year |
names | endYear | TV Series end year. ‘\N’ for all other title types |
names | runtimeMinutes | primary runtime of the title, in minutes |
names | genres | includes up to three genres associated with the title |
As a next step, we have performed multiple transformations to reach our final dataset:
-
calculated the total number of episodes for each series
-
split the data into 2 sections:
- long series (5+ seasons)
- short series (2-4 seasons)
-
filtered out the series with below 1000 votes (as there might be some obscure series with unreliable/biased data)
-
performed necessary data cleaning tasks (removed duplicates and erroneous observations)
-
merged the data together to create the final dataset, with a dummy variable to split long and short series
Due to some inaccuracies with the data we have decided to exclude 1 season series. We have also not included the exact number of seasons as a variable, as series with 10+ seasons would only show 9 seasons in the initial dataset.
In the end, we were left with x observations for our final dataset: 1733 long series, and 3444 short ones.
Next, we have performed a linear regression to analyze the data. As a dependent variable we have taken the average rating for each series. We have included 3 independent variables to try to find an answer to our research question:
- number of episodes - in our opinion the best method to measure the length on the series, since we could not include the amount of seasons.
- number of votes - to estimate the show's popularity
- interaction between the number of votes and the dummy variable for long series - to see if the effect of the series' length is stronger for more popular series
Call: lm(formula = averageRating ~ num_episodes + numVotes + numVotes * long.x, data = merged_episodes)
![Linear Regression](https://private-user-images.githubusercontent.com/143189371/275078313-a545dcd9-01c4-4e98-838c-cff00a337d54.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDM0OTIsIm5iZiI6MTczOTMwMzE5MiwicGF0aCI6Ii8xNDMxODkzNzEvMjc1MDc4MzEzLWE1NDVkY2Q5LTAxYzQtNGU5OC04MzhjLWNmZjAwYTMzN2Q1NC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMVQxOTQ2MzJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0zMWU5ZmVhYjJkMDNjYTUyMWVkOTIxZWIzZmQ4OGQzMDE4NTBmM2Q2MTZmZDg1MTg1ZjlkZmYwNzgxYzEyNjRiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.3a39Y20s3apv4ATkarGp8FysBSWJMNaTOBEDp4DVoiU)
X = num_episodes, Y= averageRating
![Plot](https://private-user-images.githubusercontent.com/143189371/275078346-29b23ef3-2b92-4581-a80e-bbd114614309.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDM0OTIsIm5iZiI6MTczOTMwMzE5MiwicGF0aCI6Ii8xNDMxODkzNzEvMjc1MDc4MzQ2LTI5YjIzZWYzLTJiOTItNDU4MS1hODBlLWJiZDExNDYxNDMwOS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMVQxOTQ2MzJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wOTY5Yzc3YmY2ZTQyMWNjZWEzZTU0MWQzMWZjZWVmOTFlMGU3NTU3YjhmMTU5NTExMWRiMzJiNmI2M2M2YWMzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.1C2d_-eS3xjP3b4uswCBL3Mkaa8yo-21J7KJd_KXPLU)
![Box Plot](https://private-user-images.githubusercontent.com/143189371/275078444-29529551-c118-4261-905d-f0e6e193e7c9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDM0OTIsIm5iZiI6MTczOTMwMzE5MiwicGF0aCI6Ii8xNDMxODkzNzEvMjc1MDc4NDQ0LTI5NTI5NTUxLWMxMTgtNDI2MS05MDVkLWYwZTZlMTkzZTdjOS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMVQxOTQ2MzJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1kYmYxNmQ0ODJkNDI1MWNmNGY5NWMzNmY1ZGMxMDI0ZGJmNmJhNmY2N2U4ODA4ZWNiNmUwZjZkYTJlMjg0YjZhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.zwljcsFMADZsgqSh_oO092O93w03i92ImoHZ5feFBmY)
![Histogram](https://private-user-images.githubusercontent.com/143189371/275078472-b349e6bf-e20a-4f9a-98c3-65e04fa52c4c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDM0OTIsIm5iZiI6MTczOTMwMzE5MiwicGF0aCI6Ii8xNDMxODkzNzEvMjc1MDc4NDcyLWIzNDllNmJmLWUyMGEtNGY5YS05OGMzLTY1ZTA0ZmE1MmM0Yy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMVQxOTQ2MzJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1kN2QwNmM3NmFkMWJhZWExZjE4NDAxYjEzNWM5ZmU2N2Q0ZDQxYjc1YzQ1MTI5NDJhODRjZTJlMDRiYWJkZjdhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.bwfzP4FrTMrUC3sQkHyEZGAOkd9BjJSIj5Ol77PiMHw)
All variables in our model are significant - that's great news! We can see that the increasing number of episodes has a negative effect on the rating. It confirms the hypothesis that perhaps after a couple of seasons the formula runs out - viewers get bored. We can also see that series with a higher number of votes have higher ratings - that's to be expected. Interestingly, the coefficient for the interaction is negative - we can conclude that the negative effect of longer series becomes significantly weaker with the increase of popularity.
The scatterplot illustrates these results very well, we can clearly see the steady decrease in rating as the number of episodes increases. The box plot and the histogram show the structure of the dataset divided into long and short series and compare the distribution of ratings. Both plots show a similar story - the distribution does not differ substantially across both groups.
It has to be said however, since our project only focused on a specific interaction, the model only explains a small percentage of the variance. In reality there are numerous other factors which influence the rating of a series. In further, more advanced projects it would be interesting to also include other variables, such as the production budget, genre, director, actors etc.
Structure and files of the repository:
├── README.md
├── data
├── gen
│ ├── temp
│ ├── output
└── src
├── analysis
├── data-preparation
└── paper
Explain to potential users how to run/replicate your workflow. If necessary, touch upon the required input data, which secret credentials are required (and how to obtain them), which software tools are needed to run the workflow (including links to the installation instructions), and how to run the workflow.
Please follow the installation guides on http://tilburgsciencehub.com/.
-
Make. Installation guide.
-
To knit RMarkdown documents, make sure you have installed Pandoc using the installation guide on their website.
-
For R, make sure you have installed the following packages:
install.packages("data.table")
install.packages("dplyr")
install.packages("tidyverse")
install.packages("ggplot2")
To run the code, follow these instructions:
- Fork this repository
- Open your command line / terminal and run the following code:
git clone https://github.com/{your username}/IMDb-Series-Rating-Investigation.git
- Set your working directory to
IMDb-Series-Rating-Investigation
and run the following command:
make
-
When make has succesfully run all the code,
-
To clean the data of all raw and unnecessary data files created during the pipeline, run the following code in the command line / terminal:
make clean
Note: when the command line/terminal is closed, the website will not be available anymore.
An alternative route to run the code would be:
- ../src/data-preparation -> Download.R
- ../src/data-preparation -> Cleaning.R
- ../src/data-preparation -> Merging.R
- ../src/data-preparation -> Exploration.Rmd
- ../src/analysis -> Plot.R
These are the contributors of the project:
-
Georgios Oikonomou ([email protected])
-
Flip Gootjes ([email protected])
-
Aleksander Spisz ([email protected])
-
Chon Fai Wong ([email protected])
This project is a part of "Data Preparation & Workflow Management" course at Tilburg University.