Projectv1.Rmd

---
title: "Projectv1"
output:
  html_document: default
  pdf_document: default
date: "2024-09-06"
---

```{r, eval=FALSE, include=FALSE}
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("tinytex")
library(tinytex)
tinytex::install_tinytex()
```
```{r, include=FALSE}
library(tinytex)
```

```{r, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


#1.1
## Research Question 
Is there a correlation between a movie's runtime and its IMDb rating, with a focus on genre?

## Research motivation
The research question—exploring whether there is a correlation between movie runtime and IMDb rating with a specific focus on genre—is clearly articulated and of significant importance. This inquiry delves into understanding how different genres come with distinct audience expectations. Filmmakers often face tough decisions about how long their films should be. A longer runtime allows for more detailed storytelling, character development, and complex narratives, but it may also test the audience's patience. On the other hand, shorter films might be more accessible and appealing to time-constrained viewers but could risk feeling rushed or underdeveloped. By examining how runtime impacts ratings within specific genres, this research can provide valuable insights into audience preferences.

The choice of research method, specifically regression analysis, is well-motivated. It allows for a nuanced exploration of how runtime correlates with ratings, while controlling for genre, to isolate the effects specific to different types of films. This approach not only highlights the relationship between runtime and audience satisfaction but also offers data-driven guidance for filmmakers. Directors can benefit from understanding whether a tight, concise runtime is more effective for genres like horror, or if extending the runtime enhances character development in dramas. 



```{r, include=FALSE}
#2.1
library(tidyverse)
```

```{r}
#loading the data
urls = c('https://datasets.imdbws.com/title.basics.tsv.gz', 'https://datasets.imdbws.com/title.ratings.tsv.gz')
datasets <- lapply(urls, read_delim, delim='\t', na = '\\N')
basics <- datasets[[1]]
ratings = datasets[[2]]
write_csv(basics, "basics.csv")
write_csv(ratings, "ratings.csv")
```



```{r}
#getting rid of the NAs for runtimeMinutes
temp <- basics %>% filter(!is.na(basics$runtimeMinutes))

#Filtering to only get the movies
temp1 <- temp %>% filter(temp$titleType == "movie")

#Merging the ratings and basics
movies <- merge(temp1, ratings)

#Removing the variable endYear, as it only contains NAs and originalTitle as it is of use
movies <- movies %>% 
  select(-endYear) %>%
  select(-originalTitle)
```


```{r, include=FALSE}
library(ggplot2)
```


##Visualizing the data
```{r echo=FALSE}
boxplot(temp1$runtimeMinutes,
   main = "Boxplot of the runtime of movies",
        ylab = "Runtime in minutes")
```


This boxplot shows how the runtime in Minutes of all the movies have been distributed.  
This boxplot does not give a lot of insights because of this we decided to remove the outliers in the dataset.

```{r}
#calculating outliers runtimeMinutes
sd_runtimeMinutes <- sd(movies$runtimeMinutes)
mean_runtimeMinutes <- mean(movies$runtimeMinutes)
lower <- mean_runtimeMinutes - 3 * sd_runtimeMinutes
outliers_runtimeMinutes <- mean_runtimeMinutes + 3 * sd_runtimeMinutes

#removing outliers
movies <- movies %>% filter(movies$runtimeMinutes <= outliers_runtimeMinutes)

```

```{r}
#makiong cutoff for numVotes
movies <- movies %>% 
  filter(movies$numVotes >= 30)
```


```{r echo=FALSE}
boxplot(movies$runtimeMinutes,
        main = "Boxplot of the runtime of movies",
        ylab = "Runtime in minutes")
```
This boxplot shows how the movies are distributed without the outliers and gives us more of an insight to the average length of all the movies.

```{r echo=FALSE}
movies_long <- movies %>% separate_rows(genres, sep = ",")

ggplot(movies_long, aes(x=genres))+ geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust=1)) + ggtitle("Genre by number of movies") + ylab("Number of movies") + xlab("Genres") +  scale_y_continuous(labels = scales::number_format(big.mark = ",", accuracy = 1))
```
This plot gives us an insight to how every singular genre is represented in our dataset.

```{r echo=FALSE}
ggplot(movies, aes(x=averageRating)) + geom_bar() + ggtitle("Average rating by number of movies") + ylab("Number of movies") + xlab("Average rating") + scale_y_continuous(labels = scales::number_format(big.mark = ",", accuracy = 1))
```
The last plot gives an overview of how the Average Rating of all movies in the dataset have been distributed.

```{r}
write_csv(movies, "movies.csv")
```

```{r}
movies <- read_csv("movies.csv")
```