Project Gutenberg with Stylo/Stylo_with_texts_from_PG.Rmd

---
title: "Stylo with Project Gutenberg Texts"
author: "David Joseph Wrisley @DJWrisley"
date: "4/13/2020"
output: html_notebook
---

## Introduction

This notebook continues the [previous one](https://github.com/djwrisley/RLAC/blob/master/GutenbergR.Rmd) when we used the GutenbergR package to analyze texts in Project Gutenberg. 

In this notebook we will work will build a corpus out of texts from Project Gutenberg in order to try out computational stylistics, also known as "stylometry." Corpus preparation and file structure for the stylo package is quite specific and this notebook makes that task easier. 

Using Project Gutenberg (PG) for your stylometry experiments has the same limitations as we encountered in the last notebook.  PG has more than 60K texts, most of them are in English and they predate the 1920s.  That being said you can still do quite a lot with it. 

### Step 1: Data Preparation

The only data that you need to prepare in advance is a csv of the metadata of the files. You need to collect the Project Gutenberg text numbers for those texts you would like to use. Austen's Pride and Prejudice is found at https://www.gutenberg.org/ebooks/1342 and so the book number is 1342. 

When working with programming it is useful to have structured sheets of information about the works you study anyway. Take a look at the [sheet](https://github.com/djwrisley/RLAC/blob/master/Stylo_with_texts_from_Project_Gutenberg.csv) that I built for Austen and the Bronte sisters. 

You can have any metadata in the sheet that you want, but the essentials are the Project Gutenberg number, author name and shortened title. Notice I also added date. You could also generate complex filenames with other metadata as Paul Vierthaler uses in his Python Stylometry [exercise](https://github.com/vierth/nyuabudhabi). 

We use Google sheets in class for the crowd creation of the corpus metadata from Project Gutenberg.

TIP: Use a concatenating formula to build the filename as the Stylo package wants it, as this one we used for our Google Sheet: =CONCATENATE(B2,"_",C2,E2,".txt").


### Step II: Installing Packages

First step, as we have seen in previous notebooks, is to install the needed packages:


```{r packages}
install.packages("gutenbergr")
install.packages("devtools")
install.packages("dplyr")

```

And then to load those libraries:

```{r}
library(gutenbergr)
library(devtools)
library(dplyr)

```

We need to load the stylo package from the project GitHub account (at the time of writing this notebook it was not available in CRAN, normally install.packages("stylo") and library(stylo) would suffice). If you are running iOS you will need to download [XQuartz](https://www.xquartz.org/) and restart your system. You may get a note to update some files.  Make sure you have the console open in RStudio so that you can agree to those updates. 

```{r}
install_github("computationalstylistics/stylo")
# install.packages("stylo") # if available in CRAN
library(stylo)

```

### Step III: Assembling the Corpus 

Define csvlist as the path or URL of the place where you are building your corpus metadata from Project Gutenberg. Make sure you deactivate R's automatic conversion of strings into factors for a data frame. We need to be able to use numbers and strings as such. I have included a link to my Drive.  The file is also available [here](https://github.com/djwrisley/RLAC/blob/master/Stylo_with_texts_from_Project_Gutenberg.csv).


```{r}
csvlist = "https://docs.google.com/spreadsheets/d/e/2PACX-1vT87reVxrWeaYiRL6hP-fnPB9MX6Rq4eZHdEURjbzY5MbY7Q5Y59MWZYc309CqIqLxBmdaRog5BhbWn/pub?gid=0&single=true&output=csv"

booklist <- read.csv(csvlist, header = TRUE, stringsAsFactors=FALSE)
```

Let's take a look look at the properties of the dataframe 'booklist' we have built.  You will see that there are both characters and integers. 

```{r}
str(booklist)

```

Next let's create the folder structure that the stylo package expects.  Assign the desired path and make a directory and a subdirectory.


```{r}

path = "/Users/djw12/Desktop/stylo"
dir.create(path, showWarnings = TRUE)
dir.create(paste0(path, "/corpus"))

```

Now we will save the books as .txt files in the format that the Stylo package expects. This code loops over the dataframe "booklist" mentioned above, downloads the books from Project Gutenberg using the function gutenberg_download that removes the boilerplate language at the beginning and end, as well as other information in the dataframe and saves the texts of the files in the desire folder with the correct names. 


```{r}
path2 <- "/Users/djw12/Desktop/stylo/corpus/"

for (row in 1:nrow(booklist)){
  bookname <- booklist[row,4]
  stylotext <- gutenberg_download(booklist[row,1])
  stylotext <- select(stylotext, text)
  write.table(stylotext, file = paste0(path2, bookname), sep="", row.names = FALSE)
}

```


What you should see is a folder corpus inside a folder stylo with the files. 

![The 17 properly labeled files for Stylo](/Users/djw12/Desktop/corpus_folder.png)


### Step IV: Using the Corpus with Stylo 


Now you can go ahead with stylo and try some commands, adapted from those found in [Eder et al](https://journal.r-project.org/archive/2016/RJ-2016-007/RJ-2016-007.pdf).

Such as a Bootstrap Consensus Tree: 


```{r}
stylo(corpus.dir = path2, analysis.type = "BCT", mfw.min = 100, mfw.max = 3000, custom.graph.title = "Austen vs Bronte sisters", write.png.file = TRUE, gui = FALSE)

```

Or a Principle Component Analysis: 


```{r}

stylo(corpus.dir = path2, analysis.type = "PCR", custom.graph.title = "Austen vs. the Bronte sisters", pca.visual.flavour = "loadings", write.png.file = TRUE, gui = FALSE)

```

Enjoy!