-
Notifications
You must be signed in to change notification settings - Fork 2
/
Stylo_with_texts_from_PG.Rmd
134 lines (74 loc) · 5.57 KB
/
Stylo_with_texts_from_PG.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
title: "Stylo with Project Gutenberg Texts"
author: "David Joseph Wrisley @DJWrisley"
date: "4/13/2020"
output: html_notebook
---
## Introduction
This notebook continues the [previous one](https://github.com/djwrisley/RLAC/blob/master/GutenbergR.Rmd) when we used the GutenbergR package to analyze texts in Project Gutenberg.
In this notebook we will work will build a corpus out of texts from Project Gutenberg in order to try out computational stylistics, also known as "stylometry." Corpus preparation and file structure for the stylo package is quite specific and this notebook makes that task easier.
Using Project Gutenberg (PG) for your stylometry experiments has the same limitations as we encountered in the last notebook. PG has more than 60K texts, most of them are in English and they predate the 1920s. That being said you can still do quite a lot with it.
### Step 1: Data Preparation
The only data that you need to prepare in advance is a csv of the metadata of the files. You need to collect the Project Gutenberg text numbers for those texts you would like to use. Austen's Pride and Prejudice is found at https://www.gutenberg.org/ebooks/1342 and so the book number is 1342.
When working with programming it is useful to have structured sheets of information about the works you study anyway. Take a look at the [sheet](https://github.com/djwrisley/RLAC/blob/master/Stylo_with_texts_from_Project_Gutenberg.csv) that I built for Austen and the Bronte sisters.
You can have any metadata in the sheet that you want, but the essentials are the Project Gutenberg number, author name and shortened title. Notice I also added date. You could also generate complex filenames with other metadata as Paul Vierthaler uses in his Python Stylometry [exercise](https://github.com/vierth/nyuabudhabi).
We use Google sheets in class for the crowd creation of the corpus metadata from Project Gutenberg.
TIP: Use a concatenating formula to build the filename as the Stylo package wants it, as this one we used for our Google Sheet: =CONCATENATE(B2,"_",C2,E2,".txt").
### Step II: Installing Packages
First step, as we have seen in previous notebooks, is to install the needed packages:
```{r packages}
install.packages("gutenbergr")
install.packages("devtools")
install.packages("dplyr")
```
And then to load those libraries:
```{r}
library(gutenbergr)
library(devtools)
library(dplyr)
```
We need to load the stylo package from the project GitHub account (at the time of writing this notebook it was not available in CRAN, normally install.packages("stylo") and library(stylo) would suffice). If you are running iOS you will need to download [XQuartz](https://www.xquartz.org/) and restart your system. You may get a note to update some files. Make sure you have the console open in RStudio so that you can agree to those updates.
```{r}
install_github("computationalstylistics/stylo")
# install.packages("stylo") # if available in CRAN
library(stylo)
```
### Step III: Assembling the Corpus
Define csvlist as the path or URL of the place where you are building your corpus metadata from Project Gutenberg. Make sure you deactivate R's automatic conversion of strings into factors for a data frame. We need to be able to use numbers and strings as such. I have included a link to my Drive. The file is also available [here](https://github.com/djwrisley/RLAC/blob/master/Stylo_with_texts_from_Project_Gutenberg.csv).
```{r}
csvlist = "https://docs.google.com/spreadsheets/d/e/2PACX-1vT87reVxrWeaYiRL6hP-fnPB9MX6Rq4eZHdEURjbzY5MbY7Q5Y59MWZYc309CqIqLxBmdaRog5BhbWn/pub?gid=0&single=true&output=csv"
booklist <- read.csv(csvlist, header = TRUE, stringsAsFactors=FALSE)
```
Let's take a look look at the properties of the dataframe 'booklist' we have built. You will see that there are both characters and integers.
```{r}
str(booklist)
```
Next let's create the folder structure that the stylo package expects. Assign the desired path and make a directory and a subdirectory.
```{r}
path = "/Users/djw12/Desktop/stylo"
dir.create(path, showWarnings = TRUE)
dir.create(paste0(path, "/corpus"))
```
Now we will save the books as .txt files in the format that the Stylo package expects. This code loops over the dataframe "booklist" mentioned above, downloads the books from Project Gutenberg using the function gutenberg_download that removes the boilerplate language at the beginning and end, as well as other information in the dataframe and saves the texts of the files in the desire folder with the correct names.
```{r}
path2 <- "/Users/djw12/Desktop/stylo/corpus/"
for (row in 1:nrow(booklist)){
bookname <- booklist[row,4]
stylotext <- gutenberg_download(booklist[row,1])
stylotext <- select(stylotext, text)
write.table(stylotext, file = paste0(path2, bookname), sep="", row.names = FALSE)
}
```
What you should see is a folder corpus inside a folder stylo with the files.
![The 17 properly labeled files for Stylo](/Users/djw12/Desktop/corpus_folder.png)
### Step IV: Using the Corpus with Stylo
Now you can go ahead with stylo and try some commands, adapted from those found in [Eder et al](https://journal.r-project.org/archive/2016/RJ-2016-007/RJ-2016-007.pdf).
Such as a Bootstrap Consensus Tree:
```{r}
stylo(corpus.dir = path2, analysis.type = "BCT", mfw.min = 100, mfw.max = 3000, custom.graph.title = "Austen vs Bronte sisters", write.png.file = TRUE, gui = FALSE)
```
Or a Principle Component Analysis:
```{r}
stylo(corpus.dir = path2, analysis.type = "PCR", custom.graph.title = "Austen vs. the Bronte sisters", pca.visual.flavour = "loadings", write.png.file = TRUE, gui = FALSE)
```
Enjoy!