-
Notifications
You must be signed in to change notification settings - Fork 0
/
data_preparation.Rmd
309 lines (235 loc) · 7.84 KB
/
data_preparation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
---
title: "Data Preparation & Cleaning"
output: html_notebook
---
```{r library, warning=F, message=F}
library(readxl)
library(reshape2)
library(dplyr)
library(magrittr)
library(tm)
library(keras)
library(caret)
library(SnowballC)
```
## 1. Loading The Data
```{r load-dataset, warning=F, message=F, echo = T, results = 'hide'}
# Encode specific kinds of values as NA while reading excel
# Treat all columns as character
non_bill_df <- read_excel("data/december_non-bill_calls.xlsx", na = c("", "---"), col_types = "text")
billed_df <- read_excel("data/december_billed_calls.xlsx", na = c("", "---"), col_types = "text")
```
We will combine `non_bill_df` and `billed_df` into a dataframe called `billing_df`.
```{r}
billing_df <- bind_rows(non_bill_df, billed_df)
glimpse(billing_df)
```
Based on initial discussions and research into the meaning of some of the features in this dataset, we have categorized the following features as being not __important__.
```{r, echo=FALSE}
features_to_rm <- c(
"SR Number",
"SR Address Line 1",
"SR State",
"SR City",
"SR Site",
"SR Status",
"SR Contact Date",
"SR Coverage Hours...11",
"SR Coverage Hours...28",
"Activity Status",
"Charges Status",
"Br Area Desc",
"Base Call YN",
"Activity Facts Call Num",
"Activity Completed Date",
"Item Desc",
"SR Serial Number"
)
features_to_rm
```
> The features `SR Coverage Hours...11` and `SR Coverage Hours...28` were created by R because the excel contained two columns with the name `SR Coverage Hours`.
The features have been stored in variable called `features_to_rm`, the next step is to remove these `r length(features_to_rm)` features from the `billing_df` dataset. This step reduces our number of features from __29 to 12__.
```{r}
billing_df <- billing_df %>% select(-features_to_rm)
glimpse(billing_df)
```
## 2. Cleaning The Data
### 2.1 Encoding The Variables as Factors
We can notice that R has miss categorized some of the features in our dataset. There are certain features that are supposed to be read as categorical such as:
```{r echo=FALSE}
char_to_factors <- c(
"Invoiced (Y/N)",
"Activity Type",
"Activity Trouble Code",
"Coverage Type",
"SR Type",
"SR Device",
"SR Owner (Q#)",
"Br Branch Desc",
"Br Region Desc",
"Cash Vendor & Consumable Contracts"
)
char_to_factors
```
Lets encode the features in `char_to_factors` as factors
```{r}
billing_df <- billing_df %>% mutate_at(char_to_factors, factor)
glimpse(billing_df)
```
### 2.2 Preprocessing Free Form Text
The features in our dataset that are free form text are the features `Billing Notes` and `Call Text`.
Below is a preview of `Call Text`
```{r}
billing_df$`Call Text` %>% head(3)
```
Below is a preview of `Billing Notes`
```{r}
billing_df$`Billing Notes` %>% extract(c(3, 5, 1))
```
```{r}
call_text <- use_series(billing_df, `Call Text`)
billing_notes <- use_series(billing_df, `Billing Notes`)
```
```{r}
call_text_corpus <- VCorpus(VectorSource(call_text), readerControl = list(language = "en"))
bill_notes_corpus <- VCorpus(VectorSource(billing_notes), readerControl = list(language = "en"))
```
```{r}
call_text_corpus %>% extract(1:3) %>% inspect()
```
```{r}
call_text_corpus %>% head(3) %>% lapply(function (doc) doc$content)
```
To clean our data set we will have to:
* Convert the text to lower case, so that words like "write" and "Write" are considered the same word
* Remove numbers
* Remove English stopwords e.g "the", "is", "of", etc.
* Remove punctuation e.g ",", "?", etc.
* Eliminate extra white spaces
* Stemming our text
Using the `tm` package we will apply transformations to each text document in the `call_text_corpus` to clean the text document.
```{r}
replace_asterix <- function(document) {
gsub(pattern = "\\*", replacement = " ", document)
}
add_space_period <- function(document) {
gsub(pattern = "\\.", replacement = ". ", document)
}
remove_single_chars <- function(document) {
gsub(pattern = "\\s[a-z]\\s", replacement = " ", document)
}
clean_up <- function(corpus) {
corpus %>%
# Convert the text to lower case
tm_map(content_transformer(tolower)) %>%
# Replace asterics "*" with an empty space
tm_map(content_transformer(replace_asterix)) %>%
# Add a space after a period
tm_map(content_transformer(add_space_period)) %>%
# Remove numbers
tm_map(removeNumbers) %>%
# Remove english common stopwords
tm_map(removeWords, stopwords("english")) %>%
# Remove words related to time
tm_map(removeWords, c("pm", "am", "edt")) %>%
# Remove punctuations
tm_map(removePunctuation) %>%
# Remove orphaned letters
tm_map(content_transformer(remove_single_chars)) %>%
# Eliminate extra white spaces
tm_map(stripWhitespace) %>%
# strip trailing and leading whitespace
tm_map(content_transformer(trimws)) %>%
# Stem words
tm_map(stemDocument)
}
call_text_cleaned <- clean_up(call_text_corpus)
bill_notes_cleaned <- clean_up(bill_notes_corpus)
```
```{r}
call_text_cleaned %>% lapply(function (doc) doc$content) %>% extract(1:5)
```
```{r}
bill_notes_cleaned %>% lapply(function (doc) doc$content) %>% extract(1:5)
```
```{r}
billing_df$`Call Text` <- call_text_cleaned %>% sapply(function (doc) doc$content)
billing_df$`Billing Notes` <- bill_notes_cleaned %>% sapply(function (doc) doc$content)
```
## 3. Tokenization & Encoding
```{r constants}
CONSTANTS <- list(
# We will only consider the top MAX_WORDS in the dataset
MAX_WORDS = 20000,
# We will cut text after MAX_LEN
MAX_LEN = 200,
BATCH_SIZE_GPU = 256,
BATCH_SIZE_CPU = 128
)
```
### 3.1 Encoding Categorical data
We will start off by encoding the labels of `Invoiced (Y/N)` using the `to_categorical` from keras
```{r}
invoiced <- billing_df %>%
use_series("Invoiced (Y/N)") %>%
as.numeric() %>%
subtract(1) %>%
as.matrix()
cat('Shape of label tensor:', dim(invoiced))
```
#### 3.1.1 One Hot Encoding The Categorical Variables
```{r}
categorical_vars <- billing_df %>%
select(char_to_factors[-1]) %>%
# Treat NAs as a factor
mutate_all(addNA)
glimpse(categorical_vars)
```
```{r one-hot-encoding}
dummy_model <- dummyVars(" ~ .", data = categorical_vars, fullRank = T)
auxillaries <- data.matrix(predict(dummy_model, newdata = categorical_vars))
cat('Shape of auxillary tensor:', dim(auxillaries), "\n")
```
### 3.2 Tokenizing Free Form Text
We will tokenize each free form text: `Call Text` and `Billing Notes` separately.
#### 3.2.1 Call Text
We will start out by tokenizing `Call Text`:
A `tokenizer` object will be created and configured to only take into account the top most common words, then builds the word index. We then turn the texts into lists of integer indices.
```{r}
call_text_df <- billing_df %>% select(c("Call Text"))
call_text_tokenizer <- text_tokenizer(num_words = CONSTANTS$MAX_WORDS) %>%
fit_text_tokenizer(call_text_df$`Call Text`)
call_text_sequences <- texts_to_sequences(call_text_tokenizer, call_text_df$`Call Text`)
cat("Found", length(call_text_tokenizer$word_index), "unique tokens.\n")
```
```{r}
call_text_data <- pad_sequences(call_text_sequences, maxlen = CONSTANTS$MAX_LEN)
cat("Shape of data tensor:", dim(call_text_data), "\n")
```
#### 3.2.2 Billing Notes
We then tokenize `Billing Notes`:
```{r}
billing_notes_df <- billing_df %>% select("Billing Notes")
billing_notes_tokenizer <- text_tokenizer(num_words = CONSTANTS$MAX_WORDS) %>%
fit_text_tokenizer(billing_notes_df$`Billing Notes`)
billing_notes_sequences <- texts_to_sequences(
billing_notes_tokenizer,
billing_notes_df$`Billing Notes`
)
cat("Found", length(billing_notes_tokenizer$word_index), "unique tokens.\n")
```
```{r}
billing_notes_data <- pad_sequences(billing_notes_sequences, maxlen = CONSTANTS$MAX_LEN)
cat("Shape of data tensor:", dim(billing_notes_data), "\n")
```
```{r}
save(
CONSTANTS,
billing_df,
call_text_data,
billing_notes_data,
invoiced,
auxillaries,
file="data/data_preparation.RData"
)
```