-
Notifications
You must be signed in to change notification settings - Fork 1
/
1_5_r_TEXT_INSCRIPTION_CLEANING.Rmd
820 lines (560 loc) · 43.7 KB
/
1_5_r_TEXT_INSCRIPTION_CLEANING.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
---
title: "Automated cleaning of the text of EDH inscriptions for text mining purposes"
author:
- Petra Hermankova^[Aarhus University, petra.hermankova@cas.au.dk, https://orcid.org/0000-0002-6349-0540]
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
theme: united
toc: yes
toc_float: true
number_sections: true
toc_depth: 2
df_print: paged
---
```{r setup, echo=TRUE, message=FALSE}
require("knitr")
#opts_knit$set(root.dir = "~/Github/EDCS_ETL/")
library(tidyverse)
library(jsonlite)
#library(leaflet)
```
## Loading data
Load the dataset from Sciencedata.dk , you don't need any credentials, the JSON file lives in a public folder.
Make a list and tibble from the downloaded dataset
```{r}
EDH <- jsonlite::fromJSON("https://sciencedata.dk/public/b6b6afdb969d378b70929e86e58ad975/EDH_attrs_cleaned_2023-04-26.json")
```
Or load local dataset
```{r}
getwd()
EDH <- jsonlite::fromJSON("../data/large_data/EDH_attrs_cleaned_2023-04-26.json")
```
```{r}
EDH_clean <- EDH
```
# Cleaning epigraphic text for a text mining analysis (word and sentence centered)
*Aim:* The main purpose of this script is to clean large collections of epigraphic texts all at once in order to create cleaned texts ready for text mining analysis. The output clean texts can be used for a) word centered text mining, also known as the tidytext approach (https://www.tidytextmining.com/) or for b) sentence centered text mining, as part of the Natural Language Processing (https://en.wikipedia.org/wiki/Natural_language_processing).
The presented cleaning process is designed as generic, fairly modular and fully customisable. Therefore, it can be with some modification used to any epigraphic corpus. Ample examples are provided to illustrate individual parts of the process, so anyone familiar with _Regular Expressions_ and _basic understanding of R_ can build their own cleaning function or modify the existing ones.
The final output of the cleaning function depends on which of the individual cleaning blocks will be used and in what sequence they will run. Each individual cleaning block represents one pattern occurring repeatedly in the text that can be searched for and modified, depending on the intended outcome. All the cleaning steps are dependent on the characteristics of the original dataset, therefore familiarity with the original dataset prior the cleaning process is recommended. Each dataset can have a different set of symbols and characters to be cleaned, thus, the cleaning blocks should be adjusted accordingly.
I have created three categories of cleaning blocks, closely linked with the methodological approach and the purpose of the cleaning process:
1. `Conservative cleaning model` producing a text as close to the original as possible
2. `Interpretive cleaning model` producing a text enriched with interpretations of the corpus editor
3. Generic cleaning of patterns common for both previous categories
_Structure of a cleaning block:_
Each of the cleaning blocks maintains the same structure, using Regular expressions to find and replace the searched term or pattern.
```regexpatternname <- c("regexpattern", "substitutionpattern")```
## 1. Cleaning blocks for the conservative model
*The aim of this model is to produce a clean text that is as close to the original text of an inscription as possible, without any editorial input.*
The cleaned output of the conservative model will be as close to the original text of an inscription as possible. In most cases it should resemble a _diplomatic edition_ of epigraphic text with spaces between words, lowercase letters, eliminated brackets and non-utf compliant symbols. The interpretive restoration, substitutions or any changes of the text as appear in the dataset, done by the editor of the epigraphic corpus, are eliminated from the conservative model.
### 1.1. Expanded abbreviations
**Aim:** All expanded abbreviations that are in the parenthesis () will be eliminated from the clean text (substituted with "").
* Example before cleaning: ```Αὐρ(ήλιος) Οὐαλέριος```
* Example after cleaning: ```Αὐρ Οὐαλέριος```
```{r}
expanded_abbreviations_conservative <- c("\\([^(]*\\)", "")
```
### 1.2. Suppresion of a text with superscripts
**Aim:** All supressions that are in the curly braces {} followed by one or more superscript digits will be eliminated from the clean text (substituted with "").
**!!!** It is crutial that block `suppresion_conservative` does not precede block `suppresion_superscripts_conservative`, otherwise the Regex pattern would not clean the text properly. This particular pattern is common for the PHI dataset and may or may not appear in other datasets.
* Example before cleaning: ```ἱερεὺς ληφθὶς ὑπὰ {²⁶ὑπὸ}²⁶ τῶν βαρβάρων ```
* Example after cleaning: ```ἱερεὺς ληφθὶς ὑπὰ τῶν βαρβάρων ```
```{r}
suppresion_superscripts_conservative <- c("{[^}]*}[⁰¹²³⁴⁵⁶⁷⁸⁹]+", "")
```
### 1.3. Suppresion of a text
**Aim:** All curly braces {} will be eliminated from the clean text (substituted with ""), while the contents of the braces will remain in the text.
**!!!** It is crutial that block `suppresion_conservative` does not precede block `suppresion_superscripts_conservative`, otherwise the Regex pattern would not clean the text properly.
* Example before cleaning: ```Σεβαστοῦ υἱοῦ {θ̣εοῦ Σεβαστοῦ} τύχης ```
* Example after cleaning: ```Σεβαστοῦ υἱοῦ θ̣εοῦ Σεβαστοῦ τύχης ```
```{r}
suppresion_conservative <- c("[\\{*\\}]", "")
```
### 1.4. Restoration
**Aim:** All restoration that are in the square brackets [] will be eliminated from the clean text (substituted with "").
**!!!** Beware that by eliminating the contents of the brackets you may loose some context - use at your own discretion.
* Example before cleaning: ```[Ν]ανα Ἕλληνο̣[ς] θυγάτηρ καὶ ἡ ἑτέρα [γυνὴ]```
* Example after cleaning: ```ανα Ἕλληνο θυγάτηρ καὶ ἡ ἑτέρα```
```{r}
restoration_conservative <- c("\\[[^[]*\\]", "")
```
### 1.5. Substitution
**Aim:** All substitutions that are in the angular brackets <> will be eliminated from the clean text (substituted with "").
**!!!** Beware that by eliminating the contents of the brackets you may loose some context - use at your own discretion.
* Example before cleaning: ```κωρο<ν Ἀ>ντιόχ<ου> ἡ πατρὶς τειμῆ<ς>```
* Example after cleaning: ```κωρο ντιόχ ἡ πατρὶς τειμῆς```
```{r}
substitution_conservative <- c("\\<[^<]*\\>", "")
```
### 1.6. Substitution in EDH dataset
**Aim:** All sustitutions following the pattern "A=B" will be cleaned thw following way: B remain in the text and the equal sign and A will be eliminated from the clean text.
**!!!** Beware that by eliminating the brackets you may loose some information about the preservation of the text - use at your own discretion. The `substitution_edh_interpretive` should be run before `substitution_interpretive` block, otherwise the Regex pattern would not clean the text properly. The `substitution_interpretive` block will clean the angular brackets in the next step.
* Example before cleaning: ```pos<u=I>erunt bene merenti```
* Example after cleaning: ```pos<I>erunt bene merenti```
```{r}
substitution_edh_conservative <- c("([α-ωΑ-Ωa-zA-Z])=([α-ωΑ-Ωa-zA-Z])", "\\2")
```
## 2. Cleaning blocks for the interpretive model
*The aim of this model is to produce a clean text that is enriched with interpretations of the original text as published by the editor of the corpus. The editorial interpretations include abbreviations, restorations, substitutions and suppresions of the text.*
The output of the interpretive model will produce an epigraphic text with as many editorial suggestions, restorations, corrections, and improvements as possible to provide as much possible contents of the inscription as possible. The brackets and non-utf compliant symbols will be eliminated from the `interpretive model`.
### 2.1. Expanded abbreviations
**Aim:** All parenthesis () will be eliminated from the clean text (substituted with ""), while the contents of the parenthesis will remain in the text.
* Example before cleaning: ```Αὐρ(ήλιος) Οὐαλέριος```
* Example after cleaning: ```Αὐρήλιος Οὐαλέριος```
```{r}
expanded_abbreviations_interpretive <- c("[\\(*\\)]", "")
```
### 2.2. Suppresion of a text with superscripts
**Aim:** Contents found within curly braces {} followed by one or more superscript digits will substitute the word immediately preceding the curly braces with the word contained in the curly braces and the braces will be eliminated, see example. Note: The cleaning block will not work if there is no text preceeding the curly braces (the pattern will be skipped).
**!!!** This particular pattern is common for the PHI dataset and may or may not appear in other datasets. It is recommended to run the ```suppresion_keep_interpretive``` or ```suppresion_remove_interpretive``` block after ```suppresion_superscripts_interpretive``` block, otherwise the Regex pattern would not clean the text properly.
* Example before cleaning: ```ἱερεὺς ληφθὶς ὑπὰ {²⁶ὑπὸ}²⁶ τῶν βαρβάρων ```
* Example after cleaning: ```ἱερεὺς ληφθὶς ὑπὸ τῶν βαρβάρων ```
```{r}
suppresion_superscripts_interpretive <- c(" [^ ]+ \\{([⁰¹²³⁴⁵⁶⁷⁸⁹]+)([^}]+)\\}\\1", " \\2")
```
### 2.3. Suppresion of a text
**Aim:** All curly braces {} will be eliminated from the clean text (substituted with ""), while the contents of the braces will remain in the text.
**!!!** It is crutial that block ```suppresion_keep_interpretive``` or ```suppresion_remove_interpretive``` does not precede block `suppresion_superscripts_interpretive`, otherwise the Regex pattern would not clean the text properly. Due to ambiguous use of {} by editors of epigraphic corpora, the exact usage depends on the specific dataset and the way the curly braces were used. Therefore, two options how to handle curly braces are provided: If you wish to keep the text within the curly braces and remove the braces, use ```suppresion_keep_interpretive``` block. If you wish to remove the text in the braces and the braces, use ```suppresion_remove_interpretive``` block.
* Example before cleaning: ```θ̣εοῦ Σεβαστοῦ υἱοῦ {θ̣εοῦ Σεβαστοῦ} τύχης ```
* Example after cleaning (keep text): ```θ̣εοῦ Σεβαστοῦ υἱοῦ θ̣εοῦ Σεβαστοῦ τύχης```
* Example after cleaning (remove text): ```θ̣εοῦ Σεβαστοῦ υἱοῦ τύχης```
```{r}
suppresion_keep_interpretive <- c("[\\{*\\}]", "")
```
OR if you wish to remove the contents of the braces
```{r}
suppresion_remove_interpretive <- c("{[^}]*}", "")
```
### 2.4. Restoration
**Aim:** All square brackets [] will be eliminated from the clean text (substituted with ""), while the contents of the brackets will remain in the text.
**!!!** Beware that by eliminating the brackets you may loose some information about the preservation of the text - use at your own discretion.
* Example before cleaning: ```[Ν]ανα Ἕλληνο̣[ς] θυγάτηρ καὶ ἡ ἑτέρα [γυνὴ]```
* Example after cleaning: ```Νανα Ἕλληνο̣ς θυγάτηρ καὶ ἡ ἑτέρα γυνὴ```
```{r}
restoration_interpretive <- c("[\\[*\\]]", "")
```
### 2.5. Substitution
**Aim:** All angular brackets <> will be eliminated from the clean text (substituted with ""), while the contents of the brackets will remain in the text.
**!!!** Beware that by eliminating the brackets you may loose some information about the preservation of the text - use at your own discretion.
* Example before cleaning: ```κωρο<ν Ἀ>ντιόχ<ου> ἡ πατρὶς τειμῆ<ς>```
* Example after cleaning: ```κωρον Ἀντιόχου ἡ πατρὶς τειμῆς```
```{r}
substitution_interpretive <- c("[\\<*\\>]", "")
```
### 2.6. Substitution in the EDH dataset
**Aim:** All sustitutions following the pattern "A=B" will be cleaned the following way: "A" will remain in the text and the equal sign and "B" will be eliminated from the clean text.
**!!!** The `substitution_edh_interpretive` should be run before `substitution_interpretive` block, otherwise the Regex pattern would not clean the text properly. The `substitution_interpretive` block will clean the angular brackets in the next step.
* Example before cleaning: ```pos<u=I>erunt bene merenti```
* Example after cleaning: ```pos<u>erunt bene merenti```
```{r}
substitution_edh_interpretive <- c("([α-ωΑ-Ωa-zA-Z])=([α-ωΑ-Ωa-zA-Z])", "\\1")
```
## 3. The generic text cleaning
*The aim of the generic cleaning is to strip the epigraphic text any non-utf compliant symbols and characters that do not adhere to the principles of a quantitat ive text mining.*
The cleaning blocks in this section represent common patterns appearing in any epigraphic text, such as interpunction, lacunas or other representations of an empty space, various editorial notes and comments in the text itself, that are not relevant to the text mining, erasures, numerals, and several specific unicode symbols appearing in the original text. Depending on the characteristics of the originahal dataset and the intended outcome, anyone can change individial cleaning blocks to better fit their needs. Through testing is, however, strongly recommended!
### 3.1. Lacuna 1
**Aim:** All square brackets [] containing one or more "— " will be eliminated from the clean text (substituted with "").
**!!!** The block ```lacuna1``` should be run before ```restoration_conservative``` and ```restoration_interpretive``` blocks, otherwise the Regex pattern would not clean the text properly. Note: If there is a text within the square bracket, e.g. ```προύχον[τος — — —]``` the block ```lacuna1``` will skip the pattern. However, the block ```restoration_interpretive``` will eliminate the square brackets, the script ```interpunction_symbols``` will clean the "—" and the script ```multi_whitespace``` will eliminate the extra whitespaces. Therefore the blocks should be used in combination and in the indicated sequence: (1)```restoration_interpretive```, (2)```interpunction_symbols``` and (3)```multi_whitespace```.
* Example before cleaning: ```[— — —]ης θεῷ Φοίβῳ```
* Example after cleaning: ```ης θεῷ Φοίβῳ```
```{r}
lacuna1 <- c("\\[[— ]+\\]", "")
```
### 3.2. Lacuna 2
**Aim:** All square brackets [] containing one or more "." will be eliminated from the clean text (substituted with "").
**!!!** The block ```lacuna2``` should be run before ```restoration_conservative``` and ```restoration_interpretive``` blocks, otherwise the Regex pattern would not clean the text properly. Note: If there is a text within the square bracket, e.g. ```προύχον[τος...]``` the block ```lacuna2``` will skip the pattern. However, the block ```restoration_interpretive``` will eliminate the square brackets, the script ```interpunction_symbols``` will clean the "." and the script ```multi_whitespace``` will eliminate the extra whitespaces. Therefore the blocks should be used in combination and in the indicated sequence: (1)```restoration_interpretive```, (2)```interpunction_symbols``` and (3)```multi_whitespace```.
* Example before cleaning: ```[․․]ω Διὶ καὶ Ἥρᾳ```
* Example after cleaning: ```ω Διὶ καὶ Ἥρᾳ```
```{r}
lacuna2 <- c("\\[[․]+\\]", "")
```
### 3.3. Vacat
**Aim:** All instances of the following strings "vacat, vac, vac., v." will be replaced by a space (substituted with " "). If there is any extra whitespace, it will be cleaned by ```multi_whitespace``` block in the following steps.
**!!!** If your datasets contains latin inscriptions, you may want to check whether the ```vacat``` block is not eliminitating more words than anticipated, e.g. words containing string "vacat" or "vac". If so, adjust the cleaning block accordingly, i.e. remove "vac", or don't use it.
* Example before cleaning: ```Ἡρακλείδα vacat χαῖρε.```
* Example after cleaning: ```Ἡρακλείδα χαῖρε.```
```{r}
vacat <- c("(vacat|vac|vac\\.|v\\.)", " ")
```
### 3.4. Editorial notes
**Aim:** All instances of the editorial notes in parenthesis such as (vel sim.) will be replaced by a space (substituted with " "). If there is any extra whitespace, it will be cleaned by ```multi_whitespace``` block in the following steps.
**!!!** The ```editorial_notes``` block should run before the ```expanded_abbreviations_conservative``` and ```expanded_abbreviations_interpretive``` blocks, otherwise the Regex pattern would not clean the text properly.
* Example before cleaning: ```Ἥρωι (vel sim.) Καλλισθένης```
* Example after cleaning: ``Ἥρωι Καλλισθένης```
```{r}
editorial_notes <-c("\\(vel sim.\\)", " ")
```
### 3.5. New line
**Aim:** All instances of in-line symbol for new line (|) will be eliminated (substituted with "").
* Example before cleaning: ```Λάμπρη Τ̣ελεσήνορ|ος γυνή.```
* Example after cleaning: ```Λάμπρη Τ̣ελεσήνορος γυνή```
```{r}
new_line <- c("[\\||\\/]", "")
```
### 3.6. Split word over two lines
**Aim:** All instances of words split between two lines with a dash (-) will be eliminated (substituted with "").
* Example before cleaning: ```ἀρχιερέως καὶ εὐποσιάρ-\nχου μηνὸς```
* Example after cleaning: ```ἀρχιερέως καὶ εὐποσιάρχου μηνὸς```
```{r}
split_word_multiline <- c("-\\n", "")
```
### 3.7. Erasure empty
**Aim:** All instances of erased text (〚—〛) will be replaced by a space (substituted with " "). If there is any extra whitespace, it will be cleaned by ```multi_whitespace``` block in the following steps.
* Example before cleaning: ```Ἀρτέμιδι 〚— — —〛 ἐπηκόοις.```
* Example after cleaning: ```Ἀρτέμιδι ἐπηκόοις.```
```{r}
erasure_empty <- c("〚[— ]+〛", " ")
```
### 3.8. Erasure with new text
**Aim:** All instances of double brackets for erasures (〚 〛) will be eliminated (substituted with "") and the contents of the double brackets will be preserved as part of the clean text.
* Example before cleaning: ```Ἀμύντωρ Νουμηνίου 〚χαῖρε〛. καὶ ἡ γυνὴ αὐτοῦ```
* Example after cleaning: ```Ἀμύντωρ Νουμηνίου χαῖρε. καὶ ἡ γυνὴ αὐτοῦ```
```{r}
erasure_new_text <- c("[〚〛]", "")
```
### 3.9. Dubious dot subscript
**Aim:** All instances of the dubious reading marked by the subscrit dot (unicode 0323) will be eliminated (substituted with "").
**!!!** The ```dubious_dot_subscript``` block should happen as first step of the cleaning, otherwise the letters might shift and the Regex pattern would not clean the text properly.
* Example before cleaning: ``` Ἀ̣πό̣λ̣λ̣ωνος```
* Example after cleaning: ``` Ἀπόλλωνος```
```{r}
dubious_dot_subscript <- c("\u{0323}", "")
```
### 3.10. Interpunction symbols
**Aim:** All instances of listed interpunction symbols (,.!-—#%^&\*/~:;) will be replaced by a space (substituted with " "). If there is any extra whitespace, it will be cleaned by ```multi_whitespace``` block in the following steps.
**!!!** If you wish to keep sentence separators, such as dots at the bottom of the line, use ```interpunction_keep_sentences``` or elimininate the sentence separators you want to keep in your text from the cleaning block ```interpunction_keep_sentences```.
* Example before cleaning: ```Φιλήτη # θεᾷ Μαλοφόρῳ``` or ```κεῖμαι πρόμοιρος Ἑρμογένης τυμβευθείς. /ἀγὼν```
* Example after cleaning: ```Φιλήτη θεᾷ Μαλοφόρῳ``` or ```κεῖμαι πρόμοιρος Ἑρμογένης τυμβευθείς ἀγὼν```
* Example after cleaning (keep sentences): ```κεῖμαι πρόμοιρος Ἑρμογένης τυμβευθείς. ἀγὼν```
```{r}
interpunction_symbols <- c("[,|\\.|․|:|⋮|⁙|;|!|\\-|—|–|#|%|\\^|&|\\*|~|@]", " ")
```
OR
if you wish to preserve sentence separators, such as dots
```{r}
interpunction_keep_sentences <- c("[!|\\-|—|–|#|%|\\^|&|\\*|~|@]", " ")
```
### 3.11. Superscript numbers
**Aim:** All instances of superscripted numbers will be eliminated (substituted with "").
**!!!** The ```superscript_numbers``` should not be run before the ```suppresion_superscripts_conservative``` or ```suppresion_superscripts_interpretive``` block, otherwise the Regex pattern would not clean the text properly.
* Example before cleaning: ```Αὐρ(ήλιος) Διονύσιος #⁵⁶ βʹ #⁵⁶```
* Example after cleaning: ```Αὐρ(ήλιος) Διονύσιος # βʹ #```
```{r}
superscript_numbers <- c("[⁰¹²³⁴⁵⁶⁷⁸⁹]+", "")
```
### 3.12. Epigraphic symbols
**Aim:** All instances of the listed specialised epigraphic symbols, such as the haedera (❦), will be eliminated (substituted with "").
* Example before cleaning: ```ἀγαθῆι ❦ τύχηι```
* Example after cleaning: ```ἀγαθῆι τύχηι```
```{r}
epigraphic_symbols <-c ("[❦|·|∙|𐆖|⏑|⏓|⏕]", "")
```
### 3.13. Uncertainty symbols
**Aim:** All instances of th elisted symbols marking uncertainty (?) will be eliminated (substituted with "").
* Example before cleaning: ```χαῖρε?```
* Example after cleaning: ```χαῖρε```
```{r}
uncertainty_symbols <-c ("[\\?]", "")
```
### 3.14. End of line
**Aim:** All instances of end of line symbol (\n) will be replaced by space (substituted with " ").
* Example before cleaning: ```καὶ ἄρξαντα\nτοῦ κοινοῦ```
* Example after cleaning: ```καὶ ἄρξαντα τοῦ κοινοῦ```
```{r}
end_line <- c("\\n", " ")
```
### 3.15. Extra blank space
**Aim:** All instances of extra blank space (" ") will be replaced by space (substituted with " ").
* Example before cleaning: ```ἀγαθῆι τύχηι.```
* Example after cleaning: ```ἀγαθῆι τύχηι.```
```{r}
extra_blank <- c("[ ]+", " ")
```
### 3.16. Multi-whitespace
**Aim:** All instances of more then one whitespace " " next to each other will be eliminated (substituted with "").
**!!!** The ```multi_whitespace``` should run as the second last cleaning block to ensure all redundant white spaces are cleaned from the text.
* Example before cleaning: ```Ἡρακλείδα χαῖρε.```
* Example after cleaning: ```Ἡρακλείδα χαῖρε.```
```{r}
multi_whitespace <- c("\\s+", " ")
```
### 3.17. Trailing and leading whitespace
**Aim:** All instances of whitespace " " at the beginning and end of the line will be eliminated (substituted with "").
**!!!** The ```whitespace_endline``` should run as the last cleaning block to ensure all redundant white spaces are cleaned from the text.
* Example before cleaning: ``` χαῖρε ```
* Example after cleaning: ```χαῖρε```
```{r}
whitespace_endline <- c("(^\\s|\\s$)", "")
```
### 3.18. Editorial comments in Latin alphabet
**Aim:** All instances of editorial comments in Latin alphabet that are enclosed in curly braces {} with superscript numbers will be eliminated (substituted with "").
**!!!** If your dataset contains Latin inscriptions, use this block with caution. Verify first, that running the block does not eliminate any necessary information or text. This block has been specifically designed for the interpretive cleaning of the PHI Greek Inscription dataset and it should run before ```suppresion_superscripts_interpretive``` and ```suppresion_interpretive``` blocks, otherwise the Regex pattern would not clean the text properly.
* Example before cleaning: ```ἀγαθῆι τύχηι. {²in parte inferiore altera manu incisa est:}² ὑπὲρ τῆς τοῦ```
* Example after cleaning: ```ἀγαθῆι τύχηι. ὑπὲρ τῆς τοῦ```
```{r}
editorial_comments_latin <- c("\\{([⁰¹²³⁴⁵⁶⁷⁸⁹]+)([a-zA-Z0-9][^}]+)\\}\\1", "")
```
### 3.19. Arabic numerals
**Aim:** All instances of arabic numerals (0-9) will be eliminated (substituted with "").
**!!!** If your dataset contains arabic numerals that you would like to keep, use this block with caution. Verify first, that running the block does not eliminate any necessary information or text. This block has been specifically designed for the interpretive cleaning of the PHI Greek Inscription dataset and it should run before ```multi_whitespace``` and ```whitespace_endline``` blocks, otherwise the Regex pattern would not clean the text properly.
* Example before cleaning: ```ἡ γυνὴ αὐτοῦ ΦιλΙ̣ 4 5 καὶ```
* Example after cleaning: ```ἡ γυνὴ αὐτοῦ ΦιλΙ καὶ```
```{r}
arabic_numerals <- c("[0-9]+", "")
```
### 3.20 Unclosed brackets
**Aim:** All instances of unclosed brackets will be eliminated (substituted with "").
**!!!** Use the `unclosed_brackets` block immediately before ```multi_whitespace``` and ```whitespace_endline``` blocks, otherwise the Regex pattern would not clean the text properly.
* Example before cleaning: ```ummio isenna Xv [```
* Example after cleaning: ```ummio isenna Xv ```
```{r}
unclosed_brackets <- c("[\\[|\\{|\\(|\\)|\\}|\\]]", "")
```
### 3.21 Latin Enclitics
**Aim:** All instances of Latin eclitics -que will be separated from the word by space.
* Example before cleaning: ```libertatisque```
* Example after cleaning: ```libertatis```
```{r}
latin_que <- c("(\\w+)(que)\\b", "\\1 \\2")
```
### 3.22 Roman numerals + vir
**Aim:** All instances of Roman numerals preceeding word 'vir' will be separated from the word by space.
* Example before cleaning: ```IIIIviribus```
* Example after cleaning: ```IIII viribus```
```{r}
latin_vir <- c("([I|V|X])(vir*)", "\\1 \\2")
```
## Building cleaning functions for specific datasets
When we have established the individual buidling blocks, we can put them together in the right sequence and build a cleaning function in R for conservative and interpretive models.
### Conservative model
*Aim:* to have a clean text that is as close to the original inscription as preserved on the medium - in case of the EDH dataset column `diplomatic_text` should be similar to the output of the `conservative_cleaning` model.
Since the dataset is mostly in Latin, I did not use the following cleaning scripts: `vacat`, `editorial_notes`, `editorial_comments_latin` since they would eliminate some parts of the text that should not be eliminated. I am not using the `suppresion_superscripts_conservative` script beacuse the structure of the EDH dataset does not contain curly braces followed by superscript numbers. The script `unclosed_brackets` has been added since EDH dataset contains a lot of unclosed brackets of all kinds. Script `substitution_edh_conservative` was added to clean additional substitution features of the EDH dataset.
```{r}
cleaning_conservative_edh <- function(epigraphic_dataset){
clean_text <- gsub(pattern=dubious_dot_subscript[1], replacement=dubious_dot_subscript[2], x=epigraphic_dataset, perl=TRUE)
clean_text <- gsub(pattern=lacuna1[1], replacement=lacuna1[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=lacuna2[1], replacement=lacuna2[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=expanded_abbreviations_conservative[1], replacement=expanded_abbreviations_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=suppresion_conservative[1], replacement=suppresion_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=restoration_conservative[1], replacement=restoration_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_edh_conservative[1], replacement=substitution_edh_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_conservative[1], replacement=substitution_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=new_line[1], replacement=new_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=split_word_multiline[1], replacement=split_word_multiline[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_empty[1], replacement=erasure_empty[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_new_text[1], replacement=erasure_new_text[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=interpunction_symbols[1], replacement=interpunction_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=superscript_numbers[1], replacement=superscript_numbers[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=epigraphic_symbols[1], replacement=epigraphic_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=uncertainty_symbols[1], replacement=uncertainty_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=uncertainty_symbols[1], replacement=uncertainty_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=end_line[1], replacement=end_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=latin_que[1], replacement=latin_que[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=latin_vir[1], replacement=latin_vir[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=extra_blank[1], replacement=extra_blank[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=arabic_numerals[1], replacement=arabic_numerals[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=unclosed_brackets[1], replacement=unclosed_brackets[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=multi_whitespace[1], replacement=multi_whitespace[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=whitespace_endline[1], replacement=whitespace_endline[2], x=clean_text, perl=TRUE)
return(clean_text)
}
```
#### Example of conservative cleaning:
`Transcription` column of the first five inscriptions before cleaning:
```{r}
print(EDH_clean$transcription[1:5])
```
`Diplomatic_text` column of the first five inscriptions (for comparison with the cleaning output):
```{r}
print(EDH_clean$diplomatic_text[1:5])
```
Output of the ```cleaning_conservative_edh``` function:
```{r}
example_edh <- as.data.frame(cleaning_conservative_edh(EDH_clean$transcription))
example_edh[1:5,]
```
### Interpretive model for 'tidytext' analysis based on the analysis of words
*Aim:* to have a clean text enriched by editorial interpretations and reconstructions of the text (to have as rich text of an inscription as possible).
Since the dataset is mostly in Latin, I did not use the following cleaning scripts: `vacat`, `editorial_notes`, `editorial_comments_latin` since they would eliminate some parts of the text that should not be eliminated. I am not using the `suppresion_superscripts_interpretive` script beacuse the structure of the EDH dataset does not contain curly braces followed by superscript numbers. The script `unclosed_brackets` has been added since EDH dataset contains a lot of unclosed brackets of all kinds. Script `substitution_edh_interpretive` was added to clean additional substitution features of the EDH dataset.
EDH has provided their own version of clean text in the column `text_cleaned` but did not provide any cleaning script or steps leading to the current state of `text_cleaned`. As a second step I will compare the output of the `interpretive_cleaning` model with the `text_cleaned` version to see who has produced better text for text mining.
The output of the function will consist of words separated by one space, so the data is ready for tidytext analysis. No interpunction will be left in the text.
```{r}
cleaning_interpretive_word_edh <- function(epigraphic_dataset){
clean_text <- gsub(pattern=dubious_dot_subscript[1], replacement=dubious_dot_subscript[2], x=epigraphic_dataset, perl=TRUE)
clean_text <- gsub(pattern=lacuna1[1], replacement=lacuna1[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=lacuna2[1], replacement=lacuna2[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=expanded_abbreviations_interpretive[1], replacement=expanded_abbreviations_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=suppresion_keep_interpretive[1], replacement=suppresion_keep_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=restoration_interpretive[1], replacement=restoration_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_edh_interpretive[1], replacement=substitution_edh_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_interpretive[1], replacement=substitution_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=new_line[1], replacement=new_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=split_word_multiline[1], replacement=split_word_multiline[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_empty[1], replacement=erasure_empty[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_new_text[1], replacement=erasure_new_text[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=interpunction_symbols[1], replacement=interpunction_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=superscript_numbers[1], replacement=superscript_numbers[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=epigraphic_symbols[1], replacement=epigraphic_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=uncertainty_symbols[1], replacement=uncertainty_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=latin_que[1], replacement=latin_que[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=latin_vir[1], replacement=latin_vir[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=end_line[1], replacement=end_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=extra_blank[1], replacement=extra_blank[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=arabic_numerals[1], replacement=arabic_numerals[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=multi_whitespace[1], replacement=multi_whitespace[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=whitespace_endline[1], replacement=whitespace_endline[2], x=clean_text, perl=TRUE)
return(clean_text)
}
```
`Transcription` column of the first five inscriptions before cleaning:
```{r}
print(EDH_clean$transcription[1:5])
```
`Text_edition` column, extracted from the EDH XML files as a clean version of the text, for comparison with the output of the ```cleaning_intepretive_word_edh``` function:
```{r}
print(EDH_clean$text_edition[1:5])
```
Output of the ```cleaning_interpretive_word_edh``` function:
```{r}
example_edh2 <- as.data.frame(cleaning_interpretive_word_edh(EDH_clean$transcription))
example_edh2[1:5,]
```
### Interpretive model for 'tidytext' analysis based on the analysis of sentences
*Aim:* to have a clean text enriched by editorial interpretations and reconstructions of the text (to have as rich text of an inscription as possible).
Since the dataset is mostly in Latin, I did not use the following cleaning scripts: `vacat`, `editorial_notes`, `editorial_comments_latin` since they would eliminate some parts of the text that should not be eliminated. I am not using the `suppresion_superscripts_interpretive` script beacuse the structure of the EDH dataset does not contain curly braces followed by superscript numbers. The script `unclosed_brackets` has been added since EDH dataset contains a lot of unclosed brackets of all kinds. Script `substitution_edh_interpretive` was added to clean additional substitution features of the EDH dataset.
EDH has provided their own version of clean text in the column `text_cleaned` but did not provide any cleaning script or steps leading to the current state of `text_cleaned`. As a second step I will compare the output of the `interpretive_cleaning` model with the `text_cleaned` version to see who has produced better text for text mining.
The output of the function will consist of words separated by one space, so the data is ready for tidytext analysis. Sentence separators will be left in the text, so individual sentences can be analysed separately. For this reason the block `interpunction_symbols` was substituted by `interpunction_keep_sentences`.
```{r}
cleaning_interpretive_sentence_edh <- function(epigraphic_dataset){
clean_text <- gsub(pattern=dubious_dot_subscript[1], replacement=dubious_dot_subscript[2], x=epigraphic_dataset, perl=TRUE)
clean_text <- gsub(pattern=lacuna1[1], replacement=lacuna1[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=lacuna2[1], replacement=lacuna2[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=expanded_abbreviations_interpretive[1], replacement=expanded_abbreviations_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=suppresion_keep_interpretive[1], replacement=suppresion_keep_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=restoration_interpretive[1], replacement=restoration_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_edh_interpretive[1], replacement=substitution_edh_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_interpretive[1], replacement=substitution_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=new_line[1], replacement=new_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=split_word_multiline[1], replacement=split_word_multiline[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_empty[1], replacement=erasure_empty[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_new_text[1], replacement=erasure_new_text[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=interpunction_keep_sentences[1], replacement=interpunction_keep_sentences[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=superscript_numbers[1], replacement=superscript_numbers[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=epigraphic_symbols[1], replacement=epigraphic_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=uncertainty_symbols[1], replacement=uncertainty_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=latin_que[1], replacement=latin_que[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=latin_vir[1], replacement=latin_vir[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=end_line[1], replacement=end_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=extra_blank[1], replacement=extra_blank[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=arabic_numerals[1], replacement=arabic_numerals[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=multi_whitespace[1], replacement=multi_whitespace[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=whitespace_endline[1], replacement=whitespace_endline[2], x=clean_text, perl=TRUE)
return(clean_text)
}
```
`Transcription` column of three inscriptions before cleaning:
```{r}
print(EDH_clean$transcription[c(2297, 2444, 3026)])
```
`Text_cleaned` column, provided by EDH as a clean version of the text:
```{r}
print(EDH_clean$text_edition[c(2297, 2444, 3026)])
```
Output of the ```cleaning_interpretive_sentence_edh``` function (with interpunction):
```{r}
example_edh3 <- as.data.frame(cleaning_interpretive_sentence_edh(EDH_clean$transcription))
example_edh3[c(2297, 2444, 3026),]
```
### Enriching the full dataset with conservative and interpretive cleaned versions of the text:
```{r}
EDH_clean <- EDH_clean %>%
mutate(clean_text_conservative = cleaning_conservative_edh(EDH_clean$transcription)) %>%
mutate(clean_text_interpretive_word = cleaning_interpretive_word_edh(EDH_clean$transcription)) %>%
mutate(clean_text_interpretive_sentence = cleaning_interpretive_sentence_edh(EDH_clean$transcription))
```
```{r}
# if incorrect (for some reason), reverse accordingly...
#not_before <- EDH_clean$not_after
#not_after <- EDH_clean$not_before
#EDH_clean$not_after <- not_after
#EDH_clean$not_before <- not_before
```
# Saving locally (to large_data)
```{r}
EDH_cleaned_json <- jsonlite::toJSON(EDH_clean, auto_unbox = TRUE)
write(EDH_cleaned_json, file="data/large_data/EDH_text_cleaned_2023-04-26.json")
```
```{r}
user <- readline("your sciencedata username: ")
request("data/large_data/EDH_text_cleaned_2023-04-26.json", path="/sharingout/648597@au.dk/SDAM_root/SDAM_data/EDH/public",
method="PUT", cred=c(user, getPass("your sciencedata password: ")))
```
---
Comparing the text extracted from XML files with the from the API that has been cleaned with the cleaning functions
```{r}
texts <- EDH_clean %>%
select(text_edition, clean_text_conservative, clean_text_interpretive_word, clean_text_interpretive_sentence, transcription)
```
# Comparing qualitatively individual outputs next to each other
```{r}
textno <- 14
texts$text_edition[textno][1]
texts$clean_text_conservative[textno]
texts$clean_text_interpretive_word[textno]
texts$clean_text_interpretive_sentence[textno]
texts$transcription[textno][1]
```
# comparing the total number of texts
```{r}
length(unique(unlist(texts$text_edition)))
length(unique(texts$clean_text_conservative))
length(unique(texts$clean_text_interpretive_word))
length(unique(texts$clean_text_interpretive_sentence))
length(unique(unlist(texts$transcription)))
```
# comparing the contents of `text_edition` and `clean_text_interpretive_word`
## how many more texts of inscriptions are in `text_edition` than in `clean_text_interpretive_word`
```{r}
length(unique(unlist(texts$text_edition))) - length(unique(texts$clean_text_interpretive_word))
```
## What is the contents of text-edition when clean_text_interpretive_word inscriptions contains NULL
```{r}
texts %>%
select(text_edition, clean_text_interpretive_word, transcription) %>%
filter(text_edition == "NULL") %>%
View()
```
## How many total words there are in in `text_edition` and in `clean_text_interpretive_word` and what is the difference
```{r}
sum(lengths(gregexpr("\\w+", texts$text_edition)) + 1)
sum(lengths(gregexpr("\\w+", texts$clean_text_interpretive_word)) + 1)
sum(lengths(gregexpr("\\w+", texts$transcription)) + 1)
sum(lengths(gregexpr("\\w+", texts$text_edition)) + 1) - sum(lengths(gregexpr("\\w+", texts$clean_text_interpretive_word)) + 1)
```
## Using different method of word counting
```{r}
sum(str_count(texts$text_edition, '\\w+'))
sum(str_count(texts$clean_text_interpretive_word, '\\w+'))
sum(str_count(texts$transcription, '\\w+'))
sum(str_count(texts$text_edition, '\\w+')) - sum(str_count(texts$clean_text_interpretive_word, '\\w+'))
```
# How many times are special characters appearing in the `text_edition` (that should not be there)
```{r}
texts$text_edition %>% str_subset(pattern = "\\(") %>% unique() %>% length()
texts$text_edition %>% str_subset(pattern = "\\?") %>% unique() %>% length()
texts$text_edition %>% str_subset(pattern = "\\$") %>% unique() %>% length()
texts$text_edition %>% str_subset(pattern = "\\[") %>% unique() %>% length()
texts$text_edition %>% str_subset(pattern = "\\-") %>% unique() %>% length()
texts$text_edition %>% str_subset(pattern = "\\|") %>% unique() %>% length()
texts$text_edition %>% str_subset(pattern = "\\.") %>% unique() %>% length()
```
# How many times are special characters appearing in the `clean_text_interpretive_word` (that should not be there)
```{r}
texts$clean_text_interpretive_word %>% str_subset(pattern = "\\(") %>% unique() %>% length()
texts$clean_text_interpretive_word %>% str_subset(pattern = "\\?") %>% unique() %>% length()
texts$clean_text_interpretive_word %>% str_subset(pattern = "\\$") %>% unique() %>% length()
texts$clean_text_interpretive_word %>% str_subset(pattern = "\\[") %>% unique() %>% length()
texts$clean_text_interpretive_word %>% str_subset(pattern = "\\-") %>% unique() %>% length()
texts$clean_text_interpretive_word %>% str_subset(pattern = "\\|") %>% unique() %>% length()
texts$clean_text_interpretive_word %>% str_subset(pattern = "\\.") %>% unique() %>% length()
```