Skip to content

Latest commit

 

History

History
27 lines (14 loc) · 1.89 KB

text-analysis.md

File metadata and controls

27 lines (14 loc) · 1.89 KB

Text Analysis

Glossary

Please see below a list of key terms and concepts, provided in alphabetical order, from Michelle A. McSweeney's and Rachel Rakov's Text Analysis workshop.

concordance: seeing the characters on either side of the word; an easy way to investigate the context of a certain word across a corpus.

corpus: a collection of texts that are somehow related to each other.

lemmatization: one process of collapsing words in an attempt to reduce the number of words, and get a realistic understanding of the meaning of a text. Lemmatization references the word and finds the appropriate root and can therefore take longer than other processes of collapsing words. See "stemming" in this glossary for another process of collapsing words in a corpus.

lexical density: the number of unique words per total words; a good metric to approximate lexical diversity—the range of vocabulary an author uses.

part-of-speech (POS) tagging: a way to identify the category of words in a given text.

stemming: one process of collapsing words in an attempt to reduce the number of words, and get a realistic understanding of the meaning of a text. Stemming cuts off affixes to find the root (or the stem) of the word. See "lemmatization" in this glossary for another process of collapsing words in a corpus.

stop words: words that appear frequently in a language, often adding grammatical structure, but little semantic content.

text normalization: the process of taking a list of words and transforming it into a more uniform sequence.

token: An instance of a type (see "type" in this glossary).

tokenizing: a process of listing of all of the tokens in our corpus, often with punctuation removed, and all the words made into lowercase.

type: words that do not match identically in a corpus. (Ex: "Whale" and "whale" are different types.)