Skip to content

Latest commit

 

History

History
1777 lines (1437 loc) · 140 KB

doc.md

File metadata and controls

1777 lines (1437 loc) · 140 KB

📖 Documentation

Table of Contents

The main window of Wordless is divided into several sections:

  • 1.1 Menu Bar
    The Menu Bar resides at the top of the main window.

  • 1.2 Work Area
    The Work Area resides at the upper half of the main window, just below Menu Bar.

    The Work Area is further divided into the Results Area on the left side and the Settings Area on the right side. You can click on the tabs to toggle between different modules.

  • 1.3 File Area
    The File Area resides at the lower half of the main window, just above Status Bar.

  • 1.4 Status Bar
    The Status Bar resides at the bottom of the main window.

    You can show/hide the Status Bar by checking/unchecking Menu Bar → Preferences → Show Status Bar

You can modify the global scaling factor and font settings of the user interface via Menu Bar → Preferences → General → User Interface Settings.

In most cases, the first thing to do in Wordless is open and select your files to be processed via Menu Bar → File → Open Files/Folder.

Files are loaded, cached and selected automatically after being added to the File Table. Only selected files will be processed by Wordless. You can drag and drop files around the File Table to change their orders, which would be reflected in the results.

By default, Wordless would try to detect the encoding and language settings of all files for you, you should double check and make sure that the settings of each and every file are correct. If you prefer changing file settings manually, you could uncheck Open Files dialog → Auto-detect encodings and/or Open Files dialog → Auto-detect languages. The default file settings could be modified via Menu Bar → Preferences → Settings → Files → Default Settings. Additionally, you need to change Open Files dialog → Tokenized and Open Files dialog → Tagged options of each files according to whether or not the file has been tokenized or tagged.

  • 2.1 Menu Bar → File

    • 2.1.1 Open Files
      Open the Open Files dialog to add file(s) to the File Table.

    • 2.1.2 Reopen Closed Files
      Add file(s) that are closed the last time back to the File Table.

      * The history of all closed files will be erased upon exit of Wordless.

    • 2.1.3 Select All
      Select all files in the File Table.

    • 2.1.4 Deselect All
      Deselect all files in the File Table.

    • 2.1.5 Invert Selection
      Select files that are not currently selected and deselect files that are currently selected in the File Table.

    • 2.1.6 Close Selected
      Remove files that are currently selected from the File Table.

    • 2.1.7 Close All
      Remove all files from the File Table.

  • 2.2 Open Files dialog

    • 2.2.1 Add files
      Add one single file or multiple files into the table.

      * You can use the Ctrl key (Command key on macOS) and/or the Shift key to select multiple files.

    • 2.2.2 Add folder
      Add all files in the folder into the table.

      By default, all files in the chosen folder and the subfolders of the chosen folder (and subfolders of subfolders, and so on) are added to the table. If you do not want to add files in subfolders to the table, you could uncheck Include files in subfolders.

    • 2.2.3 Remove files
      Remove the selected files from the table.

    • 2.2.4 Clear table
      Remove all files from the table.

    • 2.2.5 Auto-detect encodings
      Auto-detect the encodings of all files when they are added into the table. If the detection results are incorrect, you can manually modify encoding settings in the table.

    • 2.2.6 Auto-detect languages
      Auto-detect the languages of all files when they are added into the table. If the detection results are incorrect, you can manually modify language settings in the table.

    • 2.2.7 Include files in subfolders
      When adding a folder to the table, recursively add all files in the chosen folder and subfolders of the chosen folder (and subfolders of subfolders, and so on) into the table

Note

Renamed from Overview to Profiler in Wordless 2.2.0

In Profiler, you can check and compare general linguistic features of different files.

All statistics are grouped into 5 tables for better readability: Readability, Counts, Lexical Density/Diversity, Lengths, Length Breakdown.

  • 3.1.1 Readability
    Readability statistics of each file calculated according to the different readability tests used. See section 12.4.1 Readability Formulas for more details.

  • 3.1.2 Counts

    • 3.1.2.1 Count of Paragraphs
      The number of paragraphs in each file. Each line in the file is counted as one paragraph. Blank lines and lines containing only spaces, tabs and other invisible characters are not counted.

    • 3.1.2.2 Count of Paragraphs %
      The percentage of the number of paragraphs in each file out of the total number of paragraphs in all files.

    • 3.1.2.3 Count of Sentences
      The number of sentences in each file. Wordless automatically applies the built-in sentence tokenizer according to the language of each file to calculate the number of sentences in each file. You can modify sentence tokenizer settings via Menu Bar → Preferences → Settings → Sentence Tokenization → Sentence Tokenizer Settings.

    • 3.1.2.4 Count of Sentences %
      The percentage of the number of sentences in each file out of the total number of sentences in all files.

    • 3.1.2.5 Count of Sentence Segments
      The number of sentence segments in each file. Each part of sentence ending with one or more consecutive terminal punctuation marks (as per the Unicode Standard) is counted as one sentence segment. See here for the full list of terminal punctuation marks.

    • 3.1.2.6 Count of Sentence Segments %
      The percentage of the number of sentence segments in each file out of the total number of sentence segments in all files.

    • 3.1.2.7 Count of Tokens
      The number of tokens in each file. Wordless automatically applies the built-in word tokenizer according to the language of each file to calculate the number of tokens in each file. You can modify word tokenizer settings via Menu Bar → Preferences → Settings → Word Tokenization → Word Tokenizer Settings.

      You can specify what should be counted as a "token" via Token Settings in the Settings Area

    • 3.1.2.8 Count of Tokens %
      The percentage of the number of tokens in each file out of the total number of tokens in all files.

    • 3.1.2.9 Count of Types
      The number of token types in each file.

    • 3.1.2.10 Count of Types %
      The percentage of the number of token types in each file out of the total number of token types in all files.

    • 3.1.2.11 Count of Syllables
      The number of syllables in each files. Wordless automatically applies the built-in syllable tokenizer according to the language of each file to calculate the number of syllable in each file. You can modify syllable tokenizer settings via Menu Bar → Preferences → Settings → Syllable Tokenization → Syllable Tokenizer Settings.

    • 3.1.2.12 Count of Syllables %
      The percentage of the number of syllables in each file out of the total number of syllable in all files.

    • 3.1.2.13 Count of Characters
      The number of single characters in each file. Spaces, tabs and all other invisible characters are not counted.

    • 3.1.2.14 Count of Characters %
      The percentage of the number of characters in each file out of the total number of characters in all files.

  • 3.1.3 Lexical Density/Diversity
    Statistics of lexical density/diversity which reflect the the extend to which the vocabulary used in each file varies. See section 12.4.2 Indicators of Lexical Density/Diversity for more details.

  • 3.1.4 Lengths

    • 3.1.4.1 Paragraph Length in Sentences / Sentence Segments / Tokens (Mean)
      The average value of paragraph lengths expressed in sentences / sentence segments / tokens.

    • 3.1.4.2 Paragraph Length in Sentences / Sentence Segments / Tokens (Standard Deviation)
      The standard deviation of paragraph lengths expressed in sentences / sentence segments / tokens.

    • 3.1.4.3 Paragraph Length in Sentences / Sentence Segments / Tokens (Variance)
      The variance of paragraph lengths expressed in sentences / sentence segments / tokens.

    • 3.1.4.4 Paragraph Length in Sentences / Sentence Segments / Tokens (Minimum)
      The minimum of paragraph lengths expressed in sentences / sentence segments / tokens.

    • 3.1.4.5 Paragraph Length in Sentences / Sentence Segments / Tokens (25th Percentile)
      The 25th percentile of paragraph lengths expressed in sentences / sentence segments / tokens.

    • 3.1.4.6 Paragraph Length in Sentences / Sentence Segments / Tokens (Median)
      The median of paragraph lengths expressed in sentences / sentence segments / tokens.

    • 3.1.4.7 Paragraph Length in Sentences / Sentence Segments / Tokens (75th Percentile)
      The 75th percentile of paragraph lengths expressed in sentences / sentence segments / tokens.

    • 3.1.4.8 Paragraph Length in Sentences / Sentence Segments / Tokens (Maximum)
      The maximum of paragraph lengths expressed in sentences / sentence segments / tokens.

    • 3.1.4.9 Paragraph Length in Sentences / Sentence Segments / Tokens (Range)
      The range of paragraph lengths expressed in sentences / sentence segments / tokens.

    • 3.1.4.10 Paragraph Length in Sentences / Sentence Segments / Tokens (Interquartile Range)
      The interquartile range of paragraph lengths expressed in sentences / sentence segments / tokens.

    • 3.1.4.11 Paragraph Length in Sentences / Sentence Segments / Tokens (Modes)
      The mode(s) of paragraph lengths expressed in sentences / sentence segments / tokens.

    • 3.1.4.12 Sentence / Sentence Segment Length in Tokens (Mean)
      The average value of sentence / sentence segment lengths expressed in tokens.

    • 3.1.4.13 Sentence / Sentence Segment Length in Tokens (Standard Deviation)
      The standard deviation of sentence / sentence segment lengths expressed in tokens.

    • 3.1.4.14 Sentence / Sentence Segment Length in Tokens (Variance)
      The variance of sentence / sentence segment lengths expressed in tokens.

    • 3.1.4.15 Sentence / Sentence Segment Length in Tokens (Minimum)
      The minimum of sentence / sentence segment lengths expressed in tokens.

    • 3.1.4.16 Sentence / Sentence Segment Length in Tokens (25th Percentile)
      The 25th percentile of sentence / sentence segment lengths expressed in tokens.

    • 3.1.4.17 Sentence / Sentence Segment Length in Tokens (Median)
      The median of sentence / sentence segment lengths expressed in tokens.

    • 3.1.4.18 Sentence / Sentence Segment Length in Tokens (75th Percentile)
      The 75th percentile of sentence / sentence segment lengths expressed in tokens.

    • 3.1.4.19 Sentence / Sentence Segment Length in Tokens (Maximum)
      The maximum of sentence / sentence segment lengths expressed in tokens.

    • 3.1.4.20 Sentence / Sentence Segment Length in Tokens (Range)
      The range of sentence / sentence segment lengths expressed in tokens.

    • 3.1.4.21 Sentence / Sentence Segment Length in Tokens (Interquartile Range)
      The interquartile range of sentence / sentence segment lengths expressed in tokens.

    • 3.1.4.22 Sentence / Sentence Segment Length in Tokens (Modes)
      The mode(s) of sentence / sentence segment lengths expressed in tokens.

    • 3.1.4.23 Token/Type Length in Syllables/Characters (Mean)
      The average value of token / token type lengths expressed in syllables/characters.

    • 3.1.4.24 Token/Type Length in Syllables/Characters (Standard Deviation)
      The standard deviation of token / token type lengths expressed in syllables/characters.

    • 3.1.4.25 Token/Type Length in Syllables/Characters (Variance)
      The variance of token / token type lengths expressed in syllables/characters.

    • 3.1.4.26 Token/Type Length in Syllables/Characters (Minimum)
      The minimum of token / token type lengths expressed in syllables/characters.

    • 3.1.4.27 Token/Type Length in Syllables/Characters (25th Percentile)
      The 25th percentile of token / token type lengths expressed in syllables/characters.

    • 3.1.4.28 Token/Type Length in Syllables/Characters (Median)
      The median of token / token type lengths expressed in syllables/characters.

    • 3.1.4.29 Token/Type Length in Syllables/Characters (75th Percentile)
      The 75th percentile of token / token type lengths expressed in syllables/characters.

    • 3.1.4.30 Token/Type Length in Syllables/Characters (Maximum)
      The maximum of token / token type lengths expressed in syllables/characters.

    • 3.1.4.31 Token/Type Length in Syllables/Characters (Range)
      The range of token / token type lengths expressed in syllables/characters.

    • 3.1.4.32 Token/Type Length in Syllables/Characters (Interquartile Range)
      The interquartile range of token / token type lengths expressed in syllables/characters.

    • 3.1.4.33 Token/Type Length in Syllables/Characters (Modes)
      The mode(s) of token / token type lengths expressed in syllables/characters.

    • 3.1.4.34 Syllable Length in Characters (Mean)
      The average value of syllable lengths expressed in characters.

    • 3.1.4.35 Syllable Length in Characters (Standard Deviation)
      The standard deviation of syllable lengths expressed in characters.

    • 3.1.4.36 Syllable Length in Characters (Variance)
      The variance of syllable lengths expressed in characters.

    • 3.1.4.37 Syllable Length in Characters (Minimum)
      The minimum of syllable lengths expressed in characters.

    • 3.1.4.38 Syllable Length in Characters (25th Percentile)
      The 25th percentile of syllable lengths expressed in characters.

    • 3.1.4.39 Syllable Length in Characters (Median)
      The median of syllable lengths expressed in characters.

    • 3.1.4.40 Syllable Length in Characters (75th Percentile)
      The 75th percentile of syllable lengths expressed in characters.

    • 3.1.4.41 Syllable Length in Characters (Maximum)
      The maximum of syllable lengths expressed in characters.

    • 3.1.4.42 Syllable Length in Characters (Range)
      The range of syllable lengths expressed in characters.

    • 3.1.4.43 Syllable Length in Characters (Interquartile Range)
      The interquartile range of Syllable lengths expressed in characters.

    • 3.1.4.44 Syllable Length in Characters (Modes)
      The mode(s) of syllable lengths expressed in characters.

  • 3.1.5 Length Breakdown

    • 3.1.5.1 Count of n-token-long Sentences / Sentence Segments
      The number of n-token-long sentences / sentence segments, where n = 1, 2, 3, etc.

    • 3.1.5.2 Count of n-token-long Sentences / Sentence Segments %
      The percentage of the number of n-token-long sentences / sentence segments in each file out of the total number of n-token-long sentences / sentence segments in all files, where n = 1, 2, 3, etc.

    • 3.1.5.3 Count of n-syllable-long Tokens
      The number of n-syllable-long tokens, where n = 1, 2, 3, etc.

    • 3.1.5.4 Count of n-syllable-long Tokens %
      The percentage of the number of n-syllable-long tokens in each file out of the total number of n-syllable-long tokens in all files, where n = 1, 2, 3, etc.

    • 3.1.5.5 Count of n-character-long Tokens
      The number of n-character-long tokens, where n = 1, 2, 3, etc.

    • 3.1.5.6 Count of n-character-long Tokens %
      The percentage of the number of n-character-long tokens in each file out of the total number of n-character-long tokens in all files, where n = 1, 2, 3, etc.

In Concordancer, you can search for tokens in different files and generate concordance lines. You can adjust settings for data generation via Generation Settings.

After the concordance lines are generated and displayed in the table, you can sort the results by clicking Sort Results or search in Data Table for parts that might be of interest to you by clicking Search in results. Highlight colors for sorting can be modified via Menu Bar → Preferences → Settings → Tables → Concordancer → Sorting.

You can generate concordance plots for all search terms. You can modify the settings for the generated figure via Figure Settings.

  • 4.1 Left
    The context before each search term, which displays 10 tokens left to the Node by default. You can change this behavior via Generation Settings.

  • 4.2 Node
    The search term(s) specified in Search Settings → Search Term.

  • 4.3 Right
    The context after each search term, which displays 10 tokens right to the Node by default. You can change this behavior via Generation Settings.

  • 4.4 Sentiment
    The sentiment of the Node combined with its context (Left and Right).

  • 4.5 Token No.
    The position of the first token of Node in each file.

  • 4.6 Token No. %
    The percentage of the position of the first token of Node in each file.

  • 4.7 Sentence Segment No.
    The position of the sentence segment where the Node is found in each file.

  • 4.8 Sentence Segment No. %
    The percentage of the position of the sentence segment where the Node is found in each file.

  • 4.9 Sentence No.
    The position of the sentence where the Node is found in each file.

  • 4.10 Sentence No. %
    The percentage of the position of the sentence where the Node is found in each file.

  • 4.11 Paragraph No.
    The position of the paragraph where the Node is found in each file.

  • 4.12 Paragraph No. %
    The percentage of the position of the paragraph where the Node is found in each file.

  • 4.13 File
    The name of the file where the Node is found.

Note

  1. Added in Wordless 2.0.0
  2. Renamed from Concordancer (Parallel Mode) to Parallel Concordancer in Wordless 2.2.0

In Parallel Concordancer, you can search for tokens in parallel corpora and generate parallel concordance lines. You may leave Search Settings → Search Term blank so as to search for instances of additions and deletions.

You can search in Data Table for parts that might be of interest to you by clicking Search in results.

  • 5.1 Parallel Unit No.
    The position of the alignment unit (paragraph) where the the search term is found.

  • 5.2 Parallel Unit No. %
    The percentage of the position of the alignment unit (paragraph) where the the search term is found.

  • 5.3 Parallel Units
    The parallel unit (paragraph) where the search term is found in each file.

    Highlight colors for search terms can be modified via Menu Bar → Preferences → Settings → Tables → Parallel Concordancer → Highlight Color Settings.

Note

Added in Wordless 3.0.0

In Dependency Parser, you can search for all dependency relations associated with different tokens and calculate their dependency lengths (distances).

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can select lines in the Results Area and then click Generate Figure to show dependency graphs for all selected sentences. You can modify the settings for the generated figure via Figure Settings and decide how the figures should be displayed.

  • 6.1 Head
    The token functioning as the head in the dependency structure.

  • 6.2 Dependent
    The token functioning as the dependent in the dependency structure.

  • 6.3 Dependency Length
    The dependency length (distance) between the head and dependent in the dependency structure. The dependency length is positive when the head follows the dependent and would be negative if the head precedes the dependent.

  • 6.4 Dependency Length (Absolute)
    The absolute value of the dependency length (distance) between the head and dependent in the dependency structure. The absolute dependency length is always positive.

  • 6.5 Sentence
    The sentence where the dependency structure is found.

    Highlight colors for the head and the dependent can be modified via Menu Bar → Preferences → Settings → Tables → Dependency Parser → Highlight Color Settings.

  • 6.6 Sentence No.
    The position of the sentence where the dependency structure is found.

  • 6.7 Sentence No. %
    The percentage of the position of the sentence where the dependency structure is found.

  • 6.8 File
    The name of the file where the dependency structure is found.

Note

Renamed from Wordlist to Wordlist Generator in Wordless 2.2.0

In Wordlist Generator, you can generate wordlists for different files and calculate the raw frequency, relative frequency, dispersion and adjusted frequency for each token. You can disable the calculation of dispersion and/or adjusted frequency by setting Generation Settings → Measures of Dispersion / Measure of Adjusted Frequency to None.

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can generate line charts or word clouds for wordlists using any statistics. You can modify the settings for the generated figure via Figure Settings.

  • 7.1 Rank
    The rank of the token sorted by its frequency in the first file in descending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties.

  • 7.2 Token
    You can specify what should be counted as a "token" via Token Settings.

  • 7.3 Syllabification
    The syllabified form of each token.

    If the token happens to exist in the vocabulary of multiple languages, all syllabified forms with their applicable languages will be listed.

    If there is no syllable tokenization support for the language where the token is found, "No language support" is displayed instead. To check which languages have syllable tokenization support, please refer to section 12.1 Supported Languages.

  • 7.4 Frequency
    The number of occurrences of the token in each file.

  • 7.5 Dispersion
    The dispersion of the token in each file. You can change the measure of dispersion used via Generation Settings → Measure of Dispersion. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details.

  • 7.6 Adjusted Frequency
    The adjusted frequency of the token in each file. You can change the measure of adjusted frequency used via Generation Settings → Measure of Adjusted Frequency. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details.

  • 7.7 Number of Files Found
    The number of files in which the token appears at least once.

  • 7.8 Number of Files Found %
    The percentage of the number of files in which the token appears at least once out of the total number of files that are cureently selected.

Note

Renamed from N-gram to N-gram Generator in Wordless 2.2.0

In N-gram Generator, you can search for n-grams (consecutive tokens) or skip-grams (non-consecutive tokens) in different files, count and compute the raw frequency and relative frequency of each n-gram/skip-gram, and calculate the dispersion and adjusted frequency for each n-gram/skip-gram using different measures. You can adjust the settings for the generated results via Generation Settings. You can disable the calculation of dispersion and/or adjusted frequency by setting Generation Settings → Measures of Dispersion / Measure of Adjusted Frequency to None. To allow skip-grams in the results, check Generation Settings → Allow skipped tokens and modify the settings. You can also set constraints on the position of search terms in all n-grams via Search Settings → Search Term Position.

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can generate line charts or word clouds for n-grams using any statistics. You can modify the settings for the generated figure via Figure Settings.

  • 8.1 Rank
    The rank of the n-gram sorted by its frequency in the first file in descending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties.

  • 8.2 N-gram
    You can specify what should be counted as a "n-gram" via Token Settings.

  • 8.3 Frequency
    The number of occurrences of the n-gram in each file.

  • 8.4 Dispersion
    The dispersion of the n-gram in each file. You can change the measure of dispersion used via Generation Settings → Measure of Dispersion. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details.

  • 8.5 Adjusted Frequency
    The adjusted frequency of the n-gram in each file. You can change the measure of adjusted frequency used via Generation Settings → Measure of Adjusted Frequency. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details.

  • 8.6 Number of Files Found
    The number of files in which the n-gram appears at least once.

  • 8.7 Number of Files Found %
    The percentage of the number of files in which the n-gram appears at least once out of the total number of files that are currently selected.

Note

Renamed from Collocation to Collocation Extractor in Wordless 2.2.0

In Collocation Extractor, you can search for patterns of collocation (tokens that co-occur more often than would be expected by chance) within a given collocational window (from 5 words to the left to 5 words to the right by default), conduct different tests of statistical significance on each pair of collocates and calculate the Bayes factor and effect size for each pair using different measures. You can adjust the settings for the generated results via Generation Settings. You can disable the calculation of statistical significance and/or Bayes factor and/or effect size by setting Generation Settings → Test of Statistical Significance / Measures of Bayes Factor / Measure of Effect Size to None.

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can generate line charts, word clouds, and network graphs for patterns of collocation using any statistics. You can modify the settings for the generated figure via Figure Settings.

  • 9.1 Rank
    The rank of the collocating token sorted by the p-value of the significance test conducted on the node and the collocating token in the first file in ascending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties.

  • 9.2 Node
    The search term. You can specify what should be counted as a "token" via Token Settings.

  • 9.3 Collocate
    The collocating token. You can specify what should be counted as a "token" via Token Settings.

  • 9.4 Ln, ..., L3, L2, L1, R1, R2, R3, ..., Rn
    The number of co-occurrences of the node and the collocating token with the collocating token at the given position in each file.

  • 9.5 Frequency
    The total number of co-occurrences of the node and the collocating token with the collocating token at all possible positions in each file.

  • 9.6 Test Statistic
    The test statistic of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

    Please note that test statistic is not available for some tests of statistical significance.

  • 9.7 p-value
    The p-value of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

  • 9.8 Bayes Factor
    The Bayes factor the node and the collocating token in each file. You can change the measure of Bayes factor used via Generation Settings → Measure of Bayes Factor. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

  • 9.9 Effect Size
    The effect size of the node and the collocating token in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

  • 9.10 Number of Files Found
    The number of files in which the node and the collocating token co-occur at least once.

  • 9.11 Number of Files Found %
    The percentage of the number of files in which the node and the collocating token co-occur at least once out of the total number of files that are currently selected.

Note

Renamed from Colligation to Colligation Extractor in Wordless 2.2.0

In Colligation Extractor, you can search for patterns of colligation (parts of speech that co-occur more often than would be expected by chance) within a given collocational window (from 5 words to the left to 5 words to the right by default), conduct different tests of statistical significance on each pair of parts of speech and calculate the Bayes factor and effect size for each pair using different measures. You can adjust the settings for the generated data via Generation Settings. You can disable the calculation of statistical significance and/or Bayes factor and/or effect size by setting Generation Settings → Test of Statistical Significance / Measures of Bayes Factor / Measure of Effect Size to None.

Wordless will automatically apply its built-in part-of-speech tagger on every file that are not part-of-speech-tagged already according to the language of each file. If part-of-speech tagging is not supported for the given languages, the user should provide a file that has already been part-of-speech-tagged and make sure that the correct Text Type has been set on each file.

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can generate line charts or word clouds for patterns of colligation using any statistics. You can modify the settings for the generated figure via Figure Settings.

  • 10.1 Rank
    The rank of the collocating part of speech sorted by the p-value of the significance test conducted on the node and the collocating part of speech in the first file in ascending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties.

  • 10.2 Node
    The search term. You can specify what should be counted as a "token" via Token Settings.

  • 10.3 Collocate
    The collocating part of speech. You can specify what should be counted as a "token" via Token Settings.

  • 10.4 Ln, ..., L3, L2, L1, R1, R2, R3, ..., Rn
    The number of co-occurrences of the node and the collocating part of speech with the collocating part of speech at the given position in each file.

  • 10.5 Frequency
    The total number of co-occurrences of the node and the collocating part of speech with the collocating part of speech at all possible positions in each file.

  • 10.6 Test Statistic
    The test statistic of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

    Please note that test statistic is not available for some tests of statistical significance.

  • 10.7 p-value
    The p-value of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

  • 10.8 Bayes Factor
    The Bayes factor of the node and the collocating part of speech in each file. You can change the measure of Bayes factor used via Generation Settings → Measure of Bayes Factor. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

  • 10.9 Effect Size
    The effect size of the node and the collocating part of speech in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

  • 10.10 Number of Files Found
    The number of files in which the node and the collocating part of speech co-occur at least once.

  • 10.11 Number of Files Found %
    The percentage of the number of files in which the node and the collocating part of speech co-occur at least once out of the total number of file that are currently selected.

Note

Renamed from Keyword to Keyword Extractor in Wordless 2.2

In Keyword Extractor, you can search for candidates of potential keywords (tokens that have far more or far less frequency in the observed file than in the reference file) in different files given a reference corpus, conduct different tests of statistical significance on each keyword and calculate the Bayes factor and effect size for each keyword using different measures. You can adjust the settings for the generated data via Generation Settings. You can disable the calculation of statistical significance and/or Bayes factor and/or effect size by setting Generation Settings → Test of Statistical Significance / Measures of Bayes Factor / Measure of Effect Size to None.

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can generate line charts or word clouds for keywords using any statistics. You can modify the settings for the generated figure via Figure Settings.

  • 11.1 Rank
    The rank of the keyword sorted by the p-value of the significance test conducted on the keyword in the first file in ascending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties.

  • 11.2 Keyword
    The potential keyword. You can specify what should be counted as a "token" via Token Settings.

  • 11.3 Frequency (in Reference File)
    The number of occurrences of the keyword in the reference file.

  • 11.4 Frequency (in Observed Files)
    The number of occurrences of the keyword in each observed file.

  • 11.5 Test Statistic
    The test statistic of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

    Please note that test statistic is not available for some tests of statistical significance.

  • 11.6 p-value
    The p-value of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

  • 11.7 Bayes Factor
    The Bayes factor of the keyword in each file. You can change the measure of Bayes factor used via Generation Settings → Measure of Bayes Factor. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

  • 11.8 Effect Size
    The effect size of on the keyword in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

  • 11.9 Number of Files Found
    The number of files in which the keyword appears at least once.

  • 11.10 Number of Files Found %
    The percentage of the number of files in which the keyword appears at least once out of the total number of files that are currently selected.

Language Sentence Token-ization Word Token-ization Syllable Token-ization Part-of-speech Tagging Lemma-tization Stop Word List Depen-dency Parsing Senti-ment Analysis
Afrikaans ✖️
Albanian ⭕️ ✖️ ✖️ ✖️
Amharic ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Arabic ✖️
Armenian (Classical) ✖️ ✖️ ✖️
Armenian (Eastern) ✖️ ✖️
Armenian (Western) ✖️ ✖️
Assamese ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Asturian ⭕️ ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Azerbaijani ⭕️ ✖️ ✖️ ✖️ ✖️
Basque
Belarusian ✖️
Bengali ⭕️ ✖️ ✖️ ✖️
Bulgarian ✖️
Burmese ✖️ ✖️ ✖️ ✖️ ✖️
Buryat (Russia) ✖️ ✖️ ✖️
Catalan
Chinese (Classical) ✖️ ✖️ ✖️
Chinese (Simplified) ✖️
Chinese (Traditional) ✖️
Church Slavonic (Old) ✖️ ✖️ ✖️
Coptic ✖️ ✖️ ✖️
Croatian ✖️
Czech ✖️
Danish
Dutch
English (Middle) ⭕️ ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
English (Old) ✖️ ✖️ ✖️
English (United Kingdom)
English (United States)
Erzya ✖️ ✖️ ✖️
Esperanto ⭕️ ⭕️ ✖️ ✖️ ✖️ ✖️
Estonian ✖️
Faroese ✖️ ✖️ ✖️ ✖️
Finnish ✖️
French
French (Old) ✖️ ✖️ ✖️
Galician ✖️
Georgian ⭕️ ⭕️ ✖️ ✖️ ✖️ ✖️
German (Austria)
German (Germany)
German (Switzerland)
Gothic ✖️ ✖️ ✖️
Greek (Ancient) ✖️ ✖️ ✖️
Greek (Modern)
Gujarati ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Hebrew (Ancient) ✖️ ✖️ ✖️
Hebrew (Modern) ✖️
Hindi ✖️ ✖️
Hungarian
Icelandic ✖️
Indonesian
Irish ✖️ ✖️
Italian
Japanese ✖️ ✖️
Kannada ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Kazakh ✖️
Khmer ✖️ ✖️ ✖️ ✖️
Korean ✖️ ✖️
Kurdish (Kurmanji) ✖️ ✖️
Kyrgyz ✖️ ✖️
Lao ✖️ ✖️ ✖️
Latin ✖️ ✖️
Latvian ✖️
Ligurian ✖️ ✖️ ✖️
Lithuanian ✖️
Luganda ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Luxembourgish ⭕️ ✖️ ✖️ ✖️ ✖️
Macedonian ✖️ ✖️
Malay ⭕️ ✖️ ✖️ ✖️ ✖️
Malayalam ✖️ ✖️ ✖️ ✖️ ✖️
Maltese ✖️ ✖️ ✖️
Manx ✖️ ✖️ ✖️
Marathi ✖️ ✖️
Meitei (Meitei script) ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Mongolian ⭕️ ⭕️ ✖️ ✖️ ✖️ ✖️
Nepali ⭕️ ✖️ ✖️ ✖️ ✖️
Nigerian Pidgin ✖️ ✖️ ✖️
Norwegian (Bokmål)
Norwegian (Nynorsk) ✖️ ✖️
Odia ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Persian ✖️ ✖️
Polish ✖️
Pomak ✖️ ✖️ ✖️
Portuguese (Brazil)
Portuguese (Portugal)
Punjabi (Gurmukhi script) ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Romanian
Russian
Russian (Old) ✖️ ✖️ ✖️
Sámi (Northern) ✖️ ✖️ ✖️
Sanskrit ✖️ ✖️
Scottish Gaelic ✖️ ✖️
Serbian (Cyrillic script) ⭕️ ✖️ ✖️ ✖️
Serbian (Latin script) ✖️
Sindhi ✖️ ✖️ ✖️ ✖️
Sinhala ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Slovak ✖️
Slovene
Sorbian (Lower) ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️ ✖️
Sorbian (Upper) ✖️ ✖️ ✖️
Spanish
Swahili ⭕️ ⭕️ ✖️ ✖️ ✖️ ✖️
Swedish
Tagalog ⭕️ ✖️ ✖️ ✖️ ✖️
Tajik ⭕️ ✖️ ✖️ ✖️ ✖️
Tamil ✖️ ✖️
Tatar ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Telugu ✖️ ✖️
Tetun (Dili) ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️ ✖️
Thai ✖️ ✖️
Tibetan ✖️ ✖️ ✖️ ✖️
Tigrinya ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Tswana ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️ ✖️
Turkish ✖️
Ukrainian ✖️
Urdu ✖️ ✖️
Uyghur ✖️ ✖️
Vietnamese ✖️ ✖️ ✖️
Welsh ✖️ ✖️
Wolof ✖️ ✖️ ✖️
Yoruba ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️
Zulu ⭕️ ⭕️ ✖️ ✖️ ✖️ ✖️
Other languages ⭕️ ⭕️ ✖️ ✖️ ✖️ ✖️ ✖️ ✖️

Note

✔: Supported
⭕️: Supported but falls back to the default English (United States) tokenizer
✖️: Not supported

File Type File Extensions Remarks
CSV files¹ *.csv
Excel workbooks¹² *.xlsx Legacy Microsoft 97-2003 Excel Workbooks (*.xls) are not supported.
HTML pages¹² *.htm, *.html
Lyrics File¹ *.lrc Simple LRC and enhanced LRC formats are supported.
PDF files¹² *.pdf Text could only be extracted from text-searchable PDF files. There is no support for automatically converting scanned PDF files into text-searchable ones.
PowerPoint presentations¹² *.pptx Legacy Microsoft 97-2003 PowerPoint presentations (*.ppt) are not supported.
Text files *.txt
Translation memory files¹ *.tmx
Word documents¹² *.docx Legacy Microsoft 97-2003 Word documents (*.doc) are not supported.
XML files¹ *.xml

Important

  1. Non-TXT files will be automatically converted to TXT files when being imported into Wordless. You can check the converted files under folder imports at the installation location of Wordless on your computer (as for macOS users, right click Wordless.app, select Show Package Contents and navigate to Contents/MacOS/imports/). You can change this location via Menu Bar → Preferences → Settings → General → Import → Temporary Files → Default path.
  2. It is not recommended to directly import non-text files into Wordless and the support for doing so is provided only for convenience, since accuracy of text extraction could never be guaranteed and unintended data loss might occur, for which reason users are encouraged to convert their files using specialized tools and make their own choices on which part of the data should be kept or discarded.

Language File Encoding Auto-detection
All languages UTF-8 without BOM
All languages UTF-8 with BOM
All languages UTF-16 with BOM
All languages UTF-16BE without BOM
All languages UTF-16LE without BOM
All languages UTF-32 with BOM
All languages UTF-32BE without BOM
All languages UTF-32LE without BOM
All languages UTF-7
Arabic CP720
Arabic CP864
Arabic ISO-8859-6
Arabic Mac OS
Arabic Windows-1256
Baltic languages CP775
Baltic languages ISO-8859-13
Baltic languages Windows-1257
Celtic languages ISO-8859-14
Chinese GB18030
Chinese GBK
Chinese (Simplified) GB2312
Chinese (Simplified) HZ
Chinese (Traditional) Big-5
Chinese (Traditional) Big5-HKSCS
Chinese (Traditional) CP950
Croatian Mac OS
Cyrillic CP855
Cyrillic CP866
Cyrillic ISO-8859-5
Cyrillic Mac OS
Cyrillic Windows-1251
English ASCII
English EBCDIC 037
English CP437
European HP Roman-8
European (Central) CP852
European (Central) ISO-8859-2
European (Central) Mac OS Central European
European (Central) Windows-1250
European (Northern) ISO-8859-4
European (Southern) ISO-8859-3
European (Southeastern) ISO-8859-16
European (Western) EBCDIC 500
European (Western) CP850
European (Western) CP858
European (Western) CP1140
European (Western) ISO-8859-1
European (Western) ISO-8859-15
European (Western) Mac OS Roman
European (Western) Windows-1252
French CP863
German EBCDIC 273
Greek CP737
Greek CP869
Greek CP875
Greek ISO-8859-7
Greek Mac OS
Greek Windows-1253
Hebrew CP856
Hebrew CP862
Hebrew EBCDIC 424
Hebrew ISO-8859-8
Hebrew Windows-1255
Icelandic CP861
Icelandic Mac OS
Japanese CP932
Japanese EUC-JP
Japanese EUC-JIS-2004
Japanese EUC-JISx0213
Japanese ISO-2022-JP
Japanese ISO-2022-JP-1
Japanese ISO-2022-JP-2
Japanese ISO-2022-JP-2004
Japanese ISO-2022-JP-3
Japanese ISO-2022-JP-EXT
Japanese Shift_JIS
Japanese Shift_JIS-2004
Japanese Shift_JISx0213
Kazakh KZ-1048
Kazakh PTCP154
Korean EUC-KR
Korean ISO-2022-KR
Korean JOHAB
Korean UHC
Nordic languages CP865
Nordic languages ISO-8859-10
Persian/Urdu Mac OS Farsi
Portuguese CP860
Romanian Mac OS
Russian KOI8-R
Tajik KOI8-T
Thai CP874
Thai ISO-8859-11
Thai TIS-620
Turkish CP857
Turkish EBCDIC 1026
Turkish ISO-8859-9
Turkish Mac OS
Turkish Windows-1254
Ukrainian CP1125
Ukrainian KOI8-U
Urdu CP1006
Vietnamese CP1258

The readability of a text depends on several variables including the average sentence length, average word length in characters, average word length in syllables, number of monosyllabic words, number of polysyllabic words, number of difficult words, etc.

It should be noted that some readability measures are language-specific, or applicable only to texts in languages for which Wordless have built-in syllable tokenization support (check 12.1 for reference), while others can be applied to texts in all languages.

The following variables would be used in formulas:
NumSentences: Number of sentences
NumWords: Number of words
NumWordsSyl₁: Number of monosyllabic words
NumWordsSylsₙ₊: Number of words with n or more syllables
NumWordsLtrsₙ₊: Number of words with n or more letters
NumWordsLtrsₙ₋: Number of words with n or fewer letters
NumConjs: Number of conjunctions
NumPreps: Number of prepositions
NumProns: Number of pronouns
NumWordsDale₇₆₉: Number of words outside the Dale list of 769 easy words (Dale, 1931)
NumWordsDale₃₀₀₀: Number of words outside the Dale list of 3000 easy words (Dale & Chall, 1948b)
NumWordsSpache: Number of words outside the Spache word list (Spache, 1974)
NumWordTypes: Number of word types
NumWordTypesBambergerVanecek: Number of word types outside the Bamberger-Vanecek's list of 1000 most common words (Bamberger & Vanecek, 1984, pp. 176–179)
NumWordTypesDale₇₆₉: Number of word types outside the Dale list of 769 easy words (Dale, 1931)
NumSyls: Number of syllables
NumSylsLuongNguyenDinh₁₀₀₀: Number of syllables outside the Luong-Nguyen-Dinh list of 1000 most frequent syllables extracted from all easy documents of the corpus of Vietnamese text readability dataset on literature domain (Luong et al., 2018)
NumCharsAll: Number of characters (letters, CJK characters, etc., numerals, and punctuation marks)
NumCharsAlnum: Number of alphanumeric characters (letters, CJK characters, etc., and numerals)
NumCharsAlpha: Number of alphabetic characters (letters, CJK characters, etc.)

Readability Formula Formula Supported Languages
Al-Heeti's Readability Prediction Formula¹
(Al-Heeti, 1984, pp. 102, 104, 106)
Formula Arabic
Automated Arabic Readability Index
(Al-Tamimi et al., 2013)
Formula Arabic
Automated Readability Index¹
(Smith & Senter, 1967, p. 8
Navy: Kincaid et al., 1975, p. 14)
Formula All languages
Bormuth's Cloze Mean & Grade Placement
(Bormuth, 1969, pp. 152, 160)
Formula
where C is the cloze criterion score, whose value could be changed via Menu Bar → Preferences → Settings → Measures → Readability → Bormuth's Grade Placement → Cloze criterion score
English
Coleman-Liau Index
(Coleman & Liau, 1975)
Formula All languages
Coleman's Readability Formula¹
(Liau et al., 1976)
Formula All languages²³
Dale-Chall Readability Formula¹
(Dale & Chall, 1948a; Dale & Chall, 1948b
Powers-Sumner-Kearl: Powers et al., 1958
New: Chall & Dale, 1995)
Formula English
Danielson-Bryan's Readability Formula¹
(Danielson & Bryan, 1963)
Formula All languages
Dawood's Readability Formula
(Dawood, 1977)
Formula Arabic
Degrees of Reading Power
(College Entrance Examination Board, 1981)
Formula
where M is Bormuth's cloze mean.
English
Devereux Readability Index
(Smith, 1961)
Formula All languages
Dickes-Steiwer Handformel
(Dickes & Steiwer, 1977)
Formula All languages
Easy Listening Formula
(Fang, 1966)
Formula All languages²
Flesch-Kincaid Grade Level
(Kincaid et al., 1975, p. 14)
Formula All languages²
Flesch Reading Ease¹
(Flesch, 1948
Powers-Sumner-Kearl: Powers et al., 1958
Dutch: Douma, 1960, p. 453; Brouwer, 1963
French: Kandel & Moles, 1958
German: Amstad, 1978
Italian: Franchina & Vacca, 1986
Russian: Oborneva, 2006, p. 13
Spanish: Fernández Huerta, 1959; Szigriszt Pazos, 1993, p. 247
Ukrainian: Partiko, 2001)
Formula All languages²
Flesch Reading Ease (Farr-Jenkins-Paterson)¹
(Farr et al., 1951
Powers-Sumner-Kearl: Powers et al., 1958)
Formula All languages²
FORCAST Grade Level
(Caylor & Sticht, 1973, p. 3)
Formula

* One sample of 150 words would be taken randomly from the text, so the text should be at least 150 words long.
All languages²
Fórmula de comprensibilidad de Gutiérrez de Polini
(Gutiérrez de Polini, 1972)
Formula Spanish
Fórmula de Crawford
(Crawford, 1985)
Formula Spanish²
Fucks's Stilcharakteristik
(Fucks, 1955)
Formula All languages²
Gulpease Index
(Lucisano & Emanuela Piemontese, 1988)
Formula Italian
Gunning Fog Index¹
(English: Gunning, 1968, p. 38
Powers-Sumner-Kearl: Powers et al., 1958
Navy: Kincaid et al., 1975, p. 14
Polish: Pisarek, 1969)
Formula
where NumHardWords is the number of words with 3 or more syllables, except proper nouns and words with 3 syllables ending with -ed or -es, for English texts, and the number of words with 4 or more syllables in their base forms, except proper nouns, for Polish texts.
English & Polish²
Legibilidad µ
(Muñoz Baquedano, 2006)
Formula
where LenWordsAvg is the average word length in letters, and LenWordsVar is the variance of word lengths in letters.
Spanish
Lensear Write
(O’Hayre, 1966, p. 8)
Formula
where NumWords1Syl is the number of monosyllabic words excluding the, is, are, was, were.

* One sample of 100 words would be taken randomly from the text, and if the text is shorter than 100 words, NumWords1Syl and NumSentences would be multiplied by 100 and then divided by NumWords.
English²
Lix
(Björnsson, 1968)
Formula All languages
Lorge Readability Index¹
(Lorge, 1944
Corrected: Lorge, 1948)
Formula English³
Luong-Nguyen-Dinh's Readability Formula
(Luong et al., 2018)
Formula

* The number of syllables is estimated by tokenizing the text by whitespace and counting the number of tokens excluding punctuation marks
Vietnamese
McAlpine EFLAW Readability Score
(Nirmaldasan, 2009)
Formula English
neue Wiener Literaturformeln¹
(Bamberger & Vanecek, 1984, p. 82)
Formula German²
neue Wiener Sachtextformel¹
(Bamberger & Vanecek, 1984, pp. 83–84)
Formula German²
OSMAN
(El-Haj & Rayson, 2016)
Formula
where NumFaseehWords is the number of words which have 5 or more syllables and contain ء/ئ/ؤ/ذ/ظ or end with وا/ون.

* The number of syllables in each word is estimated by adding up the number of short syllables and twice the number of long and stress syllables in each word.
Arabic
Rix
(Anderson, 1983)
Formula All languages
SMOG Grade
(McLaughlin, 1969
German: Bamberger & Vanecek, 1984, p.78)
Formula

* A sample would be constructed using the first 10 sentences, the last 10 sentences, and the 10 sentences at the middle of the text, so the text should be at least 30 sentences long.
All languages²
Spache Grade Level¹
(Spache, 1953
Revised: Spache, 1974)
Formula

* Three samples each of 100 words would be taken randomly from the text and the results would be averaged out, so the text should be at least 100 words long.
All languages
Strain Index
(Solomon, 2006)
Formula

* A sample would be constructed using the first 3 sentences in the text, so the text should be at least 3 sentences long.
All languages²
Tränkle & Bailer's Readability Formula¹
(Tränkle & Bailer, 1984)
Formula

* One sample of 100 words would be taken randomly from the text, so the text should be at least 100 words long.
All languages³
Tuldava's Text Difficulty
(Tuldava, 1975)
Formula All languages²
Wheeler & Smith's Readability Formula
(Wheeler & Smith, 1954)
Formula
where NumUnits is the number of sentence segments ending in periods, question marks, exclamation marks, colons, semicolons, and dashes.
All languages²

Note

  1. Variants available and can be selected via Menu Bar → Preferences → Settings → Measures → Readability
  2. Requires built-in syllable tokenization support
  3. Requires built-in part-of-speech tagging support

Lexical density/diversity is the measurement of the extent to which the vocabulary used in the text varies.

The following variables would be used in formulas:
fᵢ: Frequency of the i-th token type ranked descendingly by frequencies
fₘₐₓ: Maximum frequency among all token types
NumTypes: Number of token types
NumTypesf: Number of token types whose frequencies equal f
NumTokens: Number of tokens

Indicator of Lexical Density/Diversity Formula
Brunét's Index
(Brunét, 1978)
Formula
Corrected TTR
(Carroll, 1964)
Formula
Fisher's Index of Diversity
(Fisher et al., 1943)
Formula
where W₋₁ is the -1 branch of the Lambert W function
Herdan's Vₘ
(Herdan, 1955)
Formula
HD-D
(McCarthy & Jarvis, 2010)
For detailed calculation procedures, see reference.
The sample size could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → HD-D → Sample size.
Honoré's Statistic
(Honoré, 1979)
Formula
Lexical Density
(Ure, 1971)
Formula
where NumContentWords is the number of content words. By default, all tokens whose universal part-of-speech tags assigned by built-in part-of-speech taggers are ADJ (adjectives), ADV (adverbs), INTJ (interjections), NOUN (nouns), PROPN (proper nouns), NUM (numerals), VERB (verbs), SYM (symbols), or X (others) are categorized as content words. For some built-in part-of-speech taggers, this behavior could be changed via Menu Bar → Preferences → Settings → Part-of-speech Tagging → Tagsets → Mapping Settings → Content/Function Words.
LogTTR¹
(Herdan: Herdan, 1960, p. 28
Somers: Somers, 1966
Rubet: Dugast, 1979
Maas: Maas, 1972
Dugast: Dugast, 1978; Dugast, 1979)
Formula
Mean Segmental TTR
(Johnson, 1944)
Formula
where n is the number of equal-sized segment, the length of which could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → Mean Segmental TTR → Number of tokens in each segment, NumTypesSegᵢ is the number of token types in the i-th segment, and NumTokensSegᵢ is the number of tokens in the i-th segment.
Measure of Textual Lexical Diversity
(McCarthy, 2005, pp. 95–96, 99–100; McCarthy & Jarvis, 2010)
For detailed calculation procedures, see references.
The factor size could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → Measure of Textual Lexical Diversity → Factor size.
Moving-average TTR
(Covington & McFall, 2010)
Formula
where w is the window size which could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → Moving-average TTR → Window size, NumTypesWindowₚ is the number of token types within the moving window starting at position p, and NumTokensWindowₚ is the number of tokens within the moving window starting at position p.
Popescu-Mačutek-Altmann's B₁/B₂/B₃/B₄/B₅
(Popescu et al., 2008)
Formula
Popescu's R₁
(Popescu, 2009, pp. 18, 30, 33)
For detailed calculation procedures, see reference.
Popescu's R₂
(Popescu, 2009, pp. 35–36, 38)
For detailed calculation procedures, see reference.
Popescu's R₃
(Popescu, 2009, pp. 48–49, 53)
For detailed calculation procedures, see reference.
Popescu's R₄
(Popescu, 2009, p. 57)
For detailed calculation procedures, see reference.
Repeat Rate¹
(Popescu, 2009, p. 166)
Formula
Root TTR
(Guiraud, 1954)
Formula
Shannon Entropy¹
(Popescu, 2009, p. 173)
Formula
Simpson's l
(Simpson, 1949)
Formula
Type-token Ratio
(Johnson, 1944)
Formula
vocd-D
(Malvern et al., 2004, pp. 51, 56–57)
For detailed calculation procedures, see reference.
Yule's Characteristic K
(Yule, 1944, pp. 52–53)
Formula
Yule's Index of Diversity
(Williams, 1970, p. 100)
Formula

Note

  1. Variants available and can be selected via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity

For parts-based measures, each file is divided into n (whose value you could modify via Menu Bar → Preferences → Settings → Measures → Dispersion / Adjusted Frequency → General Settings → Divide each file into subsections) sub-sections and the frequency of the word in each part is counted and denoted by F₁, F₂, F₃, ..., Fₙ respectively. The total frequency of the word in each file is denoted by F and the mean value of the frequencies over all sub-sections is denoted by .

For distance-based measures, the distance between each pair of subsequent occurrences of the word is calculated and denoted by d₁, d₂, d₃, ..., dF respectively. The total number of tokens in each file is denoted by N.

Then, the dispersion and adjusted frequency of the word are calculated as follows:

Measure of Dispersion (Parts-based) Measure of Adjusted Frequency (Parts-based) Formula
Carroll's D₂
(Carroll, 1970)
Carroll's Uₘ
(Carroll, 1970)
Formula
  Engwall's FM
(Engwall, 1974)
Formula
where R is the number of sub-sections in which the word appears at least once.
Gries's DP
(Gries, 2008; Lijffijt & Gries, 2012)
Formula

* Normalization is applied by default, which behavior you could change via Menu Bar → Preferences → Settings → Measures → Dispersion → Gries's DP → Apply normalization.
Juilland's D
(Juilland & Chang-Rodrigues, 1964)
Juilland's U
(Juilland & Chang-Rodrigues, 1964)
Formula
  Kromer's UR
(Kromer, 2003)
Formula
where ψ is the digamma function, and C is the Euler–Mascheroni constant.
Lyne's D₃
(Lyne, 1985)
Formula
Rosengren's S
(Rosengren, 1971)
Rosengren's KF
(Rosengren, 1971)
Formula
Zhang's Distributional Consistency
(Zhang, 2004)
Formula
Measure of Dispersion (Distance-based) Measure of Adjusted Frequency (Distance-based) Formula
Average Logarithmic Distance
(Savický & Hlaváčová, 2002)
Average Logarithmic Distance
(Savický & Hlaváčová, 2002)
Formula
Average Reduced Frequency
(Savický & Hlaváčová, 2002)
Average Reduced Frequency
(Savický & Hlaváčová, 2002)
Formula
Average Waiting Time
(Savický & Hlaváčová, 2002)
Average Waiting Time
(Savický & Hlaváčová, 2002)
Formula

In order to calculate the statistical significance, Bayes factor, and effect size (except Mann-Whitney U Test, Student's t-test (2-sample), and Welch's t-test) for two words in the same file (collocates) or for one specific word in two different files (keywords), two contingency tables must be constructed first, one for observed values, the other for expected values.

As for collocates (in Collocation Extractor and Colligation Extractor):

Observed Values Word 1 Not Word 1 Row Total
Word 2 O₁₁ O₁₂ O₁ₓ = O₁₁ + O₁₂
Not Word 2 O₂₁ O₂₂ O₂ₓ = O₂₁ + O₂₂
Column Total Oₓ₁ = O₁₁ + O₂₁ Oₓ₂ = O₁₂ + O₂₂ Oₓₓ = O₁₁ + O₁₂ + O₂₁ + O₂₂
Expected Values Word 1 Not Word 1
Word 2 E₁₁ E₁₂
Not Word 2 E₂₁ E₂₂

O₁₁: Number of occurrences of Word 1 followed by Word 2.
O₁₂: Number of occurrences of Word 1 followed by any word except Word 2.
O₂₁: Number of occurrences of any word except Word 1 followed by Word 2.
O₂₂: Number of occurrences of any word except Word 1 followed by any word except Word 2.

As for keywords (in Keyword Extractor):

Observed Values Observed File Reference File Row Total
Word w O₁₁ O₁₂ O₁ₓ = O₁₁ + O₁₂
Not Word w O₂₁ O₂₂ O₂ₓ = O₂₁ + O₂₂
Column Total Oₓ₁ = O₁₁ + O₂₁ Oₓ₂ = O₁₂ + O₂₂ Oₓₓ = O₁₁ + O₁₂ + O₂₁ + O₂₂
Expected Values Observed File Reference File
Word w E₁₁ E₁₂
Not Word w E₂₁ E₂₂

O₁₁: Number of occurrences of Word w in the observed file.
O₁₂: Number of occurrences of Word w in the reference file.
O₂₁: Number of occurrences of all words except Word w in the observed file.
O₂₂: Number of occurrences of all words except Word w in the reference file.

To conduct Mann-Whitney U Test, Student's t-test (2-sample), and Welch's t-test on a specific word, each column total is first divided into n (5 by default) sub-sections respectively. To be more specific, in Collocation Extractor and Colligation Extractor, all collocates where Word 1 appears as node and the other collocates where Word 1 does not appear as node are divided into n parts respectively. And in Keyword Extractor, all tokens in the observed file and all tokens in the reference files are equally divided into n parts respectively.

The frequencies of Word 2 (in Collocation Extractor and Colligation Extractor) or Word w (in Keyword Extractor) in each sub-section of the 2 column totals are counted and denoted by F₁₁, F₂₁, F₃₁, ..., Fₙ₁, and F₁₂, F₂₂, F₃₂, ..., Fₙ₂ respectively. The total frequency of Word 2 (in Collocation Extractor and Colligation Extractor) or Word w (in Keyword Extractor) in the 2 column totals are denoted by Fₓ₁ and Fₓ₂ respectively. The mean value of the frequencies over all sub-sections in the 2 column totals are denoted by f_x1_bar and f_x2_bar respectively.

Then the test statistic, Bayes factor, and effect size are calculated as follows:

Test of Statistical Significance Measure of Bayes Factor Formula
Fisher's Exact Test
(Pedersen, 1996)
See: Fisher's exact test - Wikipedia
Log-likelihood Ratio Test
(Dunning, 1993)
Log-likelihood Ratio Test
(Wilson, 2013)
Formula
Mann-Whitney U Test
(Kilgarriff, 2001)
See: Mann–Whitney U test - Wikipedia
Pearson's Chi-squared Test
(Hofland & Johanson, 1982; Oakes, 1998)
Formula
Student's t-test (1-sample)
(Church et al., 1991)
Formula
Student's t-test (2-sample)
(Paquot & Bestgen, 2009)
Student's t-test (2-sample)
(Wilson, 2013)
Formula
z-score
(Dennis, 1964)
Formula
z-score (Berry-Rogghe)
(Berry-Rogghe, 1973)
Formula
where S is the average span size on both sides of the node word.
Measure of Effect Size Formula
%DIFF
(Gabrielatos & Marchi, 2012)
Formula
Cubic Association Ratio
(Daille, 1994, 1995)
Formula
Dice's Coefficient
(Smadja et al., 1996)
Formula
Difference Coefficient
(Hofland & Johanson, 1982; Gabrielatos, 2018)
Formula
Jaccard Index
(Dunning, 1998)
Formula
Kilgarriff's Ratio
(Kilgarriff, 2009)
Formula
where α is the smoothing parameter, whose value could be changed via Menu Bar → Preferences → Settings → Measures → Effect Size → Kilgarriff's Ratio → Smoothing Parameter.
Log Ratio
(Hardie, 2014)
Formula
Log-Frequency Biased MD
(Thanopoulos et al., 2002)
Formula
logDice
(Rychlý, 2008)
Formula
MI.log-f
(Lexical Computing Ltd., 2015; Kilgarriff & Tugwell, 2002)
Formula
Minimum Sensitivity
(Pedersen, 1998)
Formula
Mutual Dependency
(Thanopoulos et al., 2002)
Formula
Mutual Expectation
(Dias et al., 1999)
Formula
Mutual Information
(Dunning, 1998)
Formula
Odds Ratio
(Pojanapunya & Todd, 2016)
Formula
Pointwise Mutual Information
(Church & Hanks, 1990)
Formula
Poisson Collocation Measure
(Quasthoff & Wolff, 2002)
Formula
Squared Phi Coefficient
(Church & Gale, 1991)
Formula

  1. ^ Al-Heeti, K. N. (1984). Judgment analysis technique applied to readability prediction of Arabic reading material [Doctoral dissertation, University of Northern Colorado]. ProQuest Dissertations and Theses Global.
  2. ^ Al-Tamimi, A., Jaradat M., Aljarrah, N., & Ghanim, S. (2013). AARI: Automatic Arabic readability index. The International Arab Journal of Information Technology, 11(4), 370–378.
  3. ^ Amstad, T. (1978). Wie verständlich sind unsere Zeitungen? [Unpublished doctoral dissertation]. University of Zurich.
  4. ^ Anderson, J. (1983). Lix and Rix: Variations on a little-known readability index. Journal of Reading, 26(6), 490–496.
  5. ^ ^ ^ ^ Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache. Jugend und Volk.
  6. ^ Berry-Rogghe, G. L. M. (1973). The computation of collocations and their relevance in lexical studies. In A. J. Aiken, R. W. Bailey, & N. Hamilton-Smith (Eds.), The computer and literary studies (pp. 103–112). Edinburgh University Press.
  7. ^ Bormuth, J. R. (1969). Development of readability analyses. U.S. Department of Health, Education, and Welfare. http://files.eric.ed.gov/fulltext/ED029166.pdf
  8. ^ Björnsson, C.-H. (1968). Läsbarhet. Liber.
  9. ^ Brouwer, R. H. M. (1963). Onderzoek naar de leesmoeilijkheid van Nederlands proza. Paedagogische studiën, 40, 454–464. https://objects.library.uu.nl/reader/index.php?obj=1874-205260&lan=en
  10. ^ Brunét, E. (1978). Le vocabulaire de Jean Giraudoux: Structure et evolution. Slatkine.
  11. ^ Carroll, J. B. (1964). Language and thought. Prentice-Hall.
  12. ^ ^ Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour, 3(2), 61–65. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x
  13. ^ Caylor, J. S., & Sticht, T. G. (1973). Development of a simple readability index for job reading material. Human Resource Research Organization. https://ia902703.us.archive.org/31/items/ERIC_ED076707/ERIC_ED076707.pdf
  14. ^ Chall, J. S., & Dale, E. (1995). Readability revisited: The new Dale-Chall readability formula. Brookline Books.
  15. ^ Church, K. W., & Gale, W. A. (1991, September 29–October 1). Concordances for parallel text [Paper presentation]. Using Corpora: Seventh Annual Conference of the UW Centre for the New OED and Text Research, St. Catherine's College, Oxford, United Kingdom.
  16. ^ Church, K., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In U. Zernik (Ed.), Lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115–164). Psychology Press.
  17. ^ Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
  18. ^ Coleman, M., & Liau, T. L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2), 283–284. https://doi.org/10.1037/h0076540
  19. ^ College Entrance Examination Board. (1981). Degrees of reading power brings the students and the text together.
  20. ^ Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100. https://doi.org/10.1080/09296171003643098
  21. ^ Crawford, A. N. (1985). Fórmula y gráfico para determinar la comprensibilidad de textos de nivel primario en castellano. Lectura y Vida, 6(4). http://www.lecturayvida.fahce.unlp.edu.ar/numeros/a6n4/06_04_Crawford.pdf
  22. ^ Daille, B. (1994). Approche mixte pour l'extraction automatique de terminologie: statistiques lexicales et filtres linguistiques [Doctoral thesis, Paris Diderot University]. Béatrice Daille. http://www.bdaille.com/index.php?option=com_docman&task=doc_download&gid=8&Itemid=
  23. ^ Daille, B. (1995). Combined approach for terminology extraction: Lexical statistics and linguistic filtering. UCREL technical papers (Vol. 5). Lancaster University.
  24. ^ ^ Dale, E. (1931). A comparison of two word lists. Educational Research Bulletin, 10(18), 484–489.
  25. ^ Dale, E., & Chall, J. S. (1948a). A formula for predicting readability. Educational Research Bulletin, 27(1), 11–20, 28.
  26. ^ ^ Dale, E., & Chall, J. S. (1948b). A formula for predicting readability: Instructions. Educational Research Bulletin, 27(2), 37–54.
  27. ^ Danielson, W. A., & Bryan, S. D. (1963). Computer automation of two readability formulas. Journalism Quarterly, 40(2), 201–206. https://doi.org/10.1177/107769906304000207
  28. ^ Dawood, B.A.K. (1977). The relationship between readability and selected language variables [Unpublished master’s thesis]. University of Baghdad.
  29. ^ Dennis, S. F. (1964). The construction of a thesaurus automatically from a sample of text. In M. E. Stevens, V. E. Giuliano, & L. B. Heilprin (Eds.), Proceedings of the symposium on statistical association methods for mechanized documentation (pp. 61–148). National Bureau of Standards.
  30. ^ Dias, G., Guilloré, S., & Pereira Lopes, J. G. (1999). Language independent automatic acquisition of rigid multiword units from unrestricted text corpora. In A. Condamines, C. Fabre, & M. Péry-Woodley (Eds.), TALN'99: 6ème Conférence Annuelle Sur le Traitement Automatique des Langues Naturelles (pp. 333–339). TALN.
  31. ^ Dickes, P. & Steiwer, L. (1977). Ausarbeitung von lesbarkeitsformeln für die deutsche sprache. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 9(1), 20–28.
  32. ^ Douma, W. H. (1960). De leesbaarheid van landbouwbladen: Een onderzoek naar en een toepassing van leesbaarheidsformules [Readability of Dutch farm papers: A discussion and application of readability-formulas]. Afdeling sociologie en sociografie van de Landbouwhogeschool Wageningen. https://edepot.wur.nl/276323
  33. ^ Dugast, D. (1978). Sur quoi se fonde la notion d’étendue théoretique du vocabulaire?. Le Français Moderne, 46, 25–32.
  34. ^ ^ Dugast, D. (1979). Vocabulaire et stylistique: I théâtre et dialogue, travaux de linguistique quantitative. Slatkine.
  35. ^ Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
  36. ^ ^ Dunning, T. E. (1998). Finding structure in text, genome and other symbolic sequences [Doctoral dissertation, University of Sheffield]. arXiv. arxiv.org/pdf/1207.1847.pdf
  37. ^ El-Haj, M., & Rayson, P. (2016). OSMAN: A novel Arabic readability metric. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 250–255). European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2016/index.html
  38. ^ Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University.
  39. ^ Fang, I. E. (1966). The easy listening formula. Journal of Broadcasting, 11(1), 63–68. https://doi.org/10.1080/08838156609363529
  40. ^ Farr, J. N., Jenkins, J. J., & Paterson, D. G. (1951). Simplification of Flesch reading ease formula. Journal of Applied Psychology, 35(5), 333–337. https://doi.org/10.1037/h0062427
  41. ^ Fernández Huerta, J. (1959). Medidas sencillas de lecturabilidad. Consigna, 214, 29–32.
  42. ^ Fisher, R. A., Steven, A. C., & Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 12(1), 42–58. https://doi.org/10.2307/1411
  43. ^ Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221–233. https://doi.org/10.1037/h0057532
  44. ^ Franchina, V., & Vacca, R. (1986). Adaptation of Flesh readability index on a bilingual text written by the same author both in Italian and English languages. Linguaggi, 3, 47–49.
  45. ^ Fucks, W. (1955). Unterschied des Prosastils von Dichtern und anderen Schriftstellern: ein Beispiel mathematischer Stilanalyse. Bouvier.
  46. ^ Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In C. Taylor & A. Marchi (Eds.), Corpus approaches to discourse: A critical review (pp. 225–258). Routledge.
  47. ^ Gabrielatos, C., & Marchi, A. (2012, September 13–14). Keyness: Appropriate metrics and practical issues [Conference session]. CADS International Conference 2012, University of Bologna, Italy.
  48. ^ Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri
  49. ^ Guiraud, P. (1954). Les caractères statistiques du vocabulaire: Essai de méthodologie. Presses universitaires de France.
  50. ^ Gunning, R. (1968). The technique of clear writing (revised ed.). McGraw-Hill Book Company.
  51. ^ Gutiérrez de Polini, L. E. (1972). Investigación sobre lectura en Venezuela [Paper presentation]. Primeras Jornadas de Educación Primaria, Ministerio de Educación, Caracas, Venezuela.
  52. ^ Hardie, A. (2014, April 28). Log ratio: An informal introduction. ESRC Centre for Corpus Approaches to Social Science (CASS). http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/
  53. ^ Herdan, G. (1955). A new derivation and interpretation of Yule's ‘Characteristic’ K. Zeitschrift für Angewandte Mathematik und Physik (ZAMP), 6(4), 332–339. https://doi.org/10.1007/BF01587632
  54. ^ Herdan, G. (1960). Type-token mathematics: A textbook of mathematical linguistics. Mouton.
  55. ^ ^ Hofland, K., & Johanson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities.
  56. ^ Honoré, A. (1979). Some simple measures of richness of vocabulary. Association of Literary and Linguistic Computing Bulletin, 7(2), 172–177.
  57. ^ ^ Johnson, W. (1944). Studies in language behavior: I. a program of research. Psychological Monographs, 56(2), 1–15. https://doi.org/10.1037/h0093508
  58. ^ ^ Juilland, A., & Chang-Rodriguez, E. (1964). Frequency dictionary of Spanish words. Mouton.
  59. ^ Kandel, L., & Moles A. (1958). Application de l’indice de flesch la langue francaise [applying flesch index to french language]. The Journal of Educational Research, 21, 283–287.
  60. ^ Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 232–263. https://doi.org/10.1075/ijcl.6.1.05kil
  61. ^ Kilgarriff, A. (2009). Simple maths for keywords. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference 2009 (p. 171). University of Liverpool.
  62. ^ Kilgarriff, A., & Tugwell, D. (2002). WASP-bench: An MT lexicographers' workstation supporting state-of-the-art lexical disambiguation. In Proceedings of the 8th Machine Translation Summit (pp. 187–190). European Association for Machine Translation.
  63. ^ ^ ^ Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count, and Flesch reading ease formula) for Navy enlisted personnel (Report No. RBR 8-75). Naval Air Station Memphis. https://apps.dtic.mil/sti/pdfs/ADA006655.pdf
  64. ^ Kromer, V. (2003). A usage measure based on psychophysical relations. Journal of Quantitative Linguistics, 10(2), 177–186. https://doi.org/10.1076/jqul.10.2.177.16718
  65. ^ Lexical Computing. (2015, July 8). Statistics used in Sketch Engine. Sketch Engine. https://www.sketchengine.eu/documentation/statistics-used-in-sketch-engine/
  66. ^ Liau, T. L., Bassin, C. B., Martin, C. J., & Coleman, E. B. (1976). Modification of the Coleman readability formulas. Journal of Reading Behavior, 8(4), 381–386. https://journals.sagepub.com/doi/pdf/10.1080/10862967609547193
  67. ^ Lijffijt, J., & Gries, S. T. (2012). Correction to Stefan Th. Gries’ “dispersions and adjusted frequencies in corpora”. International Journal of Corpus Linguistics, 17(1), 147–149. https://doi.org/10.1075/ijcl.17.1.08lij
  68. ^ Lorge, I. (1944). Predicting readability. Teachers College Record, 45, 404–419.
  69. ^ Lorge, I. (1948). The Lorge and Flesch readability formulae: A correction. School and Society, 67, 141–142.
  70. ^ Lucisano, P., & Emanuela Piemontese, M. (1988). GULPEASE: A formula for the prediction of the difficulty of texts in Italian. Scuola e Città, 39(3), 110–124.
  71. ^ ^ Luong, A.-V., Nguyen, D., & Dinh, D. (2018). A new formula for Vietnamese text readability assessment. 2018 10th International Conference on Knowledge and Systems Engineering (KSE) (pp. 198–202). IEEE. https://doi.org/10.1109/KSE.2018.8573379
  72. ^ Lyne, A. A. (1985). Dispersion. In The vocabulary of French business correspondence: Word frequencies, collocations, and problems of lexicometric method (pp. 101–124). Slatkine/Champion.
  73. ^ Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment. Palgrave Macmillan.
  74. ^ Maas, H.-D. (1972). Über den zusammenhang zwischen wortschatzumfang und länge eines textes. Zeitschrift für Literaturwissenschaft und Linguistik, 2(8), 73–96.
  75. ^ McCarthy, P. M. (2005). An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD) [Doctoral dissertation, The University of Memphis]. ProQuest Dissertations and Theses Global.
  76. ^ ^ McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392. https://doi.org/10.3758/BRM.42.2.381
  77. ^ McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639–646.
  78. ^ Muñoz Baquedano, M. (2006). Legibilidad y variabilidad de los textos. Boletín de Investigación Educacional, Pontificia Universidad Católica de Chile, 21(2), 13–26.
  79. ^ Nirmaldasan. (2009, April 30). McAlpine EFLAW readability score. Readability Monitor. Retrieved November 15, 2022, from https://strainindex.wordpress.com/2009/04/30/mcalpine-eflaw-readability-score/
  80. ^ Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh University Press.
  81. ^ Oborneva, I. V. (2006). Автоматизированная оценка сложности учебных текстов на основе статистических параметров [Doctoral dissertation, Institute for Strategy of Education Development of the Russian Academy of Education]. Freereferats.ru. https://static.freereferats.ru/_avtoreferats/01002881899.pdf?ver=3
  82. ^ O’Hayre, J. (1966). Gobbledygook has gotta go. U.S. Government Printing Office. https://www.governmentattic.org/15docs/Gobbledygook_Has_Gotta_Go_1966.pdf
  83. ^ Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. Language and Computers, 68, 247–269.
  84. ^ Partiko, Z. V. (2001). Zagal’ne redaguvannja. Normativni osnovi. Afiša.
  85. ^ Pedersen, T. (1996). Fishing for exactness. In T. Winn (Ed.), Proceedings of the Sixth Annual South-Central Regional SAS Users' Group Conference (pp. 188–200). The South–Central Regional SAS Users' Group.
  86. ^ Pedersen, T. (1998). Dependent bigram identification. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (p. 1197). AAAI Press.
  87. ^ Pisarek, W. (1969). Jak mierzyć zrozumiałość tekstu?. Zeszyty Prasoznawcze, 4(42), 35–48.
  88. ^ Pojanapunya, P., & Todd, R. W. (2016). Log-likelihood and odds ratio keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory, 15(1), 133–167. https://doi.org/10.1515/cllt-2015-0030
  89. ^ Popescu I.-I., Mačutek, J, & Altmann, G. (2008). Word frequency and arc length. Glottometrics, 17, 18–42.
  90. ^ ^ ^ ^ ^ ^ Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter.
  91. ^ ^ ^ ^ Powers, R. D., Sumner, W. A., & Kearl, B. E. (1958). A recalculation of four adult readability formulas. Journal of Educational Psychology, 49(2), 99–105. https://doi.org/10.1037/h0043254
  92. ^ Quasthoff, U., & Wolff, C. (2002). The poisson collocation measure and its applications. Proceedings of 2nd International Workshop on Computational Approaches to Collocations. IEEE.
  93. ^ ^ Rosengren, I. (1971). The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée, 1, 103–127.
  94. ^ Rychlý, P. (2008). A lexicographyer-friendly association score. In P. Sojka & A. Horák (Eds.), Proceedings of Second Workshop on Recent Advances in Slavonic Natural Languages Processing. Masaryk University
  95. ^ ^ ^ ^ ^ ^ Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124
  96. ^ Simpson, E. H. (1949). Measurement of diversity. Nature, 163, p. 688. https://doi.org/10.1038/163688a0
  97. ^ Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), 1–38.
  98. ^ Smith, E. A. (1961). Devereaux readability index. Journal of Educational Research, 54(8), 298–303. https://doi.org/10.1080/00220671.1961.10882728
  99. ^ Smith, E. A., & Senter, R. J. (1967). Automated readability index. Aerospace Medical Research Laboratories. https://apps.dtic.mil/sti/pdfs/AD0667273.pdf
  100. ^ Solomon, N. W. (2006). Qualitative analysis of media language [Unpublished doctoral dissertation]. Madurai Kamaraj University.
  101. ^ Somers, H. H. (1966). Statistical methods in literary analysis. In J. Leeds (Ed.), The computer and literary style (pp. 128–140). Kent State University Press.
  102. ^ Spache, G. (1953). A new readability formula for primary-grade reading materials. Elementary School Journal, 53(7), 410–413. https://doi.org/10.1086/458513
  103. ^ ^ Spache, G. (1974). Good reading for poor readers (Rev. 9th ed.). Garrard.
  104. ^ Szigriszt Pazos, F. (1993). Sistemas predictivos de legibilidad del mensaje escrito: Formula de perspicuidad [Doctoral dissertation, Complutense University of Madrid]. Biblos-e Archivo. https://repositorio.uam.es/bitstream/handle/10486/2488/3907_barrio_cantalejo_ines_maria.pdf?sequence=1&isAllowed=y
  105. ^ ^ Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In M. G. González & C. P. S. Araujo (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 620–625). European Language Resources Association.
  106. ^ Tränkle, U., & Bailer, H. (1984). Kreuzvalidierung und Neuberechnung von Lesbarkeitsformeln für die Deutsche Sprache [Cross-validation and recalculation of the readability formulas for the German language]. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 16(3), 231–244.
  107. ^ Tuldava, J. (1975). Ob izmerenii trudnosti tekstov [On measuring the complexity of the text]. Uchenye zapiski Tartuskogo universiteta. Trudy po metodike prepodavaniya inostrannykh yazykov, 345, 102–120.
  108. ^ Ure, J. (1971). Lexical density and register differentiation. In G. E. Perren & J. L. M. Trim (Eds.), Applications of Linguistics (pp. 443–452). Cambridge University Press.
  109. ^ Wheeler, L. R., & Smith, E. H. (1954). A practical readability formula for the classroom teacher in the primary grades. Elementary English, 31(7), 397–399.
  110. ^ Williams, C. B. (1970). Style and vocabulary: Numerical studies. Griffin.
  111. ^ ^ Wilson, A. (2013). Embracing Bayes Factors for key item analysis in corpus linguistics. In M. Bieswanger & A. Koll-Stobbe (Eds.), New Approaches to the Study of Linguistic Variability (pp. 3–11). Peter Lang.
  112. ^ Yule, G. U. (1944). The statistical study of literary vocabulary. Cambridge University Press.
  113. ^ Zhang, H., Huang, C., & Yu, S. (2004). Distributional consistency: As a general method for defining a core lexicon. In M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, & R. Silva (Eds.), Proceedings of Fourth International Conference on Language Resources and Evaluation (pp. 1119–1122). European Language Resources Association.