Adding indentation to markup files (#947)

deepset-ai · Apr 7, 2021 · 64ad953 · 64ad953
1 parent 8894c4f
commit 64ad953
Show file tree

Hide file tree

Showing 7 changed files with 427 additions and 427 deletions.
diff --git a/docs/_src/api/api/document_store.md b/docs/_src/api/api/document_store.md
diff --git a/docs/_src/api/api/file_converter.md b/docs/_src/api/api/file_converter.md
@@ -20,15 +20,15 @@ Base class for implementing file converts to transform input documents to text f
 **Arguments**:
 
 - `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
-The tabular structures in documents might be noise for the reader model if it
-does not have table parsing capability for finding answers. However, tables
-may also have long strings that could possible candidate for searching answers.
-The rows containing strings are thus retained in this option.
+                              The tabular structures in documents might be noise for the reader model if it
+                              does not have table parsing capability for finding answers. However, tables
+                              may also have long strings that could possible candidate for searching answers.
+                              The rows containing strings are thus retained in this option.
 - `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
-(https://en.wikipedia.org/wiki/ISO_639-1) format.
-This option can be used to add test for encoding errors. If the extracted text is
-not one of the valid languages, then it might likely be encoding error resulting
-in garbled text.
+                        (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                        This option can be used to add test for encoding errors. If the extracted text is
+                        not one of the valid languages, then it might likely be encoding error resulting
+                        in garbled text.
 
 <a name="base.BaseConverter.convert"></a>
 #### convert
@@ -48,15 +48,15 @@ supplied meta data like author, url, external IDs can be supplied as a dictionar
 - `file_path`: path of the file to convert
 - `meta`: dictionary of meta data key-value pairs to append in the returned document.
 - `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
-The tabular structures in documents might be noise for the reader model if it
-does not have table parsing capability for finding answers. However, tables
-may also have long strings that could possible candidate for searching answers.
-The rows containing strings are thus retained in this option.
+                              The tabular structures in documents might be noise for the reader model if it
+                              does not have table parsing capability for finding answers. However, tables
+                              may also have long strings that could possible candidate for searching answers.
+                              The rows containing strings are thus retained in this option.
 - `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
-(https://en.wikipedia.org/wiki/ISO_639-1) format.
-This option can be used to add test for encoding errors. If the extracted text is
-not one of the valid languages, then it might likely be encoding error resulting
-in garbled text.
+                        (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                        This option can be used to add test for encoding errors. If the extracted text is
+                        not one of the valid languages, then it might likely be encoding error resulting
+                        in garbled text.
 - `encoding`: Select the file encoding (default is `utf-8`)
 
 <a name="base.BaseConverter.validate_language"></a>
@@ -88,15 +88,15 @@ class TextConverter(BaseConverter)
 **Arguments**:
 
 - `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
-The tabular structures in documents might be noise for the reader model if it
-does not have table parsing capability for finding answers. However, tables
-may also have long strings that could possible candidate for searching answers.
-The rows containing strings are thus retained in this option.
+                              The tabular structures in documents might be noise for the reader model if it
+                              does not have table parsing capability for finding answers. However, tables
+                              may also have long strings that could possible candidate for searching answers.
+                              The rows containing strings are thus retained in this option.
 - `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
-(https://en.wikipedia.org/wiki/ISO_639-1) format.
-This option can be used to add test for encoding errors. If the extracted text is
-not one of the valid languages, then it might likely be encoding error resulting
-in garbled text.
+                        (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                        This option can be used to add test for encoding errors. If the extracted text is
+                        not one of the valid languages, then it might likely be encoding error resulting
+                        in garbled text.
 
 <a name="txt.TextConverter.convert"></a>
 #### convert
@@ -112,15 +112,15 @@ Reads text from a txt file and executes optional preprocessing steps.
 - `file_path`: path of the file to convert
 - `meta`: dictionary of meta data key-value pairs to append in the returned document.
 - `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
-The tabular structures in documents might be noise for the reader model if it
-does not have table parsing capability for finding answers. However, tables
-may also have long strings that could possible candidate for searching answers.
-The rows containing strings are thus retained in this option.
+                              The tabular structures in documents might be noise for the reader model if it
+                              does not have table parsing capability for finding answers. However, tables
+                              may also have long strings that could possible candidate for searching answers.
+                              The rows containing strings are thus retained in this option.
 - `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
-(https://en.wikipedia.org/wiki/ISO_639-1) format.
-This option can be used to add test for encoding errors. If the extracted text is
-not one of the valid languages, then it might likely be encoding error resulting
-in garbled text.
+                        (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                        This option can be used to add test for encoding errors. If the extracted text is
+                        not one of the valid languages, then it might likely be encoding error resulting
+                        in garbled text.
 - `encoding`: Select the file encoding (default is `utf-8`)
 
 **Returns**:
@@ -153,15 +153,15 @@ For compliance with other converters we nevertheless opted for keeping the metho
 - `file_path`: Path to the .docx file you want to convert
 - `meta`: dictionary of meta data key-value pairs to append in the returned document.
 - `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
-The tabular structures in documents might be noise for the reader model if it
-does not have table parsing capability for finding answers. However, tables
-may also have long strings that could possible candidate for searching answers.
-The rows containing strings are thus retained in this option.
+                              The tabular structures in documents might be noise for the reader model if it
+                              does not have table parsing capability for finding answers. However, tables
+                              may also have long strings that could possible candidate for searching answers.
+                              The rows containing strings are thus retained in this option.
 - `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
-(https://en.wikipedia.org/wiki/ISO_639-1) format.
-This option can be used to add test for encoding errors. If the extracted text is
-not one of the valid languages, then it might likely be encoding error resulting
-in garbled text.
+                        (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                        This option can be used to add test for encoding errors. If the extracted text is
+                        not one of the valid languages, then it might likely be encoding error resulting
+                        in garbled text.
 - `encoding`: Not applicable
 
 <a name="tika"></a>
@@ -185,15 +185,15 @@ class TikaConverter(BaseConverter)
 
 - `tika_url`: URL of the Tika server
 - `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
-The tabular structures in documents might be noise for the reader model if it
-does not have table parsing capability for finding answers. However, tables
-may also have long strings that could possible candidate for searching answers.
-The rows containing strings are thus retained in this option.
+                              The tabular structures in documents might be noise for the reader model if it
+                              does not have table parsing capability for finding answers. However, tables
+                              may also have long strings that could possible candidate for searching answers.
+                              The rows containing strings are thus retained in this option.
 - `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
-(https://en.wikipedia.org/wiki/ISO_639-1) format.
-This option can be used to add test for encoding errors. If the extracted text is
-not one of the valid languages, then it might likely be encoding error resulting
-in garbled text.
+                        (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                        This option can be used to add test for encoding errors. If the extracted text is
+                        not one of the valid languages, then it might likely be encoding error resulting
+                        in garbled text.
 
 <a name="tika.TikaConverter.convert"></a>
 #### convert
@@ -207,15 +207,15 @@ in garbled text.
 - `file_path`: path of the file to convert
 - `meta`: dictionary of meta data key-value pairs to append in the returned document.
 - `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
-The tabular structures in documents might be noise for the reader model if it
-does not have table parsing capability for finding answers. However, tables
-may also have long strings that could possible candidate for searching answers.
-The rows containing strings are thus retained in this option.
+                              The tabular structures in documents might be noise for the reader model if it
+                              does not have table parsing capability for finding answers. However, tables
+                              may also have long strings that could possible candidate for searching answers.
+                              The rows containing strings are thus retained in this option.
 - `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
-(https://en.wikipedia.org/wiki/ISO_639-1) format.
-This option can be used to add test for encoding errors. If the extracted text is
-not one of the valid languages, then it might likely be encoding error resulting
-in garbled text.
+                        (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                        This option can be used to add test for encoding errors. If the extracted text is
+                        not one of the valid languages, then it might likely be encoding error resulting
+                        in garbled text.
 - `encoding`: Not applicable
 
 **Returns**:
@@ -242,15 +242,15 @@ class PDFToTextConverter(BaseConverter)
 **Arguments**:
 
 - `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
-The tabular structures in documents might be noise for the reader model if it
-does not have table parsing capability for finding answers. However, tables
-may also have long strings that could possible candidate for searching answers.
-The rows containing strings are thus retained in this option.
+                              The tabular structures in documents might be noise for the reader model if it
+                              does not have table parsing capability for finding answers. However, tables
+                              may also have long strings that could possible candidate for searching answers.
+                              The rows containing strings are thus retained in this option.
 - `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
-(https://en.wikipedia.org/wiki/ISO_639-1) format.
-This option can be used to add test for encoding errors. If the extracted text is
-not one of the valid languages, then it might likely be encoding error resulting
-in garbled text.
+                        (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                        This option can be used to add test for encoding errors. If the extracted text is
+                        not one of the valid languages, then it might likely be encoding error resulting
+                        in garbled text.
 
 <a name="pdf.PDFToTextConverter.convert"></a>
 #### convert
@@ -265,21 +265,21 @@ Extract text from a .pdf file using the pdftotext library (https://www.xpdfreade
 
 - `file_path`: Path to the .pdf file you want to convert
 - `meta`: Optional dictionary with metadata that shall be attached to all resulting documents.
-Can be any custom keys and values.
+             Can be any custom keys and values.
 - `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
-The tabular structures in documents might be noise for the reader model if it
-does not have table parsing capability for finding answers. However, tables
-may also have long strings that could possible candidate for searching answers.
-The rows containing strings are thus retained in this option.
+                              The tabular structures in documents might be noise for the reader model if it
+                              does not have table parsing capability for finding answers. However, tables
+                              may also have long strings that could possible candidate for searching answers.
+                              The rows containing strings are thus retained in this option.
 - `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
-(https://en.wikipedia.org/wiki/ISO_639-1) format.
-This option can be used to add test for encoding errors. If the extracted text is
-not one of the valid languages, then it might likely be encoding error resulting
-in garbled text.
+                        (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                        This option can be used to add test for encoding errors. If the extracted text is
+                        not one of the valid languages, then it might likely be encoding error resulting
+                        in garbled text.
 - `encoding`: Encoding that will be passed as -enc parameter to pdftotext. "Latin 1" is the default encoding
-of pdftotext. While this works well on many PDFs, it might be needed to switch to "UTF-8" or
-others if your doc contains special characters (e.g. German Umlauts, Cyrillic characters ...).
-Note: With "UTF-8" we experienced cases, where a simple "fi" gets wrongly parsed as
-"xef\xac\x81c" (see test cases). That's why we keep "Latin 1" as default here.
-(See list of available encodings by running `pdftotext -listencodings` in the terminal)
+                 of pdftotext. While this works well on many PDFs, it might be needed to switch to "UTF-8" or
+                 others if your doc contains special characters (e.g. German Umlauts, Cyrillic characters ...).
+                 Note: With "UTF-8" we experienced cases, where a simple "fi" gets wrongly parsed as
+                 "xef\xac\x81c" (see test cases). That's why we keep "Latin 1" as default here.
+                 (See list of available encodings by running `pdftotext -listencodings` in the terminal)
 
diff --git a/docs/_src/api/api/generator.md b/docs/_src/api/api/generator.md
@@ -93,8 +93,8 @@ See https://huggingface.co/transformers/model_doc/rag.html for more details
 **Arguments**:
 
 - `model_name_or_path`: Directory of a saved model or the name of a public model e.g.
-'facebook/rag-token-nq', 'facebook/rag-sequence-nq'.
-See https://huggingface.co/models for full list of available models.
+                           'facebook/rag-token-nq', 'facebook/rag-sequence-nq'.
+                           See https://huggingface.co/models for full list of available models.
 - `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
 - `retriever`: `DensePassageRetriever` used to embedded passage
 - `generator_type`: Which RAG generator implementation to use? RAG-TOKEN or RAG-SEQUENCE