Adding translator with many generic input parameter support #782

lalitpagaria · 2021-01-27T16:23:33Z

To complete #704

Pending tasks -

Add it with pipeline

Following tasks can be done in separate PR -

Doc string
Add example or tutorial

lalitpagaria · 2021-01-27T16:29:49Z

@tholor Please review this.
My main design choice to make it generic so it can be used at beginning of the pipeline (to translate query) or at the end of pipeline (to translate out of any node ie generator, retriever, summariser etc)

lalitpagaria · 2021-01-27T16:43:40Z

~~Also can anyone can take care of pending tasks. Also please suggest better alternative for mypy related issue with translate method~~

It seems transformers pipeline for translation is not consistent. It is not correctly picking up "de" to "en" translation model. I tried multiple ways to integrate model but translation model APIs are not consistent with regards to parameters.

@tholor what is view about using translator or tranlation packages (which use external API for translation) for this task?

tholor · 2021-02-04T14:54:38Z

I just did a first review & testing round. I think the code design is great and fits well into the rest of Haystack.
My biggest concern is around the quality of translations and the chosen model.
From a quick test the HF translation pipeline using t5-base seems to work decently on single sentences but starts summarizing / truncating if you pass longer texts. I am not talking about really long text that exceed the transformers max_seq_len, but rather 3-4 sentences.

Example:

from haystack import Document
from haystack.translator import TransformersTranslator

DOCS = [
    Document(
        text="""PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow.""",
    ),
    Document(
        text="""The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."""
    )
]
trans = TransformersTranslator(
        model_name_or_path="t5-base",
        input_language_code="en",
        output_language_code="de",
        use_gpu=-1
)
translated = trans.translate(documents=DOCS)
print(translated)

The results basically only cover a part of the input:

[
   {
      "text":"PG&E erklärte, dass es die Stromausfälle als Reaktion auf Prognosen für hohe Winde unter trockenen Bedingungen geplant hatte, um das Risiko von Waldbränden zu verringern.",
      "id":"b4176841-33e2-403c-9a04-b33480a45cb5",
      "score":"None",
      "probability":"None",
      "question":"None",
      "meta":{
         
      },
      "embedding":"None"
   },
   {
      "text":"Der Eiffelturm ist die zweithöchste freistehende Struktur Frankreichs nach dem Millau Viaduct.",
      "id":"cd2980f0-bbce-4899-be64-ce0e8c1ae867",
      "score":"None",
      "probability":"None",
      "question":"None",
      "meta":{
         
      },
      "embedding":"None"
   }
]

I thought it might be due to the max_length param that you can pass to the t5 model. However, this is 300 tokens per default and even setting it to 1000 didn't change anything.

If we don't find a fix here, I'd suggest we rather use the OPUS models (e.g. https://huggingface.co/Helsinki-NLP/opus-mt-en-de). They are rather small, are available for most language pairs and translation results look decent.

@tholor what is view about using translator or tranlation packages (which use external API for translation) for this task?

I would not really like to integrate an external service here. Maybe later as a second option, but our default should always be free and open source.

lalitpagaria · 2021-02-08T18:02:10Z

Thanks I will update PR by using OPUS models

…r etc

lalitpagaria · 2021-02-08T23:13:19Z

@tholor I have done the changes you suggested. Also added End-to-end translation pipeline. Please review these changes.
Can you please add remaining tasks ie Doc string and tutorial

lalitpagaria · 2021-02-10T13:53:39Z

@tholor can you please look into this.

I don't want to resolve merge conflicts after seeing list current PRs 🙂

tholor · 2021-02-10T18:17:30Z

@lalitpagaria yep, I am on it - but writing the docs always takes quite some time ;)

…ranslation

lalitpagaria · 2021-02-10T18:24:11Z

Overall what is your view about pipeline changes?
I tried to keep it generic so people can plug translator node with any type of pipeline (qa, faq, generator, summerizer etc)

haystack/translator/transformers.py

tholor · 2021-02-11T09:47:34Z

Overall what is your view about pipeline changes?
I tried to keep it generic so people can plug translator node with any type of pipeline (qa, faq, generator, summerizer etc)

I like the idea of "wrapping" an existing pipeline. I am not sure if users will really use it rather than creating a new pipeline, but it's worth a test and ask users for feedback :)

…any models

…ranslation

…t cause tensor size miss match error. If this option required by used then it can be added later.

lalitpagaria · 2021-02-12T13:45:42Z

Really liked the CI changes by @oryx1729
Very fast feedback on PR now. Great!!

lalitpagaria · 2021-02-12T13:55:13Z

@tholor all green now
Need your green signal :)

tholor self-assigned this Feb 2, 2021

lalitpagaria and others added 7 commits February 8, 2021 22:29

Adding translator with many generic input parameter support

9e0951f

Making dict_key as generic

8acf7ef

Fixing mypy issue

ea3386d

Adding pipeline and using opus models

b8a39eb

Add latest docstring and tutorial changes

184aac9

Adding test cases for end-to-end translation for generator, summerize…

2c2e38b

…r etc

raise error join and merge nodes

2cebfd6

Fix test failure

1d2fd4c

tholor and others added 4 commits February 10, 2021 19:18

add docstrings. add usage documentation. rm skip_special_tokens param

6eee6f1

Add latest docstring and tutorial changes

07f25d7

fix code snippets in md

61a243e

Merge branch 'translation' of github.com:lalitpagaria/haystack into t…

4e8140b

…ranslation

lalitpagaria added 4 commits February 10, 2021 22:50

Adding few extra configuration parameters and fixing tests

9c7516f

Fixingmypy issue and updating usage document

6ec6699

fix for mypy issue in pipeline.py

ad801ae

reverting renaming of pytest_collection_modifyitems method

8934c4a

lalitpagaria mentioned this pull request Feb 10, 2021

standardise return type of pipelines and nodes #821

Closed

tholor reviewed Feb 11, 2021

View reviewed changes

haystack/translator/transformers.py Outdated Show resolved Hide resolved

haystack/translator/transformers.py Outdated Show resolved Hide resolved

lalitpagaria added 2 commits February 11, 2021 18:48

Addressing review comments

43829d3

setting skip_special_tokens to True

3b85b8d

lalitpagaria and others added 4 commits February 11, 2021 20:08

removing model_max_length argument as None type is not supported to m…

d6d6bed

…any models

Merge remote-tracking branch 'upstream/master' into translation

ded600d

Merge branch 'translation' of github.com:lalitpagaria/haystack into t…

997c35a

…ranslation

Removing padding parameter. Better to leave it as default otherwise i…

2a77dfa

…t cause tensor size miss match error. If this option required by used then it can be added later.

tholor merged commit 5bd94ac into deepset-ai:master Feb 12, 2021

lalitpagaria deleted the translation branch February 12, 2021 15:19

lalitpagaria mentioned this pull request Feb 23, 2021

Language translator node and multiple language docs support #704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding translator with many generic input parameter support #782

Adding translator with many generic input parameter support #782

lalitpagaria commented Jan 27, 2021 •

edited

Loading

lalitpagaria commented Jan 27, 2021

lalitpagaria commented Jan 27, 2021 •

edited

Loading

tholor commented Feb 4, 2021

lalitpagaria commented Feb 8, 2021

lalitpagaria commented Feb 8, 2021

lalitpagaria commented Feb 10, 2021

tholor commented Feb 10, 2021 •

edited

Loading

lalitpagaria commented Feb 10, 2021

tholor commented Feb 11, 2021

lalitpagaria commented Feb 12, 2021

lalitpagaria commented Feb 12, 2021

Adding translator with many generic input parameter support #782

Adding translator with many generic input parameter support #782

Conversation

lalitpagaria commented Jan 27, 2021 • edited Loading

lalitpagaria commented Jan 27, 2021

lalitpagaria commented Jan 27, 2021 • edited Loading

tholor commented Feb 4, 2021

lalitpagaria commented Feb 8, 2021

lalitpagaria commented Feb 8, 2021

lalitpagaria commented Feb 10, 2021

tholor commented Feb 10, 2021 • edited Loading

lalitpagaria commented Feb 10, 2021

tholor commented Feb 11, 2021

lalitpagaria commented Feb 12, 2021

lalitpagaria commented Feb 12, 2021

lalitpagaria commented Jan 27, 2021 •

edited

Loading

lalitpagaria commented Jan 27, 2021 •

edited

Loading

tholor commented Feb 10, 2021 •

edited

Loading