Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding translator with many generic input parameter support #782

Merged
merged 22 commits into from
Feb 12, 2021
Merged

Adding translator with many generic input parameter support #782

merged 22 commits into from
Feb 12, 2021

Conversation

lalitpagaria
Copy link
Contributor

@lalitpagaria lalitpagaria commented Jan 27, 2021

To complete #704

Pending tasks -

  • Add it with pipeline

Following tasks can be done in separate PR -

  • Doc string
  • Add example or tutorial

@lalitpagaria
Copy link
Contributor Author

@tholor Please review this.
My main design choice to make it generic so it can be used at beginning of the pipeline (to translate query) or at the end of pipeline (to translate out of any node ie generator, retriever, summariser etc)

@lalitpagaria
Copy link
Contributor Author

lalitpagaria commented Jan 27, 2021

Also can anyone can take care of pending tasks. Also please suggest better alternative for mypy related issue with translate method

It seems transformers pipeline for translation is not consistent. It is not correctly picking up "de" to "en" translation model. I tried multiple ways to integrate model but translation model APIs are not consistent with regards to parameters.

@tholor what is view about using translator or tranlation packages (which use external API for translation) for this task?

@tholor tholor self-assigned this Feb 2, 2021
@tholor
Copy link
Member

tholor commented Feb 4, 2021

I just did a first review & testing round. I think the code design is great and fits well into the rest of Haystack.
My biggest concern is around the quality of translations and the chosen model.
From a quick test the HF translation pipeline using t5-base seems to work decently on single sentences but starts summarizing / truncating if you pass longer texts. I am not talking about really long text that exceed the transformers max_seq_len, but rather 3-4 sentences.

Example:

from haystack import Document
from haystack.translator import TransformersTranslator

DOCS = [
    Document(
        text="""PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow.""",
    ),
    Document(
        text="""The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."""
    )
]
trans = TransformersTranslator(
        model_name_or_path="t5-base",
        input_language_code="en",
        output_language_code="de",
        use_gpu=-1
)
translated = trans.translate(documents=DOCS)
print(translated)

The results basically only cover a part of the input:

[
   {
      "text":"PG&E erklärte, dass es die Stromausfälle als Reaktion auf Prognosen für hohe Winde unter trockenen Bedingungen geplant hatte, um das Risiko von Waldbränden zu verringern.",
      "id":"b4176841-33e2-403c-9a04-b33480a45cb5",
      "score":"None",
      "probability":"None",
      "question":"None",
      "meta":{
         
      },
      "embedding":"None"
   },
   {
      "text":"Der Eiffelturm ist die zweithöchste freistehende Struktur Frankreichs nach dem Millau Viaduct.",
      "id":"cd2980f0-bbce-4899-be64-ce0e8c1ae867",
      "score":"None",
      "probability":"None",
      "question":"None",
      "meta":{
         
      },
      "embedding":"None"
   }
]

I thought it might be due to the max_length param that you can pass to the t5 model. However, this is 300 tokens per default and even setting it to 1000 didn't change anything.

If we don't find a fix here, I'd suggest we rather use the OPUS models (e.g. https://huggingface.co/Helsinki-NLP/opus-mt-en-de). They are rather small, are available for most language pairs and translation results look decent.

@tholor what is view about using translator or tranlation packages (which use external API for translation) for this task?

I would not really like to integrate an external service here. Maybe later as a second option, but our default should always be free and open source.

@lalitpagaria
Copy link
Contributor Author

Thanks I will update PR by using OPUS models

@lalitpagaria
Copy link
Contributor Author

@tholor I have done the changes you suggested. Also added End-to-end translation pipeline. Please review these changes.
Can you please add remaining tasks ie Doc string and tutorial

@lalitpagaria
Copy link
Contributor Author

@tholor can you please look into this.

I don't want to resolve merge conflicts after seeing list current PRs 🙂

@tholor
Copy link
Member

tholor commented Feb 10, 2021

@lalitpagaria yep, I am on it - but writing the docs always takes quite some time ;)

@lalitpagaria
Copy link
Contributor Author

Overall what is your view about pipeline changes?
I tried to keep it generic so people can plug translator node with any type of pipeline (qa, faq, generator, summerizer etc)

haystack/translator/transformers.py Outdated Show resolved Hide resolved
haystack/translator/transformers.py Outdated Show resolved Hide resolved
@tholor
Copy link
Member

tholor commented Feb 11, 2021

Overall what is your view about pipeline changes?
I tried to keep it generic so people can plug translator node with any type of pipeline (qa, faq, generator, summerizer etc)

I like the idea of "wrapping" an existing pipeline. I am not sure if users will really use it rather than creating a new pipeline, but it's worth a test and ask users for feedback :)

@lalitpagaria
Copy link
Contributor Author

Really liked the CI changes by @oryx1729
Very fast feedback on PR now. Great!!

@lalitpagaria
Copy link
Contributor Author

@tholor all green now
Need your green signal :)

@tholor tholor merged commit 5bd94ac into deepset-ai:master Feb 12, 2021
@lalitpagaria lalitpagaria deleted the translation branch February 12, 2021 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants