Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made-up words in translations #843

Open
Tracked by #216
eu9ene opened this issue Sep 12, 2024 · 4 comments
Open
Tracked by #216

Made-up words in translations #843

eu9ene opened this issue Sep 12, 2024 · 4 comments
Labels
quality Improving robustness and translation quality

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Sep 12, 2024

We recently got a report about made up words in Turkish.

I also tested the new ru-en model and noticed a lot of non-existent words there as well.

For example: https://habr.com/ru/articles/842924/

The title is translated as "Language environment and teacher-sdocument. Non-evous moments".
It should be "Language environment and native-speaking teachers. Non-obvious points".

We should investigate why this happens. It might have something to do with the recent robustness fixes or using a shortlist.

@eu9ene eu9ene added the quality Improving robustness and translation quality label Sep 12, 2024
@eu9ene
Copy link
Collaborator Author

eu9ene commented Sep 12, 2024

Translation by the teacher model:
"Language environment and native teachers. Non-obvious moments".

I don't see any made-up words in the full text as well even though the fluency is far from perfect. Only a couple of informal abbreviation were translated as is, for example "дз" (home work) as "DZ" and "выпускники инязов" (graduates of universities that specialize in foreign languages) as "graduates of inaz". This text is written in a very informal style, so I guess overall lower quality of translation is expected.

This means we have the issue with made-up words specifically for the student model.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Sep 12, 2024

I don't see any made-up words in translation of news where more formal language is used.

@gregtatum
Copy link
Member

I wonder if this could be caused by the decoder being too shallow, or the decoder not being big enough. This could be good to experiment with, and also test the difference in the performance of the models.

@marco-c
Copy link
Collaborator

marco-c commented Sep 17, 2024

Maybe lexical shortlisting could also be affecting this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

3 participants