-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
en-tr translation quality feedback #816
Comments
Hi Selim! Thank you for the detailed analysis! So this story is a dialogue, likely an excerpt from some book. The language doesn't look too complex, but I would suspect the model might be bad at translating dialogues. Also quotes might somehow make it worse. However, I see there are a couple of sections without dialogues, those are just descriptions, and they still have critical errors in your annotation. I would be curious to hear your feedback on translating regular news. Something like a story from https://www.cnn.com/. Is it also full of major/critical errors and completely unusable? |
Looking at the evals we have ML based COMET and lexical BLEU which is supposed to be less reliable. COMET scores are similar for Google and Microsoft but interestingly BLEU scores for Google are a lot lower than Microsoft. Based on the analysis of the story where we see some critical and major errors in Google's translation, it might be the case that their model is quite bad, so -5% diff criteria doesn't work well here. Another hypothesis is that COMET is not great at Turkish. The evaluation datasets:
|
Sure thing. Here's an article from CNN. It's slightly better but still not acceptable IMO.
I've also checked out some articles from Wikipedia and theverge.com but the quality seems to be consistently low. If there's any particular article you're interested in, just let me know. |
I see, thanks! Just to be clear about your notation, those minor/major/critical errors, do they correspond to wrong meaning or grammatical errors? For example for the en-ru model which is not meeting our criteria yet, I see a lot of correct meaning but language fluency can be pretty bad (word choice, grammar etc.). |
Minor errors correspond to fluency errors and bad word choices. Major errors might include a few mistranslated words or grammar mistakes (such as verb tense or syntax) but you get the general meaning of the sentence. Criticals errors are either incorrect or meaningless translations and you can't comprehend what the sentence actually means. |
@selimsum could you please have a look at the same pieces translated by our large teacher model? (we compress it later to the small one that you use in the browser). It can help us identify whether the issue is in teacher training or compression. Story 1 :
Story 2
@gregtatum btw how do we do segment splitting for the first one? (https://learnenglish.britishcouncil.org/general-english/story-zone/a2-b1-stories/devils-details-a2/b1). Is it by paragraph/new_line or by sentence? It would be hard to split by sentences correctly having periods inside the quotes and sometimes missing. I've split just by new line when translating with the teacher. |
These are much better and do not include made-up words. I'd consider these acceptable and shippable. |
@eu9ene Do you think this would be fixable with an update soon? If it's going to take some time, we should consider pulling en-tr at least from release for now. |
Hi Selim, I discussed this with the team. We're investigating it and working on the fix but it will take some time to roll out a new model. We will consider pulling it based on the results of the investigation. The whole feature is under BETA label and mistakes are expected for some models to some extent. We're still in progress of figuring out where the border between "minimal usefulness" and "completely unacceptable" is. Thanks again for the detailed feedback! |
I've been playing around with en-tr translations and I'd like to share some feedback.
I chose this story for a detailed comparison with Google Translate.
In the Google docs linked below, I've highlighted each sentence with colors corresponding to four levels of translation quality. Obviously, these are my own interpretations.
Our model also generated some made-up words (13 out of 800 words). I've marked them as well.
It's quite interesting that our COMET score turns out to be within 5% of Google Translate's COMET score. I would rate our translation quality much lower than that. And this makes me think that COMET shouldn't be considered a reliable metric for Turkish.
In my opinion, our model's current translation quality in general is not good enough to be shipped yet.
The devil's in the details - Firefox Translations
The devil's in the details - Google Translate
The text was updated successfully, but these errors were encountered: