Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of languages in development #91

Closed
rkcosmos opened this issue Jul 10, 2020 · 91 comments
Closed

List of languages in development #91

rkcosmos opened this issue Jul 10, 2020 · 91 comments
Labels
help wanted Extra attention is needed Language Request Request for new language support

Comments

@rkcosmos
Copy link
Contributor

rkcosmos commented Jul 10, 2020

I will update/edit this issue to track development process of new language. The current list is

Group 1 (Arabic script)

  • Arabic (DONE, August, 5 2020)
  • Uyghur (DONE, August, 5 2020)
  • Persian (DONE, August, 5 2020)
  • Urdu (DONE, August, 5 2020)

Group 2 (Latin script)

  • Serbian-latin (DONE, July,12 2020)
  • Occitan (DONE, July,12 2020)

Group 3 (Devanagari)

  • Hindi (DONE, July,24 2020)
  • Marathi (DONE, July,24 2020)
  • Nepali (DONE, July,24 2020)
  • Rajasthani (NEED HELP)
  • Awadhi, Haryanvi, Sanskrit (if possible)

Group 4 (Cyrillic script)

  • Russian (DONE, July,29 2020)
  • Serbian-cyrillic (DONE, July,29 2020)
  • Bulgarian (DONE, July,29 2020)
  • Ukranian (DONE, July,29 2020)
  • Mongolian (DONE, July,29 2020)
  • Belarusian (DONE, July,29 2020)
  • Tajik (DONE, April,20 2021)
  • Kyrgyz (NEED HELP)

Group 5

  • Telugu (DONE, November,17 2020)
  • Kannada (DONE, November,17 2020)

Group 6 (Language that doesn't share characters with others)

  • Tamil (DONE, August, 10 2020)
  • Hebrew (ready to train)
  • Malayalam (ready to train)
  • Bengali + Assamese (DONE, August, 23 2020)
  • Punjabi (ready to train)
  • Abkhaz (ready to train)

Group 7 (Improvement and possible extra models)

  • Japanese version 2 (DONE, March, 21 2021)+ vertical text
  • Chinese version2 (DONE, March, 21 2021)+ vertical text
  • Korean version 2(DONE, March, 21 2021)
  • Latin version 2 (DONE, March, 21 2021)
  • Math + Greek?
  • Number+symbol only

Guideline for new language request

To request a new language support, I need you to send a PR with 2 following files

  1. In folder easyocr/character, we need 'yourlanguagecode_char.txt' that contains list of all characters. Please see format/example from other files in that folder.
  2. In folder easyocr/dict, we need 'yourlanguagecode.txt' that contains list of words in your language. On average we have ~30000 words per language with more than 50000 words for popular one. More is better in this file.

If your language has unique elements (such as 1. Arabic: characters change form when attach to each other + write from right to left 2. Thai: Some characters need to be above the line and some below), please educate me with your best ability and/or give useful links. It is important to take care of the detail to achieve a system that really works.

Lastly, please understand that my priority will have to go to popular language or set of languages that share most of characters together (also tell me if your language share a lot of characters with other). It takes me at least a week to work for new model. You may have to wait a while for new model to be released.

@rkcosmos rkcosmos added the Language Request Request for new language support label Jul 10, 2020
@rkcosmos rkcosmos pinned this issue Jul 10, 2020
@yossibiton

This comment has been minimized.

@madrugado
Copy link

For group 4 you could add Ukrainian, Bulgarian, and may be Mongolian, although it is not Slavic it uses Cyrillic script.

@edloginova
Copy link

Do you plan to only work with human languages? It would be amazing to add a model to recognize mathematical formulas.

@manohar-cyber
Copy link

I guess Tamil, Telugu can be added to one group because they belongs to a language group called 'Dravidian'. Meaning they relate to each other in terms of grammar, word arrangement.Two other popular( in India) language, which belong to that family can also be added to that group— Kannada and Malayalam (For further info— https://en.m.wikipedia.org/wiki/Dravidian_languages). Moreover Telugu and kannada share some common alphabets and words. I will be adding alphabet and words of kannada language for language request.
Great project, keep it up👍

@bgmastermind
Copy link

For Group 4
Bulgarian
dict bg.txt
char bg_char.txt

@upadhyayprakash
Copy link

upadhyayprakash commented Jul 10, 2020

I'd highly recomend supporting Devanagiri Script (Wiki - https://en.wikipedia.org/wiki/Devanagari), which is the fourth most widely adopted writing system in the world. Please go through the wikipedia link to understand its wide spread usage across most Ancient Languages including Sanskrit, Hindi, Marathi, Awadhi, Haryanvi.

I see you have included "Hindi" as a target language, which of course, is the most spoken language in the Indian Subcontinent.

If you could let me know what's the current word-count you have (maybe share the "dict" & "alphabets" directory), I can continue with the research to share more details about the Language as it's my First Language.

Hindi has 47 primary alphabets (including 14 Vowels & 33 Consonants).

You can contact me @ prakash.upadhyay93@yahoo.com

@arashjafari
Copy link

Can i help for the Persian (Farsi) language ? I can supply some popular words and characters

@rkcosmos

@junaidgirkar
Copy link

Can i contribute in any way. I am fluent in Hindi alongside English. Also I may be of help in the programming section. I know Python, C and Java in languages. Am good in front-end with HTML, CSS and JavaScript (basic).

@manmeet3591
Copy link

I recommend adding Punjabi language which is the 10th most spoken language around the world.
pb_char.txt

@rkcosmos rkcosmos added the help wanted Extra attention is needed label Jul 11, 2020
@rkcosmos
Copy link
Contributor Author

@edloginova After doing human language, we can explore math as well.

@upadhyayprakash Lists are here easyocr/character and easyocr/dict

@arashjafari looks like we already have both words and char. You can recheck if everything is alright.

@junaidgirkar sounds good, I'll keep in mind. May call you for help.

@Vijayabhaskar96

This comment has been minimized.

@rkcosmos

This comment has been minimized.

@rahilwazir
Copy link

rahilwazir commented Jul 12, 2020

@rkcosmos Can we add support for the language Urdu? It is very similar with Persian and Arabic (not much complexities of arabic though).

@Vijayabhaskar96

This comment has been minimized.

@fnasim
Copy link
Contributor

fnasim commented Jul 12, 2020

@rkcosmos For Group 1, could you please add Urdu to that group? Urdu is very similar to Arabic and Persian and I've just submitted the PR for the character list and a dictionary. So it should be ready to go!

cc: @rahilwazir

@loayamin
Copy link
Contributor

This might help for Arabic:

https://github.com/OSINTAI/Arabic_Words

@sardasumit
Copy link
Contributor

sardasumit commented Jul 13, 2020

i added Marathi character and dictionary data set file please train it
mr.txt

@rkcosmos
Copy link
Contributor Author

i added Marathi character and dictionary data set file please train it
mr.txt

@sardasumit did you forget a link for mr_char.txt?

@sardasumit
Copy link
Contributor

i added Marathi character and dictionary data set file please train it
mr.txt

@sardasumit did you forget a link for mr_char.txt?

@rkcosmos it is same like Hindi character
mr_char.txt

@nishad
Copy link
Contributor

nishad commented Jul 15, 2020

@rkcosmos
Malayalam (https://en.wikipedia.org/wiki/Malayalam), belongs to Group 6.
#143
This PR contains character and word lists.

@imvladikon
Copy link
Contributor

imvladikon commented Jul 15, 2020

Hi! Thanks for your work. Some notes about Hebrew, there are some ending form of letters (it means that some letter is changing their form if they are placed at the end of words) https://en.wikipedia.org/wiki/Final_form Also there are diacritical signs https://en.wikipedia.org/wiki/Niqqud that used to represent vowels or distinguish between alternative pronunciations of letters (in Arabic also there are final forms(and not only) and diacritical signs) I didn't provide diacritical signs, assume it's better to train first of all without them (usual writing consists from usual letters without diacritical signs)

@imvladikon
Copy link
Contributor

remembered the important thing. in Hebrew, there is cursive(https://en.wikipedia.org/wiki/Cursive_Hebrew) and sometimes people mixed it up together with usual writing even using printed matter, it's the same letters (chars), but let's say it's another font (e.g. https://opensiddur.org/wp-content/uploads/fonts/display-font-charmap.php?fnt=DorianCLM-Italic ) maybe it's also better not to implement immediately, don't know

@rkcosmos
Copy link
Contributor Author

@nishad Malayalam and Tamil are both Dravidian but do not use the same script. So I have to build 2 model.
@imvladikon ok, will try to keep this in mind when building Hebrew model.

@rkcosmos
Copy link
Contributor Author

Question for Indian: I'm looking into Hindi char and dict, there are a lot of chars seen in word list but not in char list. Examples are
['ा', '्', 'ि', 'ी', 'ं', 'ो', 'ु', 'ँ', 'ू', 'ड़', 'ै']. What are these symbols?

@Vijayabhaskar96
Copy link
Contributor

@rkcosmos Those are part of the existing alphabet when combined it creates a new alphabet, I think the technical term is grapheme? I'm not sure. I would like to know they render fine or something happens like it did with Tamil.

@rkcosmos
Copy link
Contributor Author

@Vijayabhaskar96 So far, Devanagari doesn't have any problem. They support unicode well.

@imvladikon
Copy link
Contributor

another addition about Hebrew;) and it's important. some diacritic signs are important, like geresh and gershayim. using geresh with ג ז צ we could use for the sounds - j g, ch, that are not represented in the alphabet and double geresh (gershayim) it's for widely spread short phrase, words (kitsur) most famous is the תנ"ך (Tanakh). Sometimes people could use usual quotation marks (apostrophe) instead of typing geresh or gershayim (e.g. תנ''ך)

@tsaidevin
Copy link

Excuse me, will there be a chi_tra version2 _?

@rkcosmos
Copy link
Contributor Author

@tsaidevin yes.

@abhishekvermasg
Copy link

Are model training scripts not there? If somebody wants to train on a new language. How can one contribute to betterment of model?

@DaniSubodh
Copy link

Hi @rkcosmos, really impressive work with the OCR framework.
I couldn't find a code for Greek language here, but I see (Greek + math) in the development list above. Do you know if the Greek language itself will be supported anytime soon?

@PhonVanna
Copy link

Hi @rkcosmos , really impressive work.
I am not sure whether you will add khmer language to it?

@JoeDelK
Copy link
Contributor

JoeDelK commented Jun 14, 2021

Hi @rkcosmos,

We created a dictionary and dict_char for the greek language, you can find it here

@rkcosmos rkcosmos unpinned this issue Jul 17, 2021
@KarmaCST
Copy link

KarmaCST commented Aug 1, 2021

Can you add Dzongkha? Dzongkha is the national language of Bhutan and it is similar to the Tibetan Language. Similar to Thai language, it is written continuously from left to right and does not have a whitespace between words. Following paper discusses on next syllabus prediction for Dzongkha. https://doi.org/10.1016/j.jksuci.2021.01.001

@weihaulee
Copy link

Hi @rkcosmos,
Is there any plan to add English (vertical text) and digits (vertical text)?

Thanks in advance.

@vneseresearcher
Copy link

vneseresearcher commented Sep 29, 2021

Hi @rkcosmos , Can you share the Japanese dataset you used to train? Thanks a lot!

@ajeetmishra
Copy link

Hi @rkcosmos. Thank you very much for the efforts you are taking. Is there a plan to include the Indian languages - Gujarati and Oriya ?

@amirashe
Copy link

hii @rkcosmos do you know if the hebrew will be ready soon? thnak a lot!

@FtmsdtHosseini
Copy link

FtmsdtHosseini commented Dec 5, 2021

Hello, @rkcosmos thank you for your great job.
I ran version 1.4.1 for Persian(Farsi), but the problem that I am facing is, Persian is written and read from right to left, but the model detects words from left to right. This confuses text detection and reduces the accuracy.
is there any solution to fix this issue?

@babraham123
Copy link
Contributor

Hi @rkcosmos, thanks so much for all the work you've put in. I've included a PR for the Amharic language, which is spoken by over 60 million people.
#616

One potential issue is that Amharic words contain a number of prefixes and suffixes to indicate the object, number of items, tense, gender, negation and so. Thus, a single verb may morph in a number of ways that are not all included in the dictionary.

@babraham123
Copy link
Contributor

Hi @rkcosmos, I also submitted a PR for the Tigrinya language, which is similar to Amharic and spoken by over 10 million people.
#615

It has the same mutation issue as Amharic. Also, Arabic numerals are very common despite having its own numeral system.

@98-Jane
Copy link

98-Jane commented Feb 10, 2022

@rkcosmos Question: Why Chinese dict is pinning rather than Chinese? In the dict folder, cannot find the Chinese dict(not pinying)?How to achieve this mapping relationship? If I want to add some words in Chinese dict, how do I add training data and dict?

@reemastha706
Copy link

@rkcosmos is Greek language updated? I saw someone contributing for greek in the comment.

@rkcosmos rkcosmos closed this as completed Aug 7, 2022
@Saranga99
Copy link

does easyocr support Sinhala language?

@amits-ds
Copy link

Hey, thank you for this Repo 🙏
Is there an update on a model for Hebrew OCR?

@minamohamadii
Copy link

i want use farsi language but i see it is not fine tune on 5 farsi and fine tune on 5 arabic
5 farsi = ۵
5 arabic = ٥
and it become a problem for me because it show me 0 in persian show me ٥ can you please improve farsi language ??? it help me so much

@ashudhatma
Copy link

For Group 3 (Devanagari)
Request to add Gujarati Language
gu.txt
gu_char.txt

@nmermigas
Copy link

Hey, I saw that the issue about the support of the greek language is completed and I can see the two required .txt documents about greek. However, I cannot find the greek language code ('gre' in the repo) in the list with the supported languages that is on the website. Is greek actually supported?

@haroon423
Copy link

urdu language not supported Easyocr model ?

@AinazRafiei
Copy link

I will update/edit this issue to track development process of new language. The current list is

Group 1 (Arabic script)

  • Arabic (DONE, August, 5 2020)
  • Uyghur (DONE, August, 5 2020)
  • Persian (DONE, August, 5 2020)
  • Urdu (DONE, August, 5 2020)

Group 2 (Latin script)

  • Serbian-latin (DONE, July,12 2020)
  • Occitan (DONE, July,12 2020)

Group 3 (Devanagari)

  • Hindi (DONE, July,24 2020)
  • Marathi (DONE, July,24 2020)
  • Nepali (DONE, July,24 2020)
  • Rajasthani (NEED HELP)
  • Awadhi, Haryanvi, Sanskrit (if possible)

Group 4 (Cyrillic script)

  • Russian (DONE, July,29 2020)
  • Serbian-cyrillic (DONE, July,29 2020)
  • Bulgarian (DONE, July,29 2020)
  • Ukranian (DONE, July,29 2020)
  • Mongolian (DONE, July,29 2020)
  • Belarusian (DONE, July,29 2020)
  • Tajik (DONE, April,20 2021)
  • Kyrgyz (NEED HELP)

Group 5

  • Telugu (DONE, November,17 2020)
  • Kannada (DONE, November,17 2020)

Group 6 (Language that doesn't share characters with others)

  • Tamil (DONE, August, 10 2020)
  • Hebrew (ready to train)
  • Malayalam (ready to train)
  • Bengali + Assamese (DONE, August, 23 2020)
  • Punjabi (ready to train)
  • Abkhaz (ready to train)

Group 7 (Improvement and possible extra models)

  • Japanese version 2 (DONE, March, 21 2021)+ vertical text
  • Chinese version2 (DONE, March, 21 2021)+ vertical text
  • Korean version 2(DONE, March, 21 2021)
  • Latin version 2 (DONE, March, 21 2021)
  • Math + Greek?
  • Number+symbol only

Guideline for new language request

To request a new language support, I need you to send a PR with 2 following files

  1. In folder easyocr/character, we need 'yourlanguagecode_char.txt' that contains list of all characters. Please see format/example from other files in that folder.
  2. In folder easyocr/dict, we need 'yourlanguagecode.txt' that contains list of words in your language. On average we have ~30000 words per language with more than 50000 words for popular one. More is better in this file.

If your language has unique elements (such as 1. Arabic: characters change form when attach to each other + write from right to left 2. Thai: Some characters need to be above the line and some below), please educate me with your best ability and/or give useful links. It is important to take care of the detail to achieve a system that really works.

Lastly, please understand that my priority will have to go to popular language or set of languages that share most of characters together (also tell me if your language share a lot of characters with other). It takes me at least a week to work for new model. You may have to wait a while for new model to be released.

There is a misstake in name of Group 1.It has to be Persian scripts.If you search you will see that Persian is the mother language of others and the rest Arabic, Urdu and Uyghur were taken from it(Persian Language).

@bereket42
Copy link

bereket42 commented May 21, 2024

Please let me know if Amharic or Tigrinya can be added, thanks! @AinazRafiei
#91 (comment)
#91 (comment)

@DanielVegaVega
Copy link

Hey, I saw that the issue about the support of the greek language is completed and I can see the two required .txt documents about greek. However, I cannot find the greek language code ('gre' in the repo) in the list with the supported languages that is on the website. Is greek actually supported?

@nmermigas looking for Greek as well. Could you find a way to "train" EasyOCR for it? Or is it something that the developer team must train?

@IordanisSap
Copy link

I am also looking for Greek.

@snimavat
Copy link

Gujarati language please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed Language Request Request for new language support
Projects
None yet
Development

No branches or pull requests