List of languages in development #91

rkcosmos · 2020-07-10T04:20:14Z

I will update/edit this issue to track development process of new language. The current list is

Group 1 (Arabic script)

Arabic (DONE, August, 5 2020)
Uyghur (DONE, August, 5 2020)
Persian (DONE, August, 5 2020)
Urdu (DONE, August, 5 2020)

Group 2 (Latin script)

Serbian-latin (DONE, July,12 2020)
Occitan (DONE, July,12 2020)

Group 3 (Devanagari)

Hindi (DONE, July,24 2020)
Marathi (DONE, July,24 2020)
Nepali (DONE, July,24 2020)
Rajasthani (NEED HELP)
Awadhi, Haryanvi, Sanskrit (if possible)

Group 4 (Cyrillic script)

Russian (DONE, July,29 2020)
Serbian-cyrillic (DONE, July,29 2020)
Bulgarian (DONE, July,29 2020)
Ukranian (DONE, July,29 2020)
Mongolian (DONE, July,29 2020)
Belarusian (DONE, July,29 2020)
Tajik (DONE, April,20 2021)
Kyrgyz (NEED HELP)

Group 5

Telugu (DONE, November,17 2020)
Kannada (DONE, November,17 2020)

Group 6 (Language that doesn't share characters with others)

Tamil (DONE, August, 10 2020)
Hebrew (ready to train)
Malayalam (ready to train)
Bengali + Assamese (DONE, August, 23 2020)
Punjabi (ready to train)
Abkhaz (ready to train)

Group 7 (Improvement and possible extra models)

Japanese version 2 (DONE, March, 21 2021)+ vertical text
Chinese version2 (DONE, March, 21 2021)+ vertical text
Korean version 2(DONE, March, 21 2021)
Latin version 2 (DONE, March, 21 2021)
Math + Greek?
Number+symbol only

Guideline for new language request

To request a new language support, I need you to send a PR with 2 following files

In folder easyocr/character, we need 'yourlanguagecode_char.txt' that contains list of all characters. Please see format/example from other files in that folder.
In folder easyocr/dict, we need 'yourlanguagecode.txt' that contains list of words in your language. On average we have ~30000 words per language with more than 50000 words for popular one. More is better in this file.

If your language has unique elements (such as 1. Arabic: characters change form when attach to each other + write from right to left 2. Thai: Some characters need to be above the line and some below), please educate me with your best ability and/or give useful links. It is important to take care of the detail to achieve a system that really works.

Lastly, please understand that my priority will have to go to popular language or set of languages that share most of characters together (also tell me if your language share a lot of characters with other). It takes me at least a week to work for new model. You may have to wait a while for new model to be released.

madrugado · 2020-07-10T05:59:44Z

For group 4 you could add Ukrainian, Bulgarian, and may be Mongolian, although it is not Slavic it uses Cyrillic script.

edloginova · 2020-07-10T06:27:46Z

Do you plan to only work with human languages? It would be amazing to add a model to recognize mathematical formulas.

manohar-cyber · 2020-07-10T11:04:45Z

I guess Tamil, Telugu can be added to one group because they belongs to a language group called 'Dravidian'. Meaning they relate to each other in terms of grammar, word arrangement.Two other popular( in India) language, which belong to that family can also be added to that group— Kannada and Malayalam (For further info— https://en.m.wikipedia.org/wiki/Dravidian_languages). Moreover Telugu and kannada share some common alphabets and words. I will be adding alphabet and words of kannada language for language request.
Great project, keep it up👍

bgmastermind · 2020-07-10T11:58:10Z

For Group 4
Bulgarian
dict bg.txt
char bg_char.txt

upadhyayprakash · 2020-07-10T12:13:34Z

I'd highly recomend supporting Devanagiri Script (Wiki - https://en.wikipedia.org/wiki/Devanagari), which is the fourth most widely adopted writing system in the world. Please go through the wikipedia link to understand its wide spread usage across most Ancient Languages including Sanskrit, Hindi, Marathi, Awadhi, Haryanvi.

I see you have included "Hindi" as a target language, which of course, is the most spoken language in the Indian Subcontinent.

If you could let me know what's the current word-count you have (maybe share the "dict" & "alphabets" directory), I can continue with the research to share more details about the Language as it's my First Language.

Hindi has 47 primary alphabets (including 14 Vowels & 33 Consonants).

You can contact me @ prakash.upadhyay93@yahoo.com

arashjafari · 2020-07-10T15:01:32Z

Can i help for the Persian (Farsi) language ? I can supply some popular words and characters

@rkcosmos

junaidgirkar · 2020-07-10T15:55:11Z

Can i contribute in any way. I am fluent in Hindi alongside English. Also I may be of help in the programming section. I know Python, C and Java in languages. Am good in front-end with HTML, CSS and JavaScript (basic).

manmeet3591 · 2020-07-10T18:43:29Z

I recommend adding Punjabi language which is the 10th most spoken language around the world.
pb_char.txt

rkcosmos · 2020-07-11T07:42:25Z

@edloginova After doing human language, we can explore math as well.

@upadhyayprakash Lists are here easyocr/character and easyocr/dict

@arashjafari looks like we already have both words and char. You can recheck if everything is alright.

@junaidgirkar sounds good, I'll keep in mind. May call you for help.

rahilwazir · 2020-07-12T01:59:11Z

@rkcosmos Can we add support for the language Urdu? It is very similar with Persian and Arabic (not much complexities of arabic though).

fnasim · 2020-07-12T05:00:33Z

@rkcosmos For Group 1, could you please add Urdu to that group? Urdu is very similar to Arabic and Persian and I've just submitted the PR for the character list and a dictionary. So it should be ready to go!

cc: @rahilwazir

loayamin · 2020-07-12T21:53:04Z

This might help for Arabic:

https://github.com/OSINTAI/Arabic_Words

sardasumit · 2020-07-13T15:47:06Z

i added Marathi character and dictionary data set file please train it
mr.txt

rkcosmos · 2020-07-14T01:37:32Z

i added Marathi character and dictionary data set file please train it
mr.txt

@sardasumit did you forget a link for mr_char.txt?

sardasumit · 2020-07-14T01:51:59Z

i added Marathi character and dictionary data set file please train it
mr.txt

@sardasumit did you forget a link for mr_char.txt?

@rkcosmos it is same like Hindi character
mr_char.txt

nishad · 2020-07-15T10:35:52Z

@rkcosmos
Malayalam (https://en.wikipedia.org/wiki/Malayalam), belongs to Group 6.
#143
This PR contains character and word lists.

imvladikon · 2020-07-15T18:04:19Z

Hi! Thanks for your work. Some notes about Hebrew, there are some ending form of letters (it means that some letter is changing their form if they are placed at the end of words) https://en.wikipedia.org/wiki/Final_form Also there are diacritical signs https://en.wikipedia.org/wiki/Niqqud that used to represent vowels or distinguish between alternative pronunciations of letters (in Arabic also there are final forms(and not only) and diacritical signs) I didn't provide diacritical signs, assume it's better to train first of all without them (usual writing consists from usual letters without diacritical signs)

imvladikon · 2020-07-15T23:18:12Z

remembered the important thing. in Hebrew, there is cursive(https://en.wikipedia.org/wiki/Cursive_Hebrew) and sometimes people mixed it up together with usual writing even using printed matter, it's the same letters (chars), but let's say it's another font (e.g. https://opensiddur.org/wp-content/uploads/fonts/display-font-charmap.php?fnt=DorianCLM-Italic ) maybe it's also better not to implement immediately, don't know

rkcosmos · 2020-07-16T08:45:15Z

@nishad Malayalam and Tamil are both Dravidian but do not use the same script. So I have to build 2 model.
@imvladikon ok, will try to keep this in mind when building Hebrew model.

rkcosmos · 2020-07-16T15:41:34Z

Question for Indian: I'm looking into Hindi char and dict, there are a lot of chars seen in word list but not in char list. Examples are
['ा', '्', 'ि', 'ी', 'ं', 'ो', 'ु', 'ँ', 'ू', 'ड़', 'ै']. What are these symbols?

Vijayabhaskar96 · 2020-07-16T15:49:02Z

@rkcosmos Those are part of the existing alphabet when combined it creates a new alphabet, I think the technical term is grapheme? I'm not sure. I would like to know they render fine or something happens like it did with Tamil.

rkcosmos · 2020-07-17T03:19:05Z

@Vijayabhaskar96 So far, Devanagari doesn't have any problem. They support unicode well.

imvladikon · 2020-07-17T14:23:07Z

another addition about Hebrew;) and it's important. some diacritic signs are important, like geresh and gershayim. using geresh with ג ז צ we could use for the sounds - j g, ch, that are not represented in the alphabet and double geresh (gershayim) it's for widely spread short phrase, words (kitsur) most famous is the תנ"ך (Tanakh). Sometimes people could use usual quotation marks (apostrophe) instead of typing geresh or gershayim (e.g. תנ''ך)

tsaidevin · 2021-04-15T08:40:58Z

Excuse me, will there be a chi_tra version2 _?

rkcosmos · 2021-04-15T10:46:29Z

@tsaidevin yes.

abhishekvermasg · 2021-04-20T03:19:17Z

Are model training scripts not there? If somebody wants to train on a new language. How can one contribute to betterment of model?

DaniSubodh · 2021-04-21T18:04:38Z

Hi @rkcosmos, really impressive work with the OCR framework.
I couldn't find a code for Greek language here, but I see (Greek + math) in the development list above. Do you know if the Greek language itself will be supported anytime soon?

PhonVanna · 2021-05-14T14:38:47Z

Hi @rkcosmos , really impressive work.
I am not sure whether you will add khmer language to it?

JoeDelK · 2021-06-14T10:20:20Z

Hi @rkcosmos,

We created a dictionary and dict_char for the greek language, you can find it here

KarmaCST · 2021-08-01T05:09:07Z

Can you add Dzongkha? Dzongkha is the national language of Bhutan and it is similar to the Tibetan Language. Similar to Thai language, it is written continuously from left to right and does not have a whitespace between words. Following paper discusses on next syllabus prediction for Dzongkha. https://doi.org/10.1016/j.jksuci.2021.01.001

weihaulee · 2021-09-06T03:57:38Z

Hi @rkcosmos,
Is there any plan to add English （vertical text） and digits （vertical text）?

Thanks in advance.

vneseresearcher · 2021-09-29T02:04:19Z

Hi @rkcosmos , Can you share the Japanese dataset you used to train? Thanks a lot!

ajeetmishra · 2021-10-17T18:27:18Z

Hi @rkcosmos. Thank you very much for the efforts you are taking. Is there a plan to include the Indian languages - Gujarati and Oriya ?

amirashe · 2021-11-10T16:42:30Z

hii @rkcosmos do you know if the hebrew will be ready soon? thnak a lot!

FtmsdtHosseini · 2021-12-05T16:36:31Z

Hello, @rkcosmos thank you for your great job.
I ran version 1.4.1 for Persian(Farsi), but the problem that I am facing is, Persian is written and read from right to left, but the model detects words from left to right. This confuses text detection and reduces the accuracy.
is there any solution to fix this issue?

babraham123 · 2021-12-09T09:53:44Z

Hi @rkcosmos, thanks so much for all the work you've put in. I've included a PR for the Amharic language, which is spoken by over 60 million people.
#616

One potential issue is that Amharic words contain a number of prefixes and suffixes to indicate the object, number of items, tense, gender, negation and so. Thus, a single verb may morph in a number of ways that are not all included in the dictionary.

babraham123 · 2021-12-09T09:57:05Z

Hi @rkcosmos, I also submitted a PR for the Tigrinya language, which is similar to Amharic and spoken by over 10 million people.
#615

It has the same mutation issue as Amharic. Also, Arabic numerals are very common despite having its own numeral system.

98-Jane · 2022-02-10T04:09:01Z

@rkcosmos Question: Why Chinese dict is pinning rather than Chinese? In the dict folder, cannot find the Chinese dict(not pinying)?How to achieve this mapping relationship? If I want to add some words in Chinese dict, how do I add training data and dict?

reemastha706 · 2022-08-02T15:53:03Z

@rkcosmos is Greek language updated? I saw someone contributing for greek in the comment.

Saranga99 · 2022-09-01T09:36:08Z

does easyocr support Sinhala language?

amits-ds · 2023-04-24T08:10:55Z

Hey, thank you for this Repo 🙏
Is there an update on a model for Hebrew OCR?

minamohamadii · 2023-05-02T06:35:17Z

i want use farsi language but i see it is not fine tune on 5 farsi and fine tune on 5 arabic
5 farsi = ۵
5 arabic = ٥
and it become a problem for me because it show me 0 in persian show me ٥ can you please improve farsi language ??? it help me so much

ashudhatma · 2023-11-28T11:13:08Z

For Group 3 (Devanagari)
Request to add Gujarati Language
gu.txt
gu_char.txt

nmermigas · 2024-02-14T20:56:42Z

Hey, I saw that the issue about the support of the greek language is completed and I can see the two required .txt documents about greek. However, I cannot find the greek language code ('gre' in the repo) in the list with the supported languages that is on the website. Is greek actually supported?

haroon423 · 2024-04-08T04:28:20Z

urdu language not supported Easyocr model ?

AinazRafiei · 2024-05-18T14:23:12Z

I will update/edit this issue to track development process of new language. The current list is

Group 1 (Arabic script)

Arabic (DONE, August, 5 2020)

Uyghur (DONE, August, 5 2020)

Persian (DONE, August, 5 2020)

Urdu (DONE, August, 5 2020)

Group 2 (Latin script)

Serbian-latin (DONE, July,12 2020)

Occitan (DONE, July,12 2020)

Group 3 (Devanagari)

Hindi (DONE, July,24 2020)

Marathi (DONE, July,24 2020)

Nepali (DONE, July,24 2020)

Rajasthani (NEED HELP)

Awadhi, Haryanvi, Sanskrit (if possible)

Group 4 (Cyrillic script)

Russian (DONE, July,29 2020)

Serbian-cyrillic (DONE, July,29 2020)

Bulgarian (DONE, July,29 2020)

Ukranian (DONE, July,29 2020)

Mongolian (DONE, July,29 2020)

Belarusian (DONE, July,29 2020)

Tajik (DONE, April,20 2021)

Kyrgyz (NEED HELP)

Group 5

Telugu (DONE, November,17 2020)

Kannada (DONE, November,17 2020)

Group 6 (Language that doesn't share characters with others)

Tamil (DONE, August, 10 2020)

Hebrew (ready to train)

Malayalam (ready to train)

Bengali + Assamese (DONE, August, 23 2020)

Punjabi (ready to train)

Abkhaz (ready to train)

Group 7 (Improvement and possible extra models)

Japanese version 2 (DONE, March, 21 2021)+ vertical text

Chinese version2 (DONE, March, 21 2021)+ vertical text

Korean version 2(DONE, March, 21 2021)

Latin version 2 (DONE, March, 21 2021)

Math + Greek?

Number+symbol only

Guideline for new language request

To request a new language support, I need you to send a PR with 2 following files

In folder easyocr/character, we need 'yourlanguagecode_char.txt' that contains list of all characters. Please see format/example from other files in that folder.

In folder easyocr/dict, we need 'yourlanguagecode.txt' that contains list of words in your language. On average we have ~30000 words per language with more than 50000 words for popular one. More is better in this file.

If your language has unique elements (such as 1. Arabic: characters change form when attach to each other + write from right to left 2. Thai: Some characters need to be above the line and some below), please educate me with your best ability and/or give useful links. It is important to take care of the detail to achieve a system that really works.

Lastly, please understand that my priority will have to go to popular language or set of languages that share most of characters together (also tell me if your language share a lot of characters with other). It takes me at least a week to work for new model. You may have to wait a while for new model to be released.

There is a misstake in name of Group 1.It has to be Persian scripts.If you search you will see that Persian is the mother language of others and the rest Arabic, Urdu and Uyghur were taken from it(Persian Language).

bereket42 · 2024-05-21T07:19:40Z

Please let me know if Amharic or Tigrinya can be added, thanks! @AinazRafiei
#91 (comment)
#91 (comment)

DanielVegaVega · 2024-07-04T08:48:55Z

Hey, I saw that the issue about the support of the greek language is completed and I can see the two required .txt documents about greek. However, I cannot find the greek language code ('gre' in the repo) in the list with the supported languages that is on the website. Is greek actually supported?

@nmermigas looking for Greek as well. Could you find a way to "train" EasyOCR for it? Or is it something that the developer team must train?

IordanisSap · 2024-07-09T06:36:14Z

I am also looking for Greek.

snimavat · 2024-09-11T05:27:02Z

Gujarati language please

rkcosmos added the Language Request Request for new language support label Jul 10, 2020

List of languages in development #91

List of languages in development #91

Comments

rkcosmos commented Jul 10, 2020 • edited Loading

This comment has been minimized.

madrugado commented Jul 10, 2020

edloginova commented Jul 10, 2020

manohar-cyber commented Jul 10, 2020

bgmastermind commented Jul 10, 2020

upadhyayprakash commented Jul 10, 2020 • edited Loading

arashjafari commented Jul 10, 2020

junaidgirkar commented Jul 10, 2020

manmeet3591 commented Jul 10, 2020

rkcosmos commented Jul 11, 2020

This comment has been minimized.

This comment has been minimized.

rahilwazir commented Jul 12, 2020 • edited Loading

This comment has been minimized.

fnasim commented Jul 12, 2020 • edited Loading

loayamin commented Jul 12, 2020

sardasumit commented Jul 13, 2020 • edited Loading

rkcosmos commented Jul 14, 2020

sardasumit commented Jul 14, 2020

nishad commented Jul 15, 2020

imvladikon commented Jul 15, 2020 • edited Loading

imvladikon commented Jul 15, 2020

rkcosmos commented Jul 16, 2020

rkcosmos commented Jul 16, 2020

Vijayabhaskar96 commented Jul 16, 2020

rkcosmos commented Jul 17, 2020

imvladikon commented Jul 17, 2020

tsaidevin commented Apr 15, 2021

rkcosmos commented Apr 15, 2021

abhishekvermasg commented Apr 20, 2021

DaniSubodh commented Apr 21, 2021

PhonVanna commented May 14, 2021

JoeDelK commented Jun 14, 2021 • edited Loading

KarmaCST commented Aug 1, 2021

weihaulee commented Sep 6, 2021

vneseresearcher commented Sep 29, 2021 • edited Loading

ajeetmishra commented Oct 17, 2021

amirashe commented Nov 10, 2021

FtmsdtHosseini commented Dec 5, 2021 • edited Loading

babraham123 commented Dec 9, 2021

babraham123 commented Dec 9, 2021

98-Jane commented Feb 10, 2022 • edited Loading

reemastha706 commented Aug 2, 2022

Saranga99 commented Sep 1, 2022

amits-ds commented Apr 24, 2023

minamohamadii commented May 2, 2023

ashudhatma commented Nov 28, 2023

nmermigas commented Feb 14, 2024

haroon423 commented Apr 8, 2024

AinazRafiei commented May 18, 2024

bereket42 commented May 21, 2024 • edited Loading

DanielVegaVega commented Jul 4, 2024

IordanisSap commented Jul 9, 2024

snimavat commented Sep 11, 2024

rkcosmos commented Jul 10, 2020 •

edited

Loading

upadhyayprakash commented Jul 10, 2020 •

edited

Loading

rahilwazir commented Jul 12, 2020 •

edited

Loading

fnasim commented Jul 12, 2020 •

edited

Loading

sardasumit commented Jul 13, 2020 •

edited

Loading

imvladikon commented Jul 15, 2020 •

edited

Loading

JoeDelK commented Jun 14, 2021 •

edited

Loading

vneseresearcher commented Sep 29, 2021 •

edited

Loading

FtmsdtHosseini commented Dec 5, 2021 •

edited

Loading

98-Jane commented Feb 10, 2022 •

edited

Loading

bereket42 commented May 21, 2024 •

edited

Loading