Support Python 3 #39

madalu · 2020-01-03T15:31:04Z

Thanks so much for this excellent software! I have been using it for years to run OCR on scans and it has never failed me.

Would it be possible to add Python 3 support? Unfortunatly, Python 2 development has been officially frozen and Python 2 will no longer receive updates: https://www.python.org/doc/sunset-python-2/

blaueente · 2020-08-21T21:30:13Z

This issue has become worse, as Ubuntu 20.04 specifically does not offer pip for python anymore, so even "manual" non-package installation is becoming very difficult.

stweil · 2020-11-26T07:53:27Z

@jwilk, you added the "wontfix" label. Would you accept pull requests which replace Python2 by Python3 support?

bastien-roucaries · 2021-06-02T15:12:58Z

@jwilk, you added the "wontfix" label. Would you accept pull requests which replace Python2 by Python3 support?

@stweil go ahead, I will include this patch for debian and ubuntu if not upstream

jsbien · 2021-06-02T16:33:35Z

Great!

bastien-roucaries · 2021-06-03T07:46:51Z

@jsbien @stweil I can get a pull request from here if more convenient. But I lack time to do myself the patch

bastien-roucaries · 2021-08-06T14:26:22Z

Fixed by pull request

bastien-roucaries · 2021-08-06T14:27:31Z

@jsbien @stweil @jwilk @madalu Could you test and review

jsbien · 2021-08-13T09:59:15Z

A quick test is OK, thanks.
BTW, please update the doc/dependencies file.

Dominic-Mayers · 2021-08-19T03:03:26Z

@bastien-roucaries In my case, I could not extract the hocr from a djvu file using djvu2hocr. It complained that the argument to write was bytes instead of string. Note that the method encode converts string to bytes in the given encoding. I had to make the following modifications to '''lib/cli/djvu2hocr.py''':

At line 331, replace
sys.stdout.write(hocr_header.encode('UTF-8'))
with
sys.stdout.write(hocr_header)

At line 345, replace
sys.stdout.write(hocr_footer.encode('UTF-8'))
with
sys.stdout.write(hocr_footer)

At line 277, replace
tree.write(sys.stdout)
with
tree.write(sys.stdout.buffer)

faridcher · 2021-08-19T04:39:41Z

@bastien-roucaries yes, @Dominic-Mayers's changes are needed to workaround an error. Now it works fine in my Debian machine.

> ~/src/py/ocrodjvu$ djvu2hocr ~/99tech.djvu 
Converting /home/farid/fin/stock/books/murphy/99tech.djvu:
Traceback (most recent call last):
  File "/usr/local/bin/djvu2hocr", line 26, in <module>
    cli.main(sys.argv)
  File "/usr/local/share/ocrodjvu/lib/cli/djvu2hocr.py", line 331, in main
    sys.stdout.write(hocr_header.encode('UTF-8'))
TypeError: write() argument must be str, not bytes

jsbien · 2021-08-19T10:17:24Z

I confirm.

FYI, I tried to convert the resulting hOCR with hocr2djvused and got
lib.errors.MalformedHocr: malformed hOCR document: page without bounding box information
I understand this is not related to the Python version.

rmast · 2022-01-08T18:39:09Z

I saw four forks with a Python3-conversion I merged the successful parts

The remaining issues are string/bytes issues with the optional ocrad and gocr. I guess there has to be done something with TextIOWrapper in common.py to adapt the output of tesseract.py, cuneiform.py, ocrad.py and gocr.py.

You can see the remaining issues with

make test

or more specifically:

nosetests tests.ocrodjvu.test_integration:test_ocr

rmast · 2022-01-09T20:03:19Z

I made the tests for gocr and ocrad work as well. For the gocr output I used BytesIO instead of StringIO.
All tests run fine now, and I updated the coverage. As far as I'm concerned anyone could try the python3 branch in my fork.

rmast · 2022-01-24T23:40:39Z

We should probably try to get it working on Python 3.10 as well:
jwilk-archive/python-djvulibre#13

FriedrichFroebel · 2022-09-25T15:29:13Z

Inside the @rmast fork, GitHub Actions have not been integrated yet, as well as some upstream changes, and issues are disabled (which is the default for forks). With the planned upstream retirement (#46) from both ocrodjvu and python-djvulibre, it seems like this partly active fork might remain the best choice.

I still regularly use both didjvu and ocrodjvu (although only on Manjaro Linux with Python 2, which causes more and more pain when updating packages), so I thought about actually using the fork. It would probably need some work to incorporate the upstream changes nevertheless, as well as some modifications to make it compatible with the latest Python versions (as I already did for didjvu, although this Python 3 fork might become obsolete as well in the far future, due to gamera4 relying on the deprecated distutils package). While I can imagine to at least maintain a basic version of ocrodjvu as well, I am not familiar with most of the underlying stuff at the moment.

rmast · 2022-09-28T20:41:06Z

Inside the @rmast fork, GitHub Actions have not been integrated yet, as well as some upstream changes, and issues are disabled (which is the default for forks). With the planned upstream retirement (#46) from both ocrodjvu and python-djvulibre, it seems like this partly active fork might remain the best choice.

I forked some stuff to protect them for a maintainer taking them offline. I don't know if I will have enough time to be the main maintainer of those forks. I have no experience with Github Actions. We could try to do it together. My last summer holiday I've spent time on improving the MRC-compression of ocrmypdf by using the djvu-tricks of these JWilk-repo's, not only using tesseract, but also easyocr for segmentation details of text-parts to the foreground. Unfortunately my first proof of concept got late due to struggling with cython and memory management during custom otsu-histograms, so my holiday was over before the POC was live.

By the way, didjvu is not mentioned in the jwilk-retirement-message.

I still regularly use both didjvu and ocrodjvu (although only on Manjaro Linux with Python 2, which causes more and more pain when updating packages), so I thought about actually using the fork. It would probably need some work to incorporate the upstream changes nevertheless, as well as some modifications to make it compatible with the latest Python versions (as I already did for didjvu, although this Python 3 fork might become obsolete as well in the far future, due to gamera4 relying on the deprecated distutils package). While I can imagine to at least maintain a basic version of ocrodjvu as well, I am not familiar with most of the underlying stuff at the moment.

I´m not an active user of any of these tools, only melancholic about losing useful algorithms thought out before for MRC-compression. As with didjvu, where you did most of the migration, I can try to fix things that might confuse you, but be prepared to drop lots of functionality you don't use yourself, unless someone else claims to still be using it. I tend to get comparable open source functionality into similar PDF MRC compression. PDF is what I use when I scan in a document and spread it among my peers.

Gamera4 still has minimal maintenance, so distutils might still be taken care of.

I was able to revive a functional pip-installer for python 2.7 as the main pip-download doesn't support 2.7 anymore.

rmast · 2022-09-28T20:56:16Z

I read your issues in the Gamera-4 repo. There are more issues and support might be dropped as well, mostly due to Python as moving target, just as with these jwilk-repos. We might inspect the dependencies of didjvu on Gamera-4. As far as my attention has been concerned until now the main dependency is on the djvu-binarizer. Do you actively use any of those other binarizers? My effort this summer was giving live to even another binarizer, based on otsu of easyocr-segments.

FriedrichFroebel · 2022-09-29T08:11:23Z

I forked some stuff to protect them for a maintainer taking them offline. I don't know if I will have enough time to be the main maintainer of those forks. I have no experience with Github Actions. We could try to do it together.

There is GitHub Actions support in this upstream repository now, so this might be limited to mostly copy-and-paste, although some changes are required (see my didjvu fork for example). I might have a look at it and might decide to "modernize" the code as I did for didjvu as well in the case I find enough time to do so.

By the way, didjvu is not mentioned in the jwilk-retirement-message.

I am aware of that.

I´m not an active user of any of these tools, only melancholic about losing useful algorithms thought out before for MRC-compression. As with didjvu, where you did most of the migration, I can try to fix things that might confuse you, but be prepared to drop lots of functionality you don't use yourself, unless someone else claims to still be using it.

I use both didjvu and ocrodjvu on a regular basis at the moment - and I might keep maintaining at least the bits which I actually use as far as I am able to. While I am rather familiar with Python development, the actual DJVU and image processing stuff is something I only have a rough understanding of.

Gamera4 still has minimal maintenance, so distutils might still be taken care of.

If you look at the corresponding issue there, future is not really clear. I just started fixing some deprecated stuff to test Python 3.11 compatibility, but especially with distutils the migration path for some functionality is not even clear for the upstream developers.

We might inspect the dependencies of didjvu on Gamera-4. As far as my attention has been concerned until now the main dependency is on the djvu-binarizer. Do you actively use any of those other binarizers?

I just looked through the code of didjvu: It seems like the only important imports are from gamera.plugins.threshold and gamera.plugins.binarization, while I only use the default djvu_threshold implementation in my cases. But the didjvu stuff is out of scope here anyway.

FriedrichFroebel · 2022-09-29T18:16:34Z

I just did my first real test with the aggregated Python3 port. Apart from the fact that the requirements.txt file misses the regex and the future module, at least my tests worked without any issues.

rmast · 2022-09-29T19:14:10Z

Nice. We did those upgrade activities to make those repos survive the deprecation of python 2.7, and I'm glad they do.

rmast · 2022-10-01T11:38:58Z

The main reason for deprecating for example JWilk and Gamera-repo's would be Python as moving target, which is too tedious to follow. I wonder whether converting them to a more solid language would be able to preserve them. It's probably too much of an effort right now, but there are AI solutions for translation Python to C++ or Java nowadays: https://morioh.com/p/81aa0e33b28a [https://i.ytimg.com/vi/cKUEvbzcCQ4/maxresdefault.jpg]<https://morioh.com/p/81aa0e33b28a> Convert Python code to Java & C++ with AI Code Translator by Facebook - Morioh<https://morioh.com/p/81aa0e33b28a> How to Install OpenJDK 11 on CentOS 8 What is OpenJDK? OpenJDk or Open Java Development Kit is a free, open-source framework of the Java Platform, Standard Edition (or Java SE). morioh.com

FriedrichFroebel · 2022-10-02T06:21:52Z

This still is a matter of taste and of the actual code base. If the code has been modernized, there should not be any real issues for plain Python code. The biggest problems mostly arise from Python 2 code which has been made compatible to Python 3, but never actually modernized. From my experience, Python 3 tends to be rather stable, except that its C/C++ APIs might change (as we see for gamera-4). For this reason, maintaining Python code should mostly be easy enough.

madalu mentioned this issue Jan 3, 2020

Keep maintaining the package in Debian and get the program back to the distribution #38

Open

jwilk added the wontfix label Jan 20, 2020

ericonr mentioned this issue Nov 1, 2020

New package: ocrodjvu-0.11 (with dependency python-djvulibre-0.8.4) void-linux/void-packages#11300

Closed

bastien-roucaries mentioned this issue Aug 6, 2021

Port to python3 #41

Open

v-- mentioned this issue Jan 31, 2022

Add support for python 3 kcroker/dpsprep#6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Python 3 #39

Support Python 3 #39

madalu commented Jan 3, 2020 •

edited

Loading

blaueente commented Aug 21, 2020

stweil commented Nov 26, 2020

bastien-roucaries commented Jun 2, 2021

jsbien commented Jun 2, 2021

bastien-roucaries commented Jun 3, 2021

bastien-roucaries commented Aug 6, 2021

bastien-roucaries commented Aug 6, 2021

jsbien commented Aug 13, 2021

Dominic-Mayers commented Aug 19, 2021 •

edited

Loading

faridcher commented Aug 19, 2021 •

edited

Loading

jsbien commented Aug 19, 2021

rmast commented Jan 8, 2022 •

edited

Loading

rmast commented Jan 9, 2022 •

edited

Loading

rmast commented Jan 24, 2022

FriedrichFroebel commented Sep 25, 2022

rmast commented Sep 28, 2022

rmast commented Sep 28, 2022

FriedrichFroebel commented Sep 29, 2022

FriedrichFroebel commented Sep 29, 2022

rmast commented Sep 29, 2022 via email •

edited by jwilk

Loading

rmast commented Oct 1, 2022 via email •

edited by jwilk

Loading

FriedrichFroebel commented Oct 2, 2022

Support Python 3 #39

Support Python 3 #39

Comments

madalu commented Jan 3, 2020 • edited Loading

blaueente commented Aug 21, 2020

stweil commented Nov 26, 2020

bastien-roucaries commented Jun 2, 2021

jsbien commented Jun 2, 2021

bastien-roucaries commented Jun 3, 2021

bastien-roucaries commented Aug 6, 2021

bastien-roucaries commented Aug 6, 2021

jsbien commented Aug 13, 2021

Dominic-Mayers commented Aug 19, 2021 • edited Loading

faridcher commented Aug 19, 2021 • edited Loading

jsbien commented Aug 19, 2021

rmast commented Jan 8, 2022 • edited Loading

rmast commented Jan 9, 2022 • edited Loading

rmast commented Jan 24, 2022

FriedrichFroebel commented Sep 25, 2022

rmast commented Sep 28, 2022

rmast commented Sep 28, 2022

FriedrichFroebel commented Sep 29, 2022

FriedrichFroebel commented Sep 29, 2022

rmast commented Sep 29, 2022 via email • edited by jwilk Loading

rmast commented Oct 1, 2022 via email • edited by jwilk Loading

FriedrichFroebel commented Oct 2, 2022

madalu commented Jan 3, 2020 •

edited

Loading

Dominic-Mayers commented Aug 19, 2021 •

edited

Loading

faridcher commented Aug 19, 2021 •

edited

Loading

rmast commented Jan 8, 2022 •

edited

Loading

rmast commented Jan 9, 2022 •

edited

Loading

rmast commented Sep 29, 2022 via email •

edited by jwilk

Loading

rmast commented Oct 1, 2022 via email •

edited by jwilk

Loading