Skip to content
This repository has been archived by the owner on Oct 3, 2022. It is now read-only.

Support Python 3 #39

Open
madalu opened this issue Jan 3, 2020 · 22 comments
Open

Support Python 3 #39

madalu opened this issue Jan 3, 2020 · 22 comments
Labels

Comments

@madalu
Copy link

madalu commented Jan 3, 2020

Thanks so much for this excellent software! I have been using it for years to run OCR on scans and it has never failed me.

Would it be possible to add Python 3 support? Unfortunatly, Python 2 development has been officially frozen and Python 2 will no longer receive updates: https://www.python.org/doc/sunset-python-2/

@blaueente
Copy link

This issue has become worse, as Ubuntu 20.04 specifically does not offer pip for python anymore, so even "manual" non-package installation is becoming very difficult.

@stweil
Copy link

stweil commented Nov 26, 2020

@jwilk, you added the "wontfix" label. Would you accept pull requests which replace Python2 by Python3 support?

@bastien-roucaries
Copy link

@jwilk, you added the "wontfix" label. Would you accept pull requests which replace Python2 by Python3 support?

@stweil go ahead, I will include this patch for debian and ubuntu if not upstream

@jsbien
Copy link

jsbien commented Jun 2, 2021

Great!

@bastien-roucaries
Copy link

@jsbien @stweil I can get a pull request from here if more convenient. But I lack time to do myself the patch

@bastien-roucaries
Copy link

Fixed by pull request

@bastien-roucaries
Copy link

@jsbien @stweil @jwilk @madalu Could you test and review

@jsbien
Copy link

jsbien commented Aug 13, 2021

A quick test is OK, thanks.
BTW, please update the doc/dependencies file.

@Dominic-Mayers
Copy link

Dominic-Mayers commented Aug 19, 2021

@bastien-roucaries In my case, I could not extract the hocr from a djvu file using djvu2hocr. It complained that the argument to write was bytes instead of string. Note that the method encode converts string to bytes in the given encoding. I had to make the following modifications to '''lib/cli/djvu2hocr.py''':

At line 331, replace
sys.stdout.write(hocr_header.encode('UTF-8'))
with
sys.stdout.write(hocr_header)

At line 345, replace
sys.stdout.write(hocr_footer.encode('UTF-8'))
with
sys.stdout.write(hocr_footer)

At line 277, replace
tree.write(sys.stdout)
with
tree.write(sys.stdout.buffer)

@faridcher
Copy link

faridcher commented Aug 19, 2021

@bastien-roucaries yes, @Dominic-Mayers's changes are needed to workaround an error. Now it works fine in my Debian machine.

> ~/src/py/ocrodjvu$ djvu2hocr ~/99tech.djvu 
Converting /home/farid/fin/stock/books/murphy/99tech.djvu:
Traceback (most recent call last):
  File "/usr/local/bin/djvu2hocr", line 26, in <module>
    cli.main(sys.argv)
  File "/usr/local/share/ocrodjvu/lib/cli/djvu2hocr.py", line 331, in main
    sys.stdout.write(hocr_header.encode('UTF-8'))
TypeError: write() argument must be str, not bytes

@jsbien
Copy link

jsbien commented Aug 19, 2021

I confirm.

FYI, I tried to convert the resulting hOCR with hocr2djvused and got
lib.errors.MalformedHocr: malformed hOCR document: page without bounding box information
I understand this is not related to the Python version.

@rmast
Copy link

rmast commented Jan 8, 2022

I saw four forks with a Python3-conversion I merged the successful parts

The remaining issues are string/bytes issues with the optional ocrad and gocr. I guess there has to be done something with TextIOWrapper in common.py to adapt the output of tesseract.py, cuneiform.py, ocrad.py and gocr.py.

You can see the remaining issues with

make test

or more specifically:

nosetests tests.ocrodjvu.test_integration:test_ocr

@rmast
Copy link

rmast commented Jan 9, 2022

I made the tests for gocr and ocrad work as well. For the gocr output I used BytesIO instead of StringIO.
All tests run fine now, and I updated the coverage. As far as I'm concerned anyone could try the python3 branch in my fork.

@rmast
Copy link

rmast commented Jan 24, 2022

We should probably try to get it working on Python 3.10 as well:
jwilk-archive/python-djvulibre#13

@FriedrichFroebel
Copy link

Inside the @rmast fork, GitHub Actions have not been integrated yet, as well as some upstream changes, and issues are disabled (which is the default for forks). With the planned upstream retirement (#46) from both ocrodjvu and python-djvulibre, it seems like this partly active fork might remain the best choice.

I still regularly use both didjvu and ocrodjvu (although only on Manjaro Linux with Python 2, which causes more and more pain when updating packages), so I thought about actually using the fork. It would probably need some work to incorporate the upstream changes nevertheless, as well as some modifications to make it compatible with the latest Python versions (as I already did for didjvu, although this Python 3 fork might become obsolete as well in the far future, due to gamera4 relying on the deprecated distutils package). While I can imagine to at least maintain a basic version of ocrodjvu as well, I am not familiar with most of the underlying stuff at the moment.

@rmast
Copy link

rmast commented Sep 28, 2022

Inside the @rmast fork, GitHub Actions have not been integrated yet, as well as some upstream changes, and issues are disabled (which is the default for forks). With the planned upstream retirement (#46) from both ocrodjvu and python-djvulibre, it seems like this partly active fork might remain the best choice.

I forked some stuff to protect them for a maintainer taking them offline. I don't know if I will have enough time to be the main maintainer of those forks. I have no experience with Github Actions. We could try to do it together. My last summer holiday I've spent time on improving the MRC-compression of ocrmypdf by using the djvu-tricks of these JWilk-repo's, not only using tesseract, but also easyocr for segmentation details of text-parts to the foreground. Unfortunately my first proof of concept got late due to struggling with cython and memory management during custom otsu-histograms, so my holiday was over before the POC was live.

By the way, didjvu is not mentioned in the jwilk-retirement-message.

I still regularly use both didjvu and ocrodjvu (although only on Manjaro Linux with Python 2, which causes more and more pain when updating packages), so I thought about actually using the fork. It would probably need some work to incorporate the upstream changes nevertheless, as well as some modifications to make it compatible with the latest Python versions (as I already did for didjvu, although this Python 3 fork might become obsolete as well in the far future, due to gamera4 relying on the deprecated distutils package). While I can imagine to at least maintain a basic version of ocrodjvu as well, I am not familiar with most of the underlying stuff at the moment.

I´m not an active user of any of these tools, only melancholic about losing useful algorithms thought out before for MRC-compression. As with didjvu, where you did most of the migration, I can try to fix things that might confuse you, but be prepared to drop lots of functionality you don't use yourself, unless someone else claims to still be using it. I tend to get comparable open source functionality into similar PDF MRC compression. PDF is what I use when I scan in a document and spread it among my peers.

Gamera4 still has minimal maintenance, so distutils might still be taken care of.

I was able to revive a functional pip-installer for python 2.7 as the main pip-download doesn't support 2.7 anymore.

@rmast
Copy link

rmast commented Sep 28, 2022

I read your issues in the Gamera-4 repo. There are more issues and support might be dropped as well, mostly due to Python as moving target, just as with these jwilk-repos. We might inspect the dependencies of didjvu on Gamera-4. As far as my attention has been concerned until now the main dependency is on the djvu-binarizer. Do you actively use any of those other binarizers? My effort this summer was giving live to even another binarizer, based on otsu of easyocr-segments.

@FriedrichFroebel
Copy link

I forked some stuff to protect them for a maintainer taking them offline. I don't know if I will have enough time to be the main maintainer of those forks. I have no experience with Github Actions. We could try to do it together.

There is GitHub Actions support in this upstream repository now, so this might be limited to mostly copy-and-paste, although some changes are required (see my didjvu fork for example). I might have a look at it and might decide to "modernize" the code as I did for didjvu as well in the case I find enough time to do so.

By the way, didjvu is not mentioned in the jwilk-retirement-message.

I am aware of that.

I´m not an active user of any of these tools, only melancholic about losing useful algorithms thought out before for MRC-compression. As with didjvu, where you did most of the migration, I can try to fix things that might confuse you, but be prepared to drop lots of functionality you don't use yourself, unless someone else claims to still be using it.

I use both didjvu and ocrodjvu on a regular basis at the moment - and I might keep maintaining at least the bits which I actually use as far as I am able to. While I am rather familiar with Python development, the actual DJVU and image processing stuff is something I only have a rough understanding of.

Gamera4 still has minimal maintenance, so distutils might still be taken care of.

If you look at the corresponding issue there, future is not really clear. I just started fixing some deprecated stuff to test Python 3.11 compatibility, but especially with distutils the migration path for some functionality is not even clear for the upstream developers.

We might inspect the dependencies of didjvu on Gamera-4. As far as my attention has been concerned until now the main dependency is on the djvu-binarizer. Do you actively use any of those other binarizers?

I just looked through the code of didjvu: It seems like the only important imports are from gamera.plugins.threshold and gamera.plugins.binarization, while I only use the default djvu_threshold implementation in my cases. But the didjvu stuff is out of scope here anyway.

@FriedrichFroebel
Copy link

I just did my first real test with the aggregated Python3 port. Apart from the fact that the requirements.txt file misses the regex and the future module, at least my tests worked without any issues.

@rmast
Copy link

rmast commented Sep 29, 2022 via email

@rmast
Copy link

rmast commented Oct 1, 2022 via email

@FriedrichFroebel
Copy link

This still is a matter of taste and of the actual code base. If the code has been modernized, there should not be any real issues for plain Python code. The biggest problems mostly arise from Python 2 code which has been made compatible to Python 3, but never actually modernized. From my experience, Python 3 tends to be rather stable, except that its C/C++ APIs might change (as we see for gamera-4). For this reason, maintaining Python code should mostly be easy enough.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

10 participants