Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cchardet support and use chardet as fallback #4115

Closed
wants to merge 7 commits into from
Closed

add cchardet support and use chardet as fallback #4115

wants to merge 7 commits into from

Conversation

2mf
Copy link

@2mf 2mf commented May 30, 2017

I created this pull-request for #4112

Copy link
Member

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of duplicated logic in your change request. Can any of it be de-duplicated?

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need to try to deduplicate this logic a lot. Do you mind doing that?

@2mf
Copy link
Author

2mf commented May 30, 2017

sure :-)
I have to update the test settings as well to ensure chardet and cchardet run both

if isinstance(_package, tuple):
package = _package[1]
try:
locals()[package] = __import__(package[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like this line is correct, should be __import__(_package[0])? Might want to unpack the tuple right away and then use proper variable names so the logic is easier to follow.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an issue here with the warning that we're raising.

assert minor >= 1
assert patch >= 0
except AssertionError:
warnings.warn('Requests dependency \'cchardet\' must be version >= 2.1.0! Falling back to chardet')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't specifying the upper pin, which is < 3.0.0.

Copy link
Member

@nateprewitt nateprewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @2mf, thanks for putting this together! While I'm not opposed to supporting cchardet, I do have some questions/comments on the current implementation.

tox.ini Outdated

[testenv]

deps =
py{26,27,33,34,35,36}-chardet: chardet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super fond of doubling the number of test instances we're running.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure, how I can test against ccharset and charset using existing cases

@@ -56,18 +56,37 @@
except AssertionError:
raise RuntimeError('Requests dependency \'urllib3\' must be version >= 1.21.1, < 1.22!')

import warnings

def get_version(package):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to abstract this out to a function, maybe we add it to the urllib3 import to reduce the duplicated code to a single place?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure I'll do that

warnings.warn('Requests dependency \'cchardet\' must be version >= 2.1.0! Falling back to chardet')
raise

except (ImportError, SyntaxError, AssertionError) as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the scenario that raises a SyntaxError? Also if we're not using e for anything, we probably don't need to set it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied that from other places in requests, but from my view point it is not needed

locals()[package] = __import__(package)
for package in ('urllib3', 'idna', ('chardet', 'cchardet')):
if isinstance(package, tuple):
package, alt_package = package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to be extremely careful about touching this code, at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This looks safe, but it'd be so much better if this were testable.

@kennethreitz
Copy link
Contributor

This whole thing scares me.

@Lukasa
Copy link
Member

Lukasa commented May 30, 2017

I for one am not wedded to doing it: I'm nervous about the amount of code it adds as well.

@kennethreitz
Copy link
Contributor

@2mf do you have any metrics on how much faster cchardet is than chardet? e.g. is it worth the trouble?

@2mf
Copy link
Author

2mf commented May 30, 2017

@kennethreitz a benchmark can be seen at https://pypi.python.org/pypi/cchardet

Python 2.7.12
Request (call/s)
chardet 0.26
cchardet 1341.81

Python 3.6.0
Request (call/s)
chardet 0.26
cchardet 1472.43

@2mf
Copy link
Author

2mf commented May 30, 2017

@kennethreitz @Lukasa
If you both don't like the idea of including cchardet. I don't have a problem with that but then there must be a possibility to use cchardet by monkey patching but this is prevented by other things like:

shouldn't be the version matching in the requirements of setup.py?

# Check chardet for compatibility.
import chardet
major, minor, patch = chardet.__version__.split('.')[:3]
major, minor, patch = int(major), int(minor), int(patch)
# chardet >= 3.0.2, < 3.1.0
try:
    assert major == 3
    assert minor < 1
    assert patch >= 2
except AssertionError:
    raise RuntimeError('Requests dependency \'chardet\' must be version >= 3.0.2, < 3.1.0!')

and

for package in ('urllib3', 'idna', 'chardet'):
    locals()[package] = __import__(package)
    # This traversal is apparently necessary such that the identities are
    # preserved (requests.packages.urllib3.* is urllib3.*)
    for mod in list(sys.modules):
        if mod == package or mod.startswith(package + '.'):
            sys.modules['requests.packages.' + mod] = sys.modules[mod]

@kennethreitz
Copy link
Contributor

That is significantly faster!

mfranke added 2 commits May 30, 2017 19:57
remove unnecessary SyntaxError
pin cchardet to >=2.1.0 <2.2.0
This reverts commit b231590.
@sigmavirus24
Copy link
Contributor

shouldn't be the version matching in the requirements of setup.py?

No, @2mf it shouldn't. Requests needs very specific versions of dependencies. Some of those dependencies are popular and we need a way to tell the user "You're using a version that will horribly break your version of Requests"

assert minor < 2
assert patch >= 0
except AssertionError:
warnings.warn('Requests dependency \'cchardet\' must be version >= 2.1.0, < 2.2.0! Falling back to chardet')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a dependency of Requests, it's an optional speed-up. Please revise

# cchardet >= 2.1.0
try:
assert major == 2
assert minor >= 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not:

assert 1 <= minor < 2

except (ImportError, AssertionError):
# Check chardet for compatibility.
import chardet
major, minor, patch = get_version(chardet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chardet shouldn't be imported in this except block. This will cause extremely confusing tracebacks for users if they can't import either.

@@ -93,6 +93,7 @@ def run_tests(self):
cmdclass={'test': PyTest},
tests_require=test_requirements,
extras_require={
'cchardet:(python_version == "2.7" or python_version >= "3.4")': ['cchardet>=2.1.0'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're enforcing (in __init__.py) that cchardet be between 2.1.0 and 2.2.0 then that should be written here.

locals()[package] = __import__(package)
for package in ('urllib3', 'idna', ('chardet', 'cchardet')):
if isinstance(package, tuple):
package, alt_package = package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This looks safe, but it'd be so much better if this were testable.

@sigmavirus24
Copy link
Contributor

That is significantly faster!

And harder to install and the benchmark, as I understand it, is based off a very old version of chardet and hasn't been updated since.

@kennethreitz
Copy link
Contributor

It sounds like we're against this change overall. Sorry @2mf.

@sigmavirus24
Copy link
Contributor

I'm on board with doing this, but the current methodology isn't 100% correct and I'm merely pointing out that those benchmarks are older and potentially misleading. @dan-blanchard has done a lot of fantastic work on chardet recently

@kennethreitz kennethreitz reopened this May 30, 2017
@dan-blanchard
Copy link
Contributor

dan-blanchard commented May 30, 2017

I feel the need to point out that chardet is in the midst of gaining support for more codecs and languages than cchardet supports. I'm retraining all the models for greater accuracy too. Speed isn't everything with encoding detection.

@2mf
Copy link
Author

2mf commented May 30, 2017

Just for the sake of a recent benchmark
I did the same benchmark for chardet and cchardet that where done by @PyYoshi (on my machine) and the difference is still in the same dimension.

Python 3.6.1
chardet 3.0.3
cchardet 2.1.0

python test_chardet.py
chardet: 0.24908731023612882 call(s)/s

python test_cchardet.py
chardet: 1161.9214360906421 call(s)/s

anyway, It is not worth to continue on this one while running against so many walls

@2mf 2mf closed this May 30, 2017
@dan-blanchard
Copy link
Contributor

I hadn't actually looked at the benchmark code cChardet uses before, but it just runs the same small SJIS text file through both modules 100 times. I know cChardet will pretty much always be faster than chardet—in fact, if you look at old chardet issues, you'll even see that I once toyed with the idea of trying to merge the projects—but that is not an accurate benchmark, especially for something like requests where a good amount of the time is spent downloading the page content anyway. A requests-based benchmark would make sense for deciding if it makes a difference here.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants