Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line-length-rules don't work correctly with german umlauts #310

Closed
dakuenne opened this issue Jul 28, 2022 · 9 comments · Fixed by #325 or #348
Closed

Line-length-rules don't work correctly with german umlauts #310

dakuenne opened this issue Jul 28, 2022 · 9 comments · Fixed by #325 or #348
Labels
bug User-facing bugs windows Windows related issues
Milestone

Comments

@dakuenne
Copy link

We have configured the length-rules for title and body and all of them seem to count german umlauts as two characters. I suppose this is an issue of the python len-function which uses the byte-length of a string.

We are currently using python 3.10.4 and gitlint 0.17.0.

Example:
123456789012345678901234567890123456789012345678901234567890123456789012 -> evaluates as 72 characters
12345678901234567890123456789012345678901234567890123456789012345678901ä -> evaluates as 73 characters

@jorisroovers
Copy link
Owner

AFAIK, that’s only the case in Python 2, not python 3. See below:

$ python
Python 3.10.2 (main, Feb 13 2022, 14:37:03) [GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> len("123456789012345678901234567890123456789012345678901234567890123456789012")
72
>>> len("12345678901234567890123456789012345678901234567890123456789012345678901ä")
72

$ python2
Python 2.7.18 (default, Mar  8 2021, 13:02:45)
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> len("123456789012345678901234567890123456789012345678901234567890123456789012")
72
>>> len("12345678901234567890123456789012345678901234567890123456789012345678901ä")
73

I double checked this with git and gitlint, just to check nothing else is going on and I can’t reproduce this.

$ git log -1
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
commit d42be31a32c1c60b517d06b11dc70b4af2a07dee (HEAD -> master) ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Author: gïtlint-test-user <gitlint@test.com>
Date:   Mon Aug 1 09:18:51 2022 +0200

    12345678901234567890123456789012345678901234567890123456789012345678901ä

$ gitlint -c T1.line-length=50
1: T1 Title exceeds max length (72>50): "12345678901234567890123456789012345678901234567890123456789012345678901ä"
3: B6 Body message is missing

Can you provide a reproducible example? Thanks!

@dakuenne
Copy link
Author

dakuenne commented Aug 1, 2022

We have added gitlint as hook via gitlint install-hook and only the hook seems to count the characters wrong. If we execute python and gitlint on the command-line everything works fine and we can see the same results as you in your tests.
I have added a python --version to the hook to see if there is an old Python 2 in use, but the output shows the expected version 3.10.4.
And I forgot to mention, that we are running Gitlint on Windows 10 machines.

@jorisroovers
Copy link
Owner

jorisroovers commented Aug 3, 2022

And I forgot to mention, that we are running Gitlint on Windows 10 machines.

This might be clue! Gitlint has known issues with unicode on Windows; it's complicated (see #96).

For debugging, can you set your .gitlint config like something this, just to ensure the line length is always printed as a violation.

[title-max-length]
line-length=5

Please then rerun your test, but set GITLINT_DEBUG=1 so we can get a little more info :)

GITLINT_DEBUG=1

Please paste all the hook output here after committing. Thanks!

@jorisroovers jorisroovers added the windows Windows related issues label Aug 3, 2022
@dakuenne
Copy link
Author

dakuenne commented Aug 3, 2022

If it can't be fixed, I can live with that. Gitlint is a great tool for improving our commit messages with umlaut-support and without.

From my understanding Gitlint uses gits COMMIT_EDITMSG-file to analyse the message. I have opened this file in Notepad++ and it counts the same way as Gitlint. Umlauts are counted as two characters in Notepad++ too.

I get this output:

DEBUG: gitlint.cli To report issues, please visit https://github.com/jorisroovers/gitlint/issues
DEBUG: gitlint.cli Platform: Windows-10-10.0.19044-SP0
DEBUG: gitlint.cli Python version: 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)]
DEBUG: gitlint.git ('--version',)
DEBUG: gitlint.cli Git version: git version 2.36.0.windows.1
DEBUG: gitlint.cli Gitlint version: 0.17.0
DEBUG: gitlint.cli GITLINT_USE_SH_LIB: [NOT SET]
DEBUG: gitlint.cli DEFAULT_ENCODING: UTF-8
DEBUG: gitlint.cli Configuration
config-path: C:\Repositories\smart-vms-client\.gitlint
[GENERAL]
extra-path: C:\Repositories\smart-vms-client\tools\gitlint
contrib: []
ignore: B6
ignore-merge-commits: True
ignore-fixup-commits: True
ignore-squash-commits: True
ignore-revert-commits: True
ignore-stdin: False
staged: True
fail-without-commits: False
verbosity: 3
debug: True
target: C:\Repositories\smart-vms-client
[RULES]
  I1: ignore-by-title
     ignore=all
     regex=None
  I2: ignore-by-body
     ignore=all
     regex=None
  I3: ignore-body-lines
     regex=None
  I4: ignore-by-author-name
     ignore=all
     regex=None
  T1: title-max-length
     line-length=5
  T2: title-trailing-whitespace
  T6: title-leading-whitespace
  T3: title-trailing-punctuation
  T4: title-hard-tab
  T5: title-must-not-contain-word
     words=WIP
  T7: title-match-regex
     regex=None
  T8: title-min-length
     min-length=5
  B1: body-max-line-length
     line-length=5
  B5: body-min-length
     min-length=20
  B6: body-is-missing
     ignore-merge-commits=True
  B2: body-trailing-whitespace
  B3: body-hard-tab
  B4: body-first-line-empty
  B7: body-changed-file-mention
     files=
  B8: body-match-regex
     regex=None
  M1: author-valid-email
     regex=[^@ ]+@[^@ ]+\.[^@ ]+

gitlint: checking commit message...
DEBUG: gitlint.cli Fetching additional meta-data from staged commit
DEBUG: gitlint.cli Using --msg-filename.
DEBUG: gitlint.git ('config', '--get', 'core.commentchar')
DEBUG: gitlint.cli Linting 1 commit(s)
DEBUG: gitlint.lint Linting commit [SHA UNKNOWN]
DEBUG: gitlint.git ('config', '--get', 'user.name')
DEBUG: gitlint.git ('config', '--get', 'user.email')
DEBUG: gitlint.git ('rev-parse', '--abbrev-ref', 'HEAD')
DEBUG: gitlint.git ('diff', '--staged', '--name-only', '-r')
DEBUG: gitlint.lint Commit Object
--- Commit Message ----
12345ü
--- Meta info ---------
Author: 
Date:   2022-08-03 10:48:03 +0200
is-merge-commit:  False
is-fixup-commit:  False
is-squash-commit: False
is-revert-commit: False
Branches: ['develop']
Changed Files: ['.gitlint']
-----------------------
1: T1 Title exceeds max length (7>5): "12345ü"
DEBUG: gitlint.cli Exit Code = 1

@jorisroovers
Copy link
Owner

Ok, thanks for the debug output.

The fact that Notepad++ is giving the same result seems to indicate "it's complicated" and not just a gitlint specific issue. I will keep this open but likely won't immediately work on this given I've tried (and failed) to fix Unicode on Windows multiple times. It also involves working in a virtual windows environment (I'm not a daily windows user), which is a bit tedious in terms of workflow for me.

Thanks for the bug report!

@jorisroovers jorisroovers added bug User-facing bugs and removed status: needs-information labels Aug 4, 2022
@dakuenne
Copy link
Author

I had a chance to look at the sourcecode the last days myself and i think i found the issue - at least for windows.
The click-File option looses the encoding of the incoming datastream. If i change line 230 in cli.py from @click.option('--msg-filename', type=click.File(), help="Path to a file containing a commit-msg.") into @click.option('--msg-filename', type=click.File(encoding='UTF-8'), help="Path to a file containing a commit-msg.") the encoding is fine and the umlauts are counted as one character.
But i don't know if this has any sideeffects on other platforms or if it is a good way to fix this issue

@jorisroovers
Copy link
Owner

Great find! Give me some time to try this out and get back to you, probably next week.

@jorisroovers jorisroovers added this to the 0.18.0 milestone Aug 12, 2022
jorisroovers added a commit that referenced this issue Aug 25, 2022
This fixes encoding issues which occur more commonly when invoking the
commit-msg hook on Windows.

Fixes #310
@jorisroovers
Copy link
Owner

Ok, I just pushed a variation of this suggested fix to https://github.com/jorisroovers/gitlint/tree/issue-310-umlauts-windows.

My fix uses encoding=gitlint.utils.DEFAULT_ENCODING which falls back to UTF-8 if no other encoding is set. This ensures we respect any environmental overrides.
All tests still pass after that change, so I expect this will work.

You can try this out using:

virtualenv gitlint-issue310
source gitlint-issue310/bin/activate
pip install git+https://github.com/jorisroovers/gitlint.git@issue-310-umlauts-windows#subdirectory=gitlint-core`
gitlint --debug # use gitlint as normal

I'll try this out myself on Windows before merging the branch.

jorisroovers added a commit that referenced this issue Aug 26, 2022
This fixes encoding issues which occur more commonly when invoking the
commit-msg hook on Windows.

Fixes #310
@jorisroovers
Copy link
Owner

Manually verified on Windows, merged the fix. Will go out with 0.18.0, whenever that is :) Thanks for the bug report!

jorisroovers added a commit that referenced this issue Oct 24, 2022
No longer used for development purposes by maintainer, is only adding
maintenance overhead at this point.

Fixes #310
jorisroovers added a commit that referenced this issue Oct 25, 2022
No longer used for development purposes by maintainer, is only adding
maintenance overhead at this point.

Fixes #310
jorisroovers added a commit that referenced this issue Nov 16, 2022
- Python 3.11 support
- Last release to support Python 3.6
- Behavior Change: In a future release, gitlint will be switching to use
  `re.search` instead of `re.match` semantics for all rules. (#254)
- gitlint no longer uses the `sh` library by default in an attempt to reduce
  external dependencies.
- `--commits` now also accepts a comma-separated list of commit hashes,
  making it possible to lint a list of non-contiguous commits without invoking
  gitlint multiple times (#283)
- Improved handling of branches that have no commits (#188)
- Support for `GITLINT_CONFIG` env variable (#189)
- Added a new `gitlint-ci` pre-commit hook, making it easier to run gitlint
  through pre-commit in CI (#191)
- Contrib Rules:
  - New `contrib-disallow-cleanup-commits` rule (#312)
  - New `contrib-allowed-authors` rule (#358)
- User Defined rules:
  - Gitlint now recognizes `fixup=amend` commits, available as
    `commit.is_fixup_amend_commit=True`
  - Gitlint now parses diff **stat** information, available in
    `commit.changed_files_stats` (#314)
- Bugfixes:
  - Use correct encoding when using `--msg-filename` parameter (#310)
  - Various documentation fixes (#244) (#263) (#266) (#294) (#295) (#347) (#364)
- Under-the-hood:
  - Dependencies updated
  - Moved to blacked for formatting
  - Fixed nasty CI issue (#298)
  - Unit tests fix (#256)
  - Vagrant box removed in favor of github dev containers (#348)
  - Removed a few lingering references to the `master` branch in favor of `main`
  - Moved roadmap and project planning to github projects

Full Release details in CHANGELOG.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug User-facing bugs windows Windows related issues
Projects
Status: 0.18.0
2 participants