euc_kr char '0x3164' decode('ksx1001') cause UnicodeDecodeError #101863

TakWolf · 2023-02-13T05:32:03Z

Bug report

char '0x3164' can be encode('ksx1001'), but can not decode('ksx1001')

def main():
    code_point = 0x3164
    c = chr(code_point)
    raw = c.encode('ksx1001')
    c2 = raw.decode('ksx1001')  # <--- this cause error 
    print(f'{c} {c2}')

if __name__ == '__main__':
    main()

Traceback (most recent call last):
  File "/Users/takwolf/Develop/FontDev/fusion-pixel-font/build.py", line 11, in <module>
    main()
  File "/Users/takwolf/Develop/FontDev/fusion-pixel-font/build.py", line 6, in main
    c2 = raw.decode('ksx1001')
         ^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'euc_kr' codec can't decode bytes in position 0-1: incomplete multibyte sequence

The char is Hangul Compatibility Jamo -> Hangul Filler

https://unicode-table.com/en/3164/

The following code is get the zone in ks-x-1001:

def main():
    code_point = 0x3164
    c = chr(code_point)
    raw = c.encode('ksx1001')
    block_offset = 0xA0
    zone_1 = raw[0] - block_offset
    zone_2 = raw[1] - block_offset
    print(f'{zone_1} {zone_2}')


if __name__ == '__main__':
    main()

zone_1 = 4 
zone_2 = 52

https://en.wikipedia.org/wiki/KS_X_1001#Hangul_Filler

other chars in ksx1001 encode an decode is ok, but only this.

Your environment

CPython versions tested on: Python 3.11.1
Operating system and architecture: macOS 13.0

Linked PRs

gh-101863: Fix wrong comments in EUC-KR codec #102417

The text was updated successfully, but these errors were encountered:

arhadthedev · 2023-02-13T05:46:17Z

@ezio-melotti (as an active Unicode expert)

corona10 · 2023-02-17T10:08:01Z

I will take a look at this issue as the internal OSS sprint issue.
I will delegate this issue to my mentees.

b8choi · 2023-02-20T09:19:53Z

I will try to resolve this issue.

TakWolf · 2023-02-23T09:27:48Z

I'm not a python expert.

str.encode() and str.decode() is a builtin fun?

I found the coding logic maybe here:

https://github.com/python/cpython/tree/main/Lib/encodings

https://github.com/python/cpython/blob/main/Lib/encodings/euc_kr.py#L7-L10

that register a codec named 'euc_kr' to provide encode('euc_kr')

and import _codecs_kr is a c source?

https://github.com/python/cpython/blob/main/Modules/cjkcodecs/_codecs_kr.c#L7-L9

and mappings_kr.h

https://github.com/python/cpython/blob/main/Modules/cjkcodecs/mappings_kr.h

It seems that some arrays are provided here to encode/decode query.

static const ucs2_t __ksx1001_decmap[8264]
static const struct dbcs_index ksx1001_decmap[256]
static const ucs2_t __cp949ext_decmap[9650]
static const struct dbcs_index cp949ext_decmap[256] 
static const DBCHAR __cp949_encmap[33133]
static const struct unim_index cp949_encmap[256]

and mappings_kr.h is Generate by https://github.com/python/cpython/blob/main/Tools/unicode/genmap_korean.py

https://github.com/python/cpython/blob/main/Tools/unicode/genmap_korean.py#L12-L58

I don't understand why the cp949encmap size not equals the ksx1001decmap size?

Is euc_kr bi-directional in rules? That if can be encoded, it must can be decoded?

Is CP949 equals to ksx1001?

corona10 · 2023-02-23T09:30:43Z

Is CP949 equals to ksx1001?

Not equals

corona10 · 2023-02-23T09:36:41Z

I don't understand why the cp949encmap size not equals the ksx1001decmap size?

Because CP949 is a kind of extended version of KSX1001

TakWolf · 2023-02-23T09:44:31Z

iterate unicode :

def main():
    encode_count = 0
    decode_count = 0
    for code_point in range(0, 0x10FFFF + 1):
        c = chr(code_point)
        try:
            raw = c.encode('ksx1001')
            encode_count += 1
            try:
                raw.decode('ksx1001')
                decode_count += 1
            except:
                pass
        except:
            pass
    print(f'encode: {encode_count}')
    print(f'decode: {decode_count}')

encode: 17176
decode: 17175

iterate ksx1001 :

def main():
    decode_count = 0
    for zone_1 in range(1, 94 + 1):
        for zone_2 in range(1, 94 + 1):
            try:
                bytes([zone_1 + 0xA0, zone_2 + 0xA0]).decode('ksx1001')
                decode_count += 1
            except UnicodeDecodeError:
                pass
    print(f'decode: {decode_count}')

decode: 8225

TakWolf · 2023-02-23T10:11:43Z

def main():
    not_2_bytes_alphabet = list()
    not_2_bytes_count = 0
    ksx1001_other_count = 0
    ksx1001_syllable_count = 0
    ksx1001_word_count = 0
    ksx1001_but_not_a_char_count = 0
    for code_point in range(0, 0x10FFFF + 1):
        c = chr(code_point)
        try:
            raw = c.encode('ksx1001')
        except UnicodeEncodeError:
            continue

        if len(raw) != 2:
            not_2_bytes_count += 1
            not_2_bytes_alphabet.append(c)
            continue

        zone_1 = raw[0] - 0xA0
        if 1 <= zone_1 <= 12:
            ksx1001_other_count += 1
        elif 16 <= zone_1 <= 40:
            ksx1001_syllable_count += 1
        elif 42 <= zone_1 <= 93:
            ksx1001_word_count += 1
        else:  # zone_1 is 13~15, 41, 94
            ksx1001_but_not_a_char_count += 1

    print(f'ksx other + syllable + word :{ksx1001_other_count + ksx1001_syllable_count + ksx1001_word_count}')
    print(f'other: {ksx1001_other_count}')
    print(f'syllable: {ksx1001_syllable_count}')
    print(f'word: {ksx1001_word_count}')
    print(f'in ksx but not a char: {ksx1001_but_not_a_char_count}')

    print(f'not 2 len bytes: {not_2_bytes_count}')
    print(f'not 2 len bytes + syllable: {not_2_bytes_count + ksx1001_syllable_count}')
    print('unicode all syllable: 11172')
    print(''.join(not_2_bytes_alphabet))

ksx other + syllable + word :8226     # 8225 + 1(Hangul Filler)
other: 988
syllable: 2350
word: 4888
in ksx but not a char: 0

not 2 len bytes: 8950     # <---- what's this ? Include other syllables not in ksx1001 2350?
not 2 len bytes + syllable: 11300
unicode all syllable: 11172

not_2_bytes_alphabet:

�������	
������������������ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~�갂갃갅갆갋 ......  # Behind them are all syllables

b8choi · 2023-03-02T06:48:23Z

From my investigation so far, there are some unclear aspects of the decoding process related to Korean characters encoded in EUC-KR.

FYI, the KS X 1001 standard defines the Korean character set, while EUC-KR is an encoding method for this set. In Python, "ksx1001" is an alias of "euc_kr".

Korean alphabet, also known as Hangul, expresses one syllable consisting of consonants and vowels as one letter. The KS X 1001 standard defines a precomposed set of characters that correspond to one syllable rather than individual consonants and vowels. However, to express a character not included in this set, a method of composing consonants and vowels is defined. The "Hangul Filler" character is used for this purpose, and it, along with the following three consonants and vowels, are arranged to express one character in a total of 8 bytes.

The KS X 1001 standard does not provide any specific guidelines other than the method for expressing one character using the "Hangul Filler".

Korean characters not included in this standard are decomposed into consonants and vowels, and the Korean alphabet code value of the special character area corresponding to each part is used for information exchange. To differentiate it from information exchange for individual consonant and vowel characters, one "Hangul Filler" is added in front of each syllable.

The main issue is whether to allow an independent "Hangul Filler" that is not followed by valid consonants and vowels to compose a single character. Currently, the EUC-KR encoder does not validate strings containing "Hangul Filler" and just convert "Hangul Filler" of KS X 1001 to the corresponding Unicode "Hangul Filler".

I think there are three possible solutions to this issue, which need to be discussed:

Not fix anything: allow encoding to contain independent "Hangul Filler" characters, while decoding will treat an sequence starting with "Hangul Filler" as an error if it does not represent a valid combination of Korean characters.
Fix the encoder: strings that contain independent "Hangul Filler" will be treated as an error (this may cause new errors in the existing codes).
Fix the decoder: a valid sequence will be converted into one composed character, and in other cases, the code of each character will be simply converted into the corresponding code in Unicode.

~~I have tried the third method and confirmed that no errors have occurred.~~

For more information, please refer to the KS X 1001 specification and a detailed description of "Hangul Filler" character.

corona10 · 2023-03-02T11:21:40Z

I prefer supporting the round-trip approach since we already support encoding.
But I would need to listen to the advice from the other core devs :)

cc @ezio-melotti @vstinner

corona10 · 2023-03-02T12:30:18Z

@b8choi

I have tried the third method and confirmed that no errors have occurred.

Would you like to submit the patch anyway? We may need to analyze the side effect when we support the decoding for the character.
But please submit the patch with a nice test code.

* main: (21 commits) pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives in sub interpreters module (python#102472) pythongh-95672: Fix versionadded indentation of get_pagesize in test.rst (pythongh-102455) pythongh-102416: Do not memoize incorrectly loop rules in the parser (python#102467) pythonGH-101362: Optimise PurePath(PurePath(...)) (pythonGH-101667) pythonGH-101362: Check pathlib.Path flavour compatibility at import time (pythonGH-101664) pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath() (python#101665) pythongh-102444: Fix minor bugs in `test_typing` highlighted by pyflakes (python#102445) pythonGH-102341: Improve the test function for pow (python#102342) Fix unused classes in a typing test (pythonGH-102437) pythongh-101979: argparse: fix a bug where parentheses in metavar argument of add_argument() were dropped (python#102318) pythongh-102356: Add thrashcan macros to filter object dealloc (python#102426) Move around example in to_bytes() to avoid confusion (python#101595) pythonGH-97546: fix flaky asyncio `test_wait_for_race_condition` test (python#102421) pythongh-96821: Add config option `--with-strict-overflow` (python#96823) pythongh-101992: update pstlib module documentation (python#102133) pythongh-63301: Set exit code when tabnanny CLI exits on error (python#7699) pythongh-101863: Fix wrong comments in EUC-KR codec (pythongh-102417) pythongh-102302 Micro-optimize `inspect.Parameter.__hash__` (python#102303) pythongh-102179: Fix `os.dup2` error reporting for negative fds (python#102180) pythongh-101892: Fix `SystemError` when a callable iterator call exhausts the iterator (python#101896) ...

* main: (37 commits) pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives in sub interpreters module (python#102472) pythongh-95672: Fix versionadded indentation of get_pagesize in test.rst (pythongh-102455) pythongh-102416: Do not memoize incorrectly loop rules in the parser (python#102467) pythonGH-101362: Optimise PurePath(PurePath(...)) (pythonGH-101667) pythonGH-101362: Check pathlib.Path flavour compatibility at import time (pythonGH-101664) pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath() (python#101665) pythongh-102444: Fix minor bugs in `test_typing` highlighted by pyflakes (python#102445) pythonGH-102341: Improve the test function for pow (python#102342) Fix unused classes in a typing test (pythonGH-102437) pythongh-101979: argparse: fix a bug where parentheses in metavar argument of add_argument() were dropped (python#102318) pythongh-102356: Add thrashcan macros to filter object dealloc (python#102426) Move around example in to_bytes() to avoid confusion (python#101595) pythonGH-97546: fix flaky asyncio `test_wait_for_race_condition` test (python#102421) pythongh-96821: Add config option `--with-strict-overflow` (python#96823) pythongh-101992: update pstlib module documentation (python#102133) pythongh-63301: Set exit code when tabnanny CLI exits on error (python#7699) pythongh-101863: Fix wrong comments in EUC-KR codec (pythongh-102417) pythongh-102302 Micro-optimize `inspect.Parameter.__hash__` (python#102303) pythongh-102179: Fix `os.dup2` error reporting for negative fds (python#102180) pythongh-101892: Fix `SystemError` when a callable iterator call exhausts the iterator (python#101896) ...

corona10 · 2023-03-14T16:55:21Z

Since the decoding implementation for this case is undefined and difficult to support decoding for this edge case due to streaming issues.
I close this issue as not a bug.

If someone think that this is a bug please let me know.

TakWolf · 2023-03-14T18:02:21Z

So far, I still haven't got an explanation that chr(0x3164).encode('ksx1001').decode('ksx1001') will get a exception is or not a expected behavior (Why python do like this? What is the document or implements reference?).

At least in my understanding, the encoder should be idempotent (That if some char can be encoded, it should be decoded).
For example there are some control characters. They have special purposes in the display layer, but this does not affect that they can be encoded and decoded.

I also deal with other encoding standards ( gb2312, shitf-jis, big5), that have not the similar behaviors (They are all idempotent).

And this problem does not exist in Java. (I will test later in Rust , Go and node-Javascript)

EUC was used to solve computer coding in the early stage, if they can't convert with unicode(utf-8) equivalent, it seems incredible to me.

TakWolf · 2023-03-14T18:42:52Z

My opinion, encoding is a storage structure, data should not be changed through coders but keep it as it is.
So for b8choi's proposal 2 and 3, they are not suitable.
If the char is in the scope, coder should not throw a exception. And coder should not changed the data.
(Imagine that you just browse the ksx file in a utf8 system, the data is changed, this is not reasonable)
Ksx can be fully mapped to unicode, convert the hangul is meaningless. There is also a Hangul Filler in Unicode. Should we convert all the Hangul Filler combin chars to hangul in utf-8 data layer?

An evidence that may support my opinion, there is a Emoji Combinations in Unicode.
For example: 👩‍❤️‍💋‍👨 = 👩+❤️+‍💋+‍👨
This just a display, on data they are equal.

Hangul Filler ’s role should be similar to Ideographic Description Characters only effect on display layer.
If the text engine support Hangul Filler, it will display hangul only on display. If not display as char is.

TakWolf added the type-bug An unexpected behavior, bug, or error label Feb 13, 2023

arhadthedev added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Feb 13, 2023

corona10 self-assigned this Feb 13, 2023

TakWolf mentioned this issue Feb 16, 2023

KS-X-1001 字符分布 - 其他字符统计进度超过 100% TakWolf/fusion-pixel-font#12

Closed

corona10 assigned b8choi Feb 20, 2023

b8choi mentioned this issue Mar 4, 2023

gh-101863: Fix wrong comments in EUC-KR codec #102417

Merged

corona10 pushed a commit that referenced this issue Mar 4, 2023

gh-101863: Fix wrong comments in EUC-KR codec (gh-102417)

77a3196

hugovk pushed a commit to hugovk/cpython that referenced this issue Mar 6, 2023

pythongh-101863: Fix wrong comments in EUC-KR codec (pythongh-102417)

2d6e5b8

corona10 closed this as completed Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

euc_kr char '0x3164' decode('ksx1001') cause UnicodeDecodeError #101863

euc_kr char '0x3164' decode('ksx1001') cause UnicodeDecodeError #101863

TakWolf commented Feb 13, 2023 •

edited by bedevere-bot

Loading

arhadthedev commented Feb 13, 2023

corona10 commented Feb 17, 2023

b8choi commented Feb 20, 2023

TakWolf commented Feb 23, 2023 •

edited

Loading

corona10 commented Feb 23, 2023

corona10 commented Feb 23, 2023

TakWolf commented Feb 23, 2023

TakWolf commented Feb 23, 2023

b8choi commented Mar 2, 2023 •

edited

Loading

corona10 commented Mar 2, 2023 •

edited

Loading

corona10 commented Mar 2, 2023

corona10 commented Mar 14, 2023

TakWolf commented Mar 14, 2023 •

edited

Loading

TakWolf commented Mar 14, 2023 •

edited

Loading

euc_kr char '0x3164' decode('ksx1001') cause UnicodeDecodeError #101863

euc_kr char '0x3164' decode('ksx1001') cause UnicodeDecodeError #101863

Comments

TakWolf commented Feb 13, 2023 • edited by bedevere-bot Loading

Bug report

Your environment

Linked PRs

arhadthedev commented Feb 13, 2023

corona10 commented Feb 17, 2023

b8choi commented Feb 20, 2023

TakWolf commented Feb 23, 2023 • edited Loading

corona10 commented Feb 23, 2023

corona10 commented Feb 23, 2023

TakWolf commented Feb 23, 2023

TakWolf commented Feb 23, 2023

b8choi commented Mar 2, 2023 • edited Loading

corona10 commented Mar 2, 2023 • edited Loading

corona10 commented Mar 2, 2023

corona10 commented Mar 14, 2023

TakWolf commented Mar 14, 2023 • edited Loading

TakWolf commented Mar 14, 2023 • edited Loading

TakWolf commented Feb 13, 2023 •

edited by bedevere-bot

Loading

TakWolf commented Feb 23, 2023 •

edited

Loading

b8choi commented Mar 2, 2023 •

edited

Loading

corona10 commented Mar 2, 2023 •

edited

Loading

TakWolf commented Mar 14, 2023 •

edited

Loading

TakWolf commented Mar 14, 2023 •

edited

Loading