bpo-42236: os.device_encoding() respects UTF-8 Mode #23119

vstinner · 2020-11-02T22:03:10Z

On Unix, the os.device_encoding() function now returns 'UTF-8' rather
than the device encoding if the Python UTF-8 Mode is enabled.

https://bugs.python.org/issue42236

vstinner · 2020-11-02T22:12:19Z

@methane: Would you mind to have a look at this change?

I'm not sure if it's correct to replace nl_langinfo(CODESET) with _Py_GetLocaleEncodingObjcet() in os.device_encoding(). One effect is that it returns UTF-8 if the Python UTF-8 Mode is enabled. But there is also another effect: on Android and VxWorks, os.device_encoding() now also returns UTF-8 (if the fd is a TTY) rather than nl_langinfo(CODESET).

I'm not sure if it's the same on Android or not?

IMO this change makes os.devide_encoding() and so indirectly open() more consistent with encoding choices in Python.

By the way, I deeply reworked the documentation on encodings, especially the locale encoding the filesystem encoding and error handler (docs.python.org was not updated yet).

eryksun · 2020-11-03T01:06:39Z

Python/fileutils.c

+    const PyPreConfig *preconfig = &_PyRuntime.preconfig;
+    if (preconfig->utf8_mode) {
+        return PyUnicode_FromString("UTF-8");
+    }


If UTF-8 mode doesn't override PYTHONLEGACYWINDOWSSTDIO, then this ends up using UTF-8 with io.FileIO instances opened for console files. This will read mixed garbage and "surrogatepass" from console input and print garbage to the screen.

Currently in 3.9 PYTHONLEGACYWINDOWSSTDIO is broken separately from UTF-8 mode. Somehow it ends up incorrectly using the process ANSI codepage instead of the console input and output codepages:

C:\>chcp 850 Active code page: 850 C:\>set PYTHONLEGACYWINDOWSSTDIO=1 C:\>py -3.9 -c "import sys; print(sys.stdin.encoding)" cp1252

Note that even for non-legacy operation, with or without UTF-8 mode, the console input and output code pages are still used with os.read and os.write. Overriding the result from os.device_encoding as "UTF-8" loses important information about the correct encoding to use in those cases:

>>> os.write(1, 'αβψ\n'.encode('UTF-8')) a├í? 5

I don't think it is a problem. PYTHONLEGACYWINDOWSSTDIO is disabled by default. This option should be enabled only when the program is using some hack which relying on legacy stdout behavior. So trying to make
PYTHONLEGACYWINDOWSSTDIO perfect will be waste of time.

In this case, overriding the option means we are removing one combination of options from users. I don't think it is nice idea.

For example, some program use PYTHONLEGACYWINDOWSSTDIO to redirect stdout to file by hack. See here:

https://github.com/dagster-io/dagster/blob/4b4d9eea4ef42ab0d6b55aa123d70f6d4150d4fc/python_modules/dagster/dagster/core/execution/compute_logs.py#L33-L54

If user want to redirect stdout to file and use UTF-8 in the file, combining PYTHONLEGACYWINDOWSSTDIO and PYTHONUTF8 make sense, although it seems really ugly hack.

I'm wondering what the rationale is for forcing console bytes I/O to use UTF-8. A console in Windows is not an agnostic bytes medium, unlike a new disk file, pipe, or socket. Internally it uses 16-bit characters. The console bytes API uses best-fit code-page translation between native wide-character strings and byte strings. If the console says the input code page is 437 or output code page is 850, then there is no doubt that bytes I/O (e.g. os.read and os.write) is limited to those code pages.

Because the console bytes API uses a best-fit translation, the result doesn't even round trip necessarily to what the user enters, which can be surprising. For example, with the input and output code pages set to 850:

>>> s = os.read(0, 5).decode('utf-8', 'surrogateescape') αβψ

Based on visual feedback it looks like this worked, but on inspection we see that the console mapped αβψ to aß? when asked to read it as a byte string.

>>> s 'a\udce1?\r\n' >>> n = os.write(1, s.encode('utf-8', 'surrogateescape')) aß?

The situation in POSIX is similar if the terminal uses a legacy encoding, but I don't recall encountering a best-fit translation in POSIX. If the terminal encoding doesn't support a typed or pasted character, it just gets ignored, which is immediately obvious to the user. As far as I know, detecting the terminal encoding is not even possible in POSIX. Anyway, configuring a POSIX terminal with anything but UTF-8 nowadays is rare. The application-level locale might be some legacy setting or "C" / "POSIX", but the terminal emulator is likely to use UTF-8, so overriding the locale to force using UTF-8 is generally an improvement.

Since _Py_device_encoding in Windows is a low-level device query, not an application-level locale query, I think it should always return the encoding for a console file, which is the encoding that should be used with os.read or os.write. Also, the supported domain should be expanded beyond file descriptors 0-2. See _get_console_type in winconsoleio.c for an example of using GetNumberOfConsoleInputEvents to determine input vs output files. The latter should replace isatty in Windows. This enhancement of _Py_device_encoding would also support opening "CONIN$" and "CONOUT$" in legacy mode with the correct default encoding.

Please use b.p.o for such discussion.
PR is place for code review.

I have againsted overriding os.device_encoding() already in b.p.o.

@eryksun:

If UTF-8 mode doesn't override PYTHONLEGACYWINDOWSSTDIO, then this ends up using UTF-8 with io.FileIO instances opened for console files. This will read mixed garbage and "surrogatepass" from console input and print garbage to the screen.

I wrote this change with Unix in mind. On Unix, a terminal has no encoding. Only the LC_CTYPE locale matters. But the purpose of the UTF-8 Mode is to ignore the LC_CTYPE locale (the "locale encoding") on purpose.

I would be perfectly fine with keep the current behavior on Windows. It's fine to have a different behavior depending on the platform.

@methane:

If user want to redirect stdout to file and use UTF-8 in the file, combining PYTHONLEGACYWINDOWSSTDIO and PYTHONUTF8 make sense, although it seems really ugly hack.

On Windows, os.device_encoding(fd) returns None if fd > 2 or if the file descriptor is not a Windows console. I expect that if you redirect stdout into a file, os.device_encoding(sys.stdout.fileno()) returns None.

@methane: If os.device_encoding(fd) returns None if fd is a file and not a console, would you be ok to no implement the UTF-8 Mode in device_encoding()?

device_encoding() is an important function since it is tested first by open() when the encoding is omitted! It is tested before using locale.getpreferredencoding(False).

First of all, I just againsted about overriding PYTHONLEGACYWINDOWSSTDIO option. After that, I againsted about overriding os.device_encoding() too in b.p.o.

On Windows, os.device_encoding(fd) returns None if fd > 2 or if the file descriptor is not a Windows console. I expect that if you redirect stdout into a file, os.device_encoding(sys.stdout.fileno()) returns None.
@methane: If os.device_encoding(fd) returns None if fd is a file and not a console, would you be ok to no implement the UTF-8 Mode in device_encoding()?

See https://github.com/dagster-io/dagster/blob/4b4d9eea4ef42ab0d6b55aa123d70f6d4150d4fc/python_modules/dagster/dagster/core/execution/compute_logs.py#L33-L54

This hack replaces fd after TextIOWrapper is created. So os.device_encoding() is called for console, but the returned encoding is used to writing to file.

Users may want to use UTF-8 mode to change default text file encoding used by open(). If we enforce PYTHONLEGACYWINDOWSSTDIO=0, this (ugly) hack will be broken. That's why I againsted about overriding PYTHONLEGACYWINDOWSSTDIO.

PYTHONLEGACYWINDOWSFSENCODING=1 disables the UTF-8 Mode:
https://docs.python.org/dev/c-api/init_config.html#c.PyPreConfig.legacy_windows_fs_encoding

Maybe we can do the same for PYTHONLEGACYWINDOWSSTDIO? If PYTHONLEGACYWINDOWSSTDIO=1 is used, turn of UTF-8 Mode?
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.legacy_windows_stdio

.. But anyway, I will leave the Windows implementation unchanged, since you wrote:

I don't think UTF-8 mode should override os.device_encoding() on Windows.

https://bugs.python.org/issue42236#msg380261

:-)

This hack replaces fd after TextIOWrapper is created. So os.device_encoding() is called for console, but the returned encoding is used to writing to file.

Since 3.8, the sys.std* files do not have the correct code page in legacy mode. Regardless of how legacy mode is used by projects, what it MUST implement is the legacy behavior of 3.5. sys.stdin for console input should use the console input code page, and sys.stdout and sys.stderr for console output should use the console output code page.

New behavior is also needed to support file descriptors above 2 in _Py_device_encoding in all cases. If it's desired for legacy mode to continue to be wrong in this case for the sake of compatibility, that should be implemented as a higher-level policy in TextIOWrapper, not by unnecessarily limiting the usage of os.device_encoding in a way that's not indicated by the doc string or documentation.

Since 3.8, the sys.std* files do not have the correct code page in legacy mode.

I'm not aware of this issue. Would you mind to open a separate issue at bugs.python.org?

On Unix, the os.device_encoding() function now returns 'UTF-8' rather than the device encoding if the Python UTF-8 Mode is enabled.

vstinner · 2020-11-03T17:23:41Z

@methane @eryksun: I updated my PR to only change the behavior on Unix (non-Windows platforms).

vstinner · 2020-11-04T10:23:48Z

Example showing the impact of this change in practice, on the stdout TTY:

import sys
assert sys.stdout.isatty()
print("stdout encodingg:", sys.stdout.encoding)
reopen_stdout = open(sys.stdout.fileno(), closefd=False)
print("reopen encoding:", reopen_stdout.encoding)
reopen_stdout.close()

Python 3.9 output (old):

stdout encoding: utf-8
reopen encoding: ISO-8859-1

Python 3.10 output (new):

stdout encodingg: utf-8
reopen encoding: UTF-8

IMHO the new behavior is more consistent.

If you want the old behavior, you can explicitly pass encoding=locale.getpreferredencoding(False) to open().

vstinner · 2020-11-04T10:25:27Z

Thanks for the reviews @eryksun and @methane.

@eryksun: The final change leaves Windows unchanged. If you consider that Python has issues on a specific use case, please open a new issue.

bedevere-bot · 2020-11-04T11:18:54Z

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x RHEL8 LTO 3.x has failed when building commit 3529718.

What do you need to do:

Don't panic.
Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/567/builds/936) and take a look at the build logs.
Check if the failure is related to this commit (3529718) or if it is a false positive.
If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/567/builds/936

Summary of the results of the build (if available):

== Tests result: ENV CHANGED ==

409 tests OK.

10 slowest tests:

test_concurrent_futures: 2 min 53 sec
test_peg_generator: 2 min 52 sec
test_gdb: 1 min 42 sec
test_multiprocessing_spawn: 1 min 11 sec
test_multiprocessing_forkserver: 1 min
test_multiprocessing_fork: 54.2 sec
test_signal: 47.4 sec
test_asyncio: 46.2 sec
test_tokenize: 37.3 sec
test_io: 35.9 sec

1 test altered the execution environment:
test_asyncio

14 tests skipped:
test_devpoll test_ioctl test_kqueue test_msilib test_nis
test_ossaudiodev test_startfile test_tix test_tk test_ttk_guionly
test_winconsoleio test_winreg test_winsound test_zipfile64

Total duration: 5 min 8 sec

Click to see traceback logs

Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/sslproto.py", line 321, in __del__
    self.close()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/sslproto.py", line 316, in close
    self._ssl_protocol._start_shutdown()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/sslproto.py", line 590, in _start_shutdown
    self._abort()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/sslproto.py", line 731, in _abort
    self._transport.abort()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/selector_events.py", line 680, in abort
    self._force_close(None)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/selector_events.py", line 731, in _force_close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/base_events.py", line 746, in call_soon
    self._check_closed()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/base_events.py", line 510, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

* master: bpo-42260: Add _PyInterpreterState_SetConfig() (pythonGH-23158) Disable peg generator tests when building with PGO (pythonGH-23141) bpo-1635741: _sqlite3 uses PyModule_AddObjectRef() (pythonGH-23148) bpo-1635741: Fix PyInit_pyexpat() error handling (pythonGH-22489) bpo-42260: Main init modify sys.flags in-place (pythonGH-23150) bpo-1635741: Fix ref leak in _PyWarnings_Init() error path (pythonGH-23151) bpo-1635741: _ast uses PyModule_AddObjectRef() (pythonGH-23146) bpo-1635741: _contextvars uses PyModule_AddType() (pythonGH-23147) bpo-42260: Reorganize PyConfig (pythonGH-23149) bpo-1635741: Add PyModule_AddObjectRef() function (pythonGH-23122) bpo-42236: os.device_encoding() respects UTF-8 Mode (pythonGH-23119) bpo-42251: Add gettrace and getprofile to threading (pythonGH-23125) Enable signing of nuget.org packages and update to supported timestamp server (pythonGH-23132) Fix incorrect links in ast docs (pythonGH-23017) Add _PyType_GetModuleByDef (pythonGH-22835) Post 3.10.0a2 bpo-41796: Call _PyAST_Fini() earlier to fix a leak (pythonGH-23131) bpo-42249: Fix writing binary Plist files larger than 4 GiB. (pythonGH-23121) bpo-40077: Convert mmap.mmap static type to a heap type (pythonGH-23108) Python 3.10.0a2

On Unix, the os.device_encoding() function now returns 'UTF-8' rather than the device encoding if the Python UTF-8 Mode is enabled.

the-knights-who-say-ni added the CLA signed label Nov 2, 2020

bedevere-bot added the awaiting core review label Nov 2, 2020

eryksun reviewed Nov 3, 2020

View reviewed changes

bpo-42236: os.device_encoding() respects UTF-8 Mode

4288a46

On Unix, the os.device_encoding() function now returns 'UTF-8' rather than the device encoding if the Python UTF-8 Mode is enabled.

Update the UTF-8 Mode documentation

c5e768b

methane approved these changes Nov 3, 2020

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Nov 3, 2020

vstinner merged commit 3529718 into python:master Nov 4, 2020

bedevere-bot removed the awaiting merge label Nov 4, 2020

vstinner deleted the device_encoding branch November 4, 2020 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-42236: os.device_encoding() respects UTF-8 Mode #23119

bpo-42236: os.device_encoding() respects UTF-8 Mode #23119

vstinner commented Nov 2, 2020 •

edited

Loading

vstinner commented Nov 2, 2020

eryksun Nov 3, 2020

methane Nov 3, 2020

eryksun Nov 3, 2020

methane Nov 3, 2020 •

edited

Loading

vstinner Nov 3, 2020

methane Nov 3, 2020

vstinner Nov 3, 2020

eryksun Nov 3, 2020 •

edited

Loading

vstinner Nov 3, 2020

vstinner commented Nov 3, 2020

vstinner commented Nov 4, 2020

vstinner commented Nov 4, 2020

bedevere-bot commented Nov 4, 2020

bpo-42236: os.device_encoding() respects UTF-8 Mode #23119

bpo-42236: os.device_encoding() respects UTF-8 Mode #23119

Conversation

vstinner commented Nov 2, 2020 • edited Loading

vstinner commented Nov 2, 2020

eryksun Nov 3, 2020

Choose a reason for hiding this comment

methane Nov 3, 2020

Choose a reason for hiding this comment

eryksun Nov 3, 2020

Choose a reason for hiding this comment

methane Nov 3, 2020 • edited Loading

Choose a reason for hiding this comment

vstinner Nov 3, 2020

Choose a reason for hiding this comment

methane Nov 3, 2020

Choose a reason for hiding this comment

vstinner Nov 3, 2020

Choose a reason for hiding this comment

eryksun Nov 3, 2020 • edited Loading

Choose a reason for hiding this comment

vstinner Nov 3, 2020

Choose a reason for hiding this comment

vstinner commented Nov 3, 2020

vstinner commented Nov 4, 2020

vstinner commented Nov 4, 2020

bedevere-bot commented Nov 4, 2020

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

vstinner commented Nov 2, 2020 •

edited

Loading

methane Nov 3, 2020 •

edited

Loading

eryksun Nov 3, 2020 •

edited

Loading