Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-42236: os.device_encoding() respects UTF-8 Mode #23119

Merged
merged 2 commits into from
Nov 4, 2020
Merged

bpo-42236: os.device_encoding() respects UTF-8 Mode #23119

merged 2 commits into from
Nov 4, 2020

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Nov 2, 2020

On Unix, the os.device_encoding() function now returns 'UTF-8' rather
than the device encoding if the Python UTF-8 Mode is enabled.

https://bugs.python.org/issue42236

@vstinner
Copy link
Member Author

vstinner commented Nov 2, 2020

@methane: Would you mind to have a look at this change?

I'm not sure if it's correct to replace nl_langinfo(CODESET) with _Py_GetLocaleEncodingObjcet() in os.device_encoding(). One effect is that it returns UTF-8 if the Python UTF-8 Mode is enabled. But there is also another effect: on Android and VxWorks, os.device_encoding() now also returns UTF-8 (if the fd is a TTY) rather than nl_langinfo(CODESET).

I'm not sure if it's the same on Android or not?

IMO this change makes os.devide_encoding() and so indirectly open() more consistent with encoding choices in Python.

By the way, I deeply reworked the documentation on encodings, especially the locale encoding the filesystem encoding and error handler (docs.python.org was not updated yet).

const PyPreConfig *preconfig = &_PyRuntime.preconfig;
if (preconfig->utf8_mode) {
return PyUnicode_FromString("UTF-8");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If UTF-8 mode doesn't override PYTHONLEGACYWINDOWSSTDIO, then this ends up using UTF-8 with io.FileIO instances opened for console files. This will read mixed garbage and "surrogatepass" from console input and print garbage to the screen.

Currently in 3.9 PYTHONLEGACYWINDOWSSTDIO is broken separately from UTF-8 mode. Somehow it ends up incorrectly using the process ANSI codepage instead of the console input and output codepages:

C:\>chcp 850
Active code page: 850
C:\>set PYTHONLEGACYWINDOWSSTDIO=1
C:\>py -3.9 -c "import sys; print(sys.stdin.encoding)"
cp1252

Note that even for non-legacy operation, with or without UTF-8 mode, the console input and output code pages are still used with os.read and os.write. Overriding the result from os.device_encoding as "UTF-8" loses important information about the correct encoding to use in those cases:

>>> os.write(1, 'αβψ\n'.encode('UTF-8'))
aá?
5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a problem. PYTHONLEGACYWINDOWSSTDIO is disabled by default. This option should be enabled only when the program is using some hack which relying on legacy stdout behavior. So trying to make
PYTHONLEGACYWINDOWSSTDIO perfect will be waste of time.

In this case, overriding the option means we are removing one combination of options from users. I don't think it is nice idea.

For example, some program use PYTHONLEGACYWINDOWSSTDIO to redirect stdout to file by hack. See here:

https://github.com/dagster-io/dagster/blob/4b4d9eea4ef42ab0d6b55aa123d70f6d4150d4fc/python_modules/dagster/dagster/core/execution/compute_logs.py#L33-L54

If user want to redirect stdout to file and use UTF-8 in the file, combining PYTHONLEGACYWINDOWSSTDIO and PYTHONUTF8 make sense, although it seems really ugly hack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what the rationale is for forcing console bytes I/O to use UTF-8. A console in Windows is not an agnostic bytes medium, unlike a new disk file, pipe, or socket. Internally it uses 16-bit characters. The console bytes API uses best-fit code-page translation between native wide-character strings and byte strings. If the console says the input code page is 437 or output code page is 850, then there is no doubt that bytes I/O (e.g. os.read and os.write) is limited to those code pages.

Because the console bytes API uses a best-fit translation, the result doesn't even round trip necessarily to what the user enters, which can be surprising. For example, with the input and output code pages set to 850:

>>> s = os.read(0, 5).decode('utf-8', 'surrogateescape')
αβψ

Based on visual feedback it looks like this worked, but on inspection we see that the console mapped αβψ to aß? when asked to read it as a byte string.

>>> s
'a\udce1?\r\n'
>>> n = os.write(1, s.encode('utf-8', 'surrogateescape'))
aß?

The situation in POSIX is similar if the terminal uses a legacy encoding, but I don't recall encountering a best-fit translation in POSIX. If the terminal encoding doesn't support a typed or pasted character, it just gets ignored, which is immediately obvious to the user. As far as I know, detecting the terminal encoding is not even possible in POSIX. Anyway, configuring a POSIX terminal with anything but UTF-8 nowadays is rare. The application-level locale might be some legacy setting or "C" / "POSIX", but the terminal emulator is likely to use UTF-8, so overriding the locale to force using UTF-8 is generally an improvement.


Since _Py_device_encoding in Windows is a low-level device query, not an application-level locale query, I think it should always return the encoding for a console file, which is the encoding that should be used with os.read or os.write. Also, the supported domain should be expanded beyond file descriptors 0-2. See _get_console_type in winconsoleio.c for an example of using GetNumberOfConsoleInputEvents to determine input vs output files. The latter should replace isatty in Windows. This enhancement of _Py_device_encoding would also support opening "CONIN$" and "CONOUT$" in legacy mode with the correct default encoding.

Copy link
Member

@methane methane Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use b.p.o for such discussion.
PR is place for code review.

I have againsted overriding os.device_encoding() already in b.p.o.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eryksun:

If UTF-8 mode doesn't override PYTHONLEGACYWINDOWSSTDIO, then this ends up using UTF-8 with io.FileIO instances opened for console files. This will read mixed garbage and "surrogatepass" from console input and print garbage to the screen.

I wrote this change with Unix in mind. On Unix, a terminal has no encoding. Only the LC_CTYPE locale matters. But the purpose of the UTF-8 Mode is to ignore the LC_CTYPE locale (the "locale encoding") on purpose.

I would be perfectly fine with keep the current behavior on Windows. It's fine to have a different behavior depending on the platform.

@methane:

If user want to redirect stdout to file and use UTF-8 in the file, combining PYTHONLEGACYWINDOWSSTDIO and PYTHONUTF8 make sense, although it seems really ugly hack.

On Windows, os.device_encoding(fd) returns None if fd > 2 or if the file descriptor is not a Windows console. I expect that if you redirect stdout into a file, os.device_encoding(sys.stdout.fileno()) returns None.

@methane: If os.device_encoding(fd) returns None if fd is a file and not a console, would you be ok to no implement the UTF-8 Mode in device_encoding()?

device_encoding() is an important function since it is tested first by open() when the encoding is omitted! It is tested before using locale.getpreferredencoding(False).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I just againsted about overriding PYTHONLEGACYWINDOWSSTDIO option. After that, I againsted about overriding os.device_encoding() too in b.p.o.

On Windows, os.device_encoding(fd) returns None if fd > 2 or if the file descriptor is not a Windows console. I expect that if you redirect stdout into a file, os.device_encoding(sys.stdout.fileno()) returns None.
@methane: If os.device_encoding(fd) returns None if fd is a file and not a console, would you be ok to no implement the UTF-8 Mode in device_encoding()?

See https://github.com/dagster-io/dagster/blob/4b4d9eea4ef42ab0d6b55aa123d70f6d4150d4fc/python_modules/dagster/dagster/core/execution/compute_logs.py#L33-L54

This hack replaces fd after TextIOWrapper is created. So os.device_encoding() is called for console, but the returned encoding is used to writing to file.

Users may want to use UTF-8 mode to change default text file encoding used by open(). If we enforce PYTHONLEGACYWINDOWSSTDIO=0, this (ugly) hack will be broken. That's why I againsted about overriding PYTHONLEGACYWINDOWSSTDIO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PYTHONLEGACYWINDOWSFSENCODING=1 disables the UTF-8 Mode:
https://docs.python.org/dev/c-api/init_config.html#c.PyPreConfig.legacy_windows_fs_encoding

Maybe we can do the same for PYTHONLEGACYWINDOWSSTDIO? If PYTHONLEGACYWINDOWSSTDIO=1 is used, turn of UTF-8 Mode?
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.legacy_windows_stdio

.. But anyway, I will leave the Windows implementation unchanged, since you wrote:

I don't think UTF-8 mode should override os.device_encoding() on Windows.

https://bugs.python.org/issue42236#msg380261

:-)

Copy link
Contributor

@eryksun eryksun Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hack replaces fd after TextIOWrapper is created. So os.device_encoding() is called for console, but the returned encoding is used to writing to file.

Since 3.8, the sys.std* files do not have the correct code page in legacy mode. Regardless of how legacy mode is used by projects, what it MUST implement is the legacy behavior of 3.5. sys.stdin for console input should use the console input code page, and sys.stdout and sys.stderr for console output should use the console output code page.

New behavior is also needed to support file descriptors above 2 in _Py_device_encoding in all cases. If it's desired for legacy mode to continue to be wrong in this case for the sake of compatibility, that should be implemented as a higher-level policy in TextIOWrapper, not by unnecessarily limiting the usage of os.device_encoding in a way that's not indicated by the doc string or documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 3.8, the sys.std* files do not have the correct code page in legacy mode.

I'm not aware of this issue. Would you mind to open a separate issue at bugs.python.org?

On Unix, the os.device_encoding() function now returns 'UTF-8' rather
than the device encoding if the Python UTF-8 Mode is enabled.
@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2020

@methane @eryksun: I updated my PR to only change the behavior on Unix (non-Windows platforms).

@vstinner vstinner merged commit 3529718 into python:master Nov 4, 2020
@vstinner vstinner deleted the device_encoding branch November 4, 2020 10:20
@vstinner
Copy link
Member Author

vstinner commented Nov 4, 2020

Example showing the impact of this change in practice, on the stdout TTY:

import sys
assert sys.stdout.isatty()
print("stdout encodingg:", sys.stdout.encoding)
reopen_stdout = open(sys.stdout.fileno(), closefd=False)
print("reopen encoding:", reopen_stdout.encoding)
reopen_stdout.close()

Python 3.9 output (old):

stdout encoding: utf-8
reopen encoding: ISO-8859-1

Python 3.10 output (new):

stdout encodingg: utf-8
reopen encoding: UTF-8

IMHO the new behavior is more consistent.

If you want the old behavior, you can explicitly pass encoding=locale.getpreferredencoding(False) to open().

@vstinner
Copy link
Member Author

vstinner commented Nov 4, 2020

Thanks for the reviews @eryksun and @methane.

@eryksun: The final change leaves Windows unchanged. If you consider that Python has issues on a specific use case, please open a new issue.

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x RHEL8 LTO 3.x has failed when building commit 3529718.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/567/builds/936) and take a look at the build logs.
  4. Check if the failure is related to this commit (3529718) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/567/builds/936

Summary of the results of the build (if available):

== Tests result: ENV CHANGED ==

409 tests OK.

10 slowest tests:

  • test_concurrent_futures: 2 min 53 sec
  • test_peg_generator: 2 min 52 sec
  • test_gdb: 1 min 42 sec
  • test_multiprocessing_spawn: 1 min 11 sec
  • test_multiprocessing_forkserver: 1 min
  • test_multiprocessing_fork: 54.2 sec
  • test_signal: 47.4 sec
  • test_asyncio: 46.2 sec
  • test_tokenize: 37.3 sec
  • test_io: 35.9 sec

1 test altered the execution environment:
test_asyncio

14 tests skipped:
test_devpoll test_ioctl test_kqueue test_msilib test_nis
test_ossaudiodev test_startfile test_tix test_tk test_ttk_guionly
test_winconsoleio test_winreg test_winsound test_zipfile64

Total duration: 5 min 8 sec

Click to see traceback logs
Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/sslproto.py", line 321, in __del__
    self.close()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/sslproto.py", line 316, in close
    self._ssl_protocol._start_shutdown()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/sslproto.py", line 590, in _start_shutdown
    self._abort()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/sslproto.py", line 731, in _abort
    self._transport.abort()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/selector_events.py", line 680, in abort
    self._force_close(None)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/selector_events.py", line 731, in _force_close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/base_events.py", line 746, in call_soon
    self._check_closed()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto/build/Lib/asyncio/base_events.py", line 510, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

shihai1991 added a commit to shihai1991/cpython that referenced this pull request Nov 5, 2020
* master:
  bpo-42260: Add _PyInterpreterState_SetConfig() (pythonGH-23158)
  Disable peg generator tests when building with PGO (pythonGH-23141)
  bpo-1635741: _sqlite3 uses PyModule_AddObjectRef() (pythonGH-23148)
  bpo-1635741: Fix PyInit_pyexpat() error handling (pythonGH-22489)
  bpo-42260: Main init modify sys.flags in-place (pythonGH-23150)
  bpo-1635741: Fix ref leak in _PyWarnings_Init() error path (pythonGH-23151)
  bpo-1635741: _ast uses PyModule_AddObjectRef() (pythonGH-23146)
  bpo-1635741: _contextvars uses PyModule_AddType() (pythonGH-23147)
  bpo-42260: Reorganize PyConfig (pythonGH-23149)
  bpo-1635741: Add PyModule_AddObjectRef() function (pythonGH-23122)
  bpo-42236: os.device_encoding() respects UTF-8 Mode (pythonGH-23119)
  bpo-42251: Add gettrace and getprofile to threading (pythonGH-23125)
  Enable signing of nuget.org packages and update to supported timestamp server (pythonGH-23132)
  Fix incorrect links in ast docs (pythonGH-23017)
  Add _PyType_GetModuleByDef (pythonGH-22835)
  Post 3.10.0a2
  bpo-41796: Call _PyAST_Fini() earlier to fix a leak (pythonGH-23131)
  bpo-42249: Fix writing binary Plist files larger than 4 GiB. (pythonGH-23121)
  bpo-40077: Convert mmap.mmap static type to a heap type (pythonGH-23108)
  Python 3.10.0a2
adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 13, 2021
On Unix, the os.device_encoding() function now returns 'UTF-8' rather
than the device encoding if the Python UTF-8 Mode is enabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants