Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadConsoleW fails with non-BMP characters #4628

Closed
eryksun opened this issue Feb 18, 2020 · 10 comments · Fixed by #15783
Closed

ReadConsoleW fails with non-BMP characters #4628

eryksun opened this issue Feb 18, 2020 · 10 comments · Fixed by #15783
Assignees
Labels
Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. In-PR This issue has a related PR Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-2 A description (P2) Product-Conhost For issues in the Console codebase
Milestone

Comments

@eryksun
Copy link

eryksun commented Feb 18, 2020

Environment

Microsoft Windows [Version 10.0.18363.657]
conhost.exe builtin console, V2
wt.exe terminal, V0.9.433.0

Steps to reproduce

readsp.zip

Extract, compile and run the attached readsp.c program under the V2 console. This programs exercises directly writing a non-BMP character to the input buffer via WriteConsoleInputW and reading it back via ReadConsoleW, first with echo enabled and then with it disabled. Run the program with -v (e.g. readsp -v) to show the input key-event records that each step tries to read. It tries a normal key down/up event pair as well as the Alt+Numpad sequence that the console uses for pasted text. The latter uses 6 key events per wide-character and thus 12 key events for a surrogate pair. I included the paste sequence to try to clarify a related issue in which manually pasting a non-BMP character produces a different incorrect result, but it didn't help. I'll discuss that related issue in a comment, in case it's all due to the same underlying issue.

Expected behavior

ReadConsoleW should be able to correctly read supplementary-plane (i.e. non-BMP) characters such as "😞" (U+1F61E), regardless of whether they are typed or pasted into the terminal window, or written directly to the input buffer, or whether echo is enabled. Since the wide-character API uses 16-bit characters, the non-BMP character should be read as a UTF-16 surrogate pair, e.g. U+1F61E should be encoded as {0xD83D, 0xDE1E}.

ReadConsoleW works as expected with the legacy (V1) console. For example:

Test normal with ECHO ON
😞
stream (4): L"\ud83d\ude1e\u000d\u000a"
screen: L"\ud83d\ude1e        "

Test paste with ECHO ON
😞
stream (4): L"\ud83d\ude1e\u000d\u000a"
screen: L"\ud83d\ude1e        "

Test normal with ECHO OFF
stream (4): L"\ud83d\ude1e\u000d\u000a"

Test paste with ECHO OFF
stream (4): L"\ud83d\ude1e\u000d\u000a"

It almost works correctly with Windows Terminal version 0.9.433.0:

Test normal with ECHO ON
��
stream (4) = L"\ud83d\ude1e\u000d\u000a"
screen = L"\ufffd\ufffd        "

Test paste with ECHO ON
��
stream (4) = L"\ud83d\ude1e\u000d\u000a"
screen = L"\ufffd\ufffd        "

Test normal with ECHO OFF
stream (4) = L"\ud83d\ude1e\u000d\u000a"

Test paste with ECHO OFF
stream (4) = L"\ud83d\ude1e\u000d\u000a"

Apparently a cooked read under Windows Terminal has a bug in which a non-BMP character gets echoed as two replacement characters, U+FFFD. But at least the ReadConsoleW result is correct.

Actual behavior

In the output below, not only does the cooked read fail with ERROR_INVALID_PARAMETER (87) when echo is enabled, but the echoed text contains only the first surrogate code of the surrogate pair, 0xD83D.

Test normal with ECHO ON
�
ReadConsoleW failed (87)
screen: L"\ud83d         "

Test paste with ECHO ON
�
ReadConsoleW failed (87)
screen: L"\ud83d         "

Test normal with ECHO OFF
stream (4): L"\ud83d\ude1e\u000d\u000a"

Test paste with ECHO OFF
stream (4): L"\ud83d\ude1e\u000d\u000a"

Since it's not a valid Unicode character, I've replaced this lone surrogate code in the pasted text with the Unicode replacement character, U+FFFD, but the "screen" text, which gets read directly from the screen buffer, shows that the code displayed on the console is 0xD83D.

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Feb 18, 2020
@eryksun
Copy link
Author

eryksun commented Feb 18, 2020

A seemingly related issue is that when a non-BMP character is manually pasted into the console, ReadConsoleW echoes and returns only the first code in the UTF-16 encoded surrogate pair. For example, with "😞" only 0xD83D is returned. Here's a simple example in Python:

>>> from win32console import *
>>> h = GetStdHandle(STD_INPUT_HANDLE)
>>> h.ReadConsole(4)
�
'\ud83d\r\n'

(As before, I've replaced the echoed 0xD83D surrogate code with U+FFFD in order to avoid problems with pasting an invalid code point.)

This prevents pasting supplementary-plane characters on the command line in the CMD shell, which relies on the console's cooked read for its command-line interface. For example:

Legacy console:

C:\>echo 😞
😞

Windows Terminal:

C:\>echo ��
😞

V2 console:

C:\>echo �
�

PowerShell uses low-level ReadConsoleInputW instead of ReadConsoleW, so it's not subject to this bug, though external programs that inherit PowerShell's console may be.

@DHowett-MSFT DHowett-MSFT added Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Product-Conhost For issues in the Console codebase labels Feb 18, 2020
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Feb 18, 2020
@DHowett-MSFT DHowett-MSFT removed the Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting label Feb 18, 2020
@DHowett-MSFT DHowett-MSFT added this to the 21H1 milestone Feb 18, 2020
@DHowett-MSFT
Copy link
Contributor

@miniksa /cc for legacy v1 compat break

@BDisp
Copy link

BDisp commented Dec 9, 2022

I'm having this issue gui-cs/Terminal.Gui#2250 (comment). Please see the difference on displaying non-BMP code points in a Windows Console Host and in a Windows Terminal. The non-BMP, 𝔽𝕆𝕆𝔹𝔸 appear with an additional space to make it aligned, but in the Windows Terminal that additional space are ignored causing misalignment.

@DHowett
Copy link
Member

DHowett commented Dec 9, 2022

I'm having this issue gui-cs/Terminal.Gui#2250 (comment).

Is Terminal.Gui using ReadConsoleW to display these characters? ReadConsoleW is primarily concerned with user input, not output to the screen. 😄

@BDisp
Copy link

BDisp commented Dec 9, 2022

Is Terminal.Gui using ReadConsoleW to display these characters? ReadConsoleW is primarily concerned with user input, not output to the screen. 😄

No, of course, it's using WriteConsoleOutputW to output to the console :-)
But it was related with non-BMP and I used it, sorry.

@tig
Copy link

tig commented May 8, 2023

Can we get a call on whether the WT team thinks this issue is mis-named and should be about the fact that WT does not seem to deal with non-BMP codepoints correctly when WriteConsoleOutputW and ReadConsoleW are used?

Or should we create a new issue?

As of now, Terminal.Gui needs to disable rendering of all non-BMP codepoints when using these APIs.

Note that when Terminal.Gui uses the .NET Console API, these codepoints work fine.

@zadjii-msft zadjii-msft added the Needs-Discussion Something that requires a team discussion before we can proceed label May 8, 2023
@DHowett
Copy link
Member

DHowett commented May 8, 2023

Alright, so I've got a couple updates.

  1. @eryksun As of (probably) @lhecker's work on the conhostv2 input stack, your test case exhibits the "almost works correctly" behavior you observed in Windows Terminal, but in conhost. I'm calling this a success: ReadConsoleW no longer returns an error with your test application.
    1. I understand why it's failing to display the surrogate pairs on the input line when ECHO is enabled. An ancient, terrible function we call WriteCharsLegacy took a dependency on one codepoint occupying one wchar_t (rather than one code unit) during some Unicode refactoring we did a while ago.
    2. I'll write up notes on that and include it in our "Fix WriteCharsLegacy" meta-issue; we've been tracking good progress against the beast for the past couple of years.
  2. Leonard's work on [Epic] Text Buffer rewrite  #8000 and the adjacent cluster of issues is going to ensure that when you do read back the contents of the screen in the impacted region you do get roughly the right text out of it, within the limits of the Win32 console API.
  3. @tig I can't completely figure out how your issue is related to this one, but I will try. This issue is not mis-titled, and it very specifically (and fairly singularly) reports that ReadConsoleW does not work properly in conhostv2 when you insert something that contains surrogate pairs or requires multiple code units. As in 1 above, that issue is now fixed.
    1. If you are reporting that you cannot reliably determine the size of a character: get in line! we hate it too! wcwidth and wcswidth are woefully inadequate. There is no way to ask a terminal emulator how big a character is or was (and I don't think there should be, because what if the terminal emulator is tmux and it has many different heads? Or what if the font changes, and therefore the perceived size of the glyph would change, but the application can't know about it because it is font-agnostic? What about over a slow link like SSH, where answerback could time out and then the application would be no better off (though perhaps worse off, because it would need to handle a late synchronous reply)?)
    2. If you are reporting that some characters have a wrong or unexpected width: We routinely regenerate the display sizes of characters based on the values found in the Unicode UCD. We are up to 15.0. We use it as the true normative source of character widths.
    3. If you are reporting that WriteConsoleOutputW does not work properly: WriteConsoleOutput works entirely cellwise, and there is no guarantee that you could ever emit a non-BMP character that requires a surrogate pair using it. I'm sorry to say that the cellwise APIs cannot represent the full gamut of text. For example: U+1F574 MAN IN BUSINESS SUIT LEVITATING only occupies one column but requires two code units. The antiquated mapping of columns to code units just doesn't account for this case. 😄 However, writing known-to-be-wide characters that require two code units by way of two CHAR_INFO structs should work. If that is not working, it would be worth filing a new issue for it.
      1. I'd love to entertain discussion about how to make the old cellwise APIs work for non-cellwise content; with the work being done around text measurement in terminal emulators, it might be a nice time for us to modernize (or at least discuss modernization.)

@tig
Copy link

tig commented May 9, 2023

It's "iii".

I get it now and will back off this Issue.

What Terminal.Gui is going to do (eventually) is retarget our WindowsDriver away from using WriteConsoleOutput to using the new Console Virtual Terminal Sequences API, which I assume with both let us support non-BMP and will still be nicely performant.

@tig
Copy link

tig commented May 9, 2023

BTW, I appreciate your thorough responses!

@o-sdn-o
Copy link

o-sdn-o commented May 10, 2023

Why the following behavior is expected (copy from description)

Test normal with ECHO ON
😞
stream (4): L"\ud83d\ude1e\u000d\u000a"
screen: L"\ud83d\ude1e        "

Test paste with ECHO ON
😞
stream (4): L"\ud83d\ude1e\u000d\u000a"
screen: L"\ud83d\ude1e        "

Test normal with ECHO OFF
stream (4): L"\ud83d\ude1e\u000d\u000a"

Test paste with ECHO OFF
stream (4): L"\ud83d\ude1e\u000d\u000a"

instead of

Test normal with ECHO ON
😞 
stream (4): L"\ud83d\ude1e\u000d\u000a"
screen: L"\ud83d\ude1e\u0000       "

Test paste with ECHO ON
?? 
stream (4): L"??\u000d\u000a"
screen: L"??\u0000       "

Test normal with ECHO OFF
stream (4): L"\ud83d\ude1e\u000d\u000a"

Test paste with ECHO OFF
stream (4): L"??\u000d\u000a"
Test paste with ECHO OFF
input records:
    vk: 12, kd: 1, ks: 0002, uc: 0000
    vk: 66, kd: 1, ks: 0002, uc: 0000
    vk: 66, kd: 0, ks: 0002, uc: 0000
    vk: 63, kd: 1, ks: 0002, uc: 0000
    vk: 63, kd: 0, ks: 0002, uc: 0000
    vk: 12, kd: 0, ks: 0000, uc: d83d  <- broken pair + '?' inserted
    vk: 12, kd: 1, ks: 0002, uc: 0000
    vk: 66, kd: 1, ks: 0002, uc: 0000
    vk: 66, kd: 0, ks: 0002, uc: 0000
    vk: 63, kd: 1, ks: 0002, uc: 0000
    vk: 63, kd: 0, ks: 0002, uc: 0000
    vk: 12, kd: 0, ks: 0000, uc: de1e  <- broken pair + '?' inserted
    vk: 00, kd: 1, ks: 0000, uc: 000d
    vk: 00, kd: 0, ks: 0000, uc: 000d
stream (4): L"??\u000d\u000a"

despite the fact that the input buffer contains two Alt+Num'ed question marks and the halves of the surrogate pair are not consecutive?

Should the nonzero uc payload override all previous Alt+Num input when Alt is released?

Can a keyboard event with vk=VK_NUMPADx exist without ks=NUMLOCK_ON | ...? (ks=LEFT_ALT_PRESSED in the attached readsp.c)

@lhecker lhecker self-assigned this May 15, 2023
@DHowett DHowett assigned lhecker and unassigned DHowett and lhecker May 17, 2023
@carlos-zamora carlos-zamora removed the Needs-Discussion Something that requires a team discussion before we can proceed label Jun 5, 2023
@carlos-zamora carlos-zamora modified the milestones: 22H2, Backlog Jun 5, 2023
@microsoft-github-policy-service microsoft-github-policy-service bot added the In-PR This issue has a related PR label Aug 2, 2023
DHowett pushed a commit that referenced this issue Aug 25, 2023
This massive refactoring has two goals:
* Enable us to go beyond UCS-2 support for input editing
* Bring clarity into `COOKED_READ_DATA`'s inner workings

Unfortunately, over time, knowledge about its exact operation was lost.
While the new code is still complex it reduces the amount of code by 4x
which will make preserving knowledge hopefully significantly easier.

The new implementation is simpler and slower than the old one in a way,
because every time the input line is modified it's rewritten to the text
buffer from scratch. This however massively simplifies the underlying
algorithm and the amount of state that needs to be tracked and results
in a significant reduction in code size. It also makes it more robust,
because there's less code now that can be incorrect.

This "optimization laziness" can be afforded due the recent >10x
improvements to `TextBuffer`'s text ingestion performance.
For short inputs (<1000 characters) I still expect this implementation
to outperform the conhost from the past.
It has received one optimization already however: While reading text
from the `InputBuffer` we'll now defer writing into the `TextBuffer`
until we've stopped reading. This improves the overhead of pasting text
from O(n^2) to O(n), which is immediately noticeable for inputs >100kB.

Resizing the text buffer still ends up corrupting the input line
however, which unfortunately cannot be fixed in `COOKED_READ_DATA`.
The issue occurs due to bugs in `TextBuffer::Reflow` itself, as it
misplaces the cursor if the prompt is on the last line of the buffer.

Closes #1377
Closes #1503
Closes #4628
Closes #4975
Closes #5033
Closes #8008

This commit is required to fix #797

## Validation Steps Performed
* ASCII input ✅
* Chinese input (中文維基百科) ❔
  * Resizing the window properly wraps/unwraps wide glyphs ❌
    Broken due to `TextBuffer::Reflow` bugs
* Surrogate pair input (🙂) ❔
  * Resizing the window properly wraps/unwraps surrogate pairs ❌
    Broken due to `TextBuffer::Reflow` bugs
* In cmd.exe
  * Create 2 file: "a😊b.txt" and "a😟b.txt"
  * Press tab: Autocompletes "a😊b.txt" ✅
  * Navigate the cursor right past the "a"
  * Press tab twice: Autocompletes "a😟b.txt" ✅
* Backspace deletes preceding glyphs ✅
* Ctrl+Backspace deletes preceding words ✅
* Escape clears input ✅
* Home navigates to start ✅
* Ctrl+Home deletes text between cursor and start ✅
* End navigates to end ✅
* Ctrl+End deletes text between cursor and end ✅
* Left navigates over previous code points ✅
* Ctrl+Left navigates to previous word-starts ✅
* Right and F1 navigate over next code points ✅
  * Pressing right at the end of input copies characters
    from the previous command ✅
* Ctrl+Right navigates to next word-ends ✅
* Insert toggles overwrite mode ✅
* Delete deletes next code point ✅
* Up and F5 cycle through history ✅
  * Doesn't crash with no history ✅
  * Stops at first entry ✅
* Down cycles through history ✅
  * Doesn't crash with no history ✅
  * Stops at last entry ✅
* PageUp retrieves the oldest command ✅
* PageDown retrieves the newest command ✅
* F2 starts "copy to char" prompt ✅
  * Escape dismisses prompt ✅
  * Typing a character copies text from the previous command up
    until that character into the current buffer (acts identical
    to F3, but with automatic character search) ✅
* F3 copies the previous command into the current buffer,
  starting at the current cursor position,
  for as many characters as possible ✅
  * Doesn't erase trailing text if the current buffer
    is longer than the previous command ✅
  * Puts the cursor at the end of the copied text ✅
* F4 starts "copy from char" prompt ✅
  * Escape dismisses prompt ✅
  * Erases text between the current cursor position and the
    first instance of a given char (but not including it) ✅
* F6 inserts Ctrl+Z ✅
* F7 without modifiers starts "command list" prompt ✅
  * Escape dismisses prompt ✅
  * Minimum size of 40x10 characters ✅
  * Width expands to fit the widest history command ✅
  * Height expands up to 20 rows with longer histories ✅
  * F9 starts "command number" prompt ✅
  * Left/Right paste replace the buffer with the given command ✅
    * And put cursor at the end of the buffer ✅
  * Up/Down navigate selection through history ✅
    * Stops at start/end with <10 entries ✅
    * Stops at start/end with >20 entries ✅
    * Wide text rendering during pagination with >20 entries ✅
  * Shift+Up/Down moves history items around ✅
  * Home navigates to first entry ✅
  * End navigates to last entry ✅
  * PageUp navigates by 20 items at a time or to first ✅
  * PageDown navigates by 20 items at a time or to last ✅
* Alt+F7 clears command history ✅
* F8 cycles through commands that start with the same text as
  the current buffer up until the current cursor position ✅
  * Doesn't crash with no history ✅
* F9 starts "command number" prompt ✅
  * Escape dismisses prompt ✅
  * Ignores non-ASCII-decimal characters ✅
  * Allows entering between 1 and 5 digits ✅
  * Pressing Enter fetches the given command from the history ✅
* Alt+F10 clears doskey aliases ✅
@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs-Tag-Fix Doesn't match tag requirements label Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. In-PR This issue has a related PR Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-2 A description (P2) Product-Conhost For issues in the Console codebase
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants