Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 of refactor LiveText to use diff-match-patch #11639

Merged
merged 62 commits into from
Dec 31, 2020

Conversation

codeofdusk
Copy link
Contributor

@codeofdusk codeofdusk commented Sep 19, 2020

Link to issue number:

Supersedes #11500. Closes #3200. Closes #11498.

Summary of the issue:

LiveText objects use difflib to calculate text changes:

  • Difflib is slow, making 10,000-line objects completely unusable and slowing down or locking up NVDA for other large objects.
  • Comparisons are made at the line level, so edits in the middle of lines cause characters to the right of the caret to be read out.

Description of how this pull request fixes the issue:

This PR changes diffing functions to operate at the string rather than line level and adds diff-match-patch (DMP) as an optional diffing algorithm for LiveText objects. It is anticipated that DMP will become the default in a future NVDA release pending positive user testing.

Unlike #11500, this PR does not import or dynamically link to DMP due to licensing issues. Instead, a Python application is run in another process that calls the DMP extension, and communicates over standard IO.

Testing performed:

  • Tested this approach, difflib, and a dynamically linked DMP Python extension on very large (millions of characters) blocks of text. Observed that while DMP was able to return diffs in about 10 seconds (even with insignificant IPC overhead), Difflib ran for several minutes (I aborted the diff before it could complete).
  • Tested deleting and inserting characters in the middle of a line both with difflib and DMP. With difflib the characters to the right of the caret were read out (as stated in better support for editing characters in the middle of lines in command prompt/terminal #3200). With DMP only the changed characters were read.

Known issues with pull request:

None.

Change log entry:

== Bug Fixes ==

== Changes For Developers ==

@codeofdusk
Copy link
Contributor Author

Cc @camlorn, @feerrenrut, @LeonarddeR, @michaelDCurran, @tspivey.

@feerrenrut
Copy link
Contributor

Thanks for the update @codeofdusk.

I don't think it is a good idea to try to rush this, it would be nice to get it into 2020.4, but we have to be realistic about the amount of work left to do. If it is ready in time we will include it.

  • We privately discussed creating the separate application in Python, did you explore this? If so, what stopped this approach?
    • It's great that you have been able to verify this approach using the C++ application.
    • Looking at the source quickly, there will be some work to turn this into something ready for production, using python here will likely save you a lot of time.
  • For the sake of simplicity in this initial approach I'd suggest using standard input / output instead of named pipes, especially since there seems to be open questions about them. See https://stackoverflow.com/a/8475367 for more information. I can't speak to the performance differences with certainty, though I strongly suspect they performance costs will be very similar. I wouldn't spend time looking at it now, but here are some alternative IPC approaches to consider for future reference https://docs.microsoft.com/en-us/windows/win32/ipc/interprocess-communications

@codeofdusk

This comment has been minimized.

@LeonarddeR
Copy link
Collaborator

Is this winapi module you used also part of the c++ code?
I agree with @feerrenrut that it isn't very helpful to introduce named pipes as a communication layer to NVDA that is very different from other ways of inter process communication like stdin/stdout and rpc. I'd prefer one of those.
If there is c++ code working with named pipes, it shouldn't be that difficult to make it work with RPC, I assume.

@codeofdusk

This comment has been minimized.

@feerrenrut
Copy link
Contributor

Sorry, I was a little unclear from the linked answer on where the issues with named pipes are? They don't seem to be discussed at all?

The open questions are in the current diff for this PR, in particular synchronization for their creation. The link was showing you how to use standard input from python.

Using Python for what amounts to I/O seems like a lot of extra overhead (I'd just be calling a DMP Python extension written in C++).

The benefits of doing this (that I can immediately identify):

  • is that you don't have to get the memory management correct and robust in C++, currently there are problems.
  • you don't have to do the work to build the executable, a python script is enough.
  • you can match the IPC mechanism on the DMP side with the NVDA side, you don't have to learn two ways of doing things.

I think there is likely to be minimal overhead for this. Most of the work will be done by the native parts of python anyway.

Could you please provide some additional comments on this? Where is work necessary, and how would Python save me time?

If you still wish me to review that code then I'll set some time aside to write things up properly, in the meantime you could look at some modern C++ examples specifically on memory management.

@codeofdusk

This comment has been minimized.

@feerrenrut
Copy link
Contributor

Might I want to consider looking into RPC? (it seems like it could be easier, but might be "too intimate" for licensing...)

Perhaps, eventually. Please start by getting this delivered using standard IO.

Is there an equivalent of Valgrind on Windows? How do you suggest I debug memory leaks/access problems in general?

This is not necessary, it's a very simple program, but it is not following modern approaches, this makes it error prone. Judging by the minimalism of the code, I suspect it doesn't handle failures well. To learn more, the following is a good resource: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines

Please try to write this in python instead.

@codeofdusk

This comment has been minimized.

@feerrenrut
Copy link
Contributor

a Python implementation of NVDA_dmp, that depends on the DMP C++ extension.

Just to be clear, this is using JoshData/diff_match_patch-python right?

However, calls to sys.stdout.buffer.write don't seem to work. How should I write binary data to stdout?

What happens? How do these fail? There are some other suggestions on how to do this on stackoverflow, but it would be good to understand why this fails.
Though note, from https://stackoverflow.com/a/8475367 which states that communicate should be used:

Use communicate() rather than .stdin.write, .stdout.read or .stderr.read
to avoid deadlocks due to any of the other OS pipe buffers filling up and blocking the child process.

Also, how do I call the included Python interpreter as a subprocess? sys.executable from an installed copy of NVDA just points to nvda.exe

Yes, we'll have to add a way to run an external script to NVDA. Eventually it should be a separate executable, but for now you could add an option to NVDA to execute a script with execfile('file.py'). eg you call nvda.exe --exec file.py This will allow you to test the approach.

Finally, an alternative would be to create a python extension using the LGPL-2.1 licensed opencor/libxdiff or similar libraries (eg xdiff from libgit). This still uses the same Meyers diff algorithm used by diff match patch. The advantage is it could be run in the same process.

@lukaszgo1
Copy link
Contributor

Yes, we'll have to add a way to run an external script to NVDA. Eventually

Can't we just compile the Python script to exe similar to how nvda_eoaProxy and nvda_slave is build?

it should be a separate executable, but for now you could add an option to NVDA to execute a script with execfile('file.py'). eg you call nvda.exe --exec file.py This will allow you to test the approach.

I guess it would be simpler to add such option to nvda_slave.

@codeofdusk

This comment has been minimized.

@feerrenrut
Copy link
Contributor

communicate blocks until the process returns

Ok, that is something to deal with. That said the warning to avoid deadlocks due to any of the other OS pipe buffers filling up and blocking the child process should be noted.

Looking at the sample you posted:

  • have you confirmed this does what you expect if it is not binary encoded?
  • can you print some messages to show what stage nvda_dmp is at?
  • It would be good to confirm that nvda_dmp is doing what it looks like it should be doing.
  • Do you know for sure that the shell / idle wherever you are running this code, will actually show binary data piped to stdout?
  • You might instead capture it and print the string?

@codeofdusk

This comment has been minimized.

@codeofdusk

This comment has been minimized.

@AppVeyorBot
Copy link

See test results for failed build of commit 1e65d27cef

@codeofdusk
Copy link
Contributor Author

@feerrenrut I've updated the PR and description.

This also adds the DMP Python extension to the repo, but I'm unsure if it should be in a different location.
@AppVeyorBot
Copy link

@codeofdusk
Copy link
Contributor Author

The build is failing, but I can't see why. What's going on here? Cc @feerrenrut, @LeonarddeR, @michaelDCurran.

@michaelDCurran
Copy link
Member

michaelDCurran commented Sep 26, 2020 via email

feerrenrut
feerrenrut previously approved these changes Dec 1, 2020
Copy link
Contributor

@feerrenrut feerrenrut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok. I think it will need significant testing, so I'm not comfortable to include it at this late stage in the 2020.4 release. I'll merge it for release with 2021.1 as soon as we branch off for the beta.

@feerrenrut feerrenrut added this to the 2021.1 milestone Dec 1, 2020
@feerrenrut

This comment has been minimized.

@feerrenrut feerrenrut added the deprecated/2021.1 Label used to track deprecations due for removal in the 2021.1 release label Dec 31, 2020
feerrenrut
feerrenrut previously approved these changes Dec 31, 2020
Copy link
Contributor

@feerrenrut feerrenrut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this @codeofdusk

feerrenrut added a commit to codeofdusk/nvda that referenced this pull request Dec 31, 2020
@feerrenrut feerrenrut merged commit 9889285 into nvaccess:master Dec 31, 2020
@codeofdusk codeofdusk deleted the livetext-dmp-cpp branch December 31, 2020 10:56
@codeofdusk
Copy link
Contributor Author

There seems to be a deadlock somewhere in this code, making autoread stop working sometimes in some consoles. I've had a tough time finding it. Hopefully someone will report the issue with a clear set of steps to reproduce.

@zstanecic
Copy link
Contributor

zstanecic commented Jan 1, 2021 via email

@feerrenrut
Copy link
Contributor

@codeofdusk see if you can get a stack trace / log for it. That will give a lot of clues about what is going on.

@feerrenrut
Copy link
Contributor

It looks like the 'in nvda process' code doesn't flush stdin / stdout. This stack overflow answer suggests both processes need to flush.

More research is required to determine if it necessary to flush for each "step" in the communication.

@codeofdusk
Copy link
Contributor Author

I've found a set of steps to reproduce on my particular system. Adding flush calls at various points in the communication process does not appear to change behaviour in any way.

@codeofdusk
Copy link
Contributor Author

#11998 includes a possible patch for the described diff hang issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecated/2021.1 Label used to track deprecations due for removal in the 2021.1 release
Projects
None yet
7 participants