Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide on recommendation for edit distance algorithm #149

Open
3 tasks
mweidling opened this issue Jan 23, 2023 · 2 comments
Open
3 tasks

Decide on recommendation for edit distance algorithm #149

mweidling opened this issue Jan 23, 2023 · 2 comments

Comments

@mweidling
Copy link
Collaborator

In the QA specs we claim that we provide a recommendation for the edit distance algorithm to be used.

For this we have to

  • collect the most popular / widely used algorithms for this task
  • evaluate them
  • decide which one(s) we recommend
@kba kba mentioned this issue Jan 23, 2023
@bertsky
Copy link

bertsky commented Jan 23, 2023

FYI I did some comprehensive testing of Python-available character alignment packages roughly 5 yrs ago, and the most important result was that you absolutely cannot trust but need to test them, thoroughly. Issues:

  • robustness (crashes on rare circumstances or large problems)
  • stack limitation (i.e. bad implementation of the recursion)
  • average/benign performance
  • worst-case memory and time complexity
  • whether or not you can get a coarse, cheap estimate first - or can pass some upper limit on the distance
  • whether or not the actual alignment (or just the score) is returned
  • whether or not multiple (or just the single-best) result is returned
  • whether or not combining codepoints are glued to the base characters (i.e. grapheme clusters), or are treated as independent (in the alignment and in the distance metric)
  • whether and how efficient and parallel 1:n or even n:m variants are implemented
  • whether or not custom weights or cost functions can be passed in (but we could always make that a post-processing step as long as we get the alignment path itself)

Tools besides rapidfuzz and jellyfish you might want to list:

@mweidling
Copy link
Collaborator Author

That's very valuable information, thank you! I will keep this in mind when we think about a recommendation (if we can break it down to one, actually).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants