Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an ability to convert result of surface() method to normalized variants by specifying a projection #230

Closed
eiennohito opened this issue May 12, 2023 · 7 comments · Fixed by #234
Assignees

Comments

@eiennohito
Copy link
Collaborator

Projection can be normalization transforms as specified by ChiTra or a user-passed callable.

Original surface will always be available as raw_surface().

Specification

  • Add an projection argument to Dictionary.create which can accept either a string or a callable.
  • When projection is passed, instead of usual Morpheme.surface() result, Sudachi.rs will produce a projected string.
  • Projected strings are not cached, projection is computed each time on call to surface()
  • Original string is available as raw_surface() method on Morpheme
@eiennohito eiennohito self-assigned this May 12, 2023
@hiroshi-matsuda-rit
Copy link

@eiennohito
Copy link
Collaborator Author

In the long term it would be better to submit a patch to huggingface transformers to provide **kwargs with options to all Japanese tokenizers though

eiennohito added a commit that referenced this issue Aug 14, 2023
also, add projection functionality

Fixes #230
@eiennohito
Copy link
Collaborator Author

eiennohito commented Aug 14, 2023

@hiroshi-matsuda-rit see https://github.com/WorksApplications/sudachi.rs/blob/da5aca62e3cef8892ceaf64e7ac4f9ef25c2f8d1/python/tests/test_projection.py, any comments on the functionality?
Also see #234

@hiroshi-matsuda-rit
Copy link

@eiennohito Thank you for all your efforts in this issue. It seems new implementations are very nice.
In WordFormTypes of SudachiTra, some of the word form types have the fallback to the surface field.
I think we do not need all the fallbacking logics in WordFormTypes.
But normalized_and_surface, normalized_nouns, and dictionary_and_surface, which are used in SudahiTra or Megagon Labs' ELECTRA model, I think it would be better from a compatibility point of view if these were supported.
(I'm sorry that I do not have time to review entire source codes to identify the appropriate implementation points.)

@eiennohito
Copy link
Collaborator Author

No problem, I missed point on requirements about supporting chiTra forms. I will add that implementation as well.

@eiennohito
Copy link
Collaborator Author

@hiroshi-matsuda-rit I will add support only for normalized_and_surface, normalized_nouns, and dictionary_and_surface, please say if other types are also needed (they require more work)

@hiroshi-matsuda-rit
Copy link

hiroshi-matsuda-rit commented Aug 15, 2023

Excellent! Thanks for additional implementations. (I think those three options are enough for us.) @eiennohito

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants