Skip to content

Releases: WorksApplications/sudachi.rs

v0.6.8

14 Dec 01:45
Compare
Choose a tag to compare

Highlights

Surface Projections

  • For chiTra compatibility SudachiPy can now directly produce different tokens in the surface field.
  • Original surface is accessible via Morheme.raw_surface() method
  • It is possible to customize projection dictionary-wise, via Config object, passing it on a dictionary creation, or for a single pre-tokenizer.

0.6.7

15 Feb 07:32
07ad881
Compare
Choose a tag to compare

Highlights

  • Provide binary wheels for Python 3.11
  • Add Dictionary.lookup() method which allows you to enumerate morphemes from the dictionary without performing analysis.

0.6.6

25 Jul 05:50
Compare
Choose a tag to compare

Highlights

MacOS

  • Binary builds are universal2
  • Caveat: we don't run tests on arm because there are no public arm instances, so builds may be broken without any warning

0.6.5

21 Jun 01:07
Compare
Choose a tag to compare

Highlights

  • Fixed invalid POS tags which appeared when using user-defined POS tags both in user dictionaries and OOV handlers. You are not affected by this bug if you did not use user-defined POS in OOV handlers.

Version 0.6.4

16 Jun 00:25
Compare
Choose a tag to compare

Highlights

  • Remove Python 3.6 support which reached end-of-life status on 2021-12-23
  • OOV handler plugins support user-defined POS, similar to Java version
  • Added Regex OOV handler

Regex OOV Handler

  • For details, see Java version changelog
  • In Rust/Python Regexes do not support backtracking and backreferences
  • maxLength setting defines maximum length in unicode codepoints, not in utf-8 bytes as in Java (will be changed to codepoints later)

0.6.3

10 Feb 05:22
Compare
Choose a tag to compare

Highlights

  • Fixed path resolution algorithm for resources. They are now resolved in the following order (first existing file wins):
    1. Absolute paths stay as they are
    2. Relative to "path" value of the config file
    3. Relative to "resource_dir" parameter of the config object during creation
      • For SudachiPy it is the parameter of Dictionary constructor
    4. Relative to the location of the configuration file
    5. Relative to the current directory

Python

  • Dictionary now has __repr__() function which displays absolute paths to dictionaries in use.
  • Dictionary now has pos_of() function which returns a POS tuple for a given POS id.
  • PosMatcher supports set operations
    • union (m1 | m2)
    • intersection (m1 & m2)
    • difference (m1 - m2)
    • negation (~m1)

0.6.2

09 Dec 05:41
Compare
Choose a tag to compare

Highlights

  • Fixed analysis differences from 0.5.4
    • Central dot ・ is handled correctly
    • Catch-all OOV handler was used even if other OOV handlers could produces some results

0.6.1

08 Dec 08:45
e13bf75
Compare
Choose a tag to compare

Highlights

  • Added Fuzzing (see sudachi-fuzz subdirectory), Sudachi.rs seems to be pretty robust towards arbitrary inputs (no crashes and panics)
    • Issues like #182 should never occur more
  • ~5% analysis speed improvement over 0.6.0
  • Added support for Unicode combining symbols, now Sudachi.rs/py should be much better with emoji (🎅🏾) and more complex Unicode (İstanbul)

Rust

  • Added partial dictionary read functionality, it is now possible to skip reading certain fields if they are not needed
  • Improved startup times, especially for debug builds

Python

  • Morpheme.part_of_speech method now returns Tuple of POS components instead of a list.
  • Partial Dictionary Read
  • HuggingFace PreTokenizer support
    • We provide a built-in HuggingFace-compatible pre-tokenizer
    • API: Dictionary.pre_tokenizer()
    • It is multithreading-compatible and supports customization
  • Memory allocation reuse
    • It is possible to reduce re-allocation overhead by using out parameters which accept MorphemeLists
    • Supported API: Tokenizer.tokenize(), Morpheme.split()
    • It is now a recommended way to use both those APIs
  • PosMatcher
    • New API for checking if a morpheme has a POS tag from a set
    • Strongly prefer using it instead of string comparison of POS components
  • Performance
    • Greatly decreased cost of accessing POS components
  • len(Morpheme) now returns the length of the morpheme in Unicode codepoints. Use it instead of len(m.surface())
  • Morpheme.split() has new add_single parameter, which can be used to check whether the split has produced anything
    • E.g. with if m.split(SplitMode.A, out=res, add_single=False): handle_splits(res)
    • add_single=True, returning the list with the current morpheme is the current behavior
  • Morpheme/MorphemeList now have readable __repr__ and __str__

0.6.0

11 Nov 08:03
6a76930
Compare
Choose a tag to compare

Full Changelog

Highlights

  • Full feature parity with Java version
  • ~15% analysis speed improvement over 0.6.0-rc1
  • SudachiPy compatible Python bindings
  • ~30x speed improvement over original SudachiPy

Rust

  • No public API at the moment (contact us if you want to use Rust version directly, internals will significantly change and names are not finalized)
  • Added dictionary build functionality
  • Added an option to perform analysis without sentence splitting
    • Use it with --split-sentences=no

Python

  • Added bindings for dictionary build (undocumented and not supported as API).
  • sudachipy build and sudachipy ubuild should work once more
    • Report on build times and dictionary part sizes can differ from the original SudachiPy

0.6.0-rc1

26 Oct 02:24
1cf62ec
Compare
Choose a tag to compare
0.6.0-rc1 Pre-release
Pre-release

Highlights

  • First release of Sudachi.rs
  • SudachiPy compatible Python bindings
  • ~30x speed improvement over original SudachiPy
  • Dictionary build mode will be done before 0.6.0 final (See #13)

Rust

  • Analysis: feature parity with Python and Java version
  • Dictionary build is not supported in rc1
  • ~2x faster than Java version (with sentence splitting)
  • No public API at the moment (contact us if you want to use Rust version directly, internals will significantly change and names are not finalized)

Python

Known Issues

  • List of deprecated SudachiPy API:
    • MorphemeList.empty(dict: Dictionary)
      • This also needs a dictionary as an argument.
    • Morpheme.split(mode: SplitMode)
    • Morpheme.get_word_info()
    • Most of instance attributes are not exported: e.g. Dictionary.grammar, Dictionary.lexicon.
  • Dictionary Build is not supported: sudachipy build and sudachipy ubuild will not work, please use 0.5.3 in another virtual environment for the time being until the feature is implemented: #13