Skip to content

Commit

Permalink
0.6.1 release (#195)
Browse files Browse the repository at this point in the history
  • Loading branch information
eiennohito committed Dec 8, 2021
1 parent 0cdee2a commit e13bf75
Show file tree
Hide file tree
Showing 18 changed files with 64 additions and 45 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/python-upload-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -140,10 +140,10 @@ jobs:
python-version: ${{ matrix.python-version }}

- name: Install our module from TestPyPi
run: python -m pip install --pre -U -i https://test.pypi.org/simple/ sudachipy[test]
run: python -m pip install --pre -U -i https://test.pypi.org/simple/ sudachipy

- name: Install dependencies
run: python -m pip install sudachidict_core
run: python -m pip install sudachidict_core tokenizers

- name: Run test
working-directory: ./python
Expand Down
19 changes: 17 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
# [0.6.1](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.1) (2020-12-08)

## Highlights
* Added Fuzzing (see `sudachi-fuzz` subdirectory), Sudachi.rs seems to be pretty robust towards arbitrary inputs (no crashes and panics)
* Issues like https://github.com/WorksApplications/sudachi.rs/issues/182 should never occur more
* ~5% analysis speed improvement over 0.6.0
* Added support for Unicode combining symbols, now Sudachi.rs/py should be much better with emoji (🎅🏾) and more complex Unicode (İstanbul)

## Rust
* Added partial dictionary read functionality, it is now possible to skip reading certain fields if they are not needed
* Improved startup times, especially for debug builds

## Python
* See [Python changelog](./python/CHANGELOG.md)

# [0.6.0](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.0) (2020-11-11)
## Highlights
* Full feature parity with Java version
Expand All @@ -16,15 +31,15 @@
* Report on build times and dictionary part sizes can differ from the original SudachiPy


# [0.6.0-rc1](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.0-rc1) (2021-10-26)
# [0.6.0-rc1](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.0-rc1) (2021-10-26)
## Highlights

* First release of Sudachi.rs
* SudachiPy compatible Python bindings
* ~30x speed improvement over original SudachiPy
* Dictionary build mode will be done before 0.6.0 final (See #13)

## Rust
## Rust

* Analysis: feature parity with Python and Java version
* Dictionary build is not supported in rc1
Expand Down
14 changes: 7 additions & 7 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions README.ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@

sudachi.rs は日本語形態素解析器 [Sudachi](https://github.com/WorksApplications/Sudachi) のRust実装です。

[English README](README.md)
[English README](README.md) [SudachiPy Documentation](https://worksapplications.github.io/sudachi.rs/python)

## TL;DR

SudachiPyとして使うには
```bash
$ pip install --update 'sudachipy>=0.6.0'
$ pip install --update 'sudachipy>=0.6.1'
```

```bash
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,19 @@

[![Rust](https://github.com/WorksApplications/sudachi.rs/actions/workflows/rust.yml/badge.svg)](https://github.com/WorksApplications/sudachi.rs/actions/workflows/rust.yml)

**2021-11-11 UPDATE**: [First release of SudachiPy-compatible bindings](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.0)
**2021-12-08 UPDATE**: [0.6.1 Release](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.1)

Try it:
```shell
pip install --update 'sudachipy>=0.6.0'
pip install --update 'sudachipy>=0.6.1'
```


<p align="center"><img width="100" src="logo.png" alt="sudachi.rs logo"></p>

sudachi.rs is a Rust implementation of [Sudachi](https://github.com/WorksApplications/Sudachi), a Japanese morphological analyzer.

[日本語 README](README.ja.md)
[日本語 README](README.ja.md) [SudachiPy Documentation](https://worksapplications.github.io/sudachi.rs/python)

## TL;DR

Expand Down
2 changes: 1 addition & 1 deletion plugin/input_text/default_input_text/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "default_input_text"
version = "0.6.1-a1"
version = "0.6.1"
authors = ["Works Applications <sudachi@worksap.co.jp>"]
edition = "2018"
license = "Apache-2.0"
Expand Down
2 changes: 1 addition & 1 deletion plugin/oov/simple_oov/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "simple_oov"
version = "0.6.1-a1"
version = "0.6.1"
authors = ["Works Applications <sudachi@worksap.co.jp>"]
edition = "2018"
license = "Apache-2.0"
Expand Down
2 changes: 1 addition & 1 deletion plugin/path_rewrite/join_katakana_oov/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "join_katakana_oov"
version = "0.6.1-a1"
version = "0.6.1"
authors = ["Works Applications <sudachi@worksap.co.jp>"]
edition = "2018"
license = "Apache-2.0"
Expand Down
2 changes: 1 addition & 1 deletion plugin/path_rewrite/join_numeric/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "join_numeric"
version = "0.6.1-a1"
version = "0.6.1"
authors = ["Works Applications <sudachi@worksap.co.jp>"]
edition = "2018"
license = "Apache-2.0"
Expand Down
19 changes: 11 additions & 8 deletions python/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,25 +8,28 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
- `-d` option of `sudachipy` cli does nothing.
- `sudachipy.Tokenizer` will ignore the provided logger.
- Ref: [#76]
- `Morpheme.part_of_speech` method now returns Tuple of POS components instead of a list.

## [0.6.1] - 2021/12/08

- [`Morpheme.part_of_speech`](https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.html#sudachipy.Morpheme.part_of_speech) method now returns Tuple of POS components instead of a list.
- [Partial Dictionary Read](https://worksapplications.github.io/sudachi.rs/python/topics/subsetting.html)
- It is possible to ask for a subset of morpheme fields instead of all fields
- Supported API: `Dictionary.create()`, `Dictionary.pre_tokenizer()`
- Supported API: [`Dictionary.create()`](https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.html#sudachipy.Dictionary.create), [`Dictionary.pre_tokenizer()`](https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.html#sudachipy.Dictionary.pre_tokenizer)
- HuggingFace PreTokenizer support
- We provide a built-in HuggingFace-compatible pre-tokenizer
- API: `Dictionary.pre_tokenizer()`
- API: [`Dictionary.pre_tokenizer()`](https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.html#sudachipy.Dictionary.pre_tokenizer)
- It is multithreading-compatible and supports customization
- Memory allocation reuse
- [Memory allocation reuse](https://worksapplications.github.io/sudachi.rs/python/topics/out_param.html)
- It is possible to reduce re-allocation overhead by using `out` parameters which accept `MorphemeList`s
- Supported API: `Tokenizer.tokenize()`, `Morpheme.split()`
- It is now a recommended way to use both those APIs
- PosMatcher
- Supported API: [`Tokenizer.tokenize()`](https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.html#sudachipy.Tokenizer.tokenize), [`Morpheme.split()`](https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.html#sudachipy.Morpheme.split)
- It is now a recommended way to use both those APIs
- [PosMatcher](https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.html#sudachipy.Dictionary.pos_matcher)
- New API for checking if a morpheme has a POS tag from a set
- Strongly prefer using it instead of string comparison of POS components
- Performance
- Greatly decreased cost of accessing POS components
- `len(Morpheme)` now returns the length of the morpheme in Unicode codepoints. Use it instead of `len(m.surface())`
- `Morpheme.split()` has new `add_single` parameter, which can be used to check whether the split has produced anything
- [`Morpheme.split()`](https://worksapplications.github.io/sudachi.rs/python/api/sudachipy.html#sudachipy.Morpheme.split) has new `add_single` parameter, which can be used to check whether the split has produced anything
- E.g. with `if m.split(SplitMode.A, out=res, add_single=False): handle_splits(res)`
- `add_single=True`, returning the list with the current morpheme is the current behavior
- `Morpheme`/`MorphemeList` now have readable `__repr__` and `__str__`
Expand Down
2 changes: 1 addition & 1 deletion python/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "sudachipy"
version = "0.6.1-a1"
version = "0.6.1"
edition = "2018"
description = "Python bindings of sudachi.rs, the Japanese Morphological Analyzer"
homepage = "https://github.com/WorksApplications/sudachi.rs"
Expand Down
21 changes: 11 additions & 10 deletions python/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
# SudachiPy
[![PyPi version](https://img.shields.io/pypi/v/sudachipy.svg)](https://pypi.python.org/pypi/sudachipy/)
[![](https://img.shields.io/badge/python-3.6+-blue.svg)](https://www.python.org/downloads/release/python-360/)
[Documentation](https://worksapplications.github.io/sudachi.rs/python)

SudachiPy is a Python version of [Sudachi](https://github.com/WorksApplications/Sudachi), a Japanese morphological analyzer.

This is not a pure Python implementation, but bindings for the
This is not a pure Python implementation, but bindings for the
[Sudachi.rs](https://github.com/WorksApplications/sudachi.rs).

## Binary wheels

We provide binary builds for macOS (10.14+), Windows and Linux only for x86_64 architecture.
x86 32-bit architecture is not supported and is not tested.
MacOS source builds seem to work on ARM-based (Aarch64) Macs,
MacOS source builds seem to work on ARM-based (Aarch64) Macs,
but this architecture also is not tested and require installing Rust toolchain and Cargo.

More information [here](https://worksapplications.github.io/sudachi.rs/python/wheels.html).
Expand Down Expand Up @@ -206,7 +207,7 @@ tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
There are three editions of Sudachi Dictionary, namely, `small`, `core`, and `full`. See [WorksApplications/SudachiDict](https://github.com/WorksApplications/SudachiDict) for the detail.
SudachiPy uses `sudachidict_core` by default.
SudachiPy uses `sudachidict_core` by default.
Dictionaries are installed as Python packages `sudachidict_small`, `sudachidict_core`, and `sudachidict_full`.
Expand Down Expand Up @@ -251,19 +252,19 @@ class Dictionary(config_path=None, resource_dir=None, dict_type=None)
from sudachipy import Dictionary
# default: sudachidict_core
tokenizer_obj = Dictionary().create()
tokenizer_obj = Dictionary().create()
# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be used
tokenizer_obj = Dictionary(config_path="/path/to/sudachi.json").create()
tokenizer_obj = Dictionary(config_path="/path/to/sudachi.json").create()
# The dictionary specified by `dict_type` will be set.
tokenizer_obj = Dictionary(dict_type="core").create() # sudachidict_core (same as default)
tokenizer_obj = Dictionary(dict_type="small").create() # sudachidict_small
tokenizer_obj = Dictionary(dict_type="full").create() # sudachidict_full
# The dictionary specified by `dict_type` overrides those defined in the config path.
# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.
tokenizer_obj = Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create()
# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.
tokenizer_obj = Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create()
```
Expand All @@ -282,7 +283,7 @@ The default setting file is [sudachi.json](https://github.com/WorksApplications/
```bash
$ sudachipy -r path/to/sudachi.json
```
```
## User Dictionary
Expand All @@ -300,7 +301,7 @@ Then specify your `sudachi.json` with the `-r` option.
```bash
$ sudachipy -r path/to/sudachi.json
```
```
You can build a user dictionary with the subcommand `ubuild`.
Expand Down Expand Up @@ -358,7 +359,7 @@ Then specify your `sudachi.json` with the `-r` option.
```bash
$ sudachipy -r path/to/sudachi.json
```
```
## For Developers
Expand Down
2 changes: 1 addition & 1 deletion python/docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
author = 'Works Applications'

# The full version, including alpha/beta/rc tags
release = '0.6.1-a1'
release = '0.6.1'


# -- General configuration ---------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion python/py_src/sudachipy/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from importlib.util import find_spec
from pathlib import Path

__version__ = "0.6.1-a1"
__version__ = "0.6.1"

_DEFAULT_RESOURCEDIR = Path(__file__).resolve().parent / 'resources'
_DEFAULT_SETTINGFILE = _DEFAULT_RESOURCEDIR / 'sudachi.json'
Expand Down
2 changes: 1 addition & 1 deletion python/py_src/sudachipy/command_line.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def run(tokenizer, input_, output, print_all, morphs, is_stdout):
for m in tokenizer.tokenize(line, out=mlist):
list_info = [
m.surface(),
",".join(morphs[m.part_of_speech_id()]),
morphs[m.part_of_speech_id()],
m.normalized_form()]
if print_all:
list_info += [
Expand Down
2 changes: 1 addition & 1 deletion python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

setup(
name="SudachiPy",
version="0.6.1-a1",
version="0.6.1",
description="Python version of Sudachi, the Japanese Morphological Analyzer",
long_description=open('README.md', encoding='utf-8').read(),
long_description_content_type="text/markdown",
Expand Down
2 changes: 1 addition & 1 deletion sudachi-cli/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "sudachi-cli"
version = "0.6.1-a1"
version = "0.6.1"
authors = ["Works Applications <sudachi@worksap.co.jp>"]
edition = "2018"
description = "Rust version of Sudachi, the Japanese Morphological Analyzer"
Expand Down
2 changes: 1 addition & 1 deletion sudachi/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "sudachi"
version = "0.6.1-a1"
version = "0.6.1"
authors = ["Works Applications <sudachi@worksap.co.jp>"]
edition = "2018"
description = "Rust version of Sudachi, the Japanese Morphological Analyzer"
Expand Down

0 comments on commit e13bf75

Please sign in to comment.