Skip to content

Commit

Permalink
Port regex-based oov handler (#211)
Browse files Browse the repository at this point in the history
* store at-boundary created words during lattice creation

* support user-defined POS in OOV handlers

* add regex-based OOV handler

* unsupport python 3.6

* remove overchecks in Grammar.get_pos_id

* add changelog entries
  • Loading branch information
eiennohito committed Jun 16, 2022
1 parent 2d73a9e commit 9b1876d
Show file tree
Hide file tree
Showing 30 changed files with 1,078 additions and 257 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build-python-wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ jobs:
strategy:
matrix:
os: [windows-latest, macOS-latest]
python-version: [ "3.6", "3.7", "3.8", "3.9" , "3.10" ]
python-version: [ "3.7", "3.8", "3.9", "3.10" ]

steps:
- uses: actions/checkout@v2
Expand Down
22 changes: 18 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,18 @@
# [0.6.3](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.2) (2020-02-10)
# [0.6.4](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.3) (2022-06-16)

## Highlights

* Remove Python 3.6 support which reached end-of-life status on [2021-12-23](https://endoflife.date/python)
* OOV handler plugins support user-defined POS, [similar to Java version](https://github.com/WorksApplications/Sudachi/releases/tag/v0.6.0)
* Added Regex OOV handler

## Regex OOV Handler

* For details, see [Java version changelog](https://github.com/WorksApplications/Sudachi/releases/tag/v0.6.0)
* In Rust/Python Regexes do not support backtracking and backreferences
* `maxLength` setting defines maximum length in unicode codepoints, not in utf-8 bytes as in Java (will be changed to codepoints later)

# [0.6.3](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.3) (2022-02-10)

## Highlights

Expand All @@ -20,13 +34,13 @@
* difference (`m1 - m2`)
* negation (`~m1`)

# [0.6.2](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.2) (2020-12-09)
# [0.6.2](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.2) (2021-12-09)

## Fixes

* Fix analysis differences with 0.5.4

# [0.6.1](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.1) (2020-12-08)
# [0.6.1](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.1) (2021-12-08)

## Highlights
* Added Fuzzing (see `sudachi-fuzz` subdirectory), Sudachi.rs seems to be pretty robust towards arbitrary inputs (no crashes and panics)
Expand All @@ -41,7 +55,7 @@
## Python
* See [Python changelog](./python/CHANGELOG.md)

# [0.6.0](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.0) (2020-11-11)
# [0.6.0](https://github.com/WorksApplications/sudachi.rs/releases/tag/v0.6.0) (2021-11-11)
## Highlights
* Full feature parity with Java version
* ~15% analysis speed improvement over 0.6.0-rc1
Expand Down
37 changes: 19 additions & 18 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion python/build-wheels-manylinux-pgo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ cd "$DIR"
export RUSTFLAGS='-C profile-use=/tmp/sudachi-profdata.merged -C opt-level=3'
export CARGO_BUILD_TARGET=x86_64-unknown-linux-gnu

for PYBIN in /opt/python/cp{36,37,38,39,310}*/bin; do
for PYBIN in /opt/python/cp{37,38,39,310}*/bin; do
"${PYBIN}/pip" install -U setuptools wheel setuptools-rust
find . -iname 'sudachipy*.so'
rm -f build/lib/sudachipy/sudachipy*.so
Expand Down
5 changes: 3 additions & 2 deletions sudachi/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,13 @@ aho-corasick = "0.7" # MIT/Apache 2.0
bitflags = "1.3" # MIT/Apache 2.0
csv = "1.1" # Unilicense/MIT
fancy-regex = "0.10" # MIT
indexmap = "1.7" # # MIT/Apache 2.0
indexmap = "1.7" # MIT/Apache 2.0
itertools = "0.10" # MIT/Apachie 2.0
lazy_static = "1.4" # MIT/Apache 2.0
libloading = "0.7" # ISC (MIT-compatible)
nom = "7" # MIT
memmap2 = "0.5" # MIT/Apache 2.0
regex = "1.5" # MIT/Apache 2.0
regex = "1" # MIT/Apache 2.0
serde = { version = "1.0", features = ["derive"] } # MIT/Apache 2.0
serde_json = "1.0" # MIT/Apache 2.0
thiserror = "1.0" # MIT/Apache 2.0
Expand Down
122 changes: 122 additions & 0 deletions sudachi/src/analysis/created.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
/*
* Copyright (c) 2021 Works Applications Co., Ltd.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

use std::cmp::min;

type Carrier = u64;

/// Bitset which represents that a word of a specified length was created.
/// Lattice construction fills this bitmap and passes it to the OOV providers.
/// It allows OOV providers to check if a word of a specific length was created very cheaply.
///
/// Unfortunately, if a word is more than `MAX_VALUE` characters, handlers need to do usual linear-time check.
#[derive(Copy, Clone, Eq, PartialEq, Default, Debug)]
#[repr(transparent)]
pub struct CreatedWords(Carrier);

#[derive(Eq, PartialEq, Copy, Clone, Debug)]
pub enum HasWord {
Yes,
No,
Maybe,
}

impl CreatedWords {
/// Maximum supported length of the word
pub const MAX_VALUE: Carrier = 64;
const MAX_SHIFT: Carrier = CreatedWords::MAX_VALUE - 1;

pub fn empty() -> CreatedWords {
return Default::default();
}

pub fn single<Pos: Into<i64>>(length: Pos) -> CreatedWords {
let raw = length.into();
debug_assert!(raw > 0);
let raw = raw as Carrier;
let shift = min(raw.saturating_sub(1), CreatedWords::MAX_SHIFT);
let bits = (1 as Carrier) << shift;
CreatedWords(bits)
}

#[must_use]
pub fn add_word<P: Into<i64>>(&self, length: P) -> CreatedWords {
let mask = CreatedWords::single(length);
return self.add(mask);
}

#[must_use]
pub fn add(&self, other: CreatedWords) -> CreatedWords {
CreatedWords(self.0 | other.0)
}

pub fn has_word<P: Into<i64> + Copy>(&self, length: P) -> HasWord {
let mask = CreatedWords::single(length);
if (self.0 & mask.0) == 0 {
HasWord::No
} else {
if length.into() >= CreatedWords::MAX_VALUE as _ {
HasWord::Maybe
} else {
HasWord::Yes
}
}
}

pub fn is_empty(&self) -> bool {
return self.0 == 0;
}

pub fn not_empty(&self) -> bool {
return !self.is_empty();
}
}

#[cfg(test)]
mod test {
use super::*;

#[test]
fn simple() {
let mask = CreatedWords::single(1);
assert_eq!(mask.has_word(1), HasWord::Yes);
}

#[test]
fn add() {
let mask1 = CreatedWords::single(5);
let mask2 = mask1.add_word(10);
assert_eq!(mask2.has_word(5), HasWord::Yes);
assert_eq!(mask2.has_word(10), HasWord::Yes);
assert_eq!(mask2.has_word(15), HasWord::No);
}

#[test]
fn long_value_present() {
let mask1 = CreatedWords::single(100);
assert_eq!(HasWord::No, mask1.has_word(62));
assert_eq!(HasWord::No, mask1.has_word(63));
assert_eq!(HasWord::Maybe, mask1.has_word(64));
}

#[test]
fn long_value_absent() {
let mask1 = CreatedWords::single(62);
assert_eq!(HasWord::Yes, mask1.has_word(62));
assert_eq!(HasWord::No, mask1.has_word(63));
assert_eq!(HasWord::No, mask1.has_word(64));
}
}
1 change: 1 addition & 0 deletions sudachi/src/analysis/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ use mlist::MorphemeList;

use crate::error::SudachiResult;

pub mod created;
mod inner;
pub mod lattice;
pub mod mlist;
Expand Down
Loading

0 comments on commit 9b1876d

Please sign in to comment.