Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bundle unidic instead of ipadic #38

Closed
zackw opened this issue Dec 24, 2019 · 6 comments
Closed

Bundle unidic instead of ipadic #38

zackw opened this issue Dec 24, 2019 · 6 comments

Comments

@zackw
Copy link
Collaborator

zackw commented Dec 24, 2019

Per https://www.dampfkraft.com/nlp/japanese-tokenizer-dictionaries.html , UniDic may be a better choice for the dictionary bundled with the PyPI package than the current IPADic. The primary benefit seems to be that UniDic is actively maintained, whereas IPADic has not been updated since 2007. On the other hand, it's much larger (we cannot reasonably bundle 2.2GB of data, so we'd have to stick with 2.2.0 at the latest) and it may produce surprising tokenizations.

@polm, your thoughts?

@polm
Copy link
Collaborator

polm commented Dec 24, 2019

Definitely approve of bundling Unidic over IPADic. Unidic has issues but none of them are serious enough that using IPADic makes sense in 2019.

On the other hand, it might be best to keep the core package light and non-bundled, for people with a well-configured C++ Mecab, and make a separate bundled package for people who want to handle everything through Pip.

@zackw
Copy link
Collaborator Author

zackw commented Dec 24, 2019

The Python packaging community seems to prefer that compiled-code libraries be bundled; see for instance the guidance in the various "manylinux" PEPs. Also I recall old bugs that boiled down to people being stuck on MeCab 0.7, as well.

Unbundling the dictionary and removing the wacky code in __init__.py to synthesize a mecabrc is a tempting notion, since we know people have had problems with it; on the other hand, we also know people have expected pip install python-mecab3 to work with no further configuration. I think you understand the user community and its needs a lot better than me, so I'll let you make the call.

I would hesitate to upload a 2.2GB anything to PyPI without discussing it with the archive maintainers first.

@polm
Copy link
Collaborator

polm commented Dec 25, 2019

It's true that bundling everything is useful in many situations, but I think we should unbundle the dictionary from the core package and offer a separate package like mecab-unidic-bundled.

The main issue with bundling the dictionary is that it overrides user settings and there's no good way to provide notice of that, which causes confusion. In contrast, if someone installs a non-bundled library and it doesn't work, we can provide an error that gives instructions on what to do. (See how spaCy handles trying to run Japanese without dependencies.)

Ignoring user settings could be particularly pernicious in the case where a user uses a base dictionary like Unidic plus a custom dictionary, which is a common situation in industry. Since most vocab is in the base dictionary behavior won't change obviously but the user will lose all custom dictionary entries.

Other issues:

File size. Including a dictionary makes the library large, which makes installing it difficult in some situations (Amazon Lambda is a particular example). This is also wasteful if the user needs a different dictionary.

Mecab forks. I wasn't aware of this previously but it looks like mecab-ko is actively used for Korean. I think mecab-python3 works with it as-is (the interfaces are the same but space handling is slightly different), but that wouldn't work if we bundled a binary.

System consistency. Using system MeCab means mecabrc is used system-wide, including in non-Python systems. An example I've seen before is using ElasticSearch and Python based systems with the same mecabrc. Setting MECABRC is an easy workaround for that but there's no good way to tell users it's necessary now when it wasn't before.


Wanting to install everything with one pip command is a common use case for new projects and we should support it, but I think that there's many cases were a non-bundled library is preferable, and since there's no good way to inform existing users of the change it's best to keep the core package light.

The way spaCy handles models is probably a good reference here - the core is small, and models and language data can be installed as pip packages and selected at runtime.

@polm
Copy link
Collaborator

polm commented Apr 15, 2020

As an update on this, I have packaged UniDic for installation via PyPI. unidic-lite is an older version of UniDic that fits entirely in a PyPI package, while unidic is a version that needs you to download the data separately (via a single command) after installing a PyPI package.

I integrated support for these into fugashi, but since all they do is download the data files and provide the path to them it should be easy to integrate here. To that end I'll work on removing IPAdic integration, adding integration for these UniDic packages, and, when that's working, releasing a new version in a way that can avoid confusion. Maybe it's time for 1.0.

@polm
Copy link
Collaborator

polm commented Apr 23, 2020

I have this working now in the use-unidic branch.

@polm
Copy link
Collaborator

polm commented Jun 29, 2020

These changes were integrated into the main branch and are in the 1.0 release today.

@polm polm closed this as completed Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants