-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bundle unidic instead of ipadic #38
Comments
Definitely approve of bundling Unidic over IPADic. Unidic has issues but none of them are serious enough that using IPADic makes sense in 2019. On the other hand, it might be best to keep the core package light and non-bundled, for people with a well-configured C++ Mecab, and make a separate bundled package for people who want to handle everything through Pip. |
The Python packaging community seems to prefer that compiled-code libraries be bundled; see for instance the guidance in the various "manylinux" PEPs. Also I recall old bugs that boiled down to people being stuck on MeCab 0.7, as well. Unbundling the dictionary and removing the wacky code in I would hesitate to upload a 2.2GB anything to PyPI without discussing it with the archive maintainers first. |
It's true that bundling everything is useful in many situations, but I think we should unbundle the dictionary from the core package and offer a separate package like The main issue with bundling the dictionary is that it overrides user settings and there's no good way to provide notice of that, which causes confusion. In contrast, if someone installs a non-bundled library and it doesn't work, we can provide an error that gives instructions on what to do. (See how spaCy handles trying to run Japanese without dependencies.) Ignoring user settings could be particularly pernicious in the case where a user uses a base dictionary like Unidic plus a custom dictionary, which is a common situation in industry. Since most vocab is in the base dictionary behavior won't change obviously but the user will lose all custom dictionary entries. Other issues: File size. Including a dictionary makes the library large, which makes installing it difficult in some situations (Amazon Lambda is a particular example). This is also wasteful if the user needs a different dictionary. Mecab forks. I wasn't aware of this previously but it looks like mecab-ko is actively used for Korean. I think mecab-python3 works with it as-is (the interfaces are the same but space handling is slightly different), but that wouldn't work if we bundled a binary. System consistency. Using system MeCab means Wanting to install everything with one pip command is a common use case for new projects and we should support it, but I think that there's many cases were a non-bundled library is preferable, and since there's no good way to inform existing users of the change it's best to keep the core package light. The way spaCy handles models is probably a good reference here - the core is small, and models and language data can be installed as pip packages and selected at runtime. |
As an update on this, I have packaged UniDic for installation via PyPI. unidic-lite is an older version of UniDic that fits entirely in a PyPI package, while unidic is a version that needs you to download the data separately (via a single command) after installing a PyPI package. I integrated support for these into fugashi, but since all they do is download the data files and provide the path to them it should be easy to integrate here. To that end I'll work on removing IPAdic integration, adding integration for these UniDic packages, and, when that's working, releasing a new version in a way that can avoid confusion. Maybe it's time for 1.0. |
I have this working now in the |
These changes were integrated into the main branch and are in the 1.0 release today. |
Per https://www.dampfkraft.com/nlp/japanese-tokenizer-dictionaries.html , UniDic may be a better choice for the dictionary bundled with the PyPI package than the current IPADic. The primary benefit seems to be that UniDic is actively maintained, whereas IPADic has not been updated since 2007. On the other hand, it's much larger (we cannot reasonably bundle 2.2GB of data, so we'd have to stick with 2.2.0 at the latest) and it may produce surprising tokenizations.
@polm, your thoughts?
The text was updated successfully, but these errors were encountered: