Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit Request: sudachidict-{core,full} - {75, 160MB} #131

Closed
sorami opened this issue Jan 6, 2020 · 2 comments
Closed

Limit Request: sudachidict-{core,full} - {75, 160MB} #131

sorami opened this issue Jan 6, 2020 · 2 comments

Comments

@sorami
Copy link

sorami commented Jan 6, 2020

Project

  • sudachidict-core
  • sudachidict-full

Size of release

  • sudachidict-core: 75MB
  • sudachidict-full: 160MB

Which indexes

Both PyPI and Test PyPI.

Reasons for the request

Sudachi is a Japanese natural language processing tool. These packages include a large number of vocabulary information for the language analysis, therefore the binary size becomes large.

We regularly update this language resource (update every few months), however, we add new vocabulary but also refine and remove some vocabularies, therefore we believe it won't exceed the above size limit in the future.

@jamadden
Copy link
Contributor

jamadden commented Jan 6, 2020

Hi @sorami! It appears that the projects you've linked to (and also https://pypi.org/project/SudachiDict-small/) do not include any Python code. They just package a data file. (Note how __init__.py is an empty file.)

$ unzip -l ./SudachiDict_small-20191030-py3-none-any.whl
Archive:  ./SudachiDict_small-20191030-py3-none-any.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  11-01-2019 09:47   sudachidict_small/__init__.py
122041043  11-01-2019 09:47   sudachidict_small/resources/system.dic
...
---------                     -------
122056130                     7 files

Unfortunately, distributing large data files isn't what PyPI is intended for. In addition, large packages stress PyPI's infrastructure as well as that of its mirrors, and they tend to produce a poor user experience. That's especially true if the large package is updated frequently.

Since you seem to already have a mechanism for distributing these data files (as witnessed by the links to ZIP files on the project pages) I would encourage leveraging that instead. There are a few ways in which that's commonly done. One is to have your package on PyPI simply include a command that the user is expected to run in order to download or update the data; this command could be a module or a setuptools console entry point which the user would run as python -m the_module.update_data or update_data, respectively. That lets you distribute wheels and works well if the data is regularly updated. It could also let you simplify distribution, by having one package that can be used to choose and download any of your three datasets.

Another common approach is to have your setup.py script automatically download the data when it is run (if it doesn't exist); this works well if the data is relatively static or has very strong versioning requirements (but doesn't let you distribute wheels):

# setup.py

import os
import urllib.request
import setuptools

if not os.path.exists('path/to/data'):
    with urllib.request.urlopen('https://location/of/data') as src:
        with open('path/to/data', 'wb') as dest:
            dest.write(src.read())

setuptools.setup(…)

I hope this helps.

@sorami
Copy link
Author

sorami commented Jan 6, 2020

Hi @jamadden , thank you very much for your reply!

I see, I understand that the PyPI is not intended for hosting large data files.

Thank you very much for a detailed explanation of how we can distribute the files in other ways. Let us consider these approaches.

@sorami sorami closed this as completed Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants