Limit Request: sudachidict-{core,full} - {75, 160MB} #131

sorami · 2020-01-06T07:30:31Z

Project

sudachidict-core
sudachidict-full

Size of release

sudachidict-core: 75MB
sudachidict-full: 160MB

Which indexes

Both PyPI and Test PyPI.

Reasons for the request

Sudachi is a Japanese natural language processing tool. These packages include a large number of vocabulary information for the language analysis, therefore the binary size becomes large.

We regularly update this language resource (update every few months), however, we add new vocabulary but also refine and remove some vocabularies, therefore we believe it won't exceed the above size limit in the future.

jamadden · 2020-01-06T13:28:01Z

Hi @sorami! It appears that the projects you've linked to (and also https://pypi.org/project/SudachiDict-small/) do not include any Python code. They just package a data file. (Note how __init__.py is an empty file.)

$ unzip -l ./SudachiDict_small-20191030-py3-none-any.whl
Archive:  ./SudachiDict_small-20191030-py3-none-any.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  11-01-2019 09:47   sudachidict_small/__init__.py
122041043  11-01-2019 09:47   sudachidict_small/resources/system.dic
...
---------                     -------
122056130                     7 files

Unfortunately, distributing large data files isn't what PyPI is intended for. In addition, large packages stress PyPI's infrastructure as well as that of its mirrors, and they tend to produce a poor user experience. That's especially true if the large package is updated frequently.

Since you seem to already have a mechanism for distributing these data files (as witnessed by the links to ZIP files on the project pages) I would encourage leveraging that instead. There are a few ways in which that's commonly done. One is to have your package on PyPI simply include a command that the user is expected to run in order to download or update the data; this command could be a module or a setuptools console entry point which the user would run as python -m the_module.update_data or update_data, respectively. That lets you distribute wheels and works well if the data is regularly updated. It could also let you simplify distribution, by having one package that can be used to choose and download any of your three datasets.

Another common approach is to have your setup.py script automatically download the data when it is run (if it doesn't exist); this works well if the data is relatively static or has very strong versioning requirements (but doesn't let you distribute wheels):

# setup.py

import os
import urllib.request
import setuptools

if not os.path.exists('path/to/data'):
    with urllib.request.urlopen('https://location/of/data') as src:
        with open('path/to/data', 'wb') as dest:
            dest.write(src.read())

setuptools.setup(…)

I hope this helps.

sorami · 2020-01-06T15:03:58Z

Hi @jamadden , thank you very much for your reply!

I see, I understand that the PyPI is not intended for hosting large data files.

Thank you very much for a detailed explanation of how we can distribute the files in other ways. Let us consider these approaches.

sorami added the limit request label Jan 6, 2020

This was referenced Jan 6, 2020

PyPI package size limit increase for sudachidict-{core,full} pypa/packaging-problems#299

Closed

easy installable dictionary WorksApplications/SudachiPy#73

Closed

sorami closed this as completed Jan 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit Request: sudachidict-{core,full} - {75, 160MB} #131

Limit Request: sudachidict-{core,full} - {75, 160MB} #131

sorami commented Jan 6, 2020

jamadden commented Jan 6, 2020

sorami commented Jan 6, 2020

Limit Request: sudachidict-{core,full} - {75, 160MB} #131

Limit Request: sudachidict-{core,full} - {75, 160MB} #131

Comments

sorami commented Jan 6, 2020

jamadden commented Jan 6, 2020

sorami commented Jan 6, 2020