Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python package #267

Merged
merged 1 commit into from
Feb 28, 2022
Merged

Python package #267

merged 1 commit into from
Feb 28, 2022

Conversation

acxz
Copy link
Contributor

@acxz acxz commented Nov 1, 2020

@JoshData I'm back haha, (This might be my last one in a while)

This patchset allows users to install congress as a python package. This offers several benefits:

  1. dependencies: can be directly specified in the setup.py instead of having a requirements file. This ensures that users who want to use congress will have the requirements instead running an extra command.

  2. exposing entry points: the main run and test/run files can be run anywhere as long as your python environment is the same one where you installed congress in, providing more convenience to users say in writing their own shell scripts without hardcoding the congress directory. Of course you can still use ./run and python run as before, so documentation does not need to be changed.

  3. Makes it easier for package managers to work with. While this does not directly reslve PyPi package available? #188 if someone wants to, one can easily upload congress to PyPi now, saving users the step to even clone the repo. Distributions that currently package this codebase, such as Arch Linux, currently manually install the scripts to the usr/bin folders. (https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=congress-git) Now they can use built in python build instructions such as python setup.py install.

If this is merged, this can fixes #192

setup.py Outdated Show resolved Hide resolved
@acxz
Copy link
Contributor Author

acxz commented Nov 9, 2020

@JoshData friendly ping

1 similar comment
@acxz
Copy link
Contributor Author

acxz commented Nov 17, 2020

@JoshData friendly ping

setup.py Outdated Show resolved Hide resolved
@JoshData
Copy link
Member

Hey,

Sorry I'm actually confused by the changes. If one follows the README steps, I think you will get a copy of the library installed in the usual Python packages location, but then it says to run "./run" which will run the version that has been cloned. Or will it? Will it get the modules in the working directory or will it go to the installed package?

@acxz
Copy link
Contributor Author

acxz commented Nov 27, 2020

If we following the README as is, ./run will execute the version that has been cloned. Typing just run will execute the version that has been installed.

@JoshData
Copy link
Member

I don't think this is a good idea...

@acxz
Copy link
Contributor Author

acxz commented Nov 27, 2020

Do you mean to say that the documentation should be changed to run or that run should not be a command that should be called from anywhere?

Just to clarify ./run will only work if you are in the cloned directory.

@JoshData
Copy link
Member

The documentation needs to present a clear way to use the package, so it has to either use an installed package or not install it. (It shouldn't install a package but then not use it.) It shouldn't create two local copies of the package either (the git clone and then the one installed by pip). And "run" is a bad name for a globally-installed script.

@acxz
Copy link
Contributor Author

acxz commented Nov 27, 2020

The documentation needs to present a clear way to use the package, so it has to either use an installed package or not install it. (It shouldn't install a package but then not use it.)

Thats sensible, I'll change the local ./runs to runs.

It shouldn't create two local copies of the package either (the git clone and then the one installed by pip).

This is a harder one to tackle. As the only way to install the package is to clone it first, unless you are using a package manager. There is no way around this unless you symlink the two with pip install -e . which is commonly done when developing on python code. I think this should stay the way it is.

And "run" is a bad name for a globally-installed script.

Technically it will only work in the virtual env the package is installed, but yeah. Totally agree. Do you have a proposition for a better name?

@acxz
Copy link
Contributor Author

acxz commented Dec 16, 2020

@JoshData any further thoughts?

  1. Or should I go ahead and not include the scripts to be installable
  2. Or should we go ahead and just scrap the idea of having congress as a python package?

@JoshData
Copy link
Member

Or should we go ahead and just scrap the idea of having congress as a python package?

Is this PR solving a real problem that you are having, or is this just for fun?

@acxz
Copy link
Contributor Author

acxz commented Dec 19, 2020

I think I have outlined the real world problems this resolves in the OP.

For most users however, this brings us one step closer to resolve #188, i.e. able to install unitedstates/congress with a pip install united-states as well as being able to pull dependencies properly instead of using requirements.txt.

I would very much like for this PR to be merged in as I believe it is useful. I just suggested the second point as ultimately I don't want to keep pushing this unless you believe there is merit in this PR, since I should respect your wishes as being the current maintainer of this project.

@aih
Copy link

aih commented Dec 28, 2020

@JoshData, @acxz I also think that having this as a Python package is helpful. As a general principle, I've moved to having all of my Python scripts as packages so that I can call their modules without remembering the paths or setting environment variables.

In this case I think @JoshData has a valid concern about namespacing. Using run or runs is just not clear outside of this context. The command should be something like usc-run. More broadly, for me, the greatest value of a package is not the command-line, but being able to from uscongress.bills import get_bills_to_process. This would require a more significant change to create a unique namespace. I think this can be done by changing the 'tasks' directory to 'uscongress'.

This may be disruptive to current users, but I think it provides more value than being able to call the run command from anywhere; if that were my main problem I'd add an alias to .bash_profile.

If @JoshData is open to the more significant change above, I would suggest a few updates (in both the code and documentation):

  1. Create a namespace (e.g. uscongress) to call the scripts as modules (e.g. from uscongress.bills import get_bills_to_process).
  2. Clearly document how the modules of this package can be called from other python scripts (i.e. what is the namespace in Skip bills that are Reserved for the Speaker #1). Identify any breaking changes this will introduce for current users.
  3. Handle the namespace issue for the command-line. Any global installation should use something clearly distinctive, like usc-run. I think it shoud be possible to still use './run' from within the directory so that any users' cronjobs or other current uses are not affected. This is, to me, less confusing than ./run vs run.

@acxz
Copy link
Contributor Author

acxz commented Dec 29, 2020

The command should be something like usc-run.

usc-run is a good suggestion, I'll go ahead and update the PR to reflect that.

More broadly, for me, the greatest value of a package is not the command-line, but being able to from uscongress.bills import get_bills_to_process.

That was actually on my to do once the setup.py is integrated in this PR. Wanted to take it a step at a time ;)

Points 1 and 2 should be addressed in a follow up PR. I'll go ahead and address Point 3 in this PR.

Thanks for your input @aih!

@acxz
Copy link
Contributor Author

acxz commented Dec 29, 2020

hmm turns out I can't alias the run script with setup.py's scripts field. I would have to use console_scripts of entry_points, but that requires a python module rather than a script, which means I would have to change the structure such that run is a module. Hence requiring points 1 and 2. @aih if you think you can accomplish installing the run script under a usc-run with minimal changes, send a PR over at my fork and this branch.

Otherwise, personally I think this PR should be merged in as it is, and we can address @aih's concerns in a later (possibly more disruptive, in terms of file structure, PR). Just getting a setup.py in this repo's pipeline is a good atomic PR imo.

@JoshData if you could let us know if you are still interested in this feature that would be appreciated.

@Andrew-Chen-Wang
Copy link

Definitely useful to have this set as a package:

  1. Easy to upgrade in case the scraped websites change formats, especially if you're using this with several other applications.
  2. Useful to simply pip install rather than use git clone. Something like pur and dependabot helps a lot knowing when you can upgrade.
  3. What if you want customized code or customized parsers? Well, that's why it's great to have a tag release with a CHANGELOG and breaking changes and mini functions in the code for easy extensibility.
  4. Knowing which Python versions are supported per release would be great! In general, git tags would be helpful.

I agree with @JoshData for the PR itself; I don't think it's a good idea to include the run script, or a "command" script at all. The modules separated as is is good enough for people to simply import (e.g. the vote.py file is meant for vote scraping, and that's it, just like firebase admin); it'll just need extra documentation. +1 on making this a package though!

@acxz
Copy link
Contributor Author

acxz commented Jan 29, 2022

Handle the namespace issue for the command-line. Any global installation should use something clearly distinctive, like usc-run. I think it shoud be possible to still use './run' from within the directory so that any users' cronjobs or other current uses are not affected. This is, to me, less confusing than ./run vs run.

With the latest commit this is taken care of.

@acxz
Copy link
Contributor Author

acxz commented Jan 30, 2022

To really make the python methods and modules importable, I tried to create the setup.py file so that the directory structure could be kept the same. However, after spending quite a bit of time I wasn't able to get it to work. I decided to follow python packaging convention (this involves moving files into a congress/ subfolder) and imports now work properly. Due to some heavy changes I also squashed the commits for easy readability.

With this @aih's concerns are addressed and with a pip install ., congress can be installed where you can call the scripts and load/run methods located within the tasks/ files.

Associated doc changes have been added as well.

I would urge anyone interested in this endeavor to test out the package for their particular use case. Please feel to critique and state any concerns!

change directory structure to make python package conventional
add setup.py file to specify deps

guide users to use the installed `usc-run` command
associated changes to other scripts
make scripts installable when package is installed

add a symlink for congress/run.py to run for backwards compat

remove redundant requirements file
@JoshData
Copy link
Member

JoshData commented Feb 6, 2022

This seems OK to me. Would be great if someone else could try it out next.

@acxz
Copy link
Contributor Author

acxz commented Feb 14, 2022

Would be great if someone else could try it out next.

pinging @stevesdawg @Andrew-Chen-Wang @aih

If you guys can test it out for your use case and report back if it works or doesn't please do so.

Copy link

@Andrew-Chen-Wang Andrew-Chen-Wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if the package was versioned by date. It makes more sense as a scraper of a constantly changing website.

I also have stopped using this repo for work, but scrapers are in need all the time so please do merge! Thanks again for the lovely work:)

@acxz
Copy link
Contributor Author

acxz commented Feb 14, 2022

Thx for the review!

I would prefer if the package was versioned by date.

That actually makes a lot of sense. I'll add the change depending on @JoshData's opinion.

@aih
Copy link

aih commented Feb 14, 2022

This is great work, and I will test it over the next couple of weeks. My main use case is to: 1. Download bulk data (from ProPublica, here), then pip install, and then update the data with usc-run bills. @acxz is there any reason this would not work?

@acxz
Copy link
Contributor Author

acxz commented Feb 14, 2022

is there any reason this would not work?

None at all

@Andrew-Chen-Wang
Copy link

@aih things thatre worth testing seems like the CLI command, general importing of the package, and a single use case

@sseshan7
Copy link
Contributor

sseshan7 commented Feb 16, 2022

The test plan I ran for these changes, from inside the repo root:

  1. pip install .
  2. usc-run votes --congress=117 --session=2022 --force=True
  3. usc-run bills to process bills I've already downloaded.

These worked normally as expected! 🎉

One side-effect however, when I call usc-run from somewhere other than the repo root, it creates the data/ and cache/ dirs in the directory where I called it. I feel like this is undesirable, and it should always download to some default location unless a download destination is provided.

What are people's thoughts on this?

@michaelblyons
Copy link
Contributor

One side-effect however, when I call usc-run from somewhere other than the repo root, it creates the data/ and cache/ dirs in the directory where I called it. I feel like this is undesirable, and it should always download to some default location unless a download destination is provided.

What are people's thoughts on this?

I agree with you, but I think that can be a subsequent PR and need not block merge. Right now, you can only really run from within the repo.

@Andrew-Chen-Wang
Copy link

If you made a system install and implemented something where the folders are created at "top level", you wouldn't see the folders at all anymore since they'd be next to or even inside the package content (ie wherever the environment is). I think this behavior is desirable.

@sseshan7
Copy link
Contributor

Command line importing looks good 👍🏾

Python 3.6.12 |Anaconda, Inc.| (default, Sep  8 2020, 23:10:56) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import congress
>>> congress
<module 'congress' from '/home/shrivathsav/anaconda3/envs/congress_py36/lib/python3.6/site-packages/congress/__init__.py'>
>>> from congress import tasks
>>> tasks
<module 'congress.tasks' from '/home/shrivathsav/anaconda3/envs/congress_py36/lib/python3.6/site-packages/congress/tasks/__init__.py'>
>>> congress.tasks
<module 'congress.tasks' from '/home/shrivathsav/anaconda3/envs/congress_py36/lib/python3.6/site-packages/congress/tasks/__init__.py'>
>>> from congress.tasks import bills
>>> bills
<module 'congress.tasks.bills' from '/home/shrivathsav/anaconda3/envs/congress_py36/lib/python3.6/site-packages/congress/tasks/bills.py'>
>>> from congress.tasks import votes
>>> votes
<module 'congress.tasks.votes' from '/home/shrivathsav/anaconda3/envs/congress_py36/lib/python3.6/site-packages/congress/tasks/votes.py'>
>>> votes.vote_ids_for_house
<function vote_ids_for_house at 0x7fa0bcf91c80>

This looks good to me.

@acxz
Copy link
Contributor Author

acxz commented Feb 24, 2022

@JoshData it has been some time, since this PR has been tested by folks here. LGTM?

@JoshData JoshData merged commit c10772e into unitedstates:master Feb 28, 2022
@JoshData
Copy link
Member

Thanks all!

@acxz acxz deleted the python-package branch February 28, 2022 06:09
@sseshan7 sseshan7 mentioned this pull request Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants