Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Congressional Hearing Parser #290

Open
wants to merge 77 commits into
base: main
Choose a base branch
from

Conversation

connorjoleary
Copy link
Contributor

@connorjoleary connorjoleary commented Oct 16, 2022

Overview

This project gathers transcripts made available by
the US Government Publishing Office and uses this
information to assign who said what during federal
congressional meetings. The data can then be used
to gather insights on the speaking patterns of each
representative.

How to use

  1. Follow installation instructions in the main README
    to install the correct python libraries

  2. Go to this website and create an api key

  3. https://api.govinfo.gov/docs/

  4. Create .env file in this folder with the key

GOV_INFO_API_KEY=<gov_info_key>

  1. Run python congress/contrib/congressional_hearing_info/grab_congressional_hearings.py --num 10

@dwillis
Copy link
Member

dwillis commented Oct 18, 2022

@connorjoleary thank you for this - it's a really useful area for us to go in. I'd like to hear from @JoshData about it, and in particular having a dependency on ProPublica's API (full disclosure, I currently run that API, but I'm not full-time at ProPublica and I can't guarantee that I'd be able to immediately address errors or downtime in every case). It appears that this PR uses the API to get current members of the House and Senate; I suspect there might be other ways to do that (using the congress-legislators repository, for example).

@connorjoleary
Copy link
Contributor Author

Oh, a very good point. I'd be happy to switch out using propublicas API for that. Should be a fairly straightforward change assuming they both use the same ids.

@DanielSchuman
Copy link

DanielSchuman commented Oct 19, 2022 via email

@JoshData
Copy link
Member

Thanks for sharing this, @connorjoleary.

Yeah it would be nice if all of the data is fetched in a consistent way throughout this repository: legislators from congress-legislators, GPO documents from the fdsys scraper. But I won't block it based on that.

I'd like that incoming code remain maintained by its maintainer for some reasonable period of time and be documented in a similar way to other tools in this repo (in the main README and the github wiki section), And if that's the case, there's no need to put things inside a contrib directory - it can just be along side everything else here. (i.e. I would like to avoid this repo becoming a landing place for unmaintained code. That creates a burden for the rest of us.)

@connorjoleary
Copy link
Contributor Author

Thank you all very much for the comments. I'm happy to maintain this code for a while after it is in place, but I do worry that the quality of this code is not up to snuff. The transcripts do not always follow consistent formatting, sometimes names are misspelled, and attributing who is speaking can be difficult (for example one hearing had two people with the same last names, but distinguished them by gender). This means that the output of this hearing parser is not always accurate.

With that being said do you all still feel like this code would fit in alongside everything else?
Also, if any of you happen to have a way to contact the people who write these transcripts, it would be very helpful if you could ask them to please adapt a consistent format 😆.

@connorjoleary
Copy link
Contributor Author

Hey all, update on this project. I created a website to easily search and visualize this data. I'll likely continue to make updates to my fork of this repo, as well as use it to collect more text data. Please let me know if you would like this data to be available from this project by approving this PR.

Link to website: congresstext.com

@michaelblyons
Copy link
Contributor

Depending on how far back you want to go, you may be interested in #236.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants