Congressional Hearing Parser #290

connorjoleary · 2022-10-16T22:09:08Z

Overview

This project gathers transcripts made available by
the US Government Publishing Office and uses this
information to assign who said what during federal
congressional meetings. The data can then be used
to gather insights on the speaking patterns of each
representative.

How to use

Follow installation instructions in the main README
to install the correct python libraries
Go to this website and create an api key
https://api.govinfo.gov/docs/
Create .env file in this folder with the key

GOV_INFO_API_KEY=<gov_info_key>

Run python congress/contrib/congressional_hearing_info/grab_congressional_hearings.py --num 10

dwillis · 2022-10-18T23:51:29Z

@connorjoleary thank you for this - it's a really useful area for us to go in. I'd like to hear from @JoshData about it, and in particular having a dependency on ProPublica's API (full disclosure, I currently run that API, but I'm not full-time at ProPublica and I can't guarantee that I'd be able to immediately address errors or downtime in every case). It appears that this PR uses the API to get current members of the House and Senate; I suspect there might be other ways to do that (using the congress-legislators repository, for example).

connorjoleary · 2022-10-19T00:06:02Z

Oh, a very good point. I'd be happy to switch out using propublicas API for that. Should be a fairly straightforward change assuming they both use the same ids.

DanielSchuman · 2022-10-19T00:42:52Z

There's also the new official congress.gov API, which also should have a list of current members of the House and Senate. https://www.congress.gov/help/using-data-offsite Message ID: ***@***.***

JoshData · 2022-10-26T18:06:03Z

Thanks for sharing this, @connorjoleary.

Yeah it would be nice if all of the data is fetched in a consistent way throughout this repository: legislators from congress-legislators, GPO documents from the fdsys scraper. But I won't block it based on that.

I'd like that incoming code remain maintained by its maintainer for some reasonable period of time and be documented in a similar way to other tools in this repo (in the main README and the github wiki section), And if that's the case, there's no need to put things inside a contrib directory - it can just be along side everything else here. (i.e. I would like to avoid this repo becoming a landing place for unmaintained code. That creates a burden for the rest of us.)

connorjoleary · 2022-10-30T20:53:36Z

Thank you all very much for the comments. I'm happy to maintain this code for a while after it is in place, but I do worry that the quality of this code is not up to snuff. The transcripts do not always follow consistent formatting, sometimes names are misspelled, and attributing who is speaking can be difficult (for example one hearing had two people with the same last names, but distinguished them by gender). This means that the output of this hearing parser is not always accurate.

With that being said do you all still feel like this code would fit in alongside everything else?
Also, if any of you happen to have a way to contact the people who write these transcripts, it would be very helpful if you could ask them to please adapt a consistent format 😆.

connorjoleary · 2023-02-11T22:32:52Z

Hey all, update on this project. I created a website to easily search and visualize this data. I'll likely continue to make updates to my fork of this repo, as well as use it to collect more text data. Please let me know if you would like this data to be available from this project by approving this PR.

Link to website: congresstext.com

michaelblyons · 2023-02-12T21:42:19Z

Depending on how far back you want to go, you may be interested in #236.

connorjoleary added 30 commits June 2, 2022 19:00

grab hearings from govinfo and setup parser

125b20d

add parse hearings

3d073c4

add group parser

5f8e33a

add type hinting

3beadda

Merge branch 'unitedstates:main' into main

f949f0f

start on text cleaner

70c40f2

add grabbing congress info

9bb329d

add edge case

c814a5b

add data classes

e95d574

format

9eca885

move files to nested folder

88c9c7f

edge case

ffcd352

map state initials to state

941271b

add start of link representative function

d1d1d0f

add hearing parser test

6ed1ccf

small fixes to cover corner cases

71f3f8e

add tests

a215e02

add one offs

6d7691a

add identify_people_present func

ccd86b2

match on more complex names

6094c69

expand link func

4b5a91f

add link code to hearing parser

95e0e07

small bug fixes and add todo

4067a32

restructure main hearing parser

6900894

move link congress member to file

b84864e

use size param

1b17925

add ways speakers are indicated

01cf455

output to pickle files

68a2af3

link by authority id and cleanup

2ead2b1

remove unused

9d859aa

connorjoleary added 14 commits October 10, 2022 21:37

allow lowercase and check for state

a31d145

move states list out

db8091d

return unique congress info

f818360

use stack overflow suggestion

78e4265

add check for matching first and last names

1de3f3f

return list not dict

b15fdc3

check if state at the start of member line

43b580a

special case for virginia

977324f

add md and split on tabs

ce2a9de

attempt to split on statement patterns

5258dda

add state initals todo

d0c5654

wrap up attempts

9a6cd95

format

fc6370a

add usage directions

113d096

switch out congress members source for local

b302181

connorjoleary added 7 commits November 1, 2022 18:36

update readme to not ask for propublica api

eda2c9a

add extra titles

943a6d1

Add statement of as speaker break

af72eef

format

7461cd0

simple linking of statement to speaker

2e7fb38

use most recent hearings

9759026

wait, nevermind it is last day, not first

b2f13bf

Merge branch 'main' into main

f556f1c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Congressional Hearing Parser #290

Congressional Hearing Parser #290

connorjoleary commented Oct 16, 2022 •

edited

Loading

dwillis commented Oct 18, 2022

connorjoleary commented Oct 19, 2022

DanielSchuman commented Oct 19, 2022 via email

JoshData commented Oct 26, 2022

connorjoleary commented Oct 30, 2022

connorjoleary commented Feb 11, 2023

michaelblyons commented Feb 12, 2023

Congressional Hearing Parser #290

Are you sure you want to change the base?

Congressional Hearing Parser #290

Conversation

connorjoleary commented Oct 16, 2022 • edited Loading

Overview

How to use

dwillis commented Oct 18, 2022

connorjoleary commented Oct 19, 2022

DanielSchuman commented Oct 19, 2022 via email

JoshData commented Oct 26, 2022

connorjoleary commented Oct 30, 2022

connorjoleary commented Feb 11, 2023

michaelblyons commented Feb 12, 2023

connorjoleary commented Oct 16, 2022 •

edited

Loading