Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BadZipfile: File is not a zip file #247

Closed
dsal1951 opened this issue Dec 27, 2019 · 4 comments
Closed

BadZipfile: File is not a zip file #247

dsal1951 opened this issue Dec 27, 2019 · 4 comments

Comments

@dsal1951
Copy link

dsal1951 commented Dec 27, 2019

I'm trying to pull bill text for previous congresses for some data science research I'm working on. When I try to run the following command

./run govinfo --collections=BILLS --extract=mods,text,xml,pdf --congress=114

I get a BadZipfile error on every bill (example for s29 below). If I try to manually open the package.zip, I end up with the never ending zip -> cpgz -> cycle.

The strange thing is that if I delete all of the text-versions subdirectories and then rerun the same command, it works fine for the vast majority of bills (~90%). I haven't been able to figure out any rhyme or reason to this behavior but can confirm that I've observed it across Mac OSx and Ubuntu as well as many congresses.

Error fetching package 114s29is in collection BILLS from https://www.govinfo.gov/app/details/BILLS-114s29is.
Traceback (most recent call last):
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 174, in update_sitemap2
mirror_results = mirror_package(collection, package_name, lastmod, lastmod_cache.setdefault("packages", {}), options)
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 313, in mirror_package
extracted_files = extract_package_files(collection, package_name, file_path, lastmod_cache, options)
File "/Users/trent/Documents/congress/tasks/govinfo.py", line 371, in extract_package_files
with zipfile.ZipFile(package_file) as package:
File "/anaconda2/envs/congress2/lib/python2.7/zipfile.py", line 770, in init
self._RealGetContents()
File "/anaconda2/envs/congress2/lib/python2.7/zipfile.py", line 811, in _RealGetContents
raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file

@dsal1951
Copy link
Author

For some additional context, below is a table showing the % of bills that had their bill text properly extracted broken down by congress. So it looks like 22% of bills failed with the BadZipfile error in each congress, except for the 114th which had no failures for some strange reason.

103 0.784871
104 0.784470
105 0.783371
106 0.785635
107 0.772524
108 0.779064
109 0.773925
110 0.784199
111 0.777987
112 0.772009
113 0.777428
114 1.000000
115 0.783106
116 0.767378

@JoshData
Copy link
Member

JoshData commented Dec 29, 2019 via email

@JoshData
Copy link
Member

Sorry for the delay. This seems to be working for me: #255

@dwillis
Copy link
Member

dwillis commented Apr 2, 2020

Merged #255, so deeming this resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants