Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for compressed JSON (gz or bz2) in input #1

Open
gilliek opened this issue Mar 29, 2015 · 5 comments
Open

Add support for compressed JSON (gz or bz2) in input #1

gilliek opened this issue Mar 29, 2015 · 5 comments
Assignees

Comments

@gilliek
Copy link
Member

gilliek commented Mar 29, 2015

Since we are dealing with a huge amount of data, it is very slow to re-parse all the projects with the source code parsers everytime we update the source analyzer. Thus, it makes sense to store the intermediate JSON. However, the JSON files are really big and they use a lot of disk space so it would be useful to compress them.

@gilliek gilliek self-assigned this Mar 29, 2015
@rolinh
Copy link
Member

rolinh commented Mar 29, 2015

Mmh, what is preventing us to store the intermediate JSON in a compressed format and then un-compress it and stream it to srcanlzr directly? Like why does it need to be handled by srcanlzr?

@gilliek
Copy link
Member Author

gilliek commented Mar 29, 2015

Nothing is preventing us to do so. The main advantage of doing it directly in Go is performance IMHO. The bzip2 package of the Go standard library (http://golang.org/pkg/compress/bzip2/) implements the reader interface and the JSON decoder can directly read JSON from a reader. That way, the JSON decoder can uncompress and decode the JSON at the same time.

Besides, it only takes few lines of code to implement that option. Since everything comes from the standard library, it does not require extra testing. So I see no reason not to implement it :)

@rolinh
Copy link
Member

rolinh commented Mar 29, 2015

Fair enough.
It'll be interesting to micro-benchmark using something like bzcat foo.json.bz2 | srcanlzr ... vs having srcanlzr handle it all through bzip2 from the standard library using the reader interface. Just out of curiosity. :)

@gilliek
Copy link
Member Author

gilliek commented Mar 29, 2015

Yeah for sure :)

@gilliek
Copy link
Member Author

gilliek commented Mar 29, 2015

I bet that the pure Go version will be faster. Even if the Go standard implementation is much slower than bzcat(1), in the end, the bzcat solution will need to read the bzipped file, output the uncompressed JSON and srcanlzr will have to read it, instead of just reading the bzipped file once :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants