-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zip support? #4
Comments
So you have one zip file with multiple CSVs file inside, and you want to query individual CSV files inside that? Like you have a If so, you currently can't do that easily with create virtual table temp.students_reader using csv_reader(
id int,
name text,
birthdate text
);
select *
from temp.students_reader(
(select data from zipfile("bundle.zip") where name = "students.csv")
); If your zipfile had a single CSV file inside it (ex a I have a few features planned for the near-future that'll offer a really flexible/extensible way read CSVs from any source (filesystem/HTTP/S3/compression archives etc). Once that's complete, I could see something like this: create virtual table temp.students_reader using csv_reader(
id int,
name text,
birthdate text
);
select *
from temp.student_reader(
zipfile_reader('bundle.zip', 'students.csv')
); Where |
Thank you so much for your detailed explanation. I need to think about whether loading it fully into memory is a good idea. My original idea was to mount each user extract as virtual tables, union-all the users together in another view, and then hope predicate pushdown could buy back some performance. It sounds like I'll need to experiment a bit, I could imagine that being pathologically bad for joining comments from one user, to posts from another. But I also don't want to re-implement reddit, I just want some queries for data cleaning and analysis. Just for future readers, here is what the data looks like for each user. There will be many of these zip files:
One other gotcha for this particular data set is that you have to add the username column back to each table, it's not included in a single user extract. |
I was wondering if zip support could be implemented.
I am working with Reddit Data Export, which arrives as a zip file with a couple different CSVs, one zip file per user - some of them can be quite large depending on user activity, so would be nice to use the virtual table facilities w/o having to extract the csvs.
Or if you have other ideas, I'd appreciate it. Thanks much.
The text was updated successfully, but these errors were encountered: