Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zip support? #4

Open
nfultz opened this issue Jul 5, 2023 · 2 comments
Open

zip support? #4

nfultz opened this issue Jul 5, 2023 · 2 comments

Comments

@nfultz
Copy link

nfultz commented Jul 5, 2023

I was wondering if zip support could be implemented.

I am working with Reddit Data Export, which arrives as a zip file with a couple different CSVs, one zip file per user - some of them can be quite large depending on user activity, so would be nice to use the virtual table facilities w/o having to extract the csvs.

Or if you have other ideas, I'd appreciate it. Thanks much.

@asg017
Copy link
Owner

asg017 commented Jul 5, 2023

So you have one zip file with multiple CSVs file inside, and you want to query individual CSV files inside that? Like you have a bundle.zip which contains ./a.csv, ./b.csv, ./c.csv etc, and you want to be able to do something like "select all rows from the a.csv file inside bundle.zip"?

If so, you currently can't do that easily with sqlite-xsv alone, although you could potentially combine the SQLite zipfile extension with csv_reader to do what you'd like. The downside is that this will read the entire CSV file into memory, and won't work for files larger than 1GB. Here's what that might look like:

create virtual table temp.students_reader using csv_reader(
  id int,
  name text,
  birthdate text
);

select *
from temp.students_reader(
  (select data from zipfile("bundle.zip") where name = "students.csv")
);

If your zipfile had a single CSV file inside it (ex a a.zip that only contained a sole a.csv file), then sqlite-xsv could support that with relative ease. But since you have multiple CSV files in the zip file, then it'll be a bit awkward to specify which file entry inside the zipfile should be used.

I have a few features planned for the near-future that'll offer a really flexible/extensible way read CSVs from any source (filesystem/HTTP/S3/compression archives etc). Once that's complete, I could see something like this:

create virtual table temp.students_reader using csv_reader(
  id int,
  name text,
  birthdate text
);

select * 
from temp.student_reader(
  zipfile_reader('bundle.zip', 'students.csv')
);

Where zipfile_reader would be a function from another future SQLite extension sqlite-zip, which would efficiently read a specific entry in a zipfile and pass it along to sqlite-xsv for CSV processing.

@nfultz
Copy link
Author

nfultz commented Jul 6, 2023

Thank you so much for your detailed explanation.

I need to think about whether loading it fully into memory is a good idea. My original idea was to mount each user extract as virtual tables, union-all the users together in another view, and then hope predicate pushdown could buy back some performance. It sounds like I'll need to experiment a bit, I could imagine that being pathologically bad for joining comments from one user, to posts from another.

But I also don't want to re-implement reddit, I just want some queries for data cleaning and analysis.

Just for future readers, here is what the data looks like for each user. There will be many of these zip files:

downloads$unzip -l export_nfultz_20230612.zip 
Archive:  export_nfultz_20230612.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       31  2023-07-03 23:10   account_gender.csv
       46  2023-07-03 23:10   approved_submitter_subreddits.csv
   328813  2023-07-03 23:10   chat_history.csv
    76706  2023-07-03 23:10   comment_headers.csv
    94506  2023-07-03 23:10   comment_votes.csv
   133707  2023-07-03 23:10   comments.csv
      142  2023-07-03 23:10   drafts.csv
     1810  2023-07-03 23:10   friends.csv
       32  2023-07-03 23:10   gilded_comments.csv
       32  2023-07-03 23:10   gilded_posts.csv
      671  2023-07-03 23:10   hidden_posts.csv
    12893  2023-07-03 23:10   ip_logs.csv
       22  2023-07-03 23:10   linked_identities.csv
       14  2023-07-03 23:10   linked_phone_number.csv
    14559  2023-07-03 23:10   message_headers.csv
   316397  2023-07-03 23:10   messages.csv
       36  2023-07-03 23:10   moderated_subreddits.csv
     1374  2023-07-03 23:10   multireddits.csv
      440  2023-07-03 23:10   poll_votes.csv
  1315419  2023-07-03 23:10   post_headers.csv
   636504  2023-07-03 23:10   post_votes.csv
  1744510  2023-07-03 23:10   posts.csv
       48  2023-07-03 23:10   reddit_gold_information.csv
     1371  2023-07-03 23:10   saved_comments.csv
     7840  2023-07-03 23:10   saved_posts.csv
       71  2023-07-03 23:10   scheduled_posts.csv
      189  2023-07-03 23:10   statistics.csv
      868  2023-07-03 23:10   subscribed_subreddits.csv
       10  2023-07-03 23:10   twitter.csv
     2106  2023-07-03 23:10   user_preferences.csv
     2542  2023-07-03 23:10   checkfile.csv
---------                     -------
  4693709                     31 files

One other gotcha for this particular data set is that you have to add the username column back to each table, it's not included in a single user extract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants