Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add possibility to preview tables #59

Closed
hagenw opened this issue Feb 23, 2024 · 2 comments · Fixed by #97
Closed

Add possibility to preview tables #59

hagenw opened this issue Feb 23, 2024 · 2 comments · Fixed by #97
Labels
enhancement New feature or request

Comments

@hagenw
Copy link
Member

hagenw commented Feb 23, 2024

It might be of interest to allow an interactive preview of tables on the datacard.

E.g. one solution could be to pre-load the first 10 lines for every table and add them to the static web page.
Another solution might be to provide an interface for selecting a table to preview, and first 10 lines from the table is only then read (and maybe downloaded) when requested.

@hagenw
Copy link
Member Author

hagenw commented Feb 23, 2024

For the preview of tables, it would also be interesting to see if we can profit here if we would store the tables in a different format (e.g. parquet) in the repository. E.g. if it would be possible to not download the whole table, but just stream the first 10 lines from the repo when requested.

@hagenw
Copy link
Member Author

hagenw commented Jun 21, 2024

Good news, when storing tables as PARQUET files on the backend, we can preview them without the need to download the whole file.

The following example highlights it with a dependency table (as we don't have a real table yet published on the server) from our internal server (copied from audeering/audformat#376 (comment)):

import aiohttp
import fsspec
import pyarrow.parquet as parquet

import audbackend


host = "https://artifactory.audeering.com/artifactory"
auth = audbackend.backend.Artifactory.get_authentication(host)
repository = "data-public-local"

# Prepare fsspec https file-system to communicate with Artifactory
fs = fsspec.filesystem("https", auth=aiohttp.BasicAuth(auth[0], auth[1]))

# Preview dependency table of casual-conversations-v2 dataset
dataset = "casual-conversations-v2"
version = "1.0.0"
url = f"{host}/{repository}/{dataset}/db/{version}/db-{version}.parquet"
file = parquet.ParquetFile(url, filesystem=fs)
first_ten_rows = next(file.iter_batches(batch_size=10))
print(first_ten_rows.to_pandas())

which returns

                                      file                               archive  bit_depth  channels  ... removed  sampling_rate type  version
0                      db.disabilities.csv                          disabilities          0         0  ...       0              0    0    1.0.0
1                             db.files.csv                                 files          0         0  ...       0              0    0    1.0.0
2               db.physical-adornments.csv                   physical-adornments          0         0  ...       0              0    0    1.0.0
3               db.physical-attributes.csv                   physical-attributes          0         0  ...       0              0    0    1.0.0
4                         db.recording.csv                             recording          0         0  ...       0              0    0    1.0.0
5                         db.skin-tone.csv                             skin-tone          0         0  ...       0              0    0    1.0.0
6                           db.speaker.csv                               speaker          0         0  ...       0              0    0    1.0.0
7  audio/0000_portuguese_nonscripted_1.wav  f76b3d4a-a172-63ee-22f2-fb2255d692ee         16         1  ...       0          48000    1    1.0.0
8  audio/0000_portuguese_nonscripted_2.wav  81db070f-69a1-ab92-a365-ca95ac36c893         16         1  ...       0          48000    1    1.0.0
9  audio/0000_portuguese_nonscripted_3.wav  d4572eb1-d458-7717-2145-a7861208b8da         16         1  ...       0          48000    1    1.0.0

[10 rows x 11 columns]

Which means it should now be much easier to integrate a fast table preview feature, at least for tables we store in PARQUET.
For the CSV tables it might be slightly more complicated as those are stored inside a ZIP file, and we would need to download the first 10 rows of that file from within the ZIP file. I think it should also be possible, but I don't know how yet.

/cc @ChristianGeng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant