Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There should be a built-in way to list prefixes (directories) in a bucket #294

Closed
lrowe opened this issue Oct 13, 2020 · 7 comments · Fixed by #837
Closed

There should be a built-in way to list prefixes (directories) in a bucket #294

lrowe opened this issue Oct 13, 2020 · 7 comments · Fixed by #837
Assignees
Labels
api: storage Issues related to the googleapis/python-storage API. type: cleanup An internal cleanup or hygiene concern. type: docs Improvement to the documentation for an API.

Comments

@lrowe
Copy link

lrowe commented Oct 13, 2020

The REST API provides a way to list prefixes (directories) in a bucket but this does not seem to be supported by the client library.

According to #192 the iterator returned by list_blobs has a prefixes field which is filled in as you iterate over the blob. This should be better documented, since it is only mentioned in passing:

include_trailing_delimiter (boolean) – (Optional) If true, objects that end in exactly one instance of delimiter will have their metadata included in items in addition to prefixes.

https://googleapis.dev/python/storage/latest/client.html#google.cloud.storage.client.Client.list_blobs

As well as a method to list prefixes alone, it would be helpful to have an iterator that returned both prefixes and objects in order for produce ordered listings.

Workaround found on Stack Overflow. https://stackoverflow.com/a/59008580

def list_prefixes(client, bucket_name, prefix, delimiter):
    # Adapted from https://stackoverflow.com/a/59008580
    from google.api_core import page_iterator
    return page_iterator.HTTPIterator(
        client=client,
        api_request=client._connection.api_request,
        path=f"/b/{bucket_name}/o",
        items_key="prefixes",
        item_to_value=lambda iterator, item: item,
        extra_params={
            "projection": "noAcl",
            "prefix": prefix,
            "delimiter": delimiter,
        },
    )
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Oct 13, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Oct 14, 2020
@HemangChothani HemangChothani added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Oct 15, 2020
@HemangChothani HemangChothani self-assigned this Oct 20, 2020
@frankyn
Copy link
Member

frankyn commented Oct 21, 2020

Hi @lrowe,
Thanks for filing the issue.

Would it help to document an example instead pydocs? Here's an example:

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/storage/cloud-client/storage_list_files_with_prefix.py#L49-L65

@lrowe
Copy link
Author

lrowe commented Oct 21, 2020

@frankyn An example would be helpful too, but I think the prefixes property should be included the API docs since they are expected to be exhaustive.

Perhaps the :returns: section of list_blobs could be expanded to mention the special prefixes property of the iterator, see list_blobs docstring:

        :rtype: :class:`~google.api_core.page_iterator.Iterator`
        :returns: Iterator of all :class:`~google.cloud.storage.blob.Blob`
                  in this bucket matching the arguments.

@lrowe
Copy link
Author

lrowe commented Oct 21, 2020

I do worry though that the current list_blobs Iterator.prefixes set might be problematic when there are a very large number of prefixes since it would grow unbounded.

@frankyn
Copy link
Member

frankyn commented Oct 21, 2020

Thanks for clarifying @lrowe, how many prefixes are you expecting right now? We'd need to run tests to see how the library behaves.

@HemangChothani could you update your PR to provide better documentation instead for now?

@lrowe
Copy link
Author

lrowe commented Oct 21, 2020

I ran into this while building a Cloud Function storage trigger to generate directory listings in a bucket exposed as a static website containing output from programmatically generated analysis runs. Some of these 'directories' have ~50,000 subdirectories and I can imagine we may have cases in the future with an order of magnitude more and I will need to split the listing over multiple pages in order to keep it usable.

@frankyn frankyn added api: docs Issues related to the Docs API API. type: cleanup An internal cleanup or hygiene concern. type: docs Improvement to the documentation for an API. and removed type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: docs Issues related to the Docs API API. labels Jun 3, 2021
@frankyn
Copy link
Member

frankyn commented Jun 3, 2021

Python documentation for list_blobs() needs to be updated to clarify that there's a property for prefixes on the iterator returned. Right now documentation only states that an Iterator type is returned.

Returns:
Iterator of all :class:`~google.cloud.storage.blob.Blob`
in this bucket matching the arguments.

@amirbtb
Copy link

amirbtb commented Dec 27, 2021

Hi,

I believe there is an issue with the prefixes property of the iterator.
the set returned is empty right after I call the client

result = client.list_blobs(
    bucket_or_name='my-bucket',
    prefix='',
    delimiter='/'
)

result.prefixes
# set()

But when I loop over the result iterator I can access the prefixes

result = client.list_blobs(
    bucket_or_name='my-bucket',
    prefix='',
    delimiter='/'
)

for element in result.prefixes:
    pass

result.prefixes
# {'folder1/', 'folder2/'}

It would be great to be have a built-in way to list prefixes directly without listing files, since it can be expensive in time and compute.

Thank you in advance (and Merry Christmas 🎄)

@cojenco cojenco self-assigned this Aug 9, 2022
gcf-merge-on-green bot pushed a commit that referenced this issue Aug 11, 2022
Improve documentation as part of 294
- clarify `prefixes` entity exists as part of the response
- add link to sample broswer ["List the objects in a bucket using a prefix filter"](https://cloud.google.com/storage/docs/samples/storage-list-files-with-prefix#storage_list_files_with_prefix-python) 

Fixes #294 🦕
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/python-storage API. type: cleanup An internal cleanup or hygiene concern. type: docs Improvement to the documentation for an API.
Projects
None yet
6 participants