Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]Improve the file reading by using local file caching #1435

Closed
4 tasks done
GaryShen2008 opened this issue Dec 29, 2020 · 2 comments
Closed
4 tasks done

[FEA]Improve the file reading by using local file caching #1435

GaryShen2008 opened this issue Dec 29, 2020 · 2 comments
Assignees
Labels
performance A performance related task/issue

Comments

@GaryShen2008
Copy link
Collaborator

GaryShen2008 commented Dec 29, 2020

Is your feature request related to a problem? Please describe.
For a case of loading the same data files from remote data source by multiple times, it'll be significantly improved if there's a local file caching mechanism.

Describe the solution you'd like
Alluxio is an open source project, which can do the exact caching thing for this case. We'd like to use Alluxio as the file caching service working with our plugin to provide a solution for the case of frequent remote reading.
We'd like to optimize the file reading and partitioning in our plugin according to the core of Alluxio.

Tasks:

  • Productionize plugin to work with different filesystems(AWS, Azure, GCP, DBFS)
  • Create tests for Alluxio with different filesystems
  • Create a user guide for Alluxio settings
  • Verify Alluxio can run in an on-prem cluster
@GaryShen2008 GaryShen2008 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 29, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 5, 2021
@wbo4958
Copy link
Collaborator

wbo4958 commented Feb 26, 2021

For now, we have merged PR #1562 for the V1 data source. And we will discuss if we need to add alluxio support for the V2 data source. So for now, just close the issue. I will re-open an another issue if we have plans to support V2 data source.

@wbo4958 wbo4958 closed this as completed Feb 26, 2021
@tgravescs
Copy link
Collaborator

please open an issue to track the v2 data source, we can prioritize it as needed

@sameerz sameerz added performance A performance related task/issue and removed feature request New feature or request labels Mar 2, 2021
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
[auto-merge] bot-auto-merge-branch-23.10 to branch-23.12 [skip ci] [bot]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

4 participants