-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upcoming major improvements #209
Comments
I would vote for "coords generation" especially with FITS. |
|
|
"parquet storage (from preffs)" this sounds nifty, but I'll add for what it's worth, I did discuss with @jakirkham what it would look like to use zarr for the storage of kerchunk itself 😄 |
I would certainly love your ideas, and the thought had certainly occurred to me. In favour of parquet:
In favour of zarr:
|
Moving to parquet or zarr sounds like a great idea. |
@emfdavid can you give us an update here? I'm hitting memory issues trying to generate/use kerchunk on the NWM 1km gridded CONUS dataset from https://registry.opendata.aws/nwm-archive/. Creating/loading the consolidated JSON for just 10 years of this 40 year dataset takes 16GB of RAM. |
@rsignell-usgs , are you using tree reduction? Since there is a lot of redundancy between the individual files, that should need less peak memory. |
@martindurant , yes, there are 100,000+ individual JSONs that cover the 40 year period. I use 40 workers that each consolidate a single year. Access to the individual single year JSON (which takes 1.5GB memory) is shown here: https://nbviewer.org/gist/26f42b8556bf4cab5df81fc924342d5d I don't have enough memory on the ESIP qhub to combine the 40 JSONs into a single JSON. :( |
You might still be able to tree further: try combining in batches of 5 or 8, and then combining those? |
After creating the filesystem for one year, I see 1.2GB in use. I'll look into it. I am indeed working on the parquet backend, which should give better memory footprint per reference set; but strings are still strings, so all those paths add up once in memory unless templates are only applied at access time. Hm. However, it may be possible, instead, to make the combine process not need to load all the reference sets up front. |
Hi Rich
I am travelling this week. Martin is ahead of me anyway though.
I am working on open sourcing the HRRR aggregation I built but I was over
optimistic about doing it while travelling for work.
I am doing a three step tree process:
1. scan_grib to extract the metadata and write raw objects one to one
with the original forecast hour grib files
2. Daily multizarr aggregations from each individual forecast hour
3. Monthly mutlizarr aggregations from each of the daily aggregations
4. All time multizarr aggregations from the monthly aggregations
I do this on a per forecast horizon basis, so I end up with aggregations
for 0,1,2...17 &18 hour horizons. Then I get 6 hour aggregations out to 48
hour horizon because HRRR only runs a full 48 hour model every six hours:
19-24, 25-30, 31-36, 37-42, 43-48.
I am thinking I will drop the all time aggregation and use the
multizarr tool on the fly to build the date range that I need from the
monthly chunks.
I am on the train right now with terrible wifi - I will try and grab memory
use stats for you later.
Best
David
…On Wed, Dec 14, 2022 at 11:07 AM Martin Durant ***@***.***> wrote:
After creating the filesystem for one year, I see 1.2GB in use. I'll look
into it.
I am indeed working on the parquet backend, which should give better
memory footprint per reference set; but strings are still strings, so all
those paths add up once in memory unless templates are only applied at
access time. Hm.
However, it may be possible, instead, to make the combine process not need
to load all the reference sets up front.
—
Reply to this email directly, view it on GitHub
<#209 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUDN2WZZXEIBCVZUNNJOOMDWNHWCNANCNFSM555CNQHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for the update @emfdavid. Are those HRRR JSONs in a publically-accessible bucket? (perhaps requester-pays?) Have an example notebook? |
@martindurant I was able to create four 10 year combined JSONs from the 40 individual yearly JSON files. The process to create each of these 10-year files took 16GB of the 32GB memory for the Xlarge instance at https://jupyer.qhub.esipfed.org. I was unable to create the 40 year combined file from these four 10 year files though -- it blew the 32GB memory |
Try #272 |
With latest commits in 272, I could combine 13 years directly with peak memory around 13GB. |
Just to make sure I've got the right version, I have this. You?
|
yes |
FYI the Pangeo ML augmentation with support for some of these tasks through the NASA ACCESS 2019 program is on FigShare. |
Stuff that would be cool to get done and is well within out capacity. Please vote if you have favourites here!
meta_array
zarr-developers/zarr-python#1131The text was updated successfully, but these errors were encountered: