-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ensemble name filter to get_units_from_file_object #169
Conversation
This issue affects some of the Traceback (most recent call last):
File "/usr/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/lib/python3.8/site-packages/flask_cors/extension.py", line 161, in wrapped_function
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "/usr/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/app/ce/views.py", line 12, in api_request
return ce.api.call(db.session, *args, **kwargs)
File "/app/ce/api/__init__.py", line 97, in call
rv = func(session, **args)
File "/app/ce/api/data.py", line 156, in data
"units": get_units_from_run_object(
File "/app/ce/api/util.py", line 42, in get_units_from_run_object
raise Exception(
Exception: File list [EXTREMELY LONG LIST REMOVED] does not have consistent units {'%', 'days'}
2020-10-01 20:52:37 [102] [INFO] 172.18.0.1 - - [01/Oct/2020:20:52:37 +0000] "GET /api/data?ensemble_name=ce_files&model=CanESM2&variable=tx90pETCCDI&emission=historical,+rcp85×cale=monthly&time=0&area= HTTP/1.1" 500 290 "https://services.pacificclimate.org/pcex/app/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0" I built a demo server from this branch, and tested it with the same tx90p URL, but got a different error. Traceback (most recent call last):
File "/usr/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "/usr/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/app/ce/views.py", line 12, in api_request
return ce.api.call(db.session, *args, **kwargs)
File "/app/ce/api/__init__.py", line 97, in call
rv = func(session, **args)
File "/app/ce/api/data.py", line 156, in data
"units": get_units_from_run_object(
File "/app/ce/api/util.py", line 40, in get_units_from_run_object
units = {
File "/app/ce/api/util.py", line 41, in <setcomp>
get_units_from_file_object(file_, varname, ensemble_name) for file_ in files
File "/app/ce/api/util.py", line 33, in get_units_from_file_object
raise Exception(
Exception: Variable tx90pETCCDI is not indexed for file /storage/data/climate/downscale/CMIP5/BCCAQ/climdex/CanESM2_historical+rcp85_r1i1p1/tx90pETCCDI_yr_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp85_r1i1p1_1950-2100.nc
2020-10-01 20:58:46 [7] [INFO] 172.17.0.1 - - [01/Oct/2020:20:58:46 +0000] "GET /api/data?ensemble_name=ce_files&model=CanESM2&variable=tx90pETCCDI&emission=historical,+rcp85×cale=monthly&time=0&area= HTTP/1.1" 500 290 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0" I'm not sure yet if this is a code issue or if there's something wrong with the specified |
I haven't yet had time to double-check the database encoding of the |
ce/api/util.py
Outdated
@@ -25,18 +25,21 @@ def get_units_from_netcdf_file(nc, variable): | |||
return nc.variables[variable].units | |||
|
|||
|
|||
def get_units_from_file_object(file_, varname): | |||
def get_units_from_file_object(file_, varname, ensemble_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the wrong place to add the extra filtering functionality.
This function is passed a datafile and asked to return the units of the requested variable in that file.
So for example if the user was seeking information about tx90pETCCDI
with the r1i1p1
run in the ce_files
data collection ("ensemble") used by the Climate Explore Website, get_units_from_run_object()
would collect all files that have the variable tx90pETCCDI
and the run r1i1p1
. Then it would get the set of all units from each file, in order to make sure they all had the same units and were comparable, and if so, the rest of the API would go about returning that comparison information.
So in this case get_units_from_file_object
is called on every file as long as that file has the following two properties:
- it contains some
tx90pETCCDI
data - it has the run
r1i1p1
So if you call this function with, say, /storage/data/climate/downscale/CMIP5/BCCAQ/climdex/CanESM2_historical+rcp85_r1i1p1/tx90pETCCDI_yr_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp85_r1i1p1_1950-2100.nc
- this file does have tx90pETCCDI
data, and it does have the r1i1p1
run, but it is not in the ce_files
group, because it's a weird 150 year file, not a 30-year summary file, and we don't display the 150 year files on the web because it takes one million years, but we do do some other things with them.
(The presence of this file and others like it is the reason we need to add the capability to filter unit-fetches by data collection in the first place - in addition to being too big and noisy to show on the website, it has different units than the files we actually are looking for in the ce_files
group. Currently you can't view any tx90pETCCDI
data at all with the data
API because the unmodified version of get_units_from_run_object()
thinks there's a units conflict, even though the files with different units are in different data collections and can't actually be viewed at the same time).
So when get_units_from_file_object()
is asked what units are in this file, it iterates over the variables in the file (the first for
loop, line 29) and over the data collections this file is a member of (the second for
loop, line 30) but it never returns anything, because the file it is being asked about is not a member of the ce_files
data collection, so the second half of the line 31 if
is never true. Instead it reaches the end of the function and raises an error, halting calculations, which is not what we want.
I think the easiest way forward would be to move filtering-by-data-group up into get_units_from_run_object()
or maybe get_files_from_run_variable()
, though there are definitely other options, if you wanted to do something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also add that these kinds of operations on database objects can be extremely slow, as they issue separate queries to the database to access relation-based values like file_.data_file_variables
and dfv.ensembles
. These are convenient for programming but can be very inefficient. The better way is to issue queries with appropriate joins in the first place and rather walk through the database after a more limited query. The balance of course is between programming convenience and flexibility on the one hand and efficient use of the database on the other.
ce/api/util.py
Outdated
and ensemble.name == ensemble_name | ||
): | ||
files.append(file_) | ||
return files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a tighter and more efficient way of accomplishing this:
return list(filter(
lambda file_: any(
variable == dfv.netcdf_variable_name and
any(ensemble.name == ensemble_name for ensemble in dfv.ensembles)
for dfv in file_.data_file_variables
),
run.files
))
Generally speaking, as in R, loops are a code smell, or at least a reason to look at alternatives. Almost always there is a comprehension or a filter
expression that will do the same job as a something that builds up a result by weeding through another iterable.
You don't have to wrap the filter
in list
unless the result must be a list instead of a generic iterable.
Not sure if Black will agree with my spacing there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an alternative that uses a comprehension instead of a filter. It may be easier to read:
return (
file_ for file_ in run.files if any(...)
)
The parentheses (return (...)
) make this a generator. If you used brackets (return [...]
) it would be a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are pretty much identical formulations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the advice! I have just been working on this when my other projects are slow.
0224566
to
78a7da4
Compare
It works! Data for tx90p: {
"r1i1p1": {
"data": {
"1977-01-15T00:00:00Z": 10.529546415815528,
"1986-01-15T00:00:00Z": 12.998391385738454,
"1997-01-15T00:00:00Z": 14.331152288677236,
"2025-01-15T00:00:00Z": 20.63730692712576,
"2055-01-15T00:00:00Z": 34.882464440127826,
"2085-01-15T00:00:00Z": 56.33927564383733
},
"units": "%",
"modtime": "2019-09-11T16:04:29Z"
}
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! The code works, it's clear, and you added a new test. Nice.
@@ -530,11 +538,44 @@ def multitime_db(cleandb,): | |||
) | |||
for file_ in files | |||
] | |||
# Create file with different units |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New tests, awesome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This pull request should resolve #106
get_units_from_run_variable
inutil.py