Add ensemble name filter to get_units_from_file_object #169

cairosanders · 2020-09-25T20:51:58Z

This pull request should resolve #106

Adds filter by ensemble name to get_units_from_run_variable in util.py
Adds test instance in conftest.py by adding 2 datafilevariables with the same variable name but different units and ensembles

corviday · 2020-10-01T21:37:59Z

This issue affects some of the tx90p data, as shown on this call to the production server, which results in the following error message:

Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/lib/python3.8/site-packages/flask_cors/extension.py", line 161, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/ce/views.py", line 12, in api_request
    return ce.api.call(db.session, *args, **kwargs)
  File "/app/ce/api/__init__.py", line 97, in call
    rv = func(session, **args)
  File "/app/ce/api/data.py", line 156, in data
    "units": get_units_from_run_object(
  File "/app/ce/api/util.py", line 42, in get_units_from_run_object
   raise Exception(
Exception: File list [EXTREMELY LONG LIST REMOVED] does not have consistent units {'%', 'days'}
2020-10-01 20:52:37 [102] [INFO] 172.18.0.1 - - [01/Oct/2020:20:52:37 +0000] "GET /api/data?ensemble_name=ce_files&model=CanESM2&variable=tx90pETCCDI&emission=historical,+rcp85&timescale=monthly&time=0&area= HTTP/1.1" 500 290 "https://services.pacificclimate.org/pcex/app/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0"

I built a demo server from this branch, and tested it with the same tx90p URL, but got a different error.

Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/ce/views.py", line 12, in api_request
    return ce.api.call(db.session, *args, **kwargs)
  File "/app/ce/api/__init__.py", line 97, in call
    rv = func(session, **args)
  File "/app/ce/api/data.py", line 156, in data
    "units": get_units_from_run_object(
  File "/app/ce/api/util.py", line 40, in get_units_from_run_object
    units = {
  File "/app/ce/api/util.py", line 41, in <setcomp>
    get_units_from_file_object(file_, varname, ensemble_name) for file_ in files
  File "/app/ce/api/util.py", line 33, in get_units_from_file_object
    raise Exception(
Exception: Variable tx90pETCCDI is not indexed for file /storage/data/climate/downscale/CMIP5/BCCAQ/climdex/CanESM2_historical+rcp85_r1i1p1/tx90pETCCDI_yr_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp85_r1i1p1_1950-2100.nc
2020-10-01 20:58:46 [7] [INFO] 172.17.0.1 - - [01/Oct/2020:20:58:46 +0000] "GET /api/data?ensemble_name=ce_files&model=CanESM2&variable=tx90pETCCDI&emission=historical,+rcp85&timescale=monthly&time=0&area= HTTP/1.1" 500 290 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0"

I'm not sure yet if this is a code issue or if there's something wrong with the specified tx90p files. Needs further investigation. A call to a different variable has no trouble, so it may be a tx90p issue.

corviday · 2020-10-05T21:07:10Z

I haven't yet had time to double-check the database encoding of the tx90p files and determine whether the second error message in my previous comment requires 1) a data correction, or 2) additional changes on this branch, 3) a completely unrelated new issue to be resolved at a later date. Apologies for the delay.

corviday · 2020-10-15T00:24:45Z

ce/api/util.py

@@ -25,18 +25,21 @@ def get_units_from_netcdf_file(nc, variable):
    return nc.variables[variable].units


-def get_units_from_file_object(file_, varname):
+def get_units_from_file_object(file_, varname, ensemble_name):


I think this is the wrong place to add the extra filtering functionality.

This function is passed a datafile and asked to return the units of the requested variable in that file.

So for example if the user was seeking information about tx90pETCCDI with the r1i1p1 run in the ce_files data collection ("ensemble") used by the Climate Explore Website, get_units_from_run_object() would collect all files that have the variable tx90pETCCDI and the run r1i1p1. Then it would get the set of all units from each file, in order to make sure they all had the same units and were comparable, and if so, the rest of the API would go about returning that comparison information.

So in this case get_units_from_file_object is called on every file as long as that file has the following two properties:

it contains some tx90pETCCDI data

it has the run r1i1p1

So if you call this function with, say, /storage/data/climate/downscale/CMIP5/BCCAQ/climdex/CanESM2_historical+rcp85_r1i1p1/tx90pETCCDI_yr_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp85_r1i1p1_1950-2100.nc - this file does have tx90pETCCDI data, and it does have the r1i1p1 run, but it is not in the ce_files group, because it's a weird 150 year file, not a 30-year summary file, and we don't display the 150 year files on the web because it takes one million years, but we do do some other things with them.

(The presence of this file and others like it is the reason we need to add the capability to filter unit-fetches by data collection in the first place - in addition to being too big and noisy to show on the website, it has different units than the files we actually are looking for in the ce_files group. Currently you can't view any tx90pETCCDI data at all with the data API because the unmodified version of get_units_from_run_object() thinks there's a units conflict, even though the files with different units are in different data collections and can't actually be viewed at the same time).

So when get_units_from_file_object() is asked what units are in this file, it iterates over the variables in the file (the first for loop, line 29) and over the data collections this file is a member of (the second for loop, line 30) but it never returns anything, because the file it is being asked about is not a member of the ce_files data collection, so the second half of the line 31 if is never true. Instead it reaches the end of the function and raises an error, halting calculations, which is not what we want.

I think the easiest way forward would be to move filtering-by-data-group up into get_units_from_run_object() or maybe get_files_from_run_variable(), though there are definitely other options, if you wanted to do something else.

I'd also add that these kinds of operations on database objects can be extremely slow, as they issue separate queries to the database to access relation-based values like file_.data_file_variables and dfv.ensembles. These are convenient for programming but can be very inefficient. The better way is to issue queries with appropriate joins in the first place and rather walk through the database after a more limited query. The balance of course is between programming convenience and flexibility on the one hand and efficient use of the database on the other.

rod-glover · 2020-10-23T22:46:45Z

ce/api/util.py

+                    and ensemble.name == ensemble_name
+                ):
+                    files.append(file_)
+    return files


There is a tighter and more efficient way of accomplishing this:

return list(filter( lambda file_: any( variable == dfv.netcdf_variable_name and any(ensemble.name == ensemble_name for ensemble in dfv.ensembles) for dfv in file_.data_file_variables ), run.files ))

Generally speaking, as in R, loops are a code smell, or at least a reason to look at alternatives. Almost always there is a comprehension or a filter expression that will do the same job as a something that builds up a result by weeding through another iterable.

You don't have to wrap the filter in list unless the result must be a list instead of a generic iterable.

Not sure if Black will agree with my spacing there.

There's an alternative that uses a comprehension instead of a filter. It may be easier to read:

return ( file_ for file_ in run.files if any(...) )

The parentheses (return (...)) make this a generator. If you used brackets (return [...]) it would be a list.

These are pretty much identical formulations.

Thank you for the advice! I have just been working on this when my other projects are slow.

corviday · 2020-10-30T22:00:52Z

It works! Data for tx90p:

{
  "r1i1p1": {
    "data": {
      "1977-01-15T00:00:00Z": 10.529546415815528, 
      "1986-01-15T00:00:00Z": 12.998391385738454, 
      "1997-01-15T00:00:00Z": 14.331152288677236, 
      "2025-01-15T00:00:00Z": 20.63730692712576, 
      "2055-01-15T00:00:00Z": 34.882464440127826, 
      "2085-01-15T00:00:00Z": 56.33927564383733
    }, 
    "units": "%", 
    "modtime": "2019-09-11T16:04:29Z"
  }
}

corviday

Looks good to me! The code works, it's clear, and you added a new test. Nice.

corviday · 2020-10-30T22:04:15Z

ce/tests/conftest.py

@@ -530,11 +538,44 @@ def multitime_db(cleandb,):
        )
        for file_ in files
    ]
+    # Create file with different units


New tests, awesome.

rod-glover

LGTM!

Add ensemble name filter to get_units_from_file_object

92c1b56

cairosanders requested a review from corviday September 25, 2020 21:32

corviday requested changes Oct 15, 2020

View reviewed changes

Move ensemble filter to get_files_from_run_variable()

107484f

rod-glover reviewed Oct 23, 2020

View reviewed changes

cairosanders added 3 commits October 29, 2020 13:00

Improve code quality with any()

c838ea7

Add variable with diff units to test

f4ed27f

Test with old code vs new code

78a7da4

cairosanders force-pushed the bugfix/106 branch from 0224566 to 78a7da4 Compare October 30, 2020 20:09

cairosanders requested a review from corviday October 30, 2020 21:09

corviday approved these changes Oct 30, 2020

View reviewed changes

cairosanders requested a review from rod-glover October 30, 2020 22:29

rod-glover approved these changes Nov 2, 2020

View reviewed changes

cairosanders merged commit ec7261a into master Nov 2, 2020

cairosanders deleted the bugfix/106 branch November 2, 2020 17:49

corviday mentioned this pull request Feb 10, 2021

New release slower than previous release #178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ensemble name filter to get_units_from_file_object #169

Add ensemble name filter to get_units_from_file_object #169

cairosanders commented Sep 25, 2020 •

edited

Loading

corviday commented Oct 1, 2020 •

edited

Loading

corviday commented Oct 5, 2020 •

edited

Loading

corviday Oct 15, 2020 •

edited

Loading

rod-glover Oct 15, 2020

rod-glover Oct 23, 2020

rod-glover Oct 23, 2020

rod-glover Oct 23, 2020

cairosanders Oct 29, 2020

corviday commented Oct 30, 2020

corviday left a comment

corviday Oct 30, 2020

rod-glover left a comment

Add ensemble name filter to get_units_from_file_object #169

Add ensemble name filter to get_units_from_file_object #169

Conversation

cairosanders commented Sep 25, 2020 • edited Loading

corviday commented Oct 1, 2020 • edited Loading

corviday commented Oct 5, 2020 • edited Loading

corviday Oct 15, 2020 • edited Loading

Choose a reason for hiding this comment

rod-glover Oct 15, 2020

Choose a reason for hiding this comment

rod-glover Oct 23, 2020

Choose a reason for hiding this comment

rod-glover Oct 23, 2020

Choose a reason for hiding this comment

rod-glover Oct 23, 2020

Choose a reason for hiding this comment

cairosanders Oct 29, 2020

Choose a reason for hiding this comment

corviday commented Oct 30, 2020

corviday left a comment

Choose a reason for hiding this comment

corviday Oct 30, 2020

Choose a reason for hiding this comment

rod-glover left a comment

Choose a reason for hiding this comment

cairosanders commented Sep 25, 2020 •

edited

Loading

corviday commented Oct 1, 2020 •

edited

Loading

corviday commented Oct 5, 2020 •

edited

Loading

corviday Oct 15, 2020 •

edited

Loading