Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ensemble name filter to get_units_from_file_object #169

Merged
merged 5 commits into from
Nov 2, 2020

Conversation

cairosanders
Copy link
Contributor

@cairosanders cairosanders commented Sep 25, 2020

This pull request should resolve #106

  • Adds filter by ensemble name to get_units_from_run_variable in util.py
  • Adds test instance in conftest.py by adding 2 datafilevariables with the same variable name but different units and ensembles

@corviday
Copy link
Contributor

corviday commented Oct 1, 2020

This issue affects some of the tx90p data, as shown on this call to the production server, which results in the following error message:

Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/lib/python3.8/site-packages/flask_cors/extension.py", line 161, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/ce/views.py", line 12, in api_request
    return ce.api.call(db.session, *args, **kwargs)
  File "/app/ce/api/__init__.py", line 97, in call
    rv = func(session, **args)
  File "/app/ce/api/data.py", line 156, in data
    "units": get_units_from_run_object(
  File "/app/ce/api/util.py", line 42, in get_units_from_run_object
   raise Exception(
Exception: File list [EXTREMELY LONG LIST REMOVED] does not have consistent units {'%', 'days'}
2020-10-01 20:52:37 [102] [INFO] 172.18.0.1 - - [01/Oct/2020:20:52:37 +0000] "GET /api/data?ensemble_name=ce_files&model=CanESM2&variable=tx90pETCCDI&emission=historical,+rcp85&timescale=monthly&time=0&area= HTTP/1.1" 500 290 "https://services.pacificclimate.org/pcex/app/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0"

I built a demo server from this branch, and tested it with the same tx90p URL, but got a different error.

Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/ce/views.py", line 12, in api_request
    return ce.api.call(db.session, *args, **kwargs)
  File "/app/ce/api/__init__.py", line 97, in call
    rv = func(session, **args)
  File "/app/ce/api/data.py", line 156, in data
    "units": get_units_from_run_object(
  File "/app/ce/api/util.py", line 40, in get_units_from_run_object
    units = {
  File "/app/ce/api/util.py", line 41, in <setcomp>
    get_units_from_file_object(file_, varname, ensemble_name) for file_ in files
  File "/app/ce/api/util.py", line 33, in get_units_from_file_object
    raise Exception(
Exception: Variable tx90pETCCDI is not indexed for file /storage/data/climate/downscale/CMIP5/BCCAQ/climdex/CanESM2_historical+rcp85_r1i1p1/tx90pETCCDI_yr_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp85_r1i1p1_1950-2100.nc
2020-10-01 20:58:46 [7] [INFO] 172.17.0.1 - - [01/Oct/2020:20:58:46 +0000] "GET /api/data?ensemble_name=ce_files&model=CanESM2&variable=tx90pETCCDI&emission=historical,+rcp85&timescale=monthly&time=0&area= HTTP/1.1" 500 290 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0"

I'm not sure yet if this is a code issue or if there's something wrong with the specified tx90p files. Needs further investigation. A call to a different variable has no trouble, so it may be a tx90p issue.

@corviday
Copy link
Contributor

corviday commented Oct 5, 2020

I haven't yet had time to double-check the database encoding of the tx90p files and determine whether the second error message in my previous comment requires 1) a data correction, or 2) additional changes on this branch, 3) a completely unrelated new issue to be resolved at a later date. Apologies for the delay.

ce/api/util.py Outdated
@@ -25,18 +25,21 @@ def get_units_from_netcdf_file(nc, variable):
return nc.variables[variable].units


def get_units_from_file_object(file_, varname):
def get_units_from_file_object(file_, varname, ensemble_name):
Copy link
Contributor

@corviday corviday Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong place to add the extra filtering functionality.

This function is passed a datafile and asked to return the units of the requested variable in that file.

So for example if the user was seeking information about tx90pETCCDI with the r1i1p1 run in the ce_files data collection ("ensemble") used by the Climate Explore Website, get_units_from_run_object() would collect all files that have the variable tx90pETCCDI and the run r1i1p1. Then it would get the set of all units from each file, in order to make sure they all had the same units and were comparable, and if so, the rest of the API would go about returning that comparison information.

So in this case get_units_from_file_object is called on every file as long as that file has the following two properties:

  • it contains some tx90pETCCDI data
  • it has the run r1i1p1

So if you call this function with, say, /storage/data/climate/downscale/CMIP5/BCCAQ/climdex/CanESM2_historical+rcp85_r1i1p1/tx90pETCCDI_yr_BCCAQ+ANUSPLIN300+CanESM2_historical+rcp85_r1i1p1_1950-2100.nc - this file does have tx90pETCCDI data, and it does have the r1i1p1 run, but it is not in the ce_files group, because it's a weird 150 year file, not a 30-year summary file, and we don't display the 150 year files on the web because it takes one million years, but we do do some other things with them.

(The presence of this file and others like it is the reason we need to add the capability to filter unit-fetches by data collection in the first place - in addition to being too big and noisy to show on the website, it has different units than the files we actually are looking for in the ce_files group. Currently you can't view any tx90pETCCDI data at all with the data API because the unmodified version of get_units_from_run_object() thinks there's a units conflict, even though the files with different units are in different data collections and can't actually be viewed at the same time).

So when get_units_from_file_object() is asked what units are in this file, it iterates over the variables in the file (the first for loop, line 29) and over the data collections this file is a member of (the second for loop, line 30) but it never returns anything, because the file it is being asked about is not a member of the ce_files data collection, so the second half of the line 31 if is never true. Instead it reaches the end of the function and raises an error, halting calculations, which is not what we want.

I think the easiest way forward would be to move filtering-by-data-group up into get_units_from_run_object() or maybe get_files_from_run_variable(), though there are definitely other options, if you wanted to do something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add that these kinds of operations on database objects can be extremely slow, as they issue separate queries to the database to access relation-based values like file_.data_file_variables and dfv.ensembles. These are convenient for programming but can be very inefficient. The better way is to issue queries with appropriate joins in the first place and rather walk through the database after a more limited query. The balance of course is between programming convenience and flexibility on the one hand and efficient use of the database on the other.

ce/api/util.py Outdated
and ensemble.name == ensemble_name
):
files.append(file_)
return files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a tighter and more efficient way of accomplishing this:

return list(filter(
    lambda file_: any(
        variable == dfv.netcdf_variable_name and 
            any(ensemble.name == ensemble_name for ensemble in dfv.ensembles)
        for dfv in file_.data_file_variables
    ),
    run.files
))

Generally speaking, as in R, loops are a code smell, or at least a reason to look at alternatives. Almost always there is a comprehension or a filter expression that will do the same job as a something that builds up a result by weeding through another iterable.

You don't have to wrap the filter in list unless the result must be a list instead of a generic iterable.

Not sure if Black will agree with my spacing there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an alternative that uses a comprehension instead of a filter. It may be easier to read:

return (
    file_ for file_ in run.files if any(...)
)

The parentheses (return (...)) make this a generator. If you used brackets (return [...]) it would be a list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are pretty much identical formulations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the advice! I have just been working on this when my other projects are slow.

@corviday
Copy link
Contributor

It works! Data for tx90p:

{
  "r1i1p1": {
    "data": {
      "1977-01-15T00:00:00Z": 10.529546415815528, 
      "1986-01-15T00:00:00Z": 12.998391385738454, 
      "1997-01-15T00:00:00Z": 14.331152288677236, 
      "2025-01-15T00:00:00Z": 20.63730692712576, 
      "2055-01-15T00:00:00Z": 34.882464440127826, 
      "2085-01-15T00:00:00Z": 56.33927564383733
    }, 
    "units": "%", 
    "modtime": "2019-09-11T16:04:29Z"
  }
}

Copy link
Contributor

@corviday corviday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! The code works, it's clear, and you added a new test. Nice.

@@ -530,11 +538,44 @@ def multitime_db(cleandb,):
)
for file_ in files
]
# Create file with different units
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New tests, awesome.

Copy link
Contributor

@rod-glover rod-glover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cairosanders cairosanders merged commit ec7261a into master Nov 2, 2020
@cairosanders cairosanders deleted the bugfix/106 branch November 2, 2020 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unit consistency checks are not ensemble-constrained
3 participants