Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

double encoding of opendap URLs (works with 4.6.0 but fails in 4.7.3) #1876

Closed
Alexander-Barth opened this issue Oct 30, 2020 · 14 comments
Closed

Comments

@Alexander-Barth
Copy link
Contributor

  • the version of the software with which you are encountering an issue

netcdf library version 4.7.3

  • environmental information (i.e. Operating System, compiler info, java version, python version, etc.)

Ubuntu 20.04, gcc 9.3.0

  • a description of the issue with the steps needed to reproduce it

Accessing this OPENDAP URL, works fine in version netcdf 4.6.0 (from Ubuntu 18.04).

ncdump -h 'http://opendap2.oceanbrowser.net/thredds/dodsC/data/emodnet1-domains/tmp%20test.nc'

This URL with the space encoded as %20 is reported by thredds and I think it is correct.

However, ncdump -h fails in version 4.7.3 with the error NetCDF: file not found

Inspecting the thredds log files shows the servers looks for the file tmp%2520test.nc.dds (%20 is again encoded).

Directly accessing the dds resource works, which makes me think that this issue could be in netcdf-c:

curl 'http://opendap2.oceanbrowser.net/thredds/dodsC/data/emodnet1-domains/tmp%20test.nc.dds'

Thank you very much for your work on NetCDF which I used for more than 15 years!

@DennisHeimbigner
Copy link
Collaborator

This has been an ongoing issue because the DAP2 protocol chose to
use URL encoding for its escape mechanism.

The actual URL that is being sent to Thredds is double encoded,
so it looks like this:

http://opendap2.oceanbrowser.net/thredds/dodsC/data/emodnet1-domains/tmp%2520test.nc.dds

My understanding was that Apache and/or Tomcat should do the
initial decode to the URL. This should mean that the DAP
processor in Thredds sees this:

http://opendap2.oceanbrowser.net/thredds/dodsC/data/emodnet1-domains/tmp%20test.nc.dds

The DAP code should decode a second time to give this:

http://opendap2.oceanbrowser.net/thredds/dodsC/data/emodnet1-domains/tmp test.nc.dds

It looks like the first decode by Apache/Tomcat is not happening.

When you use curl, can you tell from the logs, what URL is being
shown in the Thredds log?

I also considered what would happen if you sent the unencoded form:

tmp test.nc.dds

It appears that libcurl (which netcdf uses) encodes this to be:

tmp+test.nc.dds

So that is no help.

I will see if Sean has any insight.

@Alexander-Barth
Copy link
Contributor Author

For your information, when I use curl:

curl 'http://opendap2.oceanbrowser.net/thredds/dodsC/data/emodnet1-domains/tmp%20test.nc.dds'

I see the following in the server logs:

172.17.0.1 - - [01/Nov/2020:20:47:09 +0000] "GET /thredds/dodsC/data/emodnet1-domains/tmp%20test.nc.dds HTTP/1.1" 200 3349

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Nov 2, 2020

That is consistent with the fact that curl does not encode urls by default.

@DennisHeimbigner
Copy link
Collaborator

This is probably related to this PR #1835
which has this note:

The DAP2 standard specifies the use of URL %xx escaping as also
the escape mechanism for DAP2 itself. This can cause some confusion.
To try to simplify this, the netcdf-c library DAP2 code assumes
that all URL path and constraint information is unescaped. It is assumed
that just before transmission, the constructed URL will be properly escaped.

I notice that one change in that PR was to use %20 instead of + for
encoding spaces. This change has not made it into any release yet.
Additionally, according to the above note, you should not attempt to pre-encode what you give to ncdump.
This is all a bit of mess, but I am stuck with it.

@Alexander-Barth
Copy link
Contributor Author

OK, thanks. I tried out c8a3c51 and ncdump does work when I replace %20 by space:

LD_LIBRARY_PATH=$HOME/opt/netcdf-c8a3c51c/lib/ ~/opt/netcdf-c8a3c51c/bin/ncdump -h 'http://opendap2.oceanbrowser.net/thredds/dodsC/data/emodnet-domains-1/tmp test.nc'
# netcdf library version 4.8.0-development of Nov  3 2020 09:55:54 $

I get, the correct result:

netcdf tmp\ test {
dimensions:
	depth = 13 ;
...

Currently, the "Data URL" thredds OPeNDAP Dataset Access Form presents the encoded URLs (including %20 instead of space). Should thredds therefore present the URL to the user in an unencoded form ?
Or should the user (or higher-level python, julia, ... library) make this substitution?

(Sorry for my confusion).

@DennisHeimbigner
Copy link
Collaborator

Good question. The problem is the client being used. So it if uses netcdf library,
then thos %'s need changing. But for other clients, e.g. curl, they need to stay.
So from Thredds point of view, there is no right answer.
Frankly I am still unsatisfied with the way that netcdf handles this because
of the confusion. One possibility is that we scan the URL and if any %xx are detected,
we unescape them internally.

@Alexander-Barth
Copy link
Contributor Author

For what it is worth, I checked with the current version of the netCDF4 module in python. It works for the URL-encoded OPeNDAP links, but not with links where the %20 is replaced by a space. It seems that the python module bundles its own version of the NetCDF library and uses currently version 4.6.3. This is thus consistent with ncdump of version 4.6.x.

I am wondering if this old behavior should not be the preferred one. To me its seems the one with the least surprises for the users. But of course, I do not have the same view of the project and I know what kind of problem it creates elsewhere.

Accepting both (URL-encoded and non-URL-encoded strings) might be indeed a good practical solution.

@DennisHeimbigner
Copy link
Collaborator

I think the problem is the server side. The netcdf library has to generate something
that the server will properly handle.
The original problem in Unidata/netcdf4-python#1041
was that this URL was failing:

http://iridl.ldeo.columbia.edu/SOURCES/.Indices/.soi/.c8110/.anomaly/T/%28Jan%201979%29/VALUE/dods
Can you see if your python still fails on it? How about netcdf-4.6.0?
Unfortunately, we cannot see what the server logs show, so it was unclear
what was happening on the server side.

@DennisHeimbigner
Copy link
Collaborator

Never mind, I see that server is down either temporary or permanent.

@DennisHeimbigner
Copy link
Collaborator

In reviewing all this, I am now more confused than ever.
I can understand the columbia server URL problem. That needs special casing.
What I cannot understand is why the oceanbrowser.net server is not
handling the double encoding. It must mean that Apache+Tomcat is not
doing a URL decode.
This is all to much for me. I think Sean's original suggestion was correct.
I am going to add a control parameter that tells netcdf either to encode
or to not-encode the url &/oor query with the default being to not-encode the url part
but to encode the query part. Ugh!

DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this issue Nov 5, 2020
re:  Unidata#1876
and: Unidata#1835
and: Unidata/netcdf4-python#1041

The change in PR 1835 was correct with respect to using %20 instead of '+'
for encoding blanks. However, it was a mistake to assume everything was
unencoded and then to do encoding ourselves. The problem is that
different servers do different things, with Columbia being an outlier.

So, I have added a set of client controls that can at least give
the caller some control over this. The caller can append
the following fragment to his URL to control what gets encoded before
sending it to the server. The syntax is as follows:
````
https://<host>/<path>/<query>#encode=path|query|all|none
````

The possible values:
* path  -- URL encode (i.e. %xx encode) as needed in the path part of the URL.
* query -- URL encode as needed in the query part of the URL.
* all   -- equivalent to ````#encode=path,query````.
* none  -- do not url encode any part of the URL sent to the server; not strictly necessary, so mostly for completeness.

Note that if "encode=" is used, then before it is processed, all encoding
is turned of so that ````#encode=path```` will only encode the path
and not the query.

The default is ````#encode=query````, so the path is left untouched,
but the query is always encoded.

Internally, this required changes to pass the encode flags down into
the OC2 library.

Misc. Unrelated Changes:
* Shut up those irritating warning from putget.m4
@DennisHeimbigner
Copy link
Collaborator

Mitigated by PR #1880

@Alexander-Barth
Copy link
Contributor Author

Thanks a lot for your patch!

@DennisHeimbigner
Copy link
Collaborator

I accidentally deleted this branch before I got it merged to master.
Since I would prefer not to recreate it, do you by any chance have a copy of
the branch that you can send me. If nothing else, a zip file
of the checked-out branch would be enough.

@DennisHeimbigner
Copy link
Collaborator

Never mind; I found a copy on another machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants