Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Default URL path encoding differs between main & 2.x #1103

Closed
al-niessner opened this issue Jul 23, 2024 · 12 comments · Fixed by #1109
Closed

[BUG] Default URL path encoding differs between main & 2.x #1103

al-niessner opened this issue Jul 23, 2024 · 12 comments · Fixed by #1109
Assignees
Labels
bug Something isn't working

Comments

@al-niessner
Copy link
Contributor

What is the bug?

See unit test updates. They work on main but not 2.x.

How can one reproduce the bug?

Run EndpointTest on main (passes) and then 2.x (fails)

What is the expected behavior?

Should behave the same if all code changes were moved to 2.x

What is your host/environment?

Eclipse on Linux

Do you have any screenshots?

No but can provide if critical

Do you have any additional context?

Updated unit tests on the way. Related to #1068 and #999 and #1012 and possibly others.

@al-niessner al-niessner added bug Something isn't working untriaged labels Jul 23, 2024
@Xtansia Xtansia removed the untriaged label Jul 24, 2024
@Xtansia Xtansia changed the title [BUG] main branch and 2.x branch behave differently [BUG] Default URL path encoding differs between main & 2.x Jul 24, 2024
@Xtansia Xtansia self-assigned this Jul 24, 2024
@Xtansia
Copy link
Collaborator

Xtansia commented Jul 24, 2024

This is due to main preferring Apache HttpClient v5, where as 2.x prefers v4 and they differ in their URL encoding. I'm working on making tests and implementation that gets the same encoding result regardless of HttpClient version used.

@al-niessner
Copy link
Contributor Author

Understood. One of the other various tickets asked for a unit test. Contribution requested a ticket (default in the unknown). Thanks for working it. If I can help let me know. Not sure why you are supporting 2 httpClient versions, but if you want some help supporting only 1 I am happy to lend a hand. I just need some rather direct do this and then do that.

@Xtansia
Copy link
Collaborator

Xtansia commented Jul 25, 2024

So I've narrowed the difference down to the fact v5 is more conservative and encodes everything not in the UNRESERVED char class, whereas v4 is looser and uses the PATHSAFE char class (which includes some punctuation such as :). Unfortunately v4's underlying encode method and char class bitsets are private, which makes reflection access to these less than ideal to be able to get the same behaviour.

I think this is a fairly strong argument to bring in a version of v5's PercentCodec class and remove the need for reflection and have consistent behaviour. It would also reduce dependency if we had a future transport implementation that's completely independent of Apache HttpClient. I know something like this was the original approach in #999, however there were some reservations which led to the reflection approach.

What are your thoughts @reta & @dblock?

@dblock
Copy link
Member

dblock commented Jul 25, 2024

I am in full support of a consistent approach @Xtansia. I would potentially preserve the option of using one or the other library for backwards compatibility, hopefully without reflection as a configuration option.

@al-niessner
Copy link
Contributor Author

Why dance around it with far away imports of PercentCodec and just update PathEcoder to use the core5 URIEncoder (replacement for the utils class currently used for backward compatibility-ish)?

@reta
Copy link
Collaborator

reta commented Jul 28, 2024

What are your thoughts @reta & @dblock?

Thanks @Xtansia , if we bring v5 impl than we would be breaking (potentially) the existing apps that relies on v4 impl (explicitly or implicitly), right? Shouldn't we embed more relaxed v4 encoder for both transports?

@Xtansia
Copy link
Collaborator

Xtansia commented Jul 28, 2024

Why dance around it with far away imports of PercentCodec and just update PathEcoder to use the core5 URIEncoder (replacement for the utils class currently used for backward compatibility-ish)?

@al-niessner 2.x does not take a hard dependency on v5, it by default depends on v4, so v5 is not always available. Additionally URIEncoder is marked as deprecated in v5 and just calls straight through to PercentCodec. This is talking about source-code copying a percent encoding implementation, not calling into a library.

@Xtansia
Copy link
Collaborator

Xtansia commented Jul 29, 2024

Thanks @Xtansia , if we bring v5 impl than we would be breaking (potentially) the existing apps that relies on v4 impl (explicitly or implicitly), right? Shouldn't we embed more relaxed v4 encoder for both transports?

@reta The only thing we could break by being more conservative is anything that is sniffing the HTTP request in flight and doing matching without decoding the URL path. In the RFC the colon specifically is only safe to be not encoded when there's a schema present in a URI. I'm unsure how this goes once we get to HTTP message level etc. But this obviously causes weird interactions with things such as the load balancers etc in AOSS (as seen in #1068) which I can only guess makes different assumptions. I can also imagine this may not be limited to AOSS but also other managed providers, load balancers, HTTP proxies etc.

If we do want to be ultra safe, we can do as @dblock and create an IUriEncoder interface with a v4 legacy impl and then copied safer impl, and have v4 be default on 2.x, then new one default on 3.x with v4 impl still available but deprecated?
We also don't necessarily need to make the "v4" impl actually use HttpClientV4's impl, and just copy over the "PATHSAFE" char set and re-use the new copied encoding routine.

(I guess technically removing the PathEncoder class added in #999 would be a breaking change in 2.x as it's a public class, even if not really a publicly supported api of the client as we don't have any annotation or differentiation for non-public apis. So we can @Deprecate that, have it use legacy impl and stop using it within the client)

@reta
Copy link
Collaborator

reta commented Jul 29, 2024

Thanks @Xtansia

If we do want to be ultra safe, we can do as @dblock and create an IUriEncoder interface with a v4 legacy impl and then copied safer impl, and have v4 be default on 2.x, then new one default on 3.x with v4 impl still available but deprecated?

+1 to keep it safe as @dblock suggested

(I guess technically removing the PathEncoder class added in #999 would be a breaking change in 2.x as it's a public class, even if not really a publicly supported api of the client as we don't have any annotation or differentiation for non-public apis. So we can @deprecate that, have it use legacy impl and stop using it within the client)

I think we could repurpose it to rely on IUriEncoder instead (it surely depends on the implementation), in worst case - take the @Deprecate route as you suggested,

@Xtansia
Copy link
Collaborator

Xtansia commented Jul 29, 2024

I think we could repurpose it to rely on IUriEncoder instead (it surely depends on the implementation), in worst case - take the @Deprecate route as you suggested,

In all other cases we could attach the IUriEncoder instance to something like TransportOptions and pass it through to the request endpoint builder, but the PathEncoder methods are static so can't rely on that. Which would mean we either need to use a mutable static to configure the implementation or configuring via a system property which doesn't feel like a good developer experience to me personally.

@Xtansia
Copy link
Collaborator

Xtansia commented Jul 29, 2024

After thinking/digging a bit further, the instance based approach in my previous comment does not work for 2.x as it'd require changing the signature of Endpoint.requestUrl to allow passing in the encoder

@Xtansia
Copy link
Collaborator

Xtansia commented Aug 1, 2024

A fix/feature to support this has been released as part of v2.13.0. For safety/backwards-compatibility reasons we couldn't change the default behavior in 2.x, however you should be able to opt into the HttpClient5 behavior (without needing a dependency on any version of HttpClient), by specifying the system property: org.opensearch.path.encoding=HTTP_CLIENT_V5_EQUIV

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants