Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v20.0.0 ERR_SOCKET_CONNECTION_TIMEOUT when sending http request to some domains #47822

Closed
lpgera opened this issue May 2, 2023 · 23 comments · Fixed by #47860
Closed

v20.0.0 ERR_SOCKET_CONNECTION_TIMEOUT when sending http request to some domains #47822

lpgera opened this issue May 2, 2023 · 23 comments · Fixed by #47860
Labels
http Issues or PRs related to the http subsystem. net Issues and PRs related to the net subsystem. v20.x v20.x Issues that can be reproduced on v20.x or PRs targeting the v20.x-staging branch.

Comments

@lpgera
Copy link

lpgera commented May 2, 2023

Version

v20.0.0

Platform

macOS 13.3.1, Linux linux 5.15.84-v8+ # 1613 SMP PREEMPT Thu Jan 5 12:03:08 GMT 2023 aarch64 GNU/Linux

Subsystem

http

What steps will reproduce the bug?

After switching to v20.0.0, I cannot send requests to https://api.airvisual.com/v2 anymore and I always get an ERR_SOCKET_CONNECTION_TIMEOUT after about one second.

Reproduction 1:

await fetch('https://api.airvisual.com/v2')

Reproduction 2:

require('https').get('https://api.airvisual.com/v2', (resp) => {
  let data = ''
  resp.on('data', (chunk) => data += chunk)
  resp.on('end', () => console.log(data))
})

How often does it reproduce? Is there a required condition?

The issue is only present on v20.0.0, previous versions work fine.

What is the expected behavior? Why is that the expected behavior?

The API should respond with the following JSON body: {"status":"fail","data":{"message":"incorrect_api_key"}}.

What do you see instead?

> await fetch('https://api.airvisual.com/v2')
Uncaught TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11457:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async REPL18:1:33 {
  cause: Error [ERR_SOCKET_CONNECTION_TIMEOUT]: Socket connection timeout
      at new NodeError (node:internal/errors:399:5)
      at internalConnectMultiple (node:net:1098:20)
      at Timeout.internalConnectMultipleTimeout (node:net:1637:3)
      at listOnTimeout (node:internal/timers:575:11)
      at process.processTimers (node:internal/timers:514:7) {
    code: 'ERR_SOCKET_CONNECTION_TIMEOUT'
  }
}

Additional information

Tried a couple of other domains, such as example.com, google.com, cloudflare.com and the requests work fine.

Switching back to v19.8.1 makes the requests to api.airvisual.com work again, so the error seems to be node 20 specific.

I tried talking to the HTTP api instead of using HTTPS and I receive the same error in node.

I tried running curl from every system on which I experience the error and the curl command always succeeded.

Tried looking into the network traffic with Wireshark and what I can see is that the node process starts connecting to both resolved IPv4 addresses, and then the connection is immediately closed by the client:

image

@bnoordhuis
Copy link
Member

cc @ShogunPanda - looks like autoSelectFamily fallout?

@bnoordhuis bnoordhuis added http Issues or PRs related to the http subsystem. net Issues and PRs related to the net subsystem. v20.x v20.x Issues that can be reproduced on v20.x or PRs targeting the v20.x-staging branch. labels May 3, 2023
@ShogunPanda
Copy link
Contributor

Yes, it seems to be the case.

@lpgera Can you please post here how is that endpoint resolved by your DNS?

@lpgera
Copy link
Author

lpgera commented May 3, 2023

These are the outputs from dig:

$ dig api.airvisual.com

; <<>> DiG 9.10.6 <<>> api.airvisual.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27494
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;api.airvisual.com.             IN      A

;; ANSWER SECTION:
api.airvisual.com.      24      IN      A       54.178.241.236
api.airvisual.com.      24      IN      A       18.178.42.77

;; Query time: 15 msec
;; SERVER: [[redacted]]
;; WHEN: Wed May 03 09:10:19 CEST 2023
;; MSG SIZE  rcvd: 78
$ dig AAAA api.airvisual.com

; <<>> DiG 9.10.6 <<>> AAAA api.airvisual.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37500
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;api.airvisual.com.             IN      AAAA

;; Query time: 5 msec
;; SERVER: [[redacted]]
;; WHEN: Wed May 03 09:12:25 CEST 2023
;; MSG SIZE  rcvd: 46

So it seems like there's no IPv6 address and there are two IPv4 addresses resolved, both of which match the TCP handshakes in the Wireshark dump.

@silverwind
Copy link
Contributor

silverwind commented May 24, 2023

I still see these ERR_SOCKET_CONNECTION_TIMEOUT in v20.2.0. v18.16.0 is fine. Have no reproduction yet, but there issue seems to not be fully resolved. I'm on IPv4 only and the errors happens when doing a few parallel fetches, usually within 1-2 seconds.

@unilynx
Copy link

unilynx commented May 25, 2023

I'm seeing the same issue attempting to talk to https://api.linode.com/v4/linode/instance - both node-fetch and native fetch are affected. I'm on node v20.2.0 (Intel mac, homebrew)

api.linode.com curently resolves as follows

api.linode.com has address 72.14.191.203
api.linode.com has address 69.164.200.203
api.linode.com has address 72.14.180.203
api.linode.com has IPv6 address 2600:3c00::23
api.linode.com has IPv6 address 2600:3c00::33
api.linode.com has IPv6 address 2600:3c00::13

as far as I can tell my ipv6 is currently broken (no dancing turtle on https://www.kame.net/) and if edit /etc/hosts to force an IPv4 addresses for api.linode.com the issue goes away

@silverwind
Copy link
Contributor

silverwind commented May 25, 2023

FYI, The fix for this (#47860) was not yet backported to v20, so that is why it's still happening.

@shivak
Copy link

shivak commented May 26, 2023

For those waiting on a fix: the workaround is to add dns-result-order=ipv4first and/or no-network-family-autoselection to node's command line options or the NODE_OPTIONS environment variable.

@davidmurdoch
Copy link

davidmurdoch commented Jun 1, 2023

This works:

const net = require("net");
// work around a node v20 bug: https://github.com/nodejs/node/issues/47822#issuecomment-1564708870
if (net.setDefaultAutoSelectFamily) {
  net.setDefaultAutoSelectFamily(false);
}

and it seems to prevent the ERR_SOCKET_CONNECTION_TIMEOUT error in Node v20.

A more forward-compatible version (since this bug is fixed in 20.3.0):

// Work around a node v20.0.0, v20.1.0, and v20.2.0 bug. The issue was fixed
// in v20.3.0.
// https://github.com/nodejs/node/issues/47822#issuecomment-1564708870
// Safe to remove once support for Node v20 is dropped.
if (
  // !process.env.IS_BROWSER && // uncomment this line if you use a bundler that sets env.IS_BROWSER during build time
  process.versions &&
  // check for `node` in case we want to use this in "exotic" JS envs
  process.versions.node &&
  process.versions.node.match(/20\.[0-2]\.0/)
) {
  require("net").setDefaultAutoSelectFamily(false);
}

@godofgeeks23
Copy link

I am still facing this issue. I am on node v20.2.0. Is there any workaround for getting this to work?

@godofgeeks23
Copy link

I am still facing this issue. I am on node v20.2.0. Is there any workaround for getting this to work?

Oh sorry! Nevermind. the solution by @davidmurdoch works fine. Just including this code on the top fixed my issue.

@xNatsuu
Copy link

xNatsuu commented Jul 11, 2023

I am still facing this issue. I am on node v20.2.0. Is there any workaround for getting this to work?

Oh sorry! Nevermind. the solution by @davidmurdoch works fine. Just including this code on the top fixed my issue.

where did you add this

@davidmurdoch
Copy link

I am still facing this issue. I am on node v20.2.0. Is there any workaround for getting this to work?

Oh sorry! Nevermind. the solution by @davidmurdoch works fine. Just including this code on the top fixed my issue.

where did you add this

Add it anywhere that will be hit once on start up.

@godofgeeks23
Copy link

I am still facing this issue. I am on node v20.2.0. Is there any workaround for getting this to work?

Oh sorry! Nevermind. the solution by @davidmurdoch works fine. Just including this code on the top fixed my issue.

where did you add this

I added it on the very top of the file. At the very beginning

@konlanx
Copy link

konlanx commented Oct 27, 2023

I encountered this issue in Node 20.7.0 (Running in Docker with node:20.7-alpine3.17).

I was able to resolve the issue with the workaround provided by @davidmurdoch in #47822 (comment) even though the comment states the problem is fixed in version 20.3.0.

@davidmurdoch
Copy link

davidmurdoch commented Oct 27, 2023

It was fixed in 20.3.0 by this PR: https://github.com/nodejs/node/pull/47860/files

You should open a new issue if it has regressed.

@ShogunPanda
Copy link
Contributor

This should also have been fixed in #51045. Once it gets in 21.x or 20.x please let me know if you have additional problems.

@ZoukEnot
Copy link

ZoukEnot commented May 2, 2024

Just stumbled on the issue:

linux 6.6.26-1-MANJARO
node.js v20.12.2

It seems like the problem will not appear if my ping is less than 250ms. However what concerns me is that setDefaultAutoSelectFamily somehow affects timeouts, not only protocol autoselection.

This is how network communication looks in node_20.12.2 / @AWS-SDK / @smythy / 'node:https'.request (ip addresses are hidden for security reasons):

01 at     0ms ipv4 [     SYN] local_ipv4 => server_ipv4
02 at   255ms ipv6 [     SYN] local_ipv6 => server_ipv6
03 at   281ms ipv4 [     SYN] local_ipv4 => server_ipv4
04 at   509ms ipv4 [SYN, ACK] server_ipv4 => local_ipv4
05 at   509ms ipv4 [     RST] local_ipv4 => server_ipv4
06 at   532ms ipv6 [     SYN] local_ipv6 => server_ipv6
07 at   650ms ipv4 [     SYN] local_ipv4 => server_ipv4
08 at   901ms ipv6 [     SYN] local_ipv6 => server_ipv6
09 at  1176ms ipv4 [SYN, ACK] server_ipv4 => local_ipv4
10 at  1176ms ipv4 [     RST] local_ipv4 => server_ipv4
11 at  1942ms ipv4 [SYN, ACK] server_ipv4 => local_ipv4
12 at  1942ms ipv4 [     RST] local_ipv4 => server_ipv4

As a result, https.request throws AggregateError

Error: connect ETIMEDOUT server_ipv4:443
Error: connect ENETUNREACH server_ipv6:443

This is what happens when setDefaultAutoSelectFamily is set to false:

01 at     0ms ipv4 [     SYN] local_ipv4 => server_ipv4
02 at   544ms ipv4 [SYN, ACK] server_ipv4 => local_ipv4
03 at   544ms ipv4 [     ACK] local_ipv4 => server_ipv4
04 at   545ms ipv4 [ACK, PSH] local_ipv4 => server_ipv4
05 at  1057ms ipv4 [     ACK] server_ipv4 => local_ipv4
06 at  1057ms ipv4 [ACK, PSH] server_ipv4 => local_ipv4
07 at  1057ms ipv4 [     ACK] local_ipv4 => server_ipv4
08 at  1059ms ipv4 [ACK, PSH] 162.159.134.234 => local_ipv4
09 at  1059ms ipv4 [     ACK] local_ipv4 => 162.159.134.234

Results successful network response.

@ShogunPanda
Copy link
Contributor

@ZoukEnot Your output above is helpful but slightly confusing. I saw you obfuscated the IP addresses but it seems like the client is connecting to the same IP several times, which should not happen. If that's not the case, can you please use notation like server_ipv4_1 to let me distinguish the right timeline?

@ZoukEnot
Copy link

ZoukEnot commented May 3, 2024

Your output above is helpful but slightly confusing. I saw you obfuscated the IP addresses but it seems like the client is connecting to the same IP several times, which should not happen.

Thank you for your time, @ShogunPanda.

In the example below I've carefully changed the ip adresses to not expose any sensitive information but to look just as real.

Below

  • 192.168.15.137 is local ipv4 address
  • 211.190.191.243 is server ipv4 address
  • fa10::3c51:4b1b:3421:a2e1 is local ipv6 address
  • 2a01:1a5::1d4 is server ipv6 address

DNS response returns exactly one ipv4 address and one ipv6 address.

Apr 27 09:53:33 dnsmasq[37664]: forwarded s3.example.com to 1.1.1.1
Apr 27 09:53:33 dnsmasq[37664]: forwarded s3.example.com to 8.8.8.8
Apr 27 09:53:33 dnsmasq[37664]: query[AAAA] s3.example.com from 127.0.0.1
Apr 27 09:53:33 dnsmasq[37664]: forwarded s3.example.com to 1.1.1.1
Apr 27 09:53:33 dnsmasq[37664]: reply s3.example.com is 211.190.191.243
Apr 27 09:53:33 dnsmasq[37664]: reply s3.example.com is 2a01:1a5::1d4

Executing node.js code :

01 at     0ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
02 at   255ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
03 at   281ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
04 at   509ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
05 at   509ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243
06 at   532ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
07 at   650ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
08 at   901ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
09 at  1176ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
10 at  1176ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243
11 at  1942ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
12 at  1942ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243

Multiple requests to the server happen due to client code receiving ETIMEDOUT and firing a retry.

Second request is fired at 281ms, indicating that ETIMEDOUT already happened on the first one. SYN+ACK for the very first request is received at 509ms, but it is already too late. Similarly at 1176ms SYN+ACK is received for the second ipv4 request, while third outgoing ipv4 request has been already fired at 650ms.

However when setDefaultAutoSelectFamily(false) is invoked, no ETIMEDOUT happened even after 544ms from the request start time and then communication successfully continued.

The code that triggers https request looks like this (constant values are hardcoded according to the values in debugger):

import { request, HttpResponse } from 'https';

const req = request(nodeHttpsOptions, res => {
  const response = new HttpResponse({
    statusCode: res.statusCode || -1,
    reason: res.statusMessage,
    headers: res.headers,
    body: res,
  });
  resolve({ response });
});
req.on('error', err => {
  // fall here with `ETIMEDOUT`
  // then trigger retry in a consuming code
  if (['ECONNRESET', 'EPIPE', 'ETIMEDOUT'].includes(err.code)) {
    reject(Object.assign(err, { name: 'TimeoutError' }));
  } else {
    reject(err);
  }
});
// does not affect the behavior of the code
req.on("socket", (socket) => {
  socket.setKeepAlive(true, 1000);
});

Reproduced the behaviour is these node versions:

  • 20.10.0
  • 20.12.2
  • 22.1.0

@ShogunPanda
Copy link
Contributor

I see, I can follow it now.
Can you try two things:

  1. Change the default timeout (via setDefaultAutoSelectFamilyAttemptTimeout) to 100ms
  2. Change the default timeout (via setDefaultAutoSelectFamilyAttemptTimeout) to 500ms

And in both case post the sequence like you did above?

@ZoukEnot
Copy link

ZoukEnot commented May 3, 2024

I see, I can follow it now. Can you try two things:

1. Change the default timeout (via `setDefaultAutoSelectFamilyAttemptTimeout`) to 100ms

2. Change the default timeout (via `setDefaultAutoSelectFamilyAttemptTimeout`) to 500ms

And in both case post the sequence like you did above?

Assuming DNS response is the same as in previous example:

s3.example.com is 211.190.191.243
s3.example.com is 2a01:1a5::1d4

Below are the results of calling setDefaultAutoSelectFamilyAttemptTimeout(N) with different parameters.

Autoselect timeout 50 (ETIMEDOUT):

01 at     0ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
02 at    53ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
03 at   126ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
04 at   177ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
05 at   365ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
06 at   416ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
07 at   439ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
08 at   439ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243
09 at   542ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
10 at   542ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243
11 at   859ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
12 at   859ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243

Autoselect timeout 100 (ETIMEDOUT):

01 at     0ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
02 at   103ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
03 at   196ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
04 at   298ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
05 at   506ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
06 at   506ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243
07 at   677ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
08 at   680ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
09 at   680ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243
10 at   778ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
11 at  1122ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
12 at  1122ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243

Autoselect timeout 150 (ETIMEDOUT):

01 at     0ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
02 at   154ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
03 at   194ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
04 at   346ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
05 at   422ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
06 at   434ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
07 at   434ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243
08 at   573ms ipv6 [     SYN] fa10::3c51:4b1b:3421:a2e1 => 2a01:1a5::1d4
09 at   638ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
10 at   638ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243
11 at   952ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
12 at   952ms ipv4 [     RST] 192.168.15.137 => 211.190.191.243

Autoselect timeout 500 (success):

01 at     0ms ipv4 [     SYN] 192.168.15.137 => 211.190.191.243
02 at   404ms ipv4 [SYN, ACK] 211.190.191.243 => 192.168.15.137
03 at   404ms ipv4 [     ACK] 192.168.15.137 => 211.190.191.243
04 at   406ms ipv4 [ACK, PSH] 192.168.15.137 => 211.190.191.243
05 at   906ms ipv4 [     ACK] 211.190.191.243 => 192.168.15.137
06 at   906ms ipv4 [ACK, PSH] 211.190.191.243 => 192.168.15.137
07 at   906ms ipv4 [ACK, PSH] 211.190.191.243 => 192.168.15.137
08 at   906ms ipv4 [     ACK] 192.168.15.137 => 211.190.191.243
09 at   906ms ipv4 [     ACK] 192.168.15.137 => 211.190.191.243
10 at   917ms ipv4 [ACK, PSH] 192.168.15.137 => 211.190.191.243
11 at  1418ms ipv4 [ACK, PSH] 211.190.191.243 => 192.168.15.137
12 at  1418ms ipv4 [ACK, PSH] 211.190.191.243 => 192.168.15.137

@ShogunPanda
Copy link
Contributor

What is confusing me here is that when receiving a timeout the system is somehow retrying it.
First of all, the last attempted record does not has the autoselect timeout applied so I don't see how the thing is relevant here.

Second of all, an address is only attempted once.

How did you produce the output above? Which tool are you using?
If possible, can you please try without smithy?

@soryy708
Copy link

soryy708 commented Aug 12, 2024

So in which Node 20 versions is this fixed?

It was said:

It was fixed in 20.3.0 by this PR: https://github.com/nodejs/node/pull/47860/files

But then

Reproduced the behaviour is these node versions:
20.10.0
20.12.2
22.1.0

So was there a regression?

zackpollard added a commit to immich-app/geoshenanigans that referenced this issue Aug 23, 2024
zackpollard added a commit to immich-app/geoshenanigans that referenced this issue Aug 27, 2024
* ci: fix kv-warmer, format files and run tests in CI

* chore: pin node version

* ci: fix incorrect node path

* ci: fix linting and test running

* ci: build tiles code

* chore: remove cf version metadata as it can't be used from terraform

* chore: remove file hash and use file name instead as unique identifier

* ci: deploy tiles gha

* ci: deploy tiles gha

* ci: deploy tiles gha

* ci: deploy tiles gha

* chore: test push-o-matic

* Update tile server file to 20240823.pmtiles

* ci: deploy tiles gha

* ci: run kv warmer and check if already warm

* chore: revert to node 18 because of network problems
nodejs/node#47822

* chore: revert back to v1.pmtiles until v4 spec is finalised by protomaps

* Update deployment/modules/cloudflare/tiles/workers.tf

Fix incorrect environment variable in worker deployment

Co-authored-by: bo0tzz <git@bo0tzz.me>

* fix: kv-warming deploy on main only

---------

Co-authored-by: zackpollard <zackpollard@users.noreply.github.com>
Co-authored-by: bo0tzz <git@bo0tzz.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem. net Issues and PRs related to the net subsystem. v20.x v20.x Issues that can be reproduced on v20.x or PRs targeting the v20.x-staging branch.
Projects
None yet
Development

Successfully merging a pull request may close this issue.