Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writeup of router kill issue #3320

Closed
whyrusleeping opened this issue Oct 18, 2016 · 147 comments
Closed

Writeup of router kill issue #3320

whyrusleeping opened this issue Oct 18, 2016 · 147 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@whyrusleeping
Copy link
Member

So we know that ipfs can kill people routers. We should do a quick write up of what the causes are, which routers are normally affected, and maybe propose a couple ideas for solutions.

@Kubuxu do you think you could handle doing this at some point?

@whyrusleeping whyrusleeping added this to the Dont Kill Routers milestone Oct 18, 2016
@donothesitate
Copy link

donothesitate commented Oct 23, 2016

My theory is, it does exhaust / overload NAT table, that on some routers does cause lockups.
UDP on the same routers can keep working without problem, as well as TCP connections that were already open when lockup occurred.

Possible solution: Have a switch to limit number of peers/connections.
Related #3311

@ghost
Copy link

ghost commented Oct 23, 2016

That sounds highly likely. nf_conntrack_max on my edge router is set to 1024 by default and ipfs eats 700 of those on its own, per computer I'm running it on.

A lot of those are dead connections too: if I open the webui which tries to ping them it quickly drops to 90 or so.

@Kubuxu Kubuxu added the status/deferred Conscious decision to pause or backlog label Nov 28, 2016
@hsanjuan
Copy link
Contributor

Running 5 daemons on local network with a well-known hash (they were pinning dist) kills my Fritzbox.

AFAIK everyone has high esteem for Fritzboxes as very good routers, and not some shitty hardware. Internet reports a NAT table size of around 7000. I find the problem is exacerbated when my nodes are pinning popular content (I suspect this not only consumes all the bandwidth but also increases the number of connections when other peers try to download these blocks?).

@Kubuxu
Copy link
Member

Kubuxu commented Mar 23, 2017

So my idea of what happens is that conntracker table fills up (it is small in cheapo routers, bigger is good ones) and it starts trowing out other connections. @hsanjuan can you repeat the test, kill ipfs daemons and check if it comes back up online?

@hsanjuan
Copy link
Contributor

@Kubuxu yeah yeah things are back up immediately when I kill them. Only once I had the router reboot itself, which worried me more.

@Kubuxu
Copy link
Member

Kubuxu commented Mar 23, 2017

So other possibility is that cheapo routers have bigger conntracker limit than their RAM can handle and they kernel panics or lockups. Not sure how to check it.

@whyrusleeping
Copy link
Member Author

Does UDP eat up conntracker entries? We're moving quickly towards having support for QUIC.

@Kubuxu
Copy link
Member

Kubuxu commented Mar 23, 2017

AFAIK, yes. At least from the time my services were DDoSed with UDP packets and they were much more destructive because of low conntracker limits.

@hsanjuan
Copy link
Contributor

Is it possible that this problem got much worse in the last releases (ie >=0.4.5). I used to be able to run 4 nodes without problems and now it seems I'm not, even after cleaning their contents.

@kakra
Copy link

kakra commented May 2, 2017

I'm having issues, too. Maybe ipfs should take two connection pools and migrate peer connections from a bad quality pool to a good quality pool by applying some heuristics to the peers. Peers with higher delays, lower bandwidth and short lives would live in the "bad pool" and easily replaced by new peers if connection limits are hit. Better peers would migrate to the "good pool" and only be replaced by better peers if limits are hit. Having both pools gives slow peers a chance to be part of the network without being starved by higher quality peers, which is important for a p2p distributed network.

BTW, udp also needs connection tracking, this wouldn't help here, and usually udp tracking tables are much smaller and much more short-lived which adds a lot of new problems. But udp could probably lower the need for bandwidth as there's no implicit retransmission and no ack. Of course, the protocol has to be designed in a way to handle packet loss, and it must take into account that NAT gateways usually drop udp connection table entries much faster. It doesn't make sense to deploy udp and then reimplement retransfers and keep-alive, as this would replicate tcp with no benefit (probably it would even lower performance).

Also, ipfs should limit the amount of outstanding packets, not the amount of connections itself. If there are too many packets in-flight, it should throttle further communication with peers, maybe prioritizing some over others. This way, it could also auto-tune to the available bandwidth but I'm not sure.

Looking at what BBR does for network queues, it may be better to throw away some requests instead of queuing up a huge backlog. This can improve overall network performance, bloating buffers is a performance killer. I'd like to run ipfs 24/7 but if it increases my network latency, I simply cannot, which hurts widespread deployment.

Maybe ipfs needs to measure latency and throw away slowly responding peers. For this to properly work, it needs to auto-adjust to the bandwidth, because once network queues fill, latency will exponentially spike up and the former mentioned latency measurement is useless.

These big queues are also a problem with many routers as they tend to use huge queues to increase total bandwidth for benchmarks but it totally kills latency, and thus kills important services like DNS to properly work.

I'm running a 400/25mbps assymmetric link here, and as soon as "ipfs stats bw" get beyond a certain point, everything else chokes, browsers become unusable waiting for websites tens of seconds, or resulting in DNS errors. Once a web request comes through in such a situation, the website almost immediately completely appears (minus assets hosted on different hosts) so this is clearly an upstream issue with queues and buffers filled up and improper prioritizing (as ACKs still seem to pass early through the queues, otherwise download would be reduced, too).

I don't know if QUIC would really help here... It just reduces initial round-trip times (which HTTP/2 also does) which is not really an issue here as I consider ipfs a bulk-transfer tool, not a latency-sensitive one like web browsing.

Does ipfs properly use TOS/QoS flags in IP packets?

PS: ipfs should not try to avoid tcp/ips auto-tuning capabilities by moving to UDP. Instead it should be nice to competing traffic by keeping latency below a sane limit and let TCP do the bandwidth tuning. And it should be nice to edge-router equipment (which is most of the time cheap and cannot be avoided) by limiting outstanding requests and amount of total connections. I remembered when Windows XP tried to fix this in the TCP/IP stack by limiting outstanding TCP handshakes to ten, blocking everything else then globally. This was a silly idea but it was thinking in the right direction, I guess.

@dsvi
Copy link

dsvi commented May 25, 2017

I think you might as well not do anything at all, since routers are getting consistently better at supporting higher numbers of connections. My 5 years old struggled with supporting 2 ipfs nodes (about 600 connections each) + torrent (500 connections). I've just got cheap chinese one, and it works like a charm. Most of even cheap routers nowadays have hardware NAT. They don't much care how many connections you throw at them.
Also, switching to UDP doesn't help, since when i unleash torrent far beyond 500 connections limit, it used to kill the old router as good as ipfs. And torrent uses only UDP.

@ghost
Copy link

ghost commented May 26, 2017

@dsvi: I'd rather not have to pay hard cash just to use IPFS on the pretence that it's fine to be badly behaved because some other software can be misconfigured to crash routers. A lot of people don't even have the luxury of being allowed to connect to their ISP using their own hardware.

And what a strawman you've picked — a Bittorrent client! A system that evolved its defaults based on fifteen years real world experience for precisely this reason!

No thanks, just fix the code.

@kakra
Copy link

kakra commented May 26, 2017

@dsvi I wonder if they use their own routers because the page times out upon request... ;-)

But please do not suggest that: Many people are stuck with what is delivered by their providers with no chance to swap that equipment for better stuff. Ipfs has not only to be nice to such equipment but to overall network traffic on that router, too: If it makes the rest of my traffic demands unusable, there's no chance for ipfs to evolve because nobody or only very few could run it 24/7. Ipfs won't reach its goal if it is started by people only on demand.

@dsvi
Copy link

dsvi commented May 26, 2017

Sorrry guys, should have expressed it better. I'll try this time from another direction ;)

  1. Internet world is becoming decentralized in general. This is a natural trend which is everywhere, from secure instant messaging, filesharing, decentralized email systems and so on.
    And creating tons of connections is a natural part of such systems. They are distributed, and for effective work they have to support tons of connections (distributions channels). It's unavoidable in general. There can be improvements here and there, but its fundamentally "unfixible"
  2. Hardware vendors have acknowledged that already. Modern router chipsets are way better in that regard nowadays, since all the hardware review sites have included at least torrent tests, in their review test suites. So you don't really need nowadays something 200$+ to work well with it. And a year from now, it will only get waay better, since they tend to offload a lot of routing work to hardware.
    So it already is not a problem, and will be even less so with every year.

And what about people who stuck with relic hardware for whatever reason? Well i feel sorry for some of them, but the progress will go on with them, or without.

@Calmarius
Copy link

@dsvi

"Internet world is becoming decentralized in general. "

Nope! It's becoming centralized. Almost the whole internet is served by a handful datacenter companies.
For most people search means Google, e-mail means Gmail, social interactions mean Facebook, videos mean Youtube, chat means Facebook Messenger, picture sharing means Instagram.
The rest of the web is hosted at the one of the several largest datacenter companies.

At the beginning we used to have Usenet and IRC servers running on our computers at home.
Then services got more and more centralized.

I don't see signs of any decentralization. But I see signs of further centralization.
For example some ISPs don't even give you public IP addresses anymore (for example 4G networks).

"And creating tons of connections is a natural part of such systems."

Having too many simultaneous connections makes the system inefficient.
If you have enough peers to saturate your bandwidth it's pointless to add more.

Currently my IPFS daemon opens 2048 connections within several hours to peers then runs out of file descriptors and becomes useless. This should be fixed.

@vext01
Copy link

vext01 commented Sep 22, 2018

I'm using a crappy TalkTalk router provided by the ISP and I've been unable to find a configuration where IPFS doesn't drag my internet connection to it's knees.

Using ifstat I see usually between 200kb/s and 1MB up and down whilst ipfs is connected to a couple of hundred peers.

I'd like to try connecting to fewer peers, but even with:

      "LowWater": 20,
      "HighWater": 30,

ipfs still connects to hundreds.

@vext01
Copy link

vext01 commented Dec 18, 2018

Perhaps this is a dumb question, but why don't you make it so that IPFS stops connecting to more peers once the high water mark is reached?

@Stebalien
Copy link
Member

We should implement a max connections but high/low water are really designed to be target bounds.

The libp2p team is currently refactoring the "dialer" system in a way that'll make it easy for us to configure a maximum number of outbound connections. Unfortunately, there's really nothing we can do about inbound connections except kill them as soon as we can. On the other hand, having too many connections usually comes from dialing.

@Stebalien
Copy link
Member

Note: there's actually another issue here. I'm not sure if limiting the max number of open connections will really fix this problem. I haven't tested this but I'm guessing that many routers have problems with connection velocity (the rate at which we (try to) establish connections) not simply having a bunch of connections. That's because routers often need to remember connections even after they've closed (for a period of time).

@vyzo's work on NAT detection and autorelay should help quite a bit, unless I'm mistaken.

@kakra
Copy link

kakra commented Dec 18, 2018

A work-around could be to limit the number of opening connections (in contrast to opened connections) - thus reducing the number of connection attempts running at the same time. I think this could be much more important than limiting the number of total connections.

If such a change propagated through the network, it should also reduce the amount of overwhelming incoming connection attempts - especially those with slow handshaking because the sending side is not that busy with opening many connections at the same time.

@Stebalien
Copy link
Member

We actually do that (mostly to avoid running out of file descriptors). We limit ourselves to opening at most 160 TCP connections at the same time.

@kakra
Copy link

kakra commented Dec 18, 2018

@Stebalien Curious, since when? Because I noticed a while ago that running IPFS no longer chokes DNS resolution of my router...

@EsEnZeT
Copy link

EsEnZeT commented Oct 9, 2022

Hello in 2022, same as above with CH7465LG-ZG. In my whole life I didn't occur such issues with software like this one. Anyways, dropped, waste of my time.

@vyzo
Copy link
Contributor

vyzo commented Oct 9, 2022

The real problem is the large number of concurrent connections, typically from dht and bitswap; thats what needs to be fixed.

They both have a tendency to create connection avalanches, which apparently overflows router queues and makes them crash.

Blanket disabling reuseport will throw the baby out with the bathwater as it is really necessary for hole punching.
Maybe some smarts could be added to use it initially for nat type detection and for dialing other natted peers so that hole punching can succeed, but this isnt exactly trivial to get right.

@markg85
Copy link
Contributor

markg85 commented Oct 9, 2022

Here's a thing we need to experiment with to give us more of a direction to know where the actual fault is.
I too was on the bandwagon of "ton of connections = bad = crash" but i doubt it's that simple.

Could someone who has this very issue and where disabling portreuse appears to fix it try this one simple thing.
Try it on windows, give it a couple of hours, and report back with your finding.

Thus far only @urbenlegend mentions windows as not having issues there. I'd like for someone else to confirm that. Nothing against @urbenlegend but i just need to know if this is a pattern or an exception.

I'm asking because i just remembered that all my testing (and 99% of the people in this thread) was on Linux. I don't haven have a windows installation to test this with. So if someone can confirm that this very same issue exists on windows too? That would be very helpful!

Note that this request might seem weid at first glance because the router crashes. But it's your computer that is asking the router to do things it doesn't like!

If it exists on windows too then the bug can still be anywhere.
If it doesn't appear on windows then it might be linux specific.

@Jorropo
Copy link
Contributor

Jorropo commented Oct 10, 2022

I too was on the bandwagon of "ton of connections = bad = crash" but i doubt it's that simple.

👍 @markg85 I remember trying this with you, and we set the connection number to 60 and yet it still crashed (it took more time).

It would be nice to see some evidence about connection number being high being a problem.

@markg85
Copy link
Contributor

markg85 commented Oct 10, 2022

+1 @markg85 I remember trying this with you, and we set the connection number to 60 and yet it still crashed (it took more time).

In all fairness there, that was just the low/hi water setting adjustment. It's not a limit on the number of connections it makes. If i recall correctly it still had a gazillion connections over time. What might help, an option we didn't have back then, is using the Swarm.ResourceMgr to really limit things.

Hypothetically, even if using that fixes it. You still won't know if you fixed the cause or just reduced the symptoms to be so rare that it doesn't appear occur anymore. More research is needed!

@jligeza
Copy link

jligeza commented Nov 20, 2022

IPFS used to kill my router within 15-30 minutes when using IPv4, and twice the amount when using IPv6.

I found 2 solutions to this (tested for a couple of weeks):

  1. Set flag LIBP2P_TCP_REUSEPORT=false.
  2. Use only quic protocol in Addresses.Swarm, and disable Swarm.Transports.Network.TCP.

What did not work was just limiting the number of connections. It only would take longer time to kill my router.

Other than that, I had to limit maximum number of open connections, because when it reached above 500, the router was clogged (google.com opening in 5-10 seconds).

@kakra
Copy link

kakra commented Nov 20, 2022

You can try blackholing private subnet routes. I think the biggest issue is ipfs trying to connect to non-routet private subnets via tcp. Taking away that burden from the router should fix a lot of stability problems already:

ip route add blackhole 10.0.0.0/8
ip route add blackhole 172.16.0.0/12
ip route add blackhole 192.168.0.0/16

If you have actually reachable private subnets behind your router, you should add more specific routes (longer prefix) so it still gets routed - or add the blackhole routes to the router. But for a single private subnet, these routes should just work.

@ajbouh
Copy link

ajbouh commented Nov 20, 2022

Is it possible to have ipfs perform this sort of behavior automatically?

For example, are there (userspace) network mapping techniques that we can use to understand which private networks are actually routable?

Even without automatic mapping, users might prefer to apply address filtering within ipfs itself to avoid making doomed connection attempts altogether.

@ttax00
Copy link

ttax00 commented Dec 18, 2022

1. Set flag `LIBP2P_TCP_REUSEPORT=false`.
2. Use only `quic` protocol in `Addresses.Swarm`, and disable `Swarm.Transports.Network.TCP`.

Doesn't seem to work for me, my router seems to chokes even when there's only ~60-80 connections.
I launches the daemon with reuse port false:

LIBP2P_TCP_REUSEPORT=false ipfs daemon

Output:

Initializing daemon...
Kubo version: 0.17.0
Repo version: 12
System version: amd64/linux
Golang version: go1.19.1
2022/12/18 16:40:26 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.
Swarm listening on /ip4/127.0.0.1/udp/4001/quic
Swarm listening on /ip4/172.17.0.1/udp/4001/quic
Swarm listening on /ip4/172.18.0.1/udp/4001/quic
Swarm listening on /ip4/172.19.0.1/udp/4001/quic
Swarm listening on /ip4/172.20.0.1/udp/4001/quic
Swarm listening on /ip4/192.168.10.101/udp/4001/quic
Swarm listening on /ip6/::1/udp/4001/quic
Swarm listening on /p2p-circuit
Swarm announcing /ip4/113.43.201.170/udp/4001/quic
Swarm announcing /ip4/127.0.0.1/udp/4001/quic
Swarm announcing /ip4/192.168.10.101/udp/4001/quic
Swarm announcing /ip6/::1/udp/4001/quic
API server listening on /ip4/127.0.0.1/tcp/5001
WebUI: http://127.0.0.1:5001/webui
Gateway (readonly) server listening on /ip4/127.0.0.1/tcp/8080
Daemon is ready

And settings:

"Addresses": {
		"Swarm": [
			"/ip4/0.0.0.0/udp/4001/quic",
			"/ip6/::/udp/4001/quic"
		]
}
"Swarm": {
   "Transports": {
			"Multiplexers": {},
			"Network": {
				"TCP": false
			},
			"Security": {}
		}
}

@okanisis
Copy link

I've disabled TCP:

"Transports": {
      "Network": {
        "TCP": false
      },
      "Security": {},
      "Multiplexers": {}
    }

Quic as swarm:

"Swarm": [
      "/ip4/0.0.0.0/udp/4001/quic",
      "/ip6/::/udp/4001/quic"
    ]

And still get the error with connection and swarms capped:

"ConnMgr": {
      "Type": "basic",
      "LowWater": 10,
      "HighWater": 15,
      "GracePeriod": "30s"
    },
    "ResourceMgr": {
      "Limits": {
        "System": {
          "Conns": 50,
          "Streams": 50
        }
      }
    }

Operating System:

% uname -a
Linux cryptsus 6.1.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 21 Dec 2022 22:27:55 +0000 x86_64 GNU/Linux

Router/Modem (it's from telus and they use a homemade variation of openwrt):

Model Name: | TELUS Wi-Fi Hub
Firmware Version: | v3.00.24 build10
Boot Code Version: | 0.00.01
Hardware Version: | 01

@williamalvarezdev
Copy link

Still not fixed?

@hadim
Copy link

hadim commented Apr 8, 2023

It took me a while to find out that ticket but it seems like Inam hitting the exact same bug. A combination of QUIC only + disabling TCP + configuring ConnMgr seems to make it stable so far.

@ttax00
Copy link

ttax00 commented Jun 14, 2023

Reporting back after a year of running an IPFS node. Issue still occurs occasionally even with very conservative configurations.

What makes it special is how my qbtorrent connections runs smoothly without no trouble but IPFS starts to fry routers.

@Jorropo
Copy link
Contributor

Jorropo commented Jun 14, 2023

@TechTheAwesome can you please try one of the solutions proposed above and report on if it work (disabling reuseport or Bdisabling TCP) ?

@ttax00
Copy link

ttax00 commented Jun 17, 2023

@Jorropo Unfortunately, I did try both solutions and it did not seem to work. The daemon runs for 1-2 minutes, got up to 300 peers, and then my internet connections starts to get cut off.

Is there anyway i can export a IPFS log of sort?

System:

  • Windows 11
  • Ryzen 7 5700U laptop

IPFS:

  • kubo: 0.20.0
  • ipfs-desktop: 0.28.0

@kakra
Copy link

kakra commented Jun 17, 2023

I've found that after some update, the connection limits of kubo were totally out of control. Putting the default settings into the Swarm section fixed this:

		"ConnMgr": {
			"GracePeriod": "20s",
			"HighWater": 96,
			"LowWater": 32,
			"Type": "basic"
		},

It now hovers around 20 to 30 connections.

@ttax00
Copy link

ttax00 commented Jun 18, 2023

@kakra Below is my ConnMgr

		"ConnMgr": {
			"GracePeriod": "1m0s",
			"HighWater": 40,
			"LowWater": 20,
			"Type": "basic"
		},

And my entire config, including setting TCP to false. But IPFS still causes my laptop unable to connect to websites.

{
	"API": {
		"HTTPHeaders": {
			"Access-Control-Allow-Origin": [
				"https://webui.ipfs.io",
				"http://webui.ipfs.io.ipns.localhost:8080"
			]
		}
	},
	"Addresses": {
		"API": "/ip4/127.0.0.1/tcp/5001",
		"Announce": [],
		"AppendAnnounce": [],
		"Gateway": "/ip4/127.0.0.1/tcp/8080",
		"NoAnnounce": [],
		"Swarm": [
			"/ip4/0.0.0.0/tcp/4001",
			"/ip6/::/tcp/4001",
			"/ip4/0.0.0.0/udp/4001/quic",
			"/ip4/0.0.0.0/udp/4001/quic-v1",
			"/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
			"/ip6/::/udp/4001/quic",
			"/ip6/::/udp/4001/quic-v1",
			"/ip6/::/udp/4001/quic-v1/webtransport"
		]
	},
	"AutoNAT": {},
	"Bootstrap": [
		"/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
		"/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
		"/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
		"/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
		"/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
		"/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
		"/ip4/45.76.100.74/udp/4001/quic/p2p/12D3KooWK2DoikedHm2jQgbknGMhR2SSrGKuFoWN2Xj1EUpi1nYW",
		"/ip4/45.76.244.78/udp/4001/quic/p2p/12D3KooWPgSN1V3PKhroXQ1LBTN9LCHmk5jqvxuYbr12BKuYxFYG"
	],
	"DNS": {
		"Resolvers": {}
	},
	"Datastore": {
		"BloomFilterSize": 0,
		"GCPeriod": "1h",
		"HashOnRead": false,
		"Spec": {
			"mounts": [
				{
					"child": {
						"path": "blocks",
						"shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
						"sync": true,
						"type": "flatfs"
					},
					"mountpoint": "/blocks",
					"prefix": "flatfs.datastore",
					"type": "measure"
				},
				{
					"child": {
						"compression": "none",
						"path": "datastore",
						"type": "levelds"
					},
					"mountpoint": "/",
					"prefix": "leveldb.datastore",
					"type": "measure"
				}
			],
			"type": "mount"
		},
		"StorageGCWatermark": 90,
		"StorageMax": "10GB"
	},
	"Discovery": {
		"MDNS": {
			"Enabled": true
		}
	},
	"Experimental": {
		"AcceleratedDHTClient": false,
		"FilestoreEnabled": false,
		"GraphsyncEnabled": false,
		"Libp2pStreamMounting": false,
		"P2pHttpProxy": false,
		"StrategicProviding": false,
		"UrlstoreEnabled": false
	},
	"Gateway": {
		"APICommands": [],
		"HTTPHeaders": {
			"Access-Control-Allow-Headers": [
				"X-Requested-With",
				"Range",
				"User-Agent"
			],
			"Access-Control-Allow-Methods": [
				"GET"
			],
			"Access-Control-Allow-Origin": [
				"*"
			]
		},
		"NoDNSLink": false,
		"NoFetch": false,
		"PathPrefixes": [],
		"PublicGateways": null,
		"RootRedirect": "",
		"Writable": false
	},
	"Identity": {
		"PeerID": "12D3KooWKM6QU7jdvcf6M96RWGUNmCAGJ7aCRKU9odEbbs5ddJuX"
	},
	"Internal": {},
	"Ipns": {
		"RecordLifetime": "",
		"RepublishPeriod": "",
		"ResolveCacheSize": 128
	},
	"Migration": {
		"DownloadSources": [],
		"Keep": ""
	},
	"Mounts": {
		"FuseAllowOther": false,
		"IPFS": "/ipfs",
		"IPNS": "/ipns"
	},
	"Peering": {
		"Peers": [
			{
				"Addrs": [
					"/ip4/35.78.51.148/udp/4001/quic"
				],
				"ID": "12D3KooWBsyKEDH1x4GhSjXUNwXGfb9HXbvTzeHBert2AevcyFnx"
			}
		]
	},
	"Pinning": {
		"RemoteServices": {}
	},
	"Plugins": {
		"Plugins": null
	},
	"Provider": {
		"Strategy": ""
	},
	"Pubsub": {
		"DisableSigning": false,
		"Router": ""
	},
	"Reprovider": {},
	"Routing": {
		"Methods": null,
		"Routers": null
	},
	"Swarm": {
		"AddrFilters": null,
		"ConnMgr": {
			"GracePeriod": "1m0s",
			"HighWater": 40,
			"LowWater": 20,
			"Type": "basic"
		},
		"DisableBandwidthMetrics": false,
		"DisableNatPortMap": false,
		"RelayClient": {},
		"RelayService": {},
		"ResourceMgr": {},
		"Transports": {
			"Multiplexers": {},
			"Network": {
				"TCP": false
			},
			"Security": {}
		}
	}
}

@kakra
Copy link

kakra commented Jun 18, 2023

Below is my ConnMgr

@TechTheAwesome Maybe reduce your grace time: As far as I understand, it sets how long a connection is at least kept regardless of the high water mark.

Also try setting address filters: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmaddrfilters

"Swarm" {
    "AddrFilters": [
        "/ipv4/10.0.0.0/ipcidr/8",
        "/ipv4/172.16.0.0/ipcidr/12",
        "/ipv4/192.168.0.0/ipcidr/16"
    ],
    "... your remainder swarm config here": "..."
}

This will prevent your node from connecting to local machines but it will also prevent that your node tries to connect to seemingly local nodes which will then in turn be routed via your router - which will probably just route it to its default gateway and create a useless NAT mapping: These local networks are not routable on the WAN network of your router.

IPv6 works better here because it knows the routing scopes of your addresses: It won't try to route site scope addresses via the WAN interface. So no need to bother with filters for IPv6.

After saving the changes, restart your node and maybe also your router (so you don't carry any artifacts over).

@markg85
Copy link
Contributor

markg85 commented Jun 18, 2023

Contrary to, apparently, popular belief. Low/high water mark mean nothing in this specific case.

How it "roughly" works is that your node asks other peers for their peer list.
Your node gets these peers (could be thousands in mere seconds) and your node will try to make a connection to all of them.

Do take this with a little bit of salt. I don't know the exact internals and might be off on the specifics here. But in general it does work this way. You can see this yourself. On linux, install a package called "ttyplot" and run the following command:
{ while true; do ipfs swarm peers | wc -l; sleep 0.3; done } | ttyplot

Or if you have no ttyplot, just do:
{ while true; do ipfs swarm peers | wc -l; sleep 0.3; done }

And copy it's output to some plot/chart tool of your choice.

What you'll see is the actual connections your node has open. At the grace period you'll see a sharp decline (it killed connections above the highWater mark).

Moral of the story, highWater/lowWater/grace have nothing to do with fixing this issue. They can, at best, make it occur less.

@thedavidmeister
Copy link

thedavidmeister commented Jun 27, 2023

So I discovered my ipfs node thrashing over 5000 connections, which impacted the network so badly i couldn't even do a simple git pull. I noticed it because other devices on the network were also struggling to hit just 100kb/s.

More or less default network settings for IPFS, i have a static IP and this is all dockerised, i haven't had any issues in the last few months with connectivity broadly. I've disabled TCP as per above suggestions to see if that helps, and have bumped from v0.19 to v0.20 but

  • 5000 seems like an unnecessarily high number of connections, even for a p2p system, an active eth full node "only" needs 100 or so connections from what i've seen
  • my IPFS node has perhaps a few hundred relatively tiny pins and that's it, no gateway or any other reason for a lot of resource usage afaics, there's nothing else running on this machine other than ubuntu
  • Almost all the connections seemed to be doing nothing except keeping themselves going, just a whole lot of apparent noise
  • I'm fairly sure I didn't have this issue as recently as a few days ago as I synced several hundred GB of a testnet to an eth node on the same network, before it suddenly stopped pulling blocks due to the IPFS node
  • This appears to have been a sustained issue, I'm not sure exactly how long but i'd estimate 0.5-2 days from when it started until i discovered it, so it can't be explained by a "once off" heavy task

Is it possible that there is some kind of regression/bug in a recent version of IPFS that is causing connections to thrash, or have I just been lucky for the last few months until now?

@Jorropo
Copy link
Contributor

Jorropo commented Jun 27, 2023

This issue is very old and some workarounds exists but are buried in the middle of the thread, it's also a collection of various un-actionable opinions.
Saying a lot have changed in that area since 2016 is an understatement.

I've created a new issue to collect a table of remaining problems #9998

@Jorropo Jorropo closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2023
@marten-seemann
Copy link
Member

Note that the dial prioritization logic we introduced in the v0.28 go-libp2p release (disabled by default) will dramatically reduce the number of spurious dial attempts (especially on TCP, which is probably what creates the most problems with routers).

go-libp2p v0.29 will enable dial prioritization by default, and will be included in the next Kubo release.

@kieransimkin
Copy link

The current BT internet routers for VDSL in the UK are definitely susceptible to this, they freeze up, with the lights still indicating no problem but no response over wifi or LAN. Mine has done this probably 10 times in the last day.

I suspect the problem is too many open connections - I'm experimenting with the Swarm settings now to see if I can narrow this down, but I did apply a bandwidth cap with wondershaper and it didn't seem to fix the problem, so now I'm experimenting with the High Water and Low Water settings in ipfs to see if this can fix the problem.

To be specific, I have the BT Business Smart Hub 2. The consumer level hub is basically the same hardware, so that's most likely susceptible too. This is a brand-new router and very common in the UK - after a nice conversation with a helpful guy in their second line technical support, he basically said; "yeah, the routers are not good, I recommend you replace it". I've ordered a Draytek, I imagine that will fix the problem.

@kakra
Copy link

kakra commented Nov 18, 2023

I think one way to prevent such routers to lock up is preventing them from routing private destinations to the WAN interface in the first place. Unless you can add blackhole routing entries in the router itself, you should instruct your PC running kubo to not router private destinations to your WAN router. You'd need to add 10/8, 172.16/12 and 192.168/16 to be blackholed in the routing table. Those destinations won't be reachable in the internet anyways but most routers don't care and just fill NAT tables with junk until they eventually lock up or kill valid running connections by invalidating connection tracking states early.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests