Endorse htlc and local reputation #2716

thomash-acinq · 2023-07-21T13:53:16Z

We do not yet drop HTLCs, the purpose is to collect data first.

We add

An endorsement bit to UpdateAddHtlc. This follows blip-0004: experimental endorsement signaling in update_add_htlc lightning/blips#27.
A local reputation system: For each pair (origin node, endorsement value), we compute its reputation as total fees that were paid divided by total fees that would have been paid if all HTLCs had fulfilled. When considering an HTLC to relay, we only forward it if the reputation of its source is higher than the occupancy of the outgoing channel.
A limit on the number of small HTLCs per channel. We allow just very few small HTLCs per channel so that it's not possible to block large HTLCs using only small HTLCs (similar to Add a channel congestion control mechanism #2330 but continuous).

codecov-commenter · 2023-07-21T14:16:35Z

Codecov Report

Merging #2716 (deab085) into master (12adf87) will increase coverage by 0.04%.
Report is 1 commits behind head on master.
The diff coverage is 100.00%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##           master    #2716      +/-   ##
==========================================
+ Coverage   85.82%   85.86%   +0.04%     
==========================================
  Files         216      218       +2     
  Lines       18126    18209      +83     
  Branches      771      749      -22     
==========================================
+ Hits        15556    15636      +80     
- Misses       2570     2573       +3

Files	Coverage Δ
...re/src/main/scala/fr/acinq/eclair/NodeParams.scala	`93.47% <100.00%> (+0.08%)`	⬆️
...ir-core/src/main/scala/fr/acinq/eclair/Setup.scala	`75.29% <100.00%> (+0.14%)`	⬆️
...in/scala/fr/acinq/eclair/channel/ChannelData.scala	`100.00% <ø> (ø)`
...in/scala/fr/acinq/eclair/channel/Commitments.scala	`96.93% <100.00%> (+0.11%)`	⬆️
...ain/scala/fr/acinq/eclair/channel/Monitoring.scala	`96.15% <100.00%> (+0.23%)`	⬆️
...in/scala/fr/acinq/eclair/channel/fsm/Channel.scala	`85.80% <100.00%> (+0.15%)`	⬆️
...ain/scala/fr/acinq/eclair/payment/Monitoring.scala	`98.30% <100.00%> (+0.09%)`	⬆️
.../scala/fr/acinq/eclair/payment/PaymentPacket.scala	`90.82% <100.00%> (ø)`
...a/fr/acinq/eclair/payment/relay/ChannelRelay.scala	`96.03% <100.00%> (+0.16%)`	⬆️
...fr/acinq/eclair/payment/relay/ChannelRelayer.scala	`100.00% <100.00%> (ø)`
... and 9 more

... and 3 files with indirect coverage changes

eclair-core/src/main/resources/reference.conf

eclair-core/src/main/scala/fr/acinq/eclair/channel/ChannelData.scala

eclair-core/src/main/scala/fr/acinq/eclair/wire/protocol/HtlcTlv.scala

We must allow node operators to run plain blip-0004 without our experimental reputation recorder. It's also a stop-gap in case there is a bug in it that can be exploited. We also refactor to introduce more types and documentation, without changing the reputation algorithm itself. We also fix a few issues mentioned on the main PR comments.

t-bast

Thanks, it's now more clear to me how we assign reputation to our peers. I've made a few comments on the code itself, some of which (the easy ones) I fixed in #2893.

The reputation algorithm itself looks good to me, let's try it out and see what results we get in practice and during simulations.

However, I don't think the way we interact with the reputation recorder makes the most sense.
You are storing a relay attempt as soon as we start relaying, before we know whether we actually send HTLCs out or not.
This leads to the weird CancelRelay command and an inconsistency between channel relay and trampoline relay.
In the trampoline case, if we can't find a route or can't send outgoing HTLCs, we will treat this as a failure, which is incorrect.
This can probably even be used to skew our reputation algorithm.
It's also pretty invasive, especially in the NodeRelay component...

It seems to me that it would make more sense if we implemented the following flow:

Once we start relaying (ChannelRelay / NodeRelay), we obtain the confidence value with GetConfidence and will include it in CMD_ADD_HTLC.
At that point, we DON'T update the reputation to take this payment into account, because we don't know yet if it will be relayed.
In Channel.scala, when we actually send an outgoing UpdateAddHtlc, we emit an OutgoingHtlcAdded event to the event stream, that contains the outgoing HTLC and its Origin.Hot.
In Channel.scala, when an outgoing HTLC is failed or fulfilled, we emit an OutgoingHtlcFailed / OutgoingHtlcFulfilled event to the event stream.
The reputation recorder listens to those events, and updates the internal reputation state accordingly.
We don't use the relayId but rather the outgoing channel_id and htlc_id, combined with the origin to group HTLCs.
For trampoline payments, since the reputation recorder has the Origin information, it can wait for all outgoing HTLCs to be settled to correctly account for the fees / timestamps.

I believe this better matches what we're trying to accomplish: the only thing the reputation recorder actually needs to know to update reputation is when outgoing HTLCs are sent and when they're settled.
It also provides more accurate relay data to ensure we're updating the reputation correctly, and has much less impact on the ChannelRelay / NodeRelay actors (which should simplify testing).

Can you try that, or let me know if you think that it wouldn't be as good as the currently implemented flow?

t-bast · 2024-08-02T08:40:51Z

eclair-core/src/main/resources/reference.conf

+      // Pending HTLCs are counted as failed, and because they could potentially stay pending for a very long time, the
+      // following multiplier is applied.


Thanks, this is much clearer. I'm wondering how you came up with this default value though? Can you explain it and provide some guidance on what node operators should take into account when they change that default value?

t-bast · 2024-08-02T09:18:57Z

eclair-core/src/main/scala/fr/acinq/eclair/reputation/Reputation.scala

+   */
+  def record(relayId: UUID, isSuccess: Boolean, feeOverride: Option[MilliSatoshi] = None, now: TimestampMilli = TimestampMilli.now()): Reputation = {
+    val d = decay(now)
+    var p = pending.getOrElse(relayId, Pending(MilliSatoshi(0), now))


I don't think it makes sense to record something if we can't find the HTLC in the pending map, does it? Are there cases where we won't find the HTLC in the map? I've changed that in #2893, let me know if that's correct.

t-bast · 2024-08-02T10:14:10Z

eclair-core/src/main/scala/fr/acinq/eclair/payment/relay/ChannelRelay.scala

@@ -124,8 +131,6 @@ class ChannelRelay private(nodeParams: NodeParams,
  private case class PreviouslyTried(channelId: ByteVector32, failure: RES_ADD_FAILED[ChannelException])

  def relay(previousFailures: Seq[PreviouslyTried]): Behavior[Command] = {
-    Behaviors.receiveMessagePartial {


Being in a Behaviors.receiveMessagePartial is important, because that's how we correctly get MDC fields into the logging context. If you remove this, the log lines won't contain the relayId and other important MDC fields. Restored in #2893.

t-bast · 2024-08-02T10:15:13Z

eclair-core/src/main/scala/fr/acinq/eclair/Setup.scala

@@ -360,7 +361,8 @@ class Setup(val datadir: File,
      offerManager = system.spawn(Behaviors.supervise(OfferManager(nodeParams, router, paymentTimeout = 1 minute)).onFailure(typed.SupervisorStrategy.resume), name = "offer-manager")
      paymentHandler = system.actorOf(SimpleSupervisor.props(PaymentHandler.props(nodeParams, register, offerManager), "payment-handler", SupervisorStrategy.Resume))
      triggerer = system.spawn(Behaviors.supervise(AsyncPaymentTriggerer()).onFailure(typed.SupervisorStrategy.resume), name = "async-payment-triggerer")
-      relayer = system.actorOf(SimpleSupervisor.props(Relayer.props(nodeParams, router, register, paymentHandler, triggerer, Some(postRestartCleanUpInitialized)), "relayer", SupervisorStrategy.Resume))
+      reputationRecorder = system.spawn(Behaviors.supervise(ReputationRecorder(nodeParams.relayParams.peerReputationConfig, Map.empty)).onFailure(typed.SupervisorStrategy.resume), name = "reputation-recorder")


We must make this optional and let node operator disable it entirely if they'd like to run the standard blip-0004 relay. I've done this in #2893.

t-bast · 2024-08-02T10:27:48Z

eclair-core/src/main/scala/fr/acinq/eclair/reputation/ReputationRecorder.scala

+
+  case class Confidence(value: Double)
+
+  def apply(reputationConfig: ReputationConfig, reputations: Map[(PublicKey, Int), Reputation]): Behavior[Command] = {


I see why you're keep a per-incoming-endorsement reputation, but aren't you afraid that this will split the data points and we'll end up with some endorsement values for which we don't have enough entries to make a proper decision?

Why can't you instead maintain a single Reputation instance per incoming peer, but take their endorsement value into account in the scoring? That would be more robust, wouldn't it?

t-bast · 2024-08-02T10:31:19Z

eclair-core/src/main/scala/fr/acinq/eclair/channel/Commitments.scala

+    // Jamming protection
+    // Must be the last checks so that they can be ignored for shadow deployment.
+    for ((amountMsat, i) <- incomingHtlcs.toSeq.map(_.amountMsat).sorted.zipWithIndex) {
+      if ((amountMsat.toLong < 1) || (math.log(amountMsat.toLong.toDouble) * params.localParams.maxAcceptedHtlcs / math.log(params.localParams.maxHtlcValueInFlightMsat.toLong.toDouble / params.localParams.maxAcceptedHtlcs) < i)) {
+        return Left(TooManySmallHtlcs(params.channelId, number = i + 1, below = amountMsat))
+      }
+    }


We can't have jamming protection in receiveAdd: an error here would trigger a force-close. This is only used when our peer violated some channel constraints they should be aware of. This isn't the case for jamming, where we apply a local (potentially dynamic) heuristic: we should accept those HTLCs and fail them instead of relaying them, but not raise an error in receiveAdd.

t-bast · 2024-08-02T10:37:40Z

eclair-core/src/main/scala/fr/acinq/eclair/channel/Commitments.scala

+    val canSendAdds = active.map(_.canSendAdd(add.amountMsat, params, changes1, feerates, feeConf, cmd.confidence))
+    // Log only for jamming protection.
+    canSendAdds.collectFirst {
+      case Left(f: TooManySmallHtlcs) =>
+        log.info("TooManySmallHtlcs: {} outgoing HTLCs are below {}}", f.number, f.below)
+        Metrics.dropHtlc(f, Tags.Directions.Outgoing)
+      case Left(f: ConfidenceTooLow) =>
+        log.info("ConfidenceTooLow: confidence is {}% while channel is {}% full", (100 * f.confidence).toInt, (100 * f.occupancy).toInt)
+        Metrics.dropHtlc(f, Tags.Directions.Outgoing)
+    }
+    canSendAdds.flatMap { // TODO: We ignore jamming protection, delete this flatMap to activate jamming protection.
+        case Left(_: TooManySmallHtlcs) | Left(_: ConfidenceTooLow) => None
+        case x => Some(x)
+      }
+      .collectFirst { case Left(f) =>
+        Metrics.dropHtlc(f, Tags.Directions.Outgoing)
+        Left(f)
+      }


This seems overly complex: can't you keep sendAdd unchanged, and in canSendAdd instead of returning TooManySmallHtlcs or ConfidenceTooLow, simply write the log line and increment the metric?

This will duplicate the log/metric when we have pending splices, but pending splices is something that won't happen frequently enough to skew the data?

Or even better, since HTLCs are the same in all active commitments, you can remove the changes to canSendAdd and here in sendAdd, when canSendAdd returns a success, you can evaluate the jamming conditions and log and increment the metric if needed.

I've done this in #2898 however I don't like because it just duplicates a lot of canSendAdd's code. I believe that jamming checks, like the others should be done in canSendAdd. This piece of code may look overly complex but the idea is to get rid of it when we deploy jamming protection and TooManySmallHtlcs and ConfidenceTooLow will just be treated like any other error.

thomash-acinq · 2024-08-02T13:08:29Z

The reason for the weird CancelRelay is that we need to take into account pending HTLCs in the reputation. If we receive two HTLCs at once we don't want both of them to enjoy the same reputation, the second one should be penalized. If we only update the reputation after we took our decision to relay, we can get a data race.

t-bast · 2024-08-02T13:14:53Z

The reason for the weird CancelRelay is that we need to take into account pending HTLCs in the reputation. If we receive two HTLCs at once we don't want both of them to enjoy the same reputation, the second one should be penalized. If we only update the reputation after we took our decision to relay, we can get a data race.

But you're not doing this for trampoline relay, so that race can already be exploited anyway? I don't think this matters much in practice though, because:

that race is hard to exploit, because between the call to the ReputationRecorder and the outgoing HTLC, there will be at most a few milliseconds
exploiting that race requires ensuring that we receive the incoming update_add_htlc at exactly the same time, and network delays cannot be trivially manipulated
at some point we will add a randomized delay before forwarding HTLCs, because it's good for privacy (and was discussed in Oakland) which will make this race almost impossible to exploit

thomash-acinq · 2024-08-08T15:17:42Z

1. Once we start relaying (`ChannelRelay` / `NodeRelay`), we obtain the confidence value with `GetConfidence` and will include it in `CMD_ADD_HTLC`.

2. At that point, we DON'T update the reputation to take this payment into account, because we don't know yet if it will be relayed.

3. In `Channel.scala`, when we actually send an outgoing `UpdateAddHtlc`, we emit an `OutgoingHtlcAdded` event to the event stream, that contains the outgoing HTLC and its `Origin.Hot`.

4. In `Channel.scala`, when an outgoing HTLC is failed or fulfilled, we emit an `OutgoingHtlcFailed` / `OutgoingHtlcFulfilled` event to the event stream.

5. The reputation recorder listens to those events, and updates the internal reputation state accordingly.

6. We don't use the `relayId` but rather the outgoing `channel_id` and `htlc_id`, combined with the origin to group HTLCs.

7. For trampoline payments, since the reputation recorder has the `Origin` information, it can wait for all outgoing HTLCs to be settled to correctly account for the fees / timestamps.

I've tried doing that in #2897.
For channel relays it works fine, however for trampoline I'm running into some problems:

We can't wait for for outgoing HTLCs to be settled to be able to compute the fees, we need to update the reputation as soon as we start relaying and for that we need to know the fees.
Even when all HTLCs associated to a trampoline relay fail, it's not necessarily the end of this relay because we may retry.

It seems to me that solving this would require adding more complexity than this refactoring was removing.

- change default parameters and explain them better - remove checks on incomming HTLCs

t-bast · 2024-08-09T12:32:06Z

We can't wait for for outgoing HTLCs to be settled to be able to compute the fees, we need to update the reputation as soon as we start relaying and for that we need to know the fees.

It seems to me that we're trying to make trampoline fit into a box where it actually doesn't fit. One important aspect to trampoline is that the sender does not choose the outgoing channels and does not choose the fees, they allocate a total fee budget for the trampoline node which tries to relay within that fee budget. The trampoline node will ensure that it earns at least its usual channel routing fees, otherwise it won't relay the payment. If the trampoline node is well connected, or the sender over-allocated fees, the trampoline node earns more fees than its usual routing fees: but I'm not sure that this extra fee should count in the reputation?

So I think we could handle trampoline relays in a simplified way that gets rid of those issues, by using the channel routing fees instead of trying to take the extra trampoline fees into account: when sending an outgoing HTLC with a trampoline origin, the fees we allocate to it in the reputation algorithm should just be this outgoing channel's routing fees (which can be included in the relay event since we have access to our channel update in the channel data).

If the payment succeeds, if we want to give a bonus reputation if we earned more fees than our channel routing fees, this should be easy to do as well, by splitting the extra fee between all the outgoing channels? But I'm not sure we should do this, because we can't really match an outgoing HTLC to a specific incoming channel 1-to-1, so it's probably better to just count our channel routing fees?

Do you think that model would make sense, or am I missing something?

thomash-acinq · 2024-08-09T12:46:43Z

It seems like a good solution indeed, I'll try it.

thomash-acinq force-pushed the endorse-htlc branch from 7d53a40 to 525f547 Compare July 24, 2023 14:11

thomash-acinq force-pushed the endorse-htlc branch from 525f547 to 14b6dd1 Compare August 10, 2023 12:12

t-bast added the Optech Make Me Famous! label Sep 11, 2023

thomash-acinq force-pushed the endorse-htlc branch from 14b6dd1 to deab085 Compare November 3, 2023 14:17

thomash-acinq force-pushed the endorse-htlc branch from deab085 to b8b5cb1 Compare November 14, 2023 16:11

thomash-acinq force-pushed the endorse-htlc branch from b8b5cb1 to 3b4b506 Compare April 17, 2024 09:43

thomash-acinq force-pushed the endorse-htlc branch from 3b4b506 to 7ae571d Compare June 20, 2024 16:01

thomash-acinq force-pushed the endorse-htlc branch 2 times, most recently from ff97580 to 73efca3 Compare July 4, 2024 14:59

thomash-acinq force-pushed the endorse-htlc branch 2 times, most recently from 755ba80 to fa2e536 Compare July 5, 2024 13:49

thomash-acinq marked this pull request as ready for review July 5, 2024 14:23

thomash-acinq requested a review from t-bast July 5, 2024 14:23

thomash-acinq force-pushed the endorse-htlc branch from 870796d to b79b298 Compare July 17, 2024 07:46

t-bast reviewed Jul 17, 2024

View reviewed changes

t-bast mentioned this pull request Jul 19, 2024

Add a channel congestion control mechanism #2330

Closed

Add local reputation

83a5627

thomash-acinq force-pushed the endorse-htlc branch from b79b298 to 83a5627 Compare July 31, 2024 14:45

thomash-acinq and others added 2 commits August 1, 2024 16:47

Better explanations

54f4041

t-bast mentioned this pull request Aug 2, 2024

Allow disabling the reputation recorder #2893

Merged

t-bast reviewed Aug 2, 2024

View reviewed changes

small changes

a357376

- change default parameters and explain them better - remove checks on incomming HTLCs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endorse htlc and local reputation #2716

Endorse htlc and local reputation #2716

thomash-acinq commented Jul 21, 2023 •

edited

Loading

codecov-commenter commented Jul 21, 2023 •

edited

Loading

t-bast left a comment

t-bast Aug 2, 2024

t-bast Aug 2, 2024 •

edited

Loading

t-bast Aug 2, 2024 •

edited

Loading

t-bast Aug 2, 2024 •

edited

Loading

t-bast Aug 2, 2024

t-bast Aug 2, 2024

t-bast Aug 2, 2024

thomash-acinq Aug 8, 2024

thomash-acinq commented Aug 2, 2024

t-bast commented Aug 2, 2024 •

edited

Loading

thomash-acinq commented Aug 8, 2024

t-bast commented Aug 9, 2024 •

edited

Loading

thomash-acinq commented Aug 9, 2024

		// Pending HTLCs are counted as failed, and because they could potentially stay pending for a very long time, the
		// following multiplier is applied.


		case class Confidence(value: Double)

		def apply(reputationConfig: ReputationConfig, reputations: Map[(PublicKey, Int), Reputation]): Behavior[Command] = {

Endorse htlc and local reputation #2716

Are you sure you want to change the base?

Endorse htlc and local reputation #2716

Conversation

thomash-acinq commented Jul 21, 2023 • edited Loading

codecov-commenter commented Jul 21, 2023 • edited Loading

Codecov Report

t-bast left a comment

Choose a reason for hiding this comment

t-bast Aug 2, 2024

Choose a reason for hiding this comment

t-bast Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

t-bast Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

t-bast Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

t-bast Aug 2, 2024

Choose a reason for hiding this comment

t-bast Aug 2, 2024

Choose a reason for hiding this comment

t-bast Aug 2, 2024

Choose a reason for hiding this comment

thomash-acinq Aug 8, 2024

Choose a reason for hiding this comment

thomash-acinq commented Aug 2, 2024

t-bast commented Aug 2, 2024 • edited Loading

thomash-acinq commented Aug 8, 2024

t-bast commented Aug 9, 2024 • edited Loading

thomash-acinq commented Aug 9, 2024

thomash-acinq commented Jul 21, 2023 •

edited

Loading

codecov-commenter commented Jul 21, 2023 •

edited

Loading

t-bast Aug 2, 2024 •

edited

Loading

t-bast Aug 2, 2024 •

edited

Loading

t-bast Aug 2, 2024 •

edited

Loading

t-bast commented Aug 2, 2024 •

edited

Loading

t-bast commented Aug 9, 2024 •

edited

Loading