Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endorse htlc and local reputation #2716

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Endorse htlc and local reputation #2716

wants to merge 4 commits into from

Conversation

thomash-acinq
Copy link
Member

@thomash-acinq thomash-acinq commented Jul 21, 2023

We do not yet drop HTLCs, the purpose is to collect data first.

We add

  1. An endorsement bit to UpdateAddHtlc. This follows blip-0004: experimental endorsement signaling in update_add_htlc lightning/blips#27.
  2. A local reputation system: For each pair (origin node, endorsement value), we compute its reputation as total fees that were paid divided by total fees that would have been paid if all HTLCs had fulfilled. When considering an HTLC to relay, we only forward it if the reputation of its source is higher than the occupancy of the outgoing channel.
  3. A limit on the number of small HTLCs per channel. We allow just very few small HTLCs per channel so that it's not possible to block large HTLCs using only small HTLCs (similar to Add a channel congestion control mechanism #2330 but continuous).

@codecov-commenter
Copy link

codecov-commenter commented Jul 21, 2023

Codecov Report

Merging #2716 (deab085) into master (12adf87) will increase coverage by 0.04%.
Report is 1 commits behind head on master.
The diff coverage is 100.00%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##           master    #2716      +/-   ##
==========================================
+ Coverage   85.82%   85.86%   +0.04%     
==========================================
  Files         216      218       +2     
  Lines       18126    18209      +83     
  Branches      771      749      -22     
==========================================
+ Hits        15556    15636      +80     
- Misses       2570     2573       +3     
Files Coverage Δ
...re/src/main/scala/fr/acinq/eclair/NodeParams.scala 93.47% <100.00%> (+0.08%) ⬆️
...ir-core/src/main/scala/fr/acinq/eclair/Setup.scala 75.29% <100.00%> (+0.14%) ⬆️
...in/scala/fr/acinq/eclair/channel/ChannelData.scala 100.00% <ø> (ø)
...in/scala/fr/acinq/eclair/channel/Commitments.scala 96.93% <100.00%> (+0.11%) ⬆️
...ain/scala/fr/acinq/eclair/channel/Monitoring.scala 96.15% <100.00%> (+0.23%) ⬆️
...in/scala/fr/acinq/eclair/channel/fsm/Channel.scala 85.80% <100.00%> (+0.15%) ⬆️
...ain/scala/fr/acinq/eclair/payment/Monitoring.scala 98.30% <100.00%> (+0.09%) ⬆️
.../scala/fr/acinq/eclair/payment/PaymentPacket.scala 90.82% <100.00%> (ø)
...a/fr/acinq/eclair/payment/relay/ChannelRelay.scala 96.03% <100.00%> (+0.16%) ⬆️
...fr/acinq/eclair/payment/relay/ChannelRelayer.scala 100.00% <100.00%> (ø)
... and 9 more

... and 3 files with indirect coverage changes

thomash-acinq and others added 2 commits August 1, 2024 16:47
We must allow node operators to run plain blip-0004 without our
experimental reputation recorder. It's also a stop-gap in case
there is a bug in it that can be exploited.

We also refactor to introduce more types and documentation, without
changing the reputation algorithm itself.

We also fix a few issues mentioned on the main PR comments.
Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's now more clear to me how we assign reputation to our peers. I've made a few comments on the code itself, some of which (the easy ones) I fixed in #2893.

The reputation algorithm itself looks good to me, let's try it out and see what results we get in practice and during simulations.

However, I don't think the way we interact with the reputation recorder makes the most sense.
You are storing a relay attempt as soon as we start relaying, before we know whether we actually send HTLCs out or not.
This leads to the weird CancelRelay command and an inconsistency between channel relay and trampoline relay.
In the trampoline case, if we can't find a route or can't send outgoing HTLCs, we will treat this as a failure, which is incorrect.
This can probably even be used to skew our reputation algorithm.
It's also pretty invasive, especially in the NodeRelay component...

It seems to me that it would make more sense if we implemented the following flow:

  1. Once we start relaying (ChannelRelay / NodeRelay), we obtain the confidence value with GetConfidence and will include it in CMD_ADD_HTLC.
  2. At that point, we DON'T update the reputation to take this payment into account, because we don't know yet if it will be relayed.
  3. In Channel.scala, when we actually send an outgoing UpdateAddHtlc, we emit an OutgoingHtlcAdded event to the event stream, that contains the outgoing HTLC and its Origin.Hot.
  4. In Channel.scala, when an outgoing HTLC is failed or fulfilled, we emit an OutgoingHtlcFailed / OutgoingHtlcFulfilled event to the event stream.
  5. The reputation recorder listens to those events, and updates the internal reputation state accordingly.
  6. We don't use the relayId but rather the outgoing channel_id and htlc_id, combined with the origin to group HTLCs.
  7. For trampoline payments, since the reputation recorder has the Origin information, it can wait for all outgoing HTLCs to be settled to correctly account for the fees / timestamps.

I believe this better matches what we're trying to accomplish: the only thing the reputation recorder actually needs to know to update reputation is when outgoing HTLCs are sent and when they're settled.
It also provides more accurate relay data to ensure we're updating the reputation correctly, and has much less impact on the ChannelRelay / NodeRelay actors (which should simplify testing).

Can you try that, or let me know if you think that it wouldn't be as good as the currently implemented flow?

Comment on lines 248 to 249
// Pending HTLCs are counted as failed, and because they could potentially stay pending for a very long time, the
// following multiplier is applied.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is much clearer. I'm wondering how you came up with this default value though? Can you explain it and provide some guidance on what node operators should take into account when they change that default value?

*/
def record(relayId: UUID, isSuccess: Boolean, feeOverride: Option[MilliSatoshi] = None, now: TimestampMilli = TimestampMilli.now()): Reputation = {
val d = decay(now)
var p = pending.getOrElse(relayId, Pending(MilliSatoshi(0), now))
Copy link
Member

@t-bast t-bast Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to record something if we can't find the HTLC in the pending map, does it? Are there cases where we won't find the HTLC in the map? I've changed that in #2893, let me know if that's correct.

@@ -124,8 +131,6 @@ class ChannelRelay private(nodeParams: NodeParams,
private case class PreviouslyTried(channelId: ByteVector32, failure: RES_ADD_FAILED[ChannelException])

def relay(previousFailures: Seq[PreviouslyTried]): Behavior[Command] = {
Behaviors.receiveMessagePartial {
Copy link
Member

@t-bast t-bast Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being in a Behaviors.receiveMessagePartial is important, because that's how we correctly get MDC fields into the logging context. If you remove this, the log lines won't contain the relayId and other important MDC fields. Restored in #2893.

@@ -360,7 +361,8 @@ class Setup(val datadir: File,
offerManager = system.spawn(Behaviors.supervise(OfferManager(nodeParams, router, paymentTimeout = 1 minute)).onFailure(typed.SupervisorStrategy.resume), name = "offer-manager")
paymentHandler = system.actorOf(SimpleSupervisor.props(PaymentHandler.props(nodeParams, register, offerManager), "payment-handler", SupervisorStrategy.Resume))
triggerer = system.spawn(Behaviors.supervise(AsyncPaymentTriggerer()).onFailure(typed.SupervisorStrategy.resume), name = "async-payment-triggerer")
relayer = system.actorOf(SimpleSupervisor.props(Relayer.props(nodeParams, router, register, paymentHandler, triggerer, Some(postRestartCleanUpInitialized)), "relayer", SupervisorStrategy.Resume))
reputationRecorder = system.spawn(Behaviors.supervise(ReputationRecorder(nodeParams.relayParams.peerReputationConfig, Map.empty)).onFailure(typed.SupervisorStrategy.resume), name = "reputation-recorder")
Copy link
Member

@t-bast t-bast Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must make this optional and let node operator disable it entirely if they'd like to run the standard blip-0004 relay. I've done this in #2893.


case class Confidence(value: Double)

def apply(reputationConfig: ReputationConfig, reputations: Map[(PublicKey, Int), Reputation]): Behavior[Command] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why you're keep a per-incoming-endorsement reputation, but aren't you afraid that this will split the data points and we'll end up with some endorsement values for which we don't have enough entries to make a proper decision?

Why can't you instead maintain a single Reputation instance per incoming peer, but take their endorsement value into account in the scoring? That would be more robust, wouldn't it?

Comment on lines 568 to 574
// Jamming protection
// Must be the last checks so that they can be ignored for shadow deployment.
for ((amountMsat, i) <- incomingHtlcs.toSeq.map(_.amountMsat).sorted.zipWithIndex) {
if ((amountMsat.toLong < 1) || (math.log(amountMsat.toLong.toDouble) * params.localParams.maxAcceptedHtlcs / math.log(params.localParams.maxHtlcValueInFlightMsat.toLong.toDouble / params.localParams.maxAcceptedHtlcs) < i)) {
return Left(TooManySmallHtlcs(params.channelId, number = i + 1, below = amountMsat))
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have jamming protection in receiveAdd: an error here would trigger a force-close. This is only used when our peer violated some channel constraints they should be aware of. This isn't the case for jamming, where we apply a local (potentially dynamic) heuristic: we should accept those HTLCs and fail them instead of relaying them, but not raise an error in receiveAdd.

Comment on lines +883 to +900
val canSendAdds = active.map(_.canSendAdd(add.amountMsat, params, changes1, feerates, feeConf, cmd.confidence))
// Log only for jamming protection.
canSendAdds.collectFirst {
case Left(f: TooManySmallHtlcs) =>
log.info("TooManySmallHtlcs: {} outgoing HTLCs are below {}}", f.number, f.below)
Metrics.dropHtlc(f, Tags.Directions.Outgoing)
case Left(f: ConfidenceTooLow) =>
log.info("ConfidenceTooLow: confidence is {}% while channel is {}% full", (100 * f.confidence).toInt, (100 * f.occupancy).toInt)
Metrics.dropHtlc(f, Tags.Directions.Outgoing)
}
canSendAdds.flatMap { // TODO: We ignore jamming protection, delete this flatMap to activate jamming protection.
case Left(_: TooManySmallHtlcs) | Left(_: ConfidenceTooLow) => None
case x => Some(x)
}
.collectFirst { case Left(f) =>
Metrics.dropHtlc(f, Tags.Directions.Outgoing)
Left(f)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly complex: can't you keep sendAdd unchanged, and in canSendAdd instead of returning TooManySmallHtlcs or ConfidenceTooLow, simply write the log line and increment the metric?

This will duplicate the log/metric when we have pending splices, but pending splices is something that won't happen frequently enough to skew the data?

Or even better, since HTLCs are the same in all active commitments, you can remove the changes to canSendAdd and here in sendAdd, when canSendAdd returns a success, you can evaluate the jamming conditions and log and increment the metric if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done this in #2898 however I don't like because it just duplicates a lot of canSendAdd's code. I believe that jamming checks, like the others should be done in canSendAdd. This piece of code may look overly complex but the idea is to get rid of it when we deploy jamming protection and TooManySmallHtlcs and ConfidenceTooLow will just be treated like any other error.

@thomash-acinq
Copy link
Member Author

The reason for the weird CancelRelay is that we need to take into account pending HTLCs in the reputation. If we receive two HTLCs at once we don't want both of them to enjoy the same reputation, the second one should be penalized. If we only update the reputation after we took our decision to relay, we can get a data race.

@t-bast
Copy link
Member

t-bast commented Aug 2, 2024

The reason for the weird CancelRelay is that we need to take into account pending HTLCs in the reputation. If we receive two HTLCs at once we don't want both of them to enjoy the same reputation, the second one should be penalized. If we only update the reputation after we took our decision to relay, we can get a data race.

But you're not doing this for trampoline relay, so that race can already be exploited anyway? I don't think this matters much in practice though, because:

  • that race is hard to exploit, because between the call to the ReputationRecorder and the outgoing HTLC, there will be at most a few milliseconds
  • exploiting that race requires ensuring that we receive the incoming update_add_htlc at exactly the same time, and network delays cannot be trivially manipulated
  • at some point we will add a randomized delay before forwarding HTLCs, because it's good for privacy (and was discussed in Oakland) which will make this race almost impossible to exploit

@thomash-acinq
Copy link
Member Author

1. Once we start relaying (`ChannelRelay` / `NodeRelay`), we obtain the confidence value with `GetConfidence` and will include it in `CMD_ADD_HTLC`.

2. At that point, we DON'T update the reputation to take this payment into account, because we don't know yet if it will be relayed.

3. In `Channel.scala`, when we actually send an outgoing `UpdateAddHtlc`, we emit an `OutgoingHtlcAdded` event to the event stream, that contains the outgoing HTLC and its `Origin.Hot`.

4. In `Channel.scala`, when an outgoing HTLC is failed or fulfilled, we emit an `OutgoingHtlcFailed` / `OutgoingHtlcFulfilled` event to the event stream.

5. The reputation recorder listens to those events, and updates the internal reputation state accordingly.

6. We don't use the `relayId` but rather the outgoing `channel_id` and `htlc_id`, combined with the origin to group HTLCs.

7. For trampoline payments, since the reputation recorder has the `Origin` information, it can wait for all outgoing HTLCs to be settled to correctly account for the fees / timestamps.

I've tried doing that in #2897.
For channel relays it works fine, however for trampoline I'm running into some problems:

  • We can't wait for for outgoing HTLCs to be settled to be able to compute the fees, we need to update the reputation as soon as we start relaying and for that we need to know the fees.
  • Even when all HTLCs associated to a trampoline relay fail, it's not necessarily the end of this relay because we may retry.

It seems to me that solving this would require adding more complexity than this refactoring was removing.

- change default parameters and explain them better
- remove checks on incomming HTLCs
@t-bast
Copy link
Member

t-bast commented Aug 9, 2024

We can't wait for for outgoing HTLCs to be settled to be able to compute the fees, we need to update the reputation as soon as we start relaying and for that we need to know the fees.

It seems to me that we're trying to make trampoline fit into a box where it actually doesn't fit. One important aspect to trampoline is that the sender does not choose the outgoing channels and does not choose the fees, they allocate a total fee budget for the trampoline node which tries to relay within that fee budget. The trampoline node will ensure that it earns at least its usual channel routing fees, otherwise it won't relay the payment. If the trampoline node is well connected, or the sender over-allocated fees, the trampoline node earns more fees than its usual routing fees: but I'm not sure that this extra fee should count in the reputation?

So I think we could handle trampoline relays in a simplified way that gets rid of those issues, by using the channel routing fees instead of trying to take the extra trampoline fees into account: when sending an outgoing HTLC with a trampoline origin, the fees we allocate to it in the reputation algorithm should just be this outgoing channel's routing fees (which can be included in the relay event since we have access to our channel update in the channel data).

If the payment succeeds, if we want to give a bonus reputation if we earned more fees than our channel routing fees, this should be easy to do as well, by splitting the extra fee between all the outgoing channels? But I'm not sure we should do this, because we can't really match an outgoing HTLC to a specific incoming channel 1-to-1, so it's probably better to just count our channel routing fees?

Do you think that model would make sense, or am I missing something?

@thomash-acinq
Copy link
Member Author

It seems like a good solution indeed, I'll try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants