Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload pending outbound payments from ChannelMonitors on startup #1104

Merged

Conversation

TheBlueMatt
Copy link
Collaborator

When we add payment retries, we need some kind of on-disk consistency story. I think a compelling one is described, roughly at [1] - make the retryer persist first, before calling into the ChannelManager, then when reloading we either have a pending payment there or its pending nowhere. Once the send_payment method returns, we'll either have a pending payment persisted in both the retryer and a ChannelMonitor, or in neither. Note that we may not have anything in a ChannelManager until much later, preventing us from retrying the payment if we crash.

In such a case, we'll force-close the relevant channel (its ChannelMonitor has run ahead of the manager), but the actual HTLC will still be pending, just not present in ChannelManager at all. Ideally, we'd then be able to retry it over a whole new path.

This PR addresses this situation by having ChannelManager examine the available ChannelMonitors at startup, rebuilding pending payment entries on the basis of what's in its ChannelMonitors.

Note that we aren't particularly concerned with the payment resolution end of things - if ChannelManager thinks a payment is still pending when it has been fully resolved, we'll eventually consider it sent and generate a fresh PaymentSent when the ChannelMonitor resolves the current state.

This should fix #1102, but generally needs lots more tests (its not tested at all, aside from a hacky thing at the end to compare the loaded-from-disk and the rebuilt set of payments, which passes most tests).

[1] #1059 (comment)

@TheBlueMatt TheBlueMatt added this to the 0.0.102 milestone Oct 4, 2021
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from e97930f to a0fe042 Compare October 5, 2021 05:59
@codecov
Copy link

codecov bot commented Oct 5, 2021

Codecov Report

Merging #1104 (7af5d12) into main (107c6c7) will increase coverage by 0.06%.
The diff coverage is 96.03%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1104      +/-   ##
==========================================
+ Coverage   90.40%   90.47%   +0.06%     
==========================================
  Files          68       68              
  Lines       34796    35176     +380     
==========================================
+ Hits        31458    31826     +368     
- Misses       3338     3350      +12     
Impacted Files Coverage Δ
lightning/src/util/ser.rs 89.02% <80.00%> (-0.12%) ⬇️
lightning/src/ln/channelmanager.rs 83.86% <90.06%> (+0.34%) ⬆️
lightning/src/ln/channel.rs 88.37% <96.22%> (+0.05%) ⬆️
lightning/src/chain/channelmonitor.rs 91.11% <97.56%> (+0.17%) ⬆️
lightning/src/ln/payment_tests.rs 99.09% <99.26%> (+0.25%) ⬆️
lightning/src/ln/functional_test_utils.rs 95.09% <100.00%> (+<0.01%) ⬆️
lightning/src/ln/functional_tests.rs 97.37% <100.00%> (-0.04%) ⬇️
lightning/src/ln/onion_route_tests.rs 96.60% <100.00%> (ø)
lightning/src/ln/features.rs 99.43% <0.00%> (+<0.01%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 107c6c7...7af5d12. Read the comment docs.

@TheBlueMatt
Copy link
Collaborator Author

This is now based on #1108, which should go first.

@TheBlueMatt
Copy link
Collaborator Author

Depends on #1109 as well now.

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from a0fe042 to f821c93 Compare October 5, 2021 23:26
@TheBlueMatt
Copy link
Collaborator Author

Now includes at least a naive test of payment data reloading, which works 🎉

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch 4 times, most recently from 6e60c4f to a7cea3d Compare October 11, 2021 00:53
@TheBlueMatt TheBlueMatt modified the milestones: 0.0.102, 0.0.103 Oct 15, 2021
@TheBlueMatt TheBlueMatt marked this pull request as ready for review October 20, 2021 01:25
@TheBlueMatt
Copy link
Collaborator Author

Rebased on upstream!

@jkczyz jkczyz self-requested a review October 20, 2021 17:56
Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly wondering how we handle failures that were previously fulfilled

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/chain/channelmonitor.rs Show resolved Hide resolved
lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about any confusion around terminology. I don't have as a firm of grasp as I'd like to around some of this.

Comment on lines 344 to 363
pub(super) struct RAAUpdates {
pub commitment_update: Option<msgs::CommitmentUpdate>,
pub to_forward_htlcs: Vec<(PendingHTLCInfo, u64)>,
pub failed_htlcs: Vec<(HTLCSource, PaymentHash, HTLCFailReason)>,
pub finalized_claim_htlcs: Vec<HTLCSource>,
pub monitor_update: ChannelMonitorUpdate,
pub holding_cell_failed_htlcs: Vec<(HTLCSource, PaymentHash)>,
}

/// The return value of `monitor_updating_resotred`
pub(super) struct MonitorRestoreUpdates {
pub raa: Option<msgs::RevokeAndACK>,
pub commitment_update: Option<msgs::CommitmentUpdate>,
pub order: RAACommitmentOrder,
pub forwards: Vec<(PendingHTLCInfo, u64)>,
pub failures: Vec<(HTLCSource, PaymentHash, HTLCFailReason)>,
pub finalized_claims: Vec<HTLCSource>,
pub funding_broadcastable: Option<Transaction>,
pub funding_locked: Option<msgs::FundingLocked>,
}
Copy link
Contributor

@jkczyz jkczyz Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where it makes sense, could we use consistent field names between these two structs?

@@ -2899,8 +2927,14 @@ impl<Signer: Sign> Channel<Signer> {
}
self.monitor_pending_forwards.append(&mut to_forward_infos);
self.monitor_pending_failures.append(&mut revoked_htlcs);
self.monitor_pending_finalized_fulfills.append(&mut finalized_claim_htlcs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of terminology, is there a difference between a "fulfill" and a "claim"? Does it depend on the context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mostly just depends on when the code was written :p

commitment_update: Some(commitment_update),
finalized_claim_htlcs,
to_forward_htlcs: to_forward_infos,
failed_htlcs: revoked_htlcs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question wrt "failed" and "revoked". Just want to make sure I understand if it is context dependent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer, there really isn't.

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
Comment on lines 2018 to 2023
if let hash_map::Entry::Occupied(payment) = &payment_entry {
if !payment.get().is_retryable() {
return Err(APIError::RouteError {
err: "Payment already completed"
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why this shouldn't be if payment.get().is_complete() { as indicated by the error message. Do we want to hit this for legacy payments, too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, we probably should hit this for legacy payments. I mean its unreachable for legacy payments, I believe (retry_payment refuses to call send_payment_internal at all), but nothing wrong with an extra check here.

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
Comment on lines 419 to 444
/// When a pending payment completes, we continue tracking it until all pendings HTLCs have
/// been resolved. This ensures we don't look up pending payments in ChannelMonitors on restart
/// and add a pending payment that was already completed.
Fulfilled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terminology question related to my earlier comment: is there a difference between "completes" and "fulfilled" as it pertains to the first sentence in the docs and the variant name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just missed the docs in the rename - once we get one HTLC fulfillment we consider the payment fulfilled.

@TheBlueMatt
Copy link
Collaborator Author

Pushed a fix for the lockorder comment, leaving other stuff until we settle on naming.

lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
lightning/src/chain/channelmonitor.rs Show resolved Hide resolved
lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
@TheBlueMatt
Copy link
Collaborator Author

Ok, I believe I've addressed all feedback and renamed things to be more clear.

lightning/src/chain/channelmonitor.rs Show resolved Hide resolved
lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
lightning/src/ln/channel.rs Outdated Show resolved Hide resolved
lightning/src/ln/channel.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
// We only rebuild the pending payments map if we were most recently serialized by
// 0.0.102+
for (_, monitor) in args.channel_monitors {
if by_id.get(&monitor.get_funding_txo().0.to_channel_id()).is_none() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't see why we don't load pending outbound payments for open channels as well. Couldn't we still have a situation where the ChannelManager doesn't get persisted after initiating an outbound payment, but the ChannelMonitor still knows about it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channel monitor knowing about a payment that the ChannelManager does not implies by definition that the ChannelMonitor has some newer state than the ChannelManager. It's possible I missed some edge case but it feels pretty straightforward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right 🤦‍♀️

Fwiw, when I add this diff, it gets hit a fair amount of times and all tests pass:

diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index af8a65b9..5d7ee3a5 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -5939,7 +5939,23 @@ impl<'a, Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref>
                                                        }
                                                }
                                        }
-                               }
+                               } else {
+                                       for (htlc_source, htlc) in monitor.get_pending_outbound_htlcs() {
+                                               if let HTLCSource::OutboundRoute { payment_id, session_priv, path, payment_secret, .. } = htlc_source {
+                                                       let path_amt = path.last().unwrap().fee_msat;
+                                                       let mut session_priv_bytes = [0; 32];
+                                                       session_priv_bytes[..].copy_from_slice(&session_priv[..]);
+                                                       match pending_outbound_payments.as_mut().unwrap().entry(payment_id) {
+                                                               hash_map::Entry::Occupied(mut entry) => {
+                                                                       assert!(!entry.get_mut().insert(session_priv_bytes, path_amt));
+                                                               },
+                                                               hash_map::Entry::Vacant(entry) => {
+                                                                       panic!("no entry for this payment");
+                                                               }
+                                                       }
+                                               }
+                                       }
+}
                        }
                }

which is reassuring as well

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think I'm ACK mod outstanding feedback. Might give it one more glance in the morning. Thanks for the additional test coverage!

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
// We only rebuild the pending payments map if we were most recently serialized by
// 0.0.102+
for (_, monitor) in args.channel_monitors {
if by_id.get(&monitor.get_funding_txo().0.to_channel_id()).is_none() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right 🤦‍♀️

Fwiw, when I add this diff, it gets hit a fair amount of times and all tests pass:

diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index af8a65b9..5d7ee3a5 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -5939,7 +5939,23 @@ impl<'a, Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref>
                                                        }
                                                }
                                        }
-                               }
+                               } else {
+                                       for (htlc_source, htlc) in monitor.get_pending_outbound_htlcs() {
+                                               if let HTLCSource::OutboundRoute { payment_id, session_priv, path, payment_secret, .. } = htlc_source {
+                                                       let path_amt = path.last().unwrap().fee_msat;
+                                                       let mut session_priv_bytes = [0; 32];
+                                                       session_priv_bytes[..].copy_from_slice(&session_priv[..]);
+                                                       match pending_outbound_payments.as_mut().unwrap().entry(payment_id) {
+                                                               hash_map::Entry::Occupied(mut entry) => {
+                                                                       assert!(!entry.get_mut().insert(session_priv_bytes, path_amt));
+                                                               },
+                                                               hash_map::Entry::Vacant(entry) => {
+                                                                       panic!("no entry for this payment");
+                                                               }
+                                                       }
+                                               }
+                                       }
+}
                        }
                }

which is reassuring as well

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
@@ -266,3 +274,410 @@ fn no_pending_leak_on_initial_send_failure() {

assert!(!nodes[0].node.has_pending_payments());
}

#[test]
fn retry_with_no_persist() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the new version of this test passes with the code that had the find_map bug

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Yep, added even more coverage :)

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also pretty close on an ACK after remaining feedback is addressed.

@valentinewallace Nice work on finding the test gaps. I was staring at retry_with_no_persist trying to figure out if the bug was being exercised. Mutation testing FTW lol.

I wonder if there's any improvements we can make to our testing strategy to make reviewing a bit easier. Some of the tests are pretty lengthy/verbose but I'm not sure if they can be broken up into smaller tests and still exercise the behavior.

@@ -3097,11 +3108,16 @@ impl<Signer: Sign> Channel<Signer> {
/// which failed. The messages which were generated from that call which generated the
/// monitor update failure must *not* have been sent to the remote end, and must instead
/// have been dropped. They will be regenerated when monitor_updating_restored is called.
pub fn monitor_update_failed(&mut self, resend_raa: bool, resend_commitment: bool, mut pending_forwards: Vec<(PendingHTLCInfo, u64)>, mut pending_fails: Vec<(HTLCSource, PaymentHash, HTLCFailReason)>) {
pub fn monitor_update_failed(&mut self, resend_raa: bool, resend_commitment: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move these parameters down a line, aligned with the other parameters

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a small function that doesn't seem to look any better to my eye? Feels very "rustfmt wants your code to use as many lines as possible" to me :).

lightning/src/ln/functional_tests.rs Outdated Show resolved Hide resolved
Comment on lines +642 to +686
// Now reload nodes[1]...
persister = test_utils::TestPersister::new();
let keys_manager = &chanmon_cfgs[1].keys_manager;
new_chain_monitor = test_utils::TestChainMonitor::new(Some(nodes[1].chain_source), nodes[1].tx_broadcaster.clone(), nodes[1].logger, node_cfgs[1].fee_estimator, &persister, keys_manager);
nodes[1].chain_monitor = &new_chain_monitor;
let mut chan_0_monitor_read = &chan_0_monitor_serialized.0[..];
let (_, mut chan_0_monitor) = <(BlockHash, ChannelMonitor<EnforcingSigner>)>::read(
&mut chan_0_monitor_read, keys_manager).unwrap();
assert!(chan_0_monitor_read.is_empty());

let (_, nodes_1_deserialized_tmp) = {
let mut channel_monitors = HashMap::new();
channel_monitors.insert(chan_0_monitor.get_funding_txo().0, &mut chan_0_monitor);
<(BlockHash, ChannelManager<EnforcingSigner, &test_utils::TestChainMonitor, &test_utils::TestBroadcaster, &test_utils::TestKeysInterface, &test_utils::TestFeeEstimator, &test_utils::TestLogger>)>
::read(&mut io::Cursor::new(&chan_manager_serialized.0[..]), ChannelManagerReadArgs {
default_config: Default::default(),
keys_manager,
fee_estimator: node_cfgs[1].fee_estimator,
chain_monitor: nodes[1].chain_monitor,
tx_broadcaster: nodes[1].tx_broadcaster.clone(),
logger: nodes[1].logger,
channel_monitors,
}).unwrap()
};
nodes_1_deserialized = nodes_1_deserialized_tmp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty verbose and looks repeated from retry_with_no_persist. Wondering if we could make a utility function for reloading so the tests are easier to read.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, its repeated in quite a few tests. I believe we could may be macro it, but it requires a few variables declared at top of function scope. I'll look at it as a followup, its not a new problem.

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

lightning/src/ln/payment_tests.rs Show resolved Hide resolved
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch 2 times, most recently from b67f15c to d9a85b3 Compare October 22, 2021 18:06
@TheBlueMatt
Copy link
Collaborator Author

Pushed yet more test coverage, I believe all outstanding comments have been addressed.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK after squash

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK d9a85b3

This substantially improves readability at the callsite and in the
function.
This improves readability at the callsite and in the function.
This allows us to read a `HashMap` that has values which may be
skipped if they are some backwards-compatibility type.

We also take this opportunity to fail deserialization if keys are
duplicated.
When an HTLC has been failed, we track it up until the point there
exists no broadcastable commitment transaction which has the HTLC
present, at which point Channel returns the HTLCSource back to the
ChannelManager, which fails the HTLC backwards appropriately.

When an HTLC is fulfilled, however, we fulfill on the backwards path
immediately. This is great for claiming upstream HTLCs, but when we
want to track pending payments, we need to ensure we can check with
ChannelMonitor data to rebuild pending payments. In order to do so,
we need an event similar to the HTLC failure event, but for
fulfills instead.

Specifically, if we force-close a channel, we remove its off-chain
`Channel` object entirely, at which point, on reload, we may notice
HTLC(s) which are not present in our pending payments map (as they
may have received a payment preimage, but not fully committed to
it). Thus, we'd conclude we still have a retryable payment, which
is untrue.

This commit does so, informing the ChannelManager via a new return
element where appropriate of the HTLCSource corresponding to the
failed HTLC.
In the next commit, we will reload lost pending payments from
ChannelMonitors during restart. However, in order to avoid
re-adding pending payments which have already been fulfilled, we
must ensure that we do not fully remove pending payments until all
HTLCs for the payment have been fully removed from their
ChannelMonitors.

We do so here, introducing a new PendingOutboundPayment variant
called `Completed` which only tracks the set of pending HTLCs.
If we go to send a payment, add the HTLC(s) to the channel(s),
commit the ChannelMonitor updates to disk, and then crash, we'll
come back up with no pending payments but HTLC(s) ready to be
claim/failed.

This makes it rather impractical to write a payment sender/retryer,
as you cannot guarantee atomicity - you cannot guarantee you'll
have retry data persisted even if the HTLC(s) are actually pending.

Because ChannelMonitors are *the* atomically-persisted data in LDK,
we lean on their current HTLC data to figure out what HTLC(s) are a
part of an outbound payment, rebuilding the pending payments list
on reload.
test_dup_htlc_onchain_fails_on_reload is now more of a
payment_test than a functional_test, testing for handling of
pending payments.
Peers probably shouldn't do this, but if they want to give us free
money, we should take it and not generate any spurious events.
@TheBlueMatt
Copy link
Collaborator Author

Squashed without diff, will land after CI 🎉

$ git diff-tree -U1 d9a85b35b 7af5d127a
$

@TheBlueMatt TheBlueMatt merged commit 0a31c12 into lightningdevkit:main Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants