Reload pending outbound payments from ChannelMonitors on startup #1104

TheBlueMatt · 2021-10-04T05:42:33Z

When we add payment retries, we need some kind of on-disk consistency story. I think a compelling one is described, roughly at [1] - make the retryer persist first, before calling into the ChannelManager, then when reloading we either have a pending payment there or its pending nowhere. Once the send_payment method returns, we'll either have a pending payment persisted in both the retryer and a ChannelMonitor, or in neither. Note that we may not have anything in a ChannelManager until much later, preventing us from retrying the payment if we crash.

In such a case, we'll force-close the relevant channel (its ChannelMonitor has run ahead of the manager), but the actual HTLC will still be pending, just not present in ChannelManager at all. Ideally, we'd then be able to retry it over a whole new path.

This PR addresses this situation by having ChannelManager examine the available ChannelMonitors at startup, rebuilding pending payment entries on the basis of what's in its ChannelMonitors.

Note that we aren't particularly concerned with the payment resolution end of things - if ChannelManager thinks a payment is still pending when it has been fully resolved, we'll eventually consider it sent and generate a fresh PaymentSent when the ChannelMonitor resolves the current state.

This should fix #1102, but generally needs lots more tests (its not tested at all, aside from a hacky thing at the end to compare the loaded-from-disk and the rebuilt set of payments, which passes most tests).

[1] #1059 (comment)

codecov · 2021-10-05T06:26:04Z

Codecov Report

Merging #1104 (7af5d12) into main (107c6c7) will increase coverage by 0.06%.
The diff coverage is 96.03%.

@@            Coverage Diff             @@
##             main    #1104      +/-   ##
==========================================
+ Coverage   90.40%   90.47%   +0.06%     
==========================================
  Files          68       68              
  Lines       34796    35176     +380     
==========================================
+ Hits        31458    31826     +368     
- Misses       3338     3350      +12

Impacted Files	Coverage Δ
lightning/src/util/ser.rs	`89.02% <80.00%> (-0.12%)`	⬇️
lightning/src/ln/channelmanager.rs	`83.86% <90.06%> (+0.34%)`	⬆️
lightning/src/ln/channel.rs	`88.37% <96.22%> (+0.05%)`	⬆️
lightning/src/chain/channelmonitor.rs	`91.11% <97.56%> (+0.17%)`	⬆️
lightning/src/ln/payment_tests.rs	`99.09% <99.26%> (+0.25%)`	⬆️
lightning/src/ln/functional_test_utils.rs	`95.09% <100.00%> (+<0.01%)`	⬆️
lightning/src/ln/functional_tests.rs	`97.37% <100.00%> (-0.04%)`	⬇️
lightning/src/ln/onion_route_tests.rs	`96.60% <100.00%> (ø)`
lightning/src/ln/features.rs	`99.43% <0.00%> (+<0.01%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 107c6c7...7af5d12. Read the comment docs.

TheBlueMatt · 2021-10-05T17:34:54Z

This is now based on #1108, which should go first.

TheBlueMatt · 2021-10-05T19:28:56Z

Depends on #1109 as well now.

TheBlueMatt · 2021-10-05T23:27:29Z

Now includes at least a naive test of payment data reloading, which works 🎉

TheBlueMatt · 2021-10-20T01:26:03Z

Rebased on upstream!

valentinewallace

Mostly wondering how we handle failures that were previously fulfilled

lightning/src/ln/channelmanager.rs

lightning/src/chain/channelmonitor.rs

jkczyz

Sorry about any confusion around terminology. I don't have as a firm of grasp as I'd like to around some of this.

jkczyz · 2021-10-20T21:06:40Z

lightning/src/ln/channel.rs

+pub(super) struct RAAUpdates {
+	pub commitment_update: Option<msgs::CommitmentUpdate>,
+	pub to_forward_htlcs: Vec<(PendingHTLCInfo, u64)>,
+	pub failed_htlcs: Vec<(HTLCSource, PaymentHash, HTLCFailReason)>,
+	pub finalized_claim_htlcs: Vec<HTLCSource>,
+	pub monitor_update: ChannelMonitorUpdate,
+	pub holding_cell_failed_htlcs: Vec<(HTLCSource, PaymentHash)>,
+}
+
+/// The return value of `monitor_updating_resotred`
+pub(super) struct MonitorRestoreUpdates {
+	pub raa: Option<msgs::RevokeAndACK>,
+	pub commitment_update: Option<msgs::CommitmentUpdate>,
+	pub order: RAACommitmentOrder,
+	pub forwards: Vec<(PendingHTLCInfo, u64)>,
+	pub failures: Vec<(HTLCSource, PaymentHash, HTLCFailReason)>,
+	pub finalized_claims: Vec<HTLCSource>,
+	pub funding_broadcastable: Option<Transaction>,
+	pub funding_locked: Option<msgs::FundingLocked>,
+}


Where it makes sense, could we use consistent field names between these two structs?

jkczyz · 2021-10-20T21:08:17Z

lightning/src/ln/channel.rs

@@ -2899,8 +2927,14 @@ impl<Signer: Sign> Channel<Signer> {
 			}
 			self.monitor_pending_forwards.append(&mut to_forward_infos);
 			self.monitor_pending_failures.append(&mut revoked_htlcs);
+			self.monitor_pending_finalized_fulfills.append(&mut finalized_claim_htlcs);


In terms of terminology, is there a difference between a "fulfill" and a "claim"? Does it depend on the context?

It mostly just depends on when the code was written :p

jkczyz · 2021-10-20T21:10:03Z

lightning/src/ln/channel.rs

+					commitment_update: Some(commitment_update),
+					finalized_claim_htlcs,
+					to_forward_htlcs: to_forward_infos,
+					failed_htlcs: revoked_htlcs,


Similar question wrt "failed" and "revoked". Just want to make sure I understand if it is context dependent.

Same answer, there really isn't.

lightning/src/ln/channelmanager.rs

jkczyz · 2021-10-20T23:08:03Z

lightning/src/ln/channelmanager.rs

+		if let hash_map::Entry::Occupied(payment) = &payment_entry {
+			if !payment.get().is_retryable() {
+				return Err(APIError::RouteError {
+					err: "Payment already completed"
+				});
+			}


It's unclear to me why this shouldn't be if payment.get().is_complete() { as indicated by the error message. Do we want to hit this for legacy payments, too?

Hmm, yes, we probably should hit this for legacy payments. I mean its unreachable for legacy payments, I believe (retry_payment refuses to call send_payment_internal at all), but nothing wrong with an extra check here.

lightning/src/ln/channelmanager.rs

jkczyz · 2021-10-20T23:26:59Z

lightning/src/ln/channelmanager.rs

+	/// When a pending payment completes, we continue tracking it until all pendings HTLCs have
+	/// been resolved. This ensures we don't look up pending payments in ChannelMonitors on restart
+	/// and add a pending payment that was already completed.
+	Fulfilled {


Terminology question related to my earlier comment: is there a difference between "completes" and "fulfilled" as it pertains to the first sentence in the docs and the variant name?

No, I just missed the docs in the rename - once we get one HTLC fulfillment we consider the payment fulfilled.

TheBlueMatt · 2021-10-20T23:59:50Z

Pushed a fix for the lockorder comment, leaving other stuff until we settle on naming.

lightning/src/chain/channelmonitor.rs

TheBlueMatt · 2021-10-21T19:16:19Z

Ok, I believe I've addressed all feedback and renamed things to be more clear.

lightning/src/chain/channelmonitor.rs

lightning/src/ln/channel.rs

lightning/src/ln/channelmanager.rs

lightning/src/chain/channelmonitor.rs

valentinewallace · 2021-10-21T21:56:01Z

lightning/src/ln/channelmanager.rs

+			// We only rebuild the pending payments map if we were most recently serialized by
+			// 0.0.102+
+			for (_, monitor) in args.channel_monitors {
+				if by_id.get(&monitor.get_funding_txo().0.to_channel_id()).is_none() {


Hmm, I don't see why we don't load pending outbound payments for open channels as well. Couldn't we still have a situation where the ChannelManager doesn't get persisted after initiating an outbound payment, but the ChannelMonitor still knows about it?

The channel monitor knowing about a payment that the ChannelManager does not implies by definition that the ChannelMonitor has some newer state than the ChannelManager. It's possible I missed some edge case but it feels pretty straightforward?

Ah, right 🤦‍♀️

Fwiw, when I add this diff, it gets hit a fair amount of times and all tests pass:

diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs index af8a65b9..5d7ee3a5 100644 --- a/lightning/src/ln/channelmanager.rs +++ b/lightning/src/ln/channelmanager.rs @@ -5939,7 +5939,23 @@ impl<'a, Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> } } } - } + } else { + for (htlc_source, htlc) in monitor.get_pending_outbound_htlcs() { + if let HTLCSource::OutboundRoute { payment_id, session_priv, path, payment_secret, .. } = htlc_source { + let path_amt = path.last().unwrap().fee_msat; + let mut session_priv_bytes = [0; 32]; + session_priv_bytes[..].copy_from_slice(&session_priv[..]); + match pending_outbound_payments.as_mut().unwrap().entry(payment_id) { + hash_map::Entry::Occupied(mut entry) => { + assert!(!entry.get_mut().insert(session_priv_bytes, path_amt)); + }, + hash_map::Entry::Vacant(entry) => { + panic!("no entry for this payment"); + } + } + } + } +} } }

which is reassuring as well

valentinewallace

Think I'm ACK mod outstanding feedback. Might give it one more glance in the morning. Thanks for the additional test coverage!

lightning/src/ln/channelmanager.rs

valentinewallace · 2021-10-21T23:57:13Z

lightning/src/ln/channelmanager.rs

+			// We only rebuild the pending payments map if we were most recently serialized by
+			// 0.0.102+
+			for (_, monitor) in args.channel_monitors {
+				if by_id.get(&monitor.get_funding_txo().0.to_channel_id()).is_none() {


Ah, right 🤦‍♀️

Fwiw, when I add this diff, it gets hit a fair amount of times and all tests pass:

diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs index af8a65b9..5d7ee3a5 100644 --- a/lightning/src/ln/channelmanager.rs +++ b/lightning/src/ln/channelmanager.rs @@ -5939,7 +5939,23 @@ impl<'a, Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> } } } - } + } else { + for (htlc_source, htlc) in monitor.get_pending_outbound_htlcs() { + if let HTLCSource::OutboundRoute { payment_id, session_priv, path, payment_secret, .. } = htlc_source { + let path_amt = path.last().unwrap().fee_msat; + let mut session_priv_bytes = [0; 32]; + session_priv_bytes[..].copy_from_slice(&session_priv[..]); + match pending_outbound_payments.as_mut().unwrap().entry(payment_id) { + hash_map::Entry::Occupied(mut entry) => { + assert!(!entry.get_mut().insert(session_priv_bytes, path_amt)); + }, + hash_map::Entry::Vacant(entry) => { + panic!("no entry for this payment"); + } + } + } + } +} } }

which is reassuring as well

lightning/src/ln/channelmanager.rs

valentinewallace · 2021-10-22T00:24:28Z

lightning/src/ln/payment_tests.rs

@@ -266,3 +274,410 @@ fn no_pending_leak_on_initial_send_failure() {

 	assert!(!nodes[0].node.has_pending_payments());
 }
+
+#[test]
+fn retry_with_no_persist() {


Hmm, the new version of this test passes with the code that had the find_map bug

Ah! Yep, added even more coverage :)

jkczyz

Also pretty close on an ACK after remaining feedback is addressed.

@valentinewallace Nice work on finding the test gaps. I was staring at retry_with_no_persist trying to figure out if the bug was being exercised. Mutation testing FTW lol.

I wonder if there's any improvements we can make to our testing strategy to make reviewing a bit easier. Some of the tests are pretty lengthy/verbose but I'm not sure if they can be broken up into smaller tests and still exercise the behavior.

jkczyz · 2021-10-21T23:43:48Z

lightning/src/ln/channel.rs

@@ -3097,11 +3108,16 @@ impl<Signer: Sign> Channel<Signer> {
 	/// which failed. The messages which were generated from that call which generated the
 	/// monitor update failure must *not* have been sent to the remote end, and must instead
 	/// have been dropped. They will be regenerated when monitor_updating_restored is called.
-	pub fn monitor_update_failed(&mut self, resend_raa: bool, resend_commitment: bool, mut pending_forwards: Vec<(PendingHTLCInfo, u64)>, mut pending_fails: Vec<(HTLCSource, PaymentHash, HTLCFailReason)>) {
+	pub fn monitor_update_failed(&mut self, resend_raa: bool, resend_commitment: bool,


nit: move these parameters down a line, aligned with the other parameters

For a small function that doesn't seem to look any better to my eye? Feels very "rustfmt wants your code to use as many lines as possible" to me :).

lightning/src/ln/functional_tests.rs

jkczyz · 2021-10-22T00:55:28Z

lightning/src/ln/payment_tests.rs

+	// Now reload nodes[1]...
+	persister = test_utils::TestPersister::new();
+	let keys_manager = &chanmon_cfgs[1].keys_manager;
+	new_chain_monitor = test_utils::TestChainMonitor::new(Some(nodes[1].chain_source), nodes[1].tx_broadcaster.clone(), nodes[1].logger, node_cfgs[1].fee_estimator, &persister, keys_manager);
+	nodes[1].chain_monitor = &new_chain_monitor;
+	let mut chan_0_monitor_read = &chan_0_monitor_serialized.0[..];
+	let (_, mut chan_0_monitor) = <(BlockHash, ChannelMonitor<EnforcingSigner>)>::read(
+		&mut chan_0_monitor_read, keys_manager).unwrap();
+	assert!(chan_0_monitor_read.is_empty());
+
+	let (_, nodes_1_deserialized_tmp) = {
+		let mut channel_monitors = HashMap::new();
+		channel_monitors.insert(chan_0_monitor.get_funding_txo().0, &mut chan_0_monitor);
+		<(BlockHash, ChannelManager<EnforcingSigner, &test_utils::TestChainMonitor, &test_utils::TestBroadcaster, &test_utils::TestKeysInterface, &test_utils::TestFeeEstimator, &test_utils::TestLogger>)>
+			::read(&mut io::Cursor::new(&chan_manager_serialized.0[..]), ChannelManagerReadArgs {
+				default_config: Default::default(),
+				keys_manager,
+				fee_estimator: node_cfgs[1].fee_estimator,
+				chain_monitor: nodes[1].chain_monitor,
+				tx_broadcaster: nodes[1].tx_broadcaster.clone(),
+				logger: nodes[1].logger,
+				channel_monitors,
+			}).unwrap()
+	};
+	nodes_1_deserialized = nodes_1_deserialized_tmp;


This is pretty verbose and looks repeated from retry_with_no_persist. Wondering if we could make a utility function for reloading so the tests are easier to read.

Yea, its repeated in quite a few tests. I believe we could may be macro it, but it requires a few variables declared at top of function scope. I'll look at it as a followup, its not a new problem.

lightning/src/ln/channelmanager.rs

valentinewallace

LGTM 🚀

lightning/src/ln/payment_tests.rs

TheBlueMatt · 2021-10-22T18:06:37Z

Pushed yet more test coverage, I believe all outstanding comments have been addressed.

valentinewallace

ACK after squash

jkczyz

ACK d9a85b3

This substantially improves readability at the callsite and in the function.

This improves readability at the callsite and in the function.

This allows us to read a `HashMap` that has values which may be skipped if they are some backwards-compatibility type. We also take this opportunity to fail deserialization if keys are duplicated.

When an HTLC has been failed, we track it up until the point there exists no broadcastable commitment transaction which has the HTLC present, at which point Channel returns the HTLCSource back to the ChannelManager, which fails the HTLC backwards appropriately. When an HTLC is fulfilled, however, we fulfill on the backwards path immediately. This is great for claiming upstream HTLCs, but when we want to track pending payments, we need to ensure we can check with ChannelMonitor data to rebuild pending payments. In order to do so, we need an event similar to the HTLC failure event, but for fulfills instead. Specifically, if we force-close a channel, we remove its off-chain `Channel` object entirely, at which point, on reload, we may notice HTLC(s) which are not present in our pending payments map (as they may have received a payment preimage, but not fully committed to it). Thus, we'd conclude we still have a retryable payment, which is untrue. This commit does so, informing the ChannelManager via a new return element where appropriate of the HTLCSource corresponding to the failed HTLC.

In the next commit, we will reload lost pending payments from ChannelMonitors during restart. However, in order to avoid re-adding pending payments which have already been fulfilled, we must ensure that we do not fully remove pending payments until all HTLCs for the payment have been fully removed from their ChannelMonitors. We do so here, introducing a new PendingOutboundPayment variant called `Completed` which only tracks the set of pending HTLCs.

If we go to send a payment, add the HTLC(s) to the channel(s), commit the ChannelMonitor updates to disk, and then crash, we'll come back up with no pending payments but HTLC(s) ready to be claim/failed. This makes it rather impractical to write a payment sender/retryer, as you cannot guarantee atomicity - you cannot guarantee you'll have retry data persisted even if the HTLC(s) are actually pending. Because ChannelMonitors are *the* atomically-persisted data in LDK, we lean on their current HTLC data to figure out what HTLC(s) are a part of an outbound payment, rebuilding the pending payments list on reload.

test_dup_htlc_onchain_fails_on_reload is now more of a payment_test than a functional_test, testing for handling of pending payments.

Peers probably shouldn't do this, but if they want to give us free money, we should take it and not generate any spurious events.

TheBlueMatt · 2021-10-22T18:42:19Z

Squashed without diff, will land after CI 🎉

$ git diff-tree -U1 d9a85b35b 7af5d127a
$

TheBlueMatt added this to the 0.0.102 milestone Oct 4, 2021

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from e97930f to a0fe042 Compare October 5, 2021 05:59

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from a0fe042 to f821c93 Compare October 5, 2021 23:26

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch 4 times, most recently from 6e60c4f to a7cea3d Compare October 11, 2021 00:53

TheBlueMatt modified the milestones: 0.0.102, 0.0.103 Oct 15, 2021

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from a7cea3d to 7d23578 Compare October 20, 2021 01:25

TheBlueMatt marked this pull request as ready for review October 20, 2021 01:25

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from 7d23578 to df8b291 Compare October 20, 2021 02:54

jkczyz self-requested a review October 20, 2021 17:56

valentinewallace reviewed Oct 20, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from df8b291 to 42a854c Compare October 20, 2021 21:13

jkczyz reviewed Oct 20, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from 2aef42c to ca6ae4d Compare October 21, 2021 00:00

valentinewallace reviewed Oct 21, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from ca6ae4d to a050236 Compare October 21, 2021 19:15

valentinewallace reviewed Oct 21, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from a050236 to f47acdb Compare October 21, 2021 20:41

valentinewallace reviewed Oct 21, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from f47acdb to 4687624 Compare October 21, 2021 23:14

valentinewallace reviewed Oct 22, 2021

View reviewed changes

jkczyz reviewed Oct 22, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from 4687624 to 4348833 Compare October 22, 2021 05:24

valentinewallace reviewed Oct 22, 2021

View reviewed changes

lightning/src/ln/payment_tests.rs Show resolved Hide resolved

valentinewallace approved these changes Oct 22, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch 2 times, most recently from b67f15c to d9a85b3 Compare October 22, 2021 18:06

valentinewallace approved these changes Oct 22, 2021

View reviewed changes

jkczyz approved these changes Oct 22, 2021

View reviewed changes

TheBlueMatt added 11 commits October 22, 2021 18:41

Make Channel::revoke_and_ack's return tuple a struct

c54d470

This substantially improves readability at the callsite and in the function.

Make Channel::monitor_updating_restored's return tuple a struct

fcb178f

This improves readability at the callsite and in the function.

Implement HashMap read for MaybeReadable values

f5b6e7f

This allows us to read a `HashMap` that has values which may be skipped if they are some backwards-compatibility type. We also take this opportunity to fail deserialization if keys are duplicated.

Rename payment object vars to refer to payments and not session IDs

2727c52

Add PaymentSecrets to HTLCSource::OutboundRoute objects

0d50508

Add some basic test coverage of monitor payment data reloading

5296a4a

Move test_dup_htlc_onchain_fails_on_reload to payment_tests

251eb3b

test_dup_htlc_onchain_fails_on_reload is now more of a payment_test than a functional_test, testing for handling of pending payments.

Add a test of an HTLC being fulfilled and then later failed

7af5d12

Peers probably shouldn't do this, but if they want to give us free money, we should take it and not generate any spurious events.

TheBlueMatt force-pushed the 2021-10-payment-id-in-monitors branch from d9a85b3 to 7af5d12 Compare October 22, 2021 18:41

TheBlueMatt merged commit 0a31c12 into lightningdevkit:main Oct 22, 2021

TheBlueMatt mentioned this pull request Oct 25, 2021

Store Payee info with HTLCs #1138

Merged

Reload pending outbound payments from ChannelMonitors on startup #1104

Reload pending outbound payments from ChannelMonitors on startup #1104

Conversation

TheBlueMatt commented Oct 4, 2021

codecov bot commented Oct 5, 2021 • edited Loading

Codecov Report

TheBlueMatt commented Oct 5, 2021

TheBlueMatt commented Oct 5, 2021

TheBlueMatt commented Oct 5, 2021

TheBlueMatt commented Oct 20, 2021

valentinewallace left a comment

Choose a reason for hiding this comment

jkczyz left a comment

Choose a reason for hiding this comment

jkczyz Oct 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt commented Oct 20, 2021

TheBlueMatt commented Oct 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valentinewallace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkczyz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valentinewallace left a comment

Choose a reason for hiding this comment

TheBlueMatt commented Oct 22, 2021

valentinewallace left a comment

Choose a reason for hiding this comment

jkczyz left a comment

Choose a reason for hiding this comment

TheBlueMatt commented Oct 22, 2021

codecov bot commented Oct 5, 2021 •

edited

Loading

jkczyz Oct 20, 2021 •

edited

Loading