-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pay hands back failures to gossipd #638
pay hands back failures to gossipd #638
Conversation
9b4bfda
to
0805958
Compare
Rebase on latest, remove |
0805958
to
450da48
Compare
Added saving of route nodes and route channels to db, rebased on master to get #670 into this PR also. Next step (actually report on failure) should be easier. @cdecker @rustyrussell I would like to request to review especially the new functions Sorry for the massive delay, this saving of route, was a bigger adventure than I expected! |
@rustyrussell : sorry, I would like to ask more detailed help on BOLT#4 and gossipd routing... Below is my understanding: The Now, when a node at |
450da48
to
b5d16e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice PR so far, only minor nits.
gossipd/gossip.c
Outdated
@@ -1769,6 +1769,28 @@ static struct io_plan *handle_txout_reply(struct io_conn *conn, | |||
return daemon_conn_read_next(conn, &daemon->master); | |||
} | |||
|
|||
static struct io_plan *routing_failure_req(struct io_conn *conn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: _req
suggests that there is a reply to this, just routing_failure
should be fine.
gossipd/gossip.c
Outdated
{ | ||
struct pubkey erring_node; | ||
struct short_channel_id erring_channel; | ||
u16 failcode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering whether this can be enum onion_type
right away. Would save us the cast down below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onion_type
itself seems generated, and does not seem to play nice when used in a csv file source. Or maybe some other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works, just need with #include <wire/gen_onion_wire.h> in the CSV file.
gossipd/routing.h
Outdated
@@ -21,6 +22,8 @@ struct node_connection { | |||
|
|||
/* Is this connection active? */ | |||
bool active; | |||
/* Is this connection permanently inactive? */ | |||
bool perminactive; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Permanent failure means that the channel will not be able to recover from this error, or the channel is unusable for our purposes, so we should rather remove them from our network view. That may result in us re-adding it when syncing gossip with a new peer, but we'll eventually not do the gossip dump.
wallet/db.c
Outdated
return true; | ||
} | ||
struct short_channel_id * | ||
sqlite3_column_short_channel_id_array(const tal_t *ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could simplify this a lot by using the uint64
serialization format, so that each short_channel_id
is exactly 8 bytes in size. No need for separator. That'd be a bit inconsistent with the way we serialize the scid
in the channels
table, but much simpler here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one concern is also byte order? That seems to be the main reason for why scid is serialized as string rather than directly as blob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The byte order is fixed in the protocol, otherwise we couldn't communicate them at all. I just used the string serialization because I had it handy at the time, and we were stringifying everything else as well. We have since migrated to the sqlite3_bind_*
interfaces for binaries, which are much safer and more performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I was confused by what was meant by "uint64
serialization format", are you referring to fromwire_short_channel_id
and towire_short_channel_id
? If so, probably I should indeed use that.
It seems I can chain multiple calls to towire_short_channel_id
, so maybe:
for (i = 0; i < num; ++i) {
towire_short_channel_id(&allser, &id[i]);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's it 😃
b5d16e6
to
b3804a0
Compare
Fixed up as per review:
|
b3804a0
to
b6d0fce
Compare
Fixed up as per review:
Added commit:
|
Now report payment routing failures to gossip. I assume my thoughts in #638 (comment) are correct and that is what is implemented. Re
In specifics, I want to add a Boolean field, |
@rustyrussell one thing I want to bring up for |
@rustyrussell So I propose the below new result value for
|
Conflicts with #756 btw, nothing too bad |
Since I created the conflict, I'm also happy to fix it once one of the two gets merged ^^ |
I just crashed in routing_failure:
some context: I was trying to pay an invoice and it was spamming:
I killed the process and brought it up again, and then it crashed with the log above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this is really good; just some stylistic nits where you've written a little too much code :)
gossipd/gossip.c
Outdated
{ | ||
struct pubkey erring_node; | ||
struct short_channel_id erring_channel; | ||
u16 failcode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works, just need with #include <wire/gen_onion_wire.h> in the CSV file.
gossipd/routing.c
Outdated
} | ||
|
||
return NULL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very specialized function, which only makes sense because of the current caller.
How about a 'get_nc_from_node(rstate, node, scid)' which simply returns the pointer to the node_connection; then do the work below.
gossipd/routing.c
Outdated
|
||
u64 num_ins = tal_count(node->in); | ||
u64 num_outs = tal_count(node->out); | ||
u64 num = num_ins + num_outs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tal_count returns size_t, which is pretty natural for sizes; u64 looks a bit weird here.
gossipd/routing.c
Outdated
* consideration. | ||
* | ||
*/ | ||
if ((failcode & NODE) != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (failcode & NODE)
is a little more idiomatic.
gossipd/routing.c
Outdated
get_all_node_connections_of(const tal_t *ctx, | ||
struct routing_state *rstate, | ||
const struct node *node) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would open code this, see below.
gossipd/routing.c
Outdated
if ((failcode & NODE) != 0) | ||
ncs = get_all_node_connections_of(tmpctx, rstate, node); | ||
else | ||
ncs = get_out_node_connection_of(tmpctx, rstate, node, scid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
if (failcode & NODE) {
for (i = 0; i < tal_count(node->in); i++)
handle_route_fail(rstate, node->in[i], failcode);
for (i = 0; i < tal_count(node->out); i++)
handle_route_fail(rstate, node->out[i], failcode);
} else {
struct node_connection *nc = get_nc_from_node(node, scid);
/* Possible, if something else deleted nc since we sent payment */
if (!nc)
status_trace("Error %s on node %s scid %s: no such connection any more");
else
handle_route_fail(rstate, nc, failcode);
gossipd/routing.c
Outdated
* this after deactivating, so that if the | ||
* channel_update is newer it will be | ||
* reactivated. */ | ||
if ((failcode & UPDATE) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style nit: if (failcode & UPDATE) {
is more idiomatic.
lightningd/pay.c
Outdated
if (failure->origin_index >= tal_count(payment->route_nodes)) { | ||
/* Is it permanent? */ | ||
if ((failcode & PERM) != 0) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still hand to gossipd!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what will we hand to gossipd? In particular, what outgoing channel will we report via gossip_routing_failure
message?
From my reading of BOLT 4, below is the list of possible errors a final payee node can return:
invalid_realm
- permanent non-node error, but this "should" only happen if we send a realm that the recipient is ignorant of, and that indicates a bug on our side as presumably future realm types (that old nodes might not recognize) should be enabled in a global features bit. Arguably this should be treated a a node error, as apparently we think that node supports some new realm byte, but the node itself denies cognizance of that realm byte.temporary_node_failure
- node error, so outgoing channel ignored, no problem.permanent_node_failure
- node error, so outgoing channel ignored, no problem.required_node_feature_missing
- similar toinvalid_realm
(permanent non-node error, this time the receiver thinks we support some feature but we ourselves are ignorant of that feature and did not pay attention to the even feature bit).unknown_payment_hash
- permanent non-node error, but may happen for example if payee is c-lightning and the invoice expired before the payment reached it.incorrect_payment_amount
- permanent non-node error, but is arguably a bug on our side where some roundoff or some other issue made us go beyond the limits of the invoice being paid.final_expiry_too_soon
- temporary non-node error, but we can actually reuse the exact same route, we just need to adjust the CLTVs.final_incorrect_cltv_expiry
- temporary non-node error, but maybe this indicates a bug on our side?final_incorrect_htlc_amount
- temporary non-node error, similar tofinal_incorrect_cltv_expiry
.
So for now, I think, it is sensible to report node-level errors on the final node to gossip, but I am uncertain what channel to deactivate for the non-node errors; non-node errors at the final destination seem to suggest a bug or disagreement between us and the final node.
I will fix the other review items for now but I want to know your opinion for the final node case in more detail first.
lightningd/pay.c
Outdated
failure->msg, NULL, | ||
&channel_update)) | ||
/* No channel update. */ | ||
channel_update = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so ugly it needs to be in its own function, I think!
As to sendpay, no, we need to return an error when there's a routing problem. However, we should be jsonrpc 2.0 compliant, which allows a 'data' field in errors; we should put in the information about exactly what happened in this case. There remain two problems (which DO NOT need to be fixed in this series!):
|
094ad5d
to
8ebe1be
Compare
@rustyrussell itl looks like we currently violate json rpc 2.0 as we return a string |
Okay, for |
d060bd8
to
3de856f
Compare
…e of immediate channeld error.
3de856f
to
b10fa9a
Compare
#858 merged, so rebased on top of master. Should now be OK for merging if it passes CI. |
CI OK. @cdecker @rustyrussell please merge! |
Removing the WIP label, but it's too late for me to review now. Will do it first thing tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congratulations! This is an excellent patch set. I'm applying it now..
|
||
json_pay_failed(hout->cmd, NULL, failcode, "reply from remote"); | ||
json_pay_failed(hout->cmd, failcode, failmsg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future work: we should provide more detailed failmsg: unlike logging, that's returned to the caller who is definitely interested!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue to track this: #866
good stuff @ZmnSCPxj ! |
Fixes: #550
Eventually. Remaining:
sendpay
should differ frompay
. Presumablysendpay
only tries the specific route specified and should probably return the result of the attempt immediately without retrying, whilepay
does a best-effort and keeps retrying until a permanent failure at the payee or no-route-found from gossipd.ImplementDo this in a later PR instead, it is big enough already!data
error return onsendpay
if a routing error occurred.