Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pay hands back failures to gossipd #638

Merged
merged 10 commits into from
Feb 1, 2018

Conversation

ZmnSCPxj
Copy link
Collaborator

@ZmnSCPxj ZmnSCPxj commented Jan 17, 2018

Fixes: #550

Eventually. Remaining:

  • Save the route somewhere somehow on lightningd's side. Maybe as a blob on the db? Or just in-memory (with an intervening shutdown/restart losing the route information -> an intervening shutdown/restart will restart gossipd's mapping information too anyway, so why retain (possibly obsolete) routing information?).
  • Have the route failure reported on routing failure.
  • Hash out exactly how sendpay should differ from pay. Presumably sendpay only tries the specific route specified and should probably return the result of the attempt immediately without retrying, while pay does a best-effort and keeps retrying until a permanent failure at the payee or no-route-found from gossipd.
  • Report failures of local channels also.
  • Implement data error return on sendpay if a routing error occurred. Do this in a later PR instead, it is big enough already!
  • Create test for routing reporting.

@ZmnSCPxj
Copy link
Collaborator Author

Rebase on latest, remove gossip_routing_failure_reply.

@ZmnSCPxj
Copy link
Collaborator Author

Added saving of route nodes and route channels to db, rebased on master to get #670 into this PR also.

Next step (actually report on failure) should be easier.

@cdecker @rustyrussell I would like to request to review especially the new functions sqlite3_column_short_channel_id_array, sqlite3_bind_short_channel_id_array, sqlite3_column_pubkey_array, and sqlite3_bind_pubkey_array.

Sorry for the massive delay, this saving of route, was a bigger adventure than I expected!

@ZmnSCPxj
Copy link
Collaborator Author

@rustyrussell : sorry, I would like to ask more detailed help on BOLT#4 and gossipd routing...

Below is my understanding: The gossip_getroute_request returns an array of route_hop structures, each of which contains a nodeid and a channel_id. For each route_hop, the channel_id is the channel that feeds into the nodeid beside it. So for example, for route[0], the channel_id connects the local node to the first hop node. Is my understanding correct?

Now, when a node at origin_index reports a channel-level error, in actuality, the error is from the next channel in the route. For example, if route[origin_index].nodeid reported a channel-level error, we should inactivate the route[origin_index + 1].channel_id channel. Is my understanding correct? I want to avoid a jgarzik off-by-one error.

Copy link
Member

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR so far, only minor nits.

gossipd/gossip.c Outdated
@@ -1769,6 +1769,28 @@ static struct io_plan *handle_txout_reply(struct io_conn *conn,
return daemon_conn_read_next(conn, &daemon->master);
}

static struct io_plan *routing_failure_req(struct io_conn *conn,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: _req suggests that there is a reply to this, just routing_failure should be fine.

gossipd/gossip.c Outdated
{
struct pubkey erring_node;
struct short_channel_id erring_channel;
u16 failcode;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether this can be enum onion_type right away. Would save us the cast down below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onion_type itself seems generated, and does not seem to play nice when used in a csv file source. Or maybe some other way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works, just need with #include <wire/gen_onion_wire.h> in the CSV file.

@@ -21,6 +22,8 @@ struct node_connection {

/* Is this connection active? */
bool active;
/* Is this connection permanently inactive? */
bool perminactive;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Permanent failure means that the channel will not be able to recover from this error, or the channel is unusable for our purposes, so we should rather remove them from our network view. That may result in us re-adding it when syncing gossip with a new peer, but we'll eventually not do the gossip dump.

wallet/db.c Outdated
return true;
}
struct short_channel_id *
sqlite3_column_short_channel_id_array(const tal_t *ctx,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could simplify this a lot by using the uint64 serialization format, so that each short_channel_id is exactly 8 bytes in size. No need for separator. That'd be a bit inconsistent with the way we serialize the scid in the channels table, but much simpler here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one concern is also byte order? That seems to be the main reason for why scid is serialized as string rather than directly as blob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byte order is fixed in the protocol, otherwise we couldn't communicate them at all. I just used the string serialization because I had it handy at the time, and we were stringifying everything else as well. We have since migrated to the sqlite3_bind_* interfaces for binaries, which are much safer and more performant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I was confused by what was meant by "uint64 serialization format", are you referring to fromwire_short_channel_id and towire_short_channel_id? If so, probably I should indeed use that.

It seems I can chain multiple calls to towire_short_channel_id, so maybe:

for (i = 0; i < num; ++i) {
    towire_short_channel_id(&allser, &id[i]);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's it 😃

@ZmnSCPxj
Copy link
Collaborator Author

Fixed up as per review:

  1. Rename routing_failure_req to handle_routing_failure.
  2. For PERM failures, instead of setting a perminactive flag that prevents active from being set ever, outright delete affected nod_connections.

@ZmnSCPxj
Copy link
Collaborator Author

Fixed up as per review:

  1. Use fromwire_short_channel_id and towire_short_channel_id to serialize portably each item of the channel ID array.

Added commit:

  1. Also receive channel_update on the gossip_routing_failure message, and handle them in routing_failure function.

@ZmnSCPxj
Copy link
Collaborator Author

Now report payment routing failures to gossip. I assume my thoughts in #638 (comment) are correct and that is what is implemented.


Re sendpay and pay, my thoughts are:

  1. sendpay tries only the specific route.
  2. pay retries again and again until unable to find route, or permanent failure at destination.

In specifics, I want to add a Boolean field, repeat, whether the JSON-RPC sendpay was used or JSON-RPC pay, to wallet_payment. In addition, for voyeurism, failed_tries count, which is incremented each time pay fails and loops. However, send_payment function deletes existing payment entry... uncertain.

@ZmnSCPxj
Copy link
Collaborator Author

@rustyrussell one thing I want to bring up for sendpay is that instead of erroring on failure, perhaps it should return the error information (erring node, erring channel, and returned error code and message) i.e. it reports to the controller as well as to gossipd. This is to allow the controller to implement its own routing algorithm. The controller has its own routing algorithm, runs it, uses sendpay, gets feedback it feeds into its own routing algorithm, reruns routing, uses sendpay, and so on. This seems to be a more useful low-level interface than the current sendpay that simply returns an undifferentiated error.

@ZmnSCPxj
Copy link
Collaborator Author

@rustyrussell So I propose the below new result value for sendpay command:

{ "succeeded": true } // The payment succeeded
    or
{ "succeeded": false
, "failcode": 42 // the failcode reported
, "erring_node": "<hexid>"
, "erring_channel": "<short-channel-id>"
, "message": "<hexbytes>" // The complete error message
}

sendpay will now only error in case of malformed arguments.

@jb55
Copy link
Collaborator

jb55 commented Jan 25, 2018

Conflicts with #756 btw, nothing too bad

@cdecker
Copy link
Member

cdecker commented Jan 25, 2018

Conflicts with #756 btw, nothing too bad

Since I created the conflict, I'm also happy to fix it once one of the two gets merged ^^

@jb55
Copy link
Collaborator

jb55 commented Jan 25, 2018

I just crashed in routing_failure:

      1 lightning_channeld(31552): TRACE: Can't send commit: waiting for revoke_and_ack
      1 lightningd(31428): Adding block 1260392: 00000000000005cf2f569187c1fad48b0ae07c73e44d061219b04cd786aeb5f6
      2 lightning_channeld(31552): TRACE: Can't send commit: waiting for revoke_and_ack
      1 lightningd(31428): Adding block 1260393: 000000000000066443e02e03dd7284ce95a379ea0efa50f160d2c189aeb58c85
      1 lightning_channeld(31552): TRACE: Can't send commit: waiting for revoke_and_ack
      1 lightningd(31428): Adding block 1260394: 0000000000000511ab5f0b70ea0619aa0f5022bfbf706f08933ca35193532e40
      1 lightning_channeld(31552): TRACE: Can't send commit: waiting for revoke_and_ack
      1 lightning_channeld(31552): TRACE: Read decrypt 0085020a0b4b2efce9660e0f6a302fac19fb3e9a3f9f64036fb5d5090146df058abdbf067335db19f489e010547aef751b3505761b6f9ebdc288c8a8554b8ab334a802e8b8ac45e34a184fc3d78eb822307b9d8508e57c599b3abe83c281fc3226c311
      1 lightning_channeld(31552): TRACE: peer_in WIRE_REVOKE_AND_ACK
      1 lightning_channeld(31552): TRACE: Received revoke_and_ack
      1 lightning_channeld(31552): TRACE: htlc 3: SENT_REMOVE_ACK_COMMIT->RCVD_REMOVE_ACK_REVOCATION
      1 lightning_channeld(31552): TRACE: rcvd_revoke_and_ack: HTLC LOCAL 3 = RCVD_REMOVE_ACK_REVOCATION/SENT_REMOVE_ACK_REVOCATION FAILED
      1 lightning_channeld(31552): TRACE: No commits outstanding after recv revoke_and_ack
      1 lightning_channeld(31552): TRACE: HTLC 3[LOCAL] => RCVD_REMOVE_ACK_REVOCATION
      1 lightning_channeld(31552): TRACE: Sending master 1022
      1 lightning_channeld(31552): UPDATE WIRE_CHANNEL_GOT_REVOKE
      1 lightningd(31428): peer 03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134: got revoke 14: 1 changed
      1 lightningd(31428): peer 03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134: HTLC out 3 SENT_REMOVE_ACK_COMMIT->RCVD_REMOVE_ACK_REVOCATION
      1 lightningd(31428): peer 03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134: Removing out HTLC 3 state RCVD_REMOVE_ACK_REVOCATION REMOTEFAIL
      1 lightningd(31428): peer 03933884aaf1d6b108397e5efe5c86bcf2d8ca8d2f700eda99db9214fc2712b134: htlc 3 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
      1 lightning_channeld(31552): TRACE: ... , awaiting 1122
      1 lightning_channeld(31552): TRACE: Got it!
      1 lightningd(31428): Adding block 1260395: 00000000000005179d3d88287a69b5ed3f732a1c6816231fa1b5b735cd4f77f0
      1 lightning_channeld(31552): TRACE: revoke_and_ack LOCAL: remote_per_commit = 02e8b8ac45e34a184fc3d78eb822307b9d8508e57c599b3abe83c281fc3226c311, old_remote_per_commit = 02929255b30205baee27a5d0e2f68abc9c42a63e5c927590fef24f913c9a1ed15e
      1 lightning_channeld(31552): TRACE: Commit timer already running...
      1 lightning_channeld(31552): TRACE: Trying commit
      1 lightning_channeld(31552): TRACE: Can't send commit: nothing to send
      1 lightning_gossipd(31447): TRACE: req: type WIRE_GOSSIP_ROUTING_FAILURE len 175
      1 lightning_gossipd(31447): TRACE: backtrace: common/subdaemon.c:33 (crashdump) 0x410ca4
      1 lightning_gossipd(31447): TRACE: backtrace: (null):0 ((null)) 0x7f2cd19dc2df
      1 lightning_gossipd(31447): TRACE: backtrace: gossipd/routing.c:1068 (routing_failure) 0x40e578
      1 lightning_gossipd(31447): TRACE: backtrace: gossipd/gossip.c:1875 (handle_routing_failure) 0x404e4b
      1 lightning_gossipd(31447): TRACE: backtrace: gossipd/gossip.c:1933 (recv_req) 0x404e4b
      1 lightning_gossipd(31447): TRACE: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x424d16
      1 lightning_gossipd(31447): TRACE: backtrace: ccan/ccan/io/io.c:397 (io_ready) 0x42518a
      1 lightning_gossipd(31447): TRACE: backtrace: ccan/ccan/io/poll.c:305 (io_loop) 0x42656b
      1 lightning_gossipd(31447): TRACE: backtrace: gossipd/gossip.c:1993 (main) 0x402ada
      1 lightning_gossipd(31447): TRACE: backtrace: (null):0 ((null)) 0x7f2cd19c903f
      1 lightning_gossipd(31447): TRACE: backtrace: ../sysdeps/x86_64/start.S:120 ((null)) 0x402b39
      1 lightning_gossipd(31447): TRACE: backtrace: (null):0 ((null)) 0xffffffffffffffff
      1 lightning_gossipd(31447): STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL 11

some context: I was trying to pay an invoice and it was spamming:

TRACE: Can't send commit: waiting for revoke_and_ack

I killed the process and brought it up again, and then it crashed with the log above.

Copy link
Contributor

@rustyrussell rustyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this is really good; just some stylistic nits where you've written a little too much code :)

gossipd/gossip.c Outdated
{
struct pubkey erring_node;
struct short_channel_id erring_channel;
u16 failcode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works, just need with #include <wire/gen_onion_wire.h> in the CSV file.

}

return NULL;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very specialized function, which only makes sense because of the current caller.

How about a 'get_nc_from_node(rstate, node, scid)' which simply returns the pointer to the node_connection; then do the work below.


u64 num_ins = tal_count(node->in);
u64 num_outs = tal_count(node->out);
u64 num = num_ins + num_outs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tal_count returns size_t, which is pretty natural for sizes; u64 looks a bit weird here.

* consideration.
*
*/
if ((failcode & NODE) != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (failcode & NODE) is a little more idiomatic.

get_all_node_connections_of(const tal_t *ctx,
struct routing_state *rstate,
const struct node *node)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would open code this, see below.

if ((failcode & NODE) != 0)
ncs = get_all_node_connections_of(tmpctx, rstate, node);
else
ncs = get_out_node_connection_of(tmpctx, rstate, node, scid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

if (failcode & NODE) {
     for (i = 0; i < tal_count(node->in); i++)
        handle_route_fail(rstate, node->in[i], failcode);
    for (i = 0; i < tal_count(node->out); i++)
        handle_route_fail(rstate, node->out[i], failcode);
} else {
     struct node_connection *nc = get_nc_from_node(node, scid);
     /* Possible, if something else deleted nc since we sent payment */
     if (!nc)
          status_trace("Error %s on node %s scid %s: no such connection any more");
     else
          handle_route_fail(rstate, nc, failcode);

* this after deactivating, so that if the
* channel_update is newer it will be
* reactivated. */
if ((failcode & UPDATE) != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style nit: if (failcode & UPDATE) { is more idiomatic.

lightningd/pay.c Outdated
if (failure->origin_index >= tal_count(payment->route_nodes)) {
/* Is it permanent? */
if ((failcode & PERM) != 0)
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still hand to gossipd!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what will we hand to gossipd? In particular, what outgoing channel will we report via gossip_routing_failure message?

From my reading of BOLT 4, below is the list of possible errors a final payee node can return:

  1. invalid_realm - permanent non-node error, but this "should" only happen if we send a realm that the recipient is ignorant of, and that indicates a bug on our side as presumably future realm types (that old nodes might not recognize) should be enabled in a global features bit. Arguably this should be treated a a node error, as apparently we think that node supports some new realm byte, but the node itself denies cognizance of that realm byte.
  2. temporary_node_failure - node error, so outgoing channel ignored, no problem.
  3. permanent_node_failure - node error, so outgoing channel ignored, no problem.
  4. required_node_feature_missing - similar to invalid_realm (permanent non-node error, this time the receiver thinks we support some feature but we ourselves are ignorant of that feature and did not pay attention to the even feature bit).
  5. unknown_payment_hash - permanent non-node error, but may happen for example if payee is c-lightning and the invoice expired before the payment reached it.
  6. incorrect_payment_amount - permanent non-node error, but is arguably a bug on our side where some roundoff or some other issue made us go beyond the limits of the invoice being paid.
  7. final_expiry_too_soon - temporary non-node error, but we can actually reuse the exact same route, we just need to adjust the CLTVs.
  8. final_incorrect_cltv_expiry - temporary non-node error, but maybe this indicates a bug on our side?
  9. final_incorrect_htlc_amount - temporary non-node error, similar to final_incorrect_cltv_expiry.

So for now, I think, it is sensible to report node-level errors on the final node to gossip, but I am uncertain what channel to deactivate for the non-node errors; non-node errors at the final destination seem to suggest a bug or disagreement between us and the final node.

I will fix the other review items for now but I want to know your opinion for the final node case in more detail first.

lightningd/pay.c Outdated
failure->msg, NULL,
&channel_update))
/* No channel update. */
channel_update = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so ugly it needs to be in its own function, I think!

@rustyrussell
Copy link
Contributor

As to sendpay, no, we need to return an error when there's a routing problem. However, we should be jsonrpc 2.0 compliant, which allows a 'data' field in errors; we should put in the information about exactly what happened in this case.

There remain two problems (which DO NOT need to be fixed in this series!):

  1. It's not clear what to do on temporary failures; disabling until we receive an update is a little harsh, esp. if the issue was the amount was too large. We could record the amount we tried as an upper bound (when do we clear that?), and/or have a 30 second timeout on the disabling (unless it's otherwise disabled in the mean time).

  2. We can get a completely invalid error reply. In this case we don't know which node to blame, so we need to divide an conquer. This might be like the 'temporary disable' above, only we'd reenable them as we retried.

@ZmnSCPxj ZmnSCPxj force-pushed the report-routing-failure branch 3 times, most recently from 094ad5d to 8ebe1be Compare January 26, 2018 04:31
@ZmnSCPxj
Copy link
Collaborator Author

ZmnSCPxj commented Jan 26, 2018

@rustyrussell itl looks like we currently violate json rpc 2.0 as we return a string error field, but json rpc 2.0 spec requires error to be an object with code and message fields, and the optional data.

@ZmnSCPxj
Copy link
Collaborator Author

Okay, for sendpay basically we will still error, but also send the routing error via the optional data field of the error object. We do not actually comply with JSON-RPC 2.0 for error objects, so I am now writing a separate PR to make us comply. After that I will modify to comply with @rustyrussell's review, then start planning out how to make pay repeat and implement the data return for sendpay.

@ZmnSCPxj
Copy link
Collaborator Author

Done. Will not pass unless on top of #858 though. Will rebase when #858 is merged.

@ZmnSCPxj
Copy link
Collaborator Author

#858 merged, so rebased on top of master. Should now be OK for merging if it passes CI.

@ZmnSCPxj
Copy link
Collaborator Author

CI OK. @cdecker @rustyrussell please merge!

@cdecker
Copy link
Member

cdecker commented Jan 31, 2018

Removing the WIP label, but it's too late for me to review now. Will do it first thing tomorrow.

Copy link
Contributor

@rustyrussell rustyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations! This is an excellent patch set. I'm applying it now..


json_pay_failed(hout->cmd, NULL, failcode, "reply from remote");
json_pay_failed(hout->cmd, failcode, failmsg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future work: we should provide more detailed failmsg: unlike logging, that's returned to the caller who is definitely interested!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue to track this: #866

@jb55
Copy link
Collaborator

jb55 commented Feb 1, 2018

good stuff @ZmnSCPxj !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants