Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pay: take channel capacity into account for route selection #4771

Merged

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented Sep 8, 2021

If this seems sane, I'd like to see if @cdecker can help produce any evidence to show that it helps? :)

@cdecker
Copy link
Member

cdecker commented Sep 8, 2021

Needs a changelog, otherwise perfect 👍

ACK rustyrussell@c75f74b

Copy link
Collaborator

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool 💯

ack c75f74b

@rustyrussell
Copy link
Contributor Author

Needs a changelog, otherwise perfect +1

ACK rustyrussell@c75f74b

But does it work??

Let's coordinate a before/after test using paytest?

@rustyrussell
Copy link
Contributor Author

OK, so using paytest with Christian's node, which I seem to have trouble routing to:

Before:
30,000sat: 92 attempts, failed.
3,000sat: 11 attempts, success.
1,500sat: 6 attempts, success.

After:
30,000sat: 90 attempts, failed.
3,000sat: 4 attempts, success.
1,500sat: 4 attempts, success.

So it does make a difference!

@rustyrussell rustyrussell marked this pull request as ready for review September 22, 2021 11:05
@rustyrussell rustyrussell added this to the v0.10.2 milestone Sep 22, 2021
@cdecker
Copy link
Member

cdecker commented Sep 22, 2021

ACK 4b9f4bf

@renepickhardt
Copy link
Collaborator

Didn't fully review the code yet but please note that the linerazed score is amt*(1/(c+1) + \mu*r) with r=ppm. Now ppm is an integer larger than 1 and 1/(c+1) is several orders of magnitude smaller than 1 so you need to select \mu in a way to have a similar order of magnitude for the capacity term too become significant instead of 1/2. You could try 0.00001 to begin with.

Also note that instead of the linearized amt/(c+1) you could use the actual negative log probabities: -log((c-amt+1)/(c+1). \mu should probably be in a similar size as the linearized version.

@rustyrussell
Copy link
Contributor Author

The question is: what is the capacity influence, vs fee influence? If we make the capacity influence << fee influence, it has only effect as a tiebreaker.

Of course, the current calculation uses the fee for this particular hop to scale the capacity "cost", which is clearly wrong: using the median fee would make more sense. That's currently 1000 msat + 100ppm (though we could actually estimate it with reasonable efficiency in case it changes).

We could simply add to the cost function: (1 - (amt / capacity)) * (1000 + 100 * amt / 1000000). This gives it the same selection power as median fees if capacity is infinite, which splits the difference fairly.

@renepickhardt
Copy link
Collaborator

renepickhardt commented Sep 23, 2021

Let me try to be very brief (I am happy to elaborate though):

Clarification about your formular

cost = (1 - (amt / capacity)) * (1000 + 100 * amt / 1000000)

the cost will be really low if the amount reaches the capacity as cap/cap = 1 and 1 -1 = 0. We thus get 0 cost for channels that we fully saturate - independently of the fees that we wish to pay. Thus your suggestion seems inherently flawed or am I missing something? I do think however that you have the right intuition by going to the product. Thus I will write about about averages of scores

Combining various scores

If we don't try to do find a min cost flow but just single paths we don't need convex or linear for our cost function thus the question remains which weights to take and how to combine them properly. If I recall correctly you currently use 3 features in your cost function:

  1. x_1 = f = fees coming from the routing fees
  2. x_2 = r = cltv encoded as the risk factor
  3. x_3 = p = success probability encoded in the linear term of the tailor expansion of negative log probabilities.

then what you do is you take the arithmetic mean of the features potentially with some weigh

cost = f + w_r*r + w_p*p

IIRC w_r is encoded in the risk factor via config, cli and default value and w_p is the lagrange multiplier (\mu) from our formular.

In our paper we also suggest the weighted arithmetic mean as it helps for our function to stay linear / convex.

However was noted before in the case of dijkstra we don't need that. I thus propose that instead of the arithmetic mean you use the Harmonic mean for which (as described in the linked wikipedia article) also a weighted version exist.

harmonic_mean

The reasoning is the same as I provided to the LND autopilot.

In our case the cost from probabilities are always in the [0,1] range and fees as well as cltv deltas can be normalized by dividing with the theoretic max that could be seen on the network.

I think the multiplicative nature of the harmonic mean should give a much better interaction with our 3 different features, especially since they might vary over several orders of manitude without the necessity for us to learn optimal weights w_r and w_p.

fun fact:

In my former live as a data science consultant I was once hired to help with a predictive model as it performed poorly and much worse than expected. After a couple of days of dissecting the existing model and features I came to the conclusion that it was modelled properly and the only problem was that they used the arithmetic mean as an average of their features instead of harmonic mean. After changing this one formula everything performed in a way how it was supposed to and the job was completed. Given that experience I might be slightly biased towards harmonic means in situations like we are in. Of course I cannot tell you that the trick will also work here but I believe the setting / reasons should be similar.

@rustyrussell
Copy link
Contributor Author

Clarification about your formular

cost = (1 - (amt / capacity)) * (1000 + 100 * amt / 1000000)

Doh, brain fart! That 1- is bogus, as you point out. (amt / capacity) * (1000 + 100 * amt / 1000000) is the correct naive formula.

x_1 = f = fees coming from the routing fees
x_2 = r = cltv encoded as the risk factor
x_3 = p = success probability encoded in the linear term of the tailor expansion of negative log probabilities.

then what you do is you take the arithmetic mean of the features potentially with some weigh

cost = f + w_r*r + w_p*p

However was noted before in the case of dijkstra we don't need that. I thus propose that instead of the arithmetic mean you use the Harmonic mean for which (as described in the linked wikipedia article) also a weighted version exist.

This is likely to cause the capacity to dominate, since Harmonic mean emphasizes the smallest value. I'm more comfortable with a simple sum of the three at this stage. Deeper changes would require deeper testing.

In our case the cost from probabilities are always in the [0,1] range and fees as well as cltv deltas can be normalized by dividing with the theoretic max that could be seen on the network.

There is no useful theoretical max, but there is a perfectly reasonable expectation: the medians. For cltv this is 40. At riskfactor 10, this works out to a risk cost of 760msat on a 10,000sat payment. The median fee on that payment is 1001msat.

For the moment I've altered it to simply add "median_fee(amt) * (amt / (capacity + 1)". This makes it comparable to the effect of fees in the normal case, which seems reasonable.

It's a simple heuristic, and I look forward to more sophisticated approaches!

@renepickhardt
Copy link
Collaborator

First of all I am sorry that I forgot to mention in my last reply that I really liked your approach of using the median instead of the max (as max values can easily have strong outliers) However I am a bit confused now:

You write that you don't like the harmonic mean but that you would rather prefer the arethmetic one (simple sum) however even when omitting your 1- the formula that you propose cost = (amt / capacity) * (1000 + 100 * amt / 1000000) has the structure:

cost = p * f

I don't see the risk factor as a feature in there and I will ignore it for now as my argument and confusion is independent of the 3rd feature. (btw maybe you know something that I don't but I never fully understood why one would want to optimize for a low CLTV anyway.)

The Harmonic mean of two numbers p and f is H = 2*p*f/(p+f) (note the product in the numerator) while the arithmetic is A = 1/2(p+f) what you propose is neither the harmonic nor the arithmetic, though intuitively I believe it will behave closer to the harmonic than it would to the arithmetic mean.

actually what I see is that the Rusty way R = H * A = 2*p*f/(p+f) * 1/2(p+f) = p*f. While I propose Harmonic and you wrote that you like arithmetic it seems that you actually propose the product of harmonic and arithmetic.

BTW looking at your actual code I am even more confused. in LIne 719

return fee.millisatoshis /* Raw: complex math & laziness */

you compute the feature with the capacity as the product of the fee times the probability.

but then in line 734 you just linearly add it to the other values

+ capacity_bias(global_gossmap, c, dir, cost);

this means your cost function now looks like this:

C = f + r + p*f = f*(1+p) + r

Indepenedently of your actual choices of how to compute the total cost I think it might make sense to not have the median values as constants in the code but rather learn them from gossip either at node start or once in a while. Also I want to emphasize that

@cdecker
Copy link
Member

cdecker commented Sep 28, 2021

I spent the weekend measuring the three variants we have here:

  • "original": the code currently deployed in v0.10.1
  • "variant1-4b9f4bf": the first version of this PR
  • "variant2-d939e3": the current version of the PR

With the help of @renepickhardt's and @rustyrussell's nodes running the paytest plugin we performed 6747 test payments, with varying amounts (1000 * 2**x) and the variants above. The following aresome preliminary plots:

scatter

Simple scatterplot of completion times vs amount, not really all that helpful.

median

Median time to completion for the various variants and amounts. Taking the channel size into consideration definitely helps, except maybe tiny amounts. The hump in the middle is likely a topological issue with my surroundings, resulting in some bad channels being attempted before succeeding.

success

My favorite plot: we hugely increae the success probability. The dip for the original seems to be the exemptfee causing us to pick smaller & cheaper channels, wasting time on unsuccesful attempts.

attempts

And finally the median number of attempts to complete. Again we need fewer attempts, exept for tiny amounts maybe.

More measurements would definitely be nice, to analyze the details of what's going on.

IMHO this can be merged as is, and we can defer tweaking parameters, or do so incrementally.

@renepickhardt
Copy link
Collaborator

Why not making additional experiemnts? At least one version where you use: -log((c+1-amt)/(c+1)) instead of the linearized amt/(c+1) and also at least one experiment using this and harmonic mean to combine the features in the cost function. I expect much better results if you do so for all the reasons discussed above.

also I am a bit confused by the experimental setup. Are you actually always using rusty's and my node as destination or do you force intermediary nodes? If you always use the same nodes as destinations I would not be surprised that the curves are so close to each other.

While this is obviously for you to decide I would at this point not merge the linearized and arithmetic mean variant that still emphasizes heavily on the fees (even though it seems to be an improvement)

@cdecker
Copy link
Member

cdecker commented Sep 29, 2021

Why not making additional experiemnts? At least one version where you use: -log((c+1-amt)/(c+1)) instead of the linearized amt/(c+1) and also at least one experiment using this and harmonic mean to combine the features in the cost function. I expect much better results if you do so for all the reasons discussed above.

If you code the variant up I'll test it. I don't really trust myself not to introduce a random error into the formula 😉

also I am a bit confused by the experimental setup. Are you actually always using rusty's and my node as destination or do you force intermediary nodes? If you always use the same nodes as destinations I would not be surprised that the curves are so close to each other.

Yes, this is very simplistic, given I only have 2 destinations to test against for now, hopefully more users / devs will run the plugin over time, allowing us to perform more realistic tests with more diversity. As for the second point, in order to compare the results we want the test conditions to be very similar, so any difference can be attributed to our change, I am not following how testing the same scenario by only changing one variable under our control could possibly be bad here.

While this is obviously for you to decide I would at this point not merge the linearized and arithmetic mean variant that still emphasizes heavily on the fees (even though it seems to be an improvement)

Obviously we want the best possible variant to make it into the release, but if we don't have data to corroborate the improvement we'll merge the one that we're confident has shown improvement. It's all a matter of how well we can test the variants.

@renepickhardt
Copy link
Collaborator

I have just created a small notebook at: https://github.com/renepickhardt/probabilistic-pathfinding/blob/main/Probabilistic%20Pathfinding%20Simulation.ipynb which I will plan to extend with a small simulation of various cost functions before I make a concrete proposal for a mainnet test. I am happy to provide you with the c-code of a modified cost function as those changes don't seem large and rather easy to build in.

Note I also copy the pictures from the current version of the notebook here:

arithmeticNOCLTV
arithmeticWithCLTV
harmonicNoCLTV
harmonicWithCLTV
Rusty

I would argue that we can already see that the harmonic mean has a nice impact on the shape of the cost function but of course that does not say anything about how experiments / simulations will actually work.

@renepickhardt
Copy link
Collaborator

renepickhardt commented Sep 29, 2021

Ok I have updated the notebook (I am not sure if I compute the risk factor correctly (especially with respect to putting it to the right order of magnitude as fees) other than that I have now tested in a simulation a version with a smoothed harmonic mean (to avoid division by 0 if fees are zero) and the method proposed by rusty.

At least in the simulation the harmonic mean with negative log probabilities works much better than rusties formular that heavily focuses on minimizing fees.

probabilistic with CLTV successrate: 91.87%
probabilistic no CLTV successrate: 91.46%
rusty's method successrate: 71.54%

as we can see the CLTV / risk factor seems not to make a huge difference but as noted before I am not sure if I compute this properly as of now.

What I did in this experiment: I randomly assigned balance values to the LN network (following a uniform distribution, which as of now is the best we can do) for all three cost functions I used the same instance of the randomly assinged balance values. Then I tried to deliver 100'000 sats (1 mBTC) from my node to every node in the current bos-score list. I only did one attempt along the shortest path according to the cost function. While I am not found of the bos score in this quick test it was the easiest way for me to have plausible / active destination nodes.

@renepickhardt
Copy link
Collaborator

Just leaving a rough (untested) implementation of harmonic mean with probabilistic pathfinding and risk factor.

No normalization needed due to harmonic mean but scores are cast to longs and this might loose accuracy renepickhardt@a24f363 thus some refactoring of method signatures might be necessary

Couldn't test the code right now so please view this barely as a documentation that should be tested on mainnet

@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Oct 5, 2021

The question is: what is the capacity influence, vs fee influence? If we make the capacity influence << fee influence, it has only effect as a tiebreaker.

https://lists.linuxfoundation.org/pipermail/lightning-dev/2021-August/003191.html ? Especially if this is "just" Dijkstra and not mincostflow and need to convex (or is it concave?) cost function is not needed.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We bias by channel linearly by capacity, scaled by median fee.
This means that we effectively double the fee if we would use the
entire capacity, and only increase it by 50% if we would only use
1/2 the capacity.

This should drive us towards larger channels.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Plugins: `pay` now biases towards larger channels, improving success probability.
@cdecker cdecker force-pushed the guilt/pay-capacity-weight-hack branch from d939e35 to bdcb9f6 Compare October 21, 2021 16:09
Changelog-Changed: pay: The route selection will now use the log-propability-based channel selection to increase success rate and reduce time to completion
@cdecker
Copy link
Member

cdecker commented Oct 21, 2021

Updated the pull request to use the log-probabilities, since they perform consistently better than the linear bias of the original version. Rebased on master and happy to merge as soon as it passes CI.

@renepickhardt could you check that by cleaning up the last commit I didn't accdentally mess up the logic?

Copy link
Collaborator

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

ack 8378ed2

Copy link
Collaborator

@renepickhardt renepickhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delighted to write ACK (:

though I have a few nitpicking comments on naming and one issue with the harmonic mean formula which I messed up before our evaluation. but I think this should be up to future evaluation.

plugins/libplugin-pay.c Show resolved Hide resolved
plugins/libplugin-pay.c Show resolved Hide resolved
{
u64 cmsat = cost.millisatoshis; /* Raw: lengthy math */
u64 rmsat = risk.millisatoshis; /* Raw: lengthy math */
u64 bias = capacity_bias(global_gossmap, c, dir, cost);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that historically this was thought of as a capacity bias but I would actually for future redability rename this to negative_uniform_logprob

}

/* Prioritize costs over distance, but bias to larger channels. */
static u64 route_score(u32 distance,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe rename to cost instead of score just to be closer to terms used in the literature (c.f. min-cost-flow instead of minimum score flow)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think channel_cost or cost_function would be a better name also at the point where you make the dijsktra call


/* Prioritize costs over distance, but bias to larger channels. */
static u64 route_score(u32 distance,
struct amount_msat cost,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: here the same this cost I would actually name fees as the fees are just a feature in the final cost function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants