Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimagining histogram autobin #3001

Closed
alexcjohnson opened this issue Sep 12, 2018 · 6 comments · Fixed by #3044
Closed

Reimagining histogram autobin #3001

alexcjohnson opened this issue Sep 12, 2018 · 6 comments · Fixed by #3044
Labels
bug something broken feature something new

Comments

@alexcjohnson
Copy link
Collaborator

Histogram autobin works relatively well for a single trace, but could be better. But for multiple traces, despite several attempts to clean it up over time (#2028, #1944, #1901, ...), it has a bunch of problems:

  • If you leave out all binning information, we'll push autobinx: true back to the first trace, but autobin: false to all subsequent traces. This doesn't have any immediate impact, but if you then try to alter bins later, the results depend on which trace you edit. In particular if you edit the first one, (which seems logical, right?) the later ones will keep their original bin sizes https://codepen.io/alexcjohnson/pen/pOVrGj?editors=0010, but if instead you edit the other two (so change [0] to [1,2] in the restyle call at the end) the first one will get this new size as well.
  • Stacked/grouped histograms can have different bin sizes or incompatible start positions (this is the resulting situation in the codepen above). This results in a misleading plot (you can make peaks shift, so two matching peaks look separated, for example, and make a flat distribution look like it has gaps), and I would argue even if explicitly supplied this way we should not allow this situation, and take the size only from the first one that explicitly specifies it, similar to how we handle stacked area options. It's fine though to have independent sizes & starts if barmode='overlay'. I initially thought we might need a "bingroup" attribute similar to scatter's stackgroup but now I think it's better to just enforce a match across the already known group.
  • Match hist bins #1944, while matching up autobin sizes across histograms, made what I now think is the wrong decision: for multiple autobinned histograms the bin size "is the minimum any of them were auto-assigned". I think a much better solution would be to concatenate all the data together and autobin it as a single unit. That will generally result in roughly the largest bin size of any of the constituent traces, but sometimes it will be bigger than any of the individual traces (if they have a bigger range together than separately), sometimes smaller (if they have similar ranges but the total sample size is now big enough that we choose a smaller bin), and it becomes very clear how to shift the start to reduce ambiguity (to minimize data exactly at bin edges). Initially I was thinking we should make this optional, but I've come to think if we clean up the first two points above, the autobin size should just be changed to be the composite size.
  • We're mutating gd.data (autobinx, xbins.(start|end|size)) - and doing so in buggy ways at that.

Proposal:

  • Drop autobin(x|y) entirely, and just use the (improved) autobin routine to fill in whatever gaps there are in the explicitly specified attributes. So if you have explicitly specified an attribute and would like it to revert to auto, instead of turning on autobin you would delete the value.
  • For backward compatibility with restyle we can convert autobinx: true into xbins: null. I'll have to investigate the existing behavior when autobinx: true and xbins are both specified upfront to figure out what we would want to do in cleanData.
  • Coerce nbinsx iff an explicit xbins.size is not found (in any trace in the group)
  • Determine any needed auto values for xbins, and stash these and only these in fullTrace._xbins, but have both supplyDefaults and calc ensure the two are merged in fullTrace.xbins. So for example, if you specify trace.xbins: {end: 30} we'll set fullTrace._xbins: {start: 0, size: 5} and fullTrace.xbins: {start: 0, end: 30, size: 5}. The reason for this is Plotly.react: we want the full set in fullTrace, and supplyDefaults needs access to the auto-generated values, but in case you delete an explicit value from trace we don't want supplyDefaults to be able to fill it back in, so we will see the change and trigger a calc.

@etpinard I know this is a fairly big change, potentially breaking for some users, but the existing behavior is sufficiently broken and problematic already that I think we should consider it. This would also allow us to take a broader view of what "autobin" means, realizing it's not just a boolean but size, start, and end can all be independently auto or explicit.

Note that start does not need to match exactly from one trace to the next but does need to be compatible (all of them must be the same modulo size). end can be completely independent from trace to trace. So this logic will still be a little bit intricate...

@etpinard
Copy link
Contributor

etpinard commented Sep 12, 2018

Ok, wow. I didn't realize how much our autobin logic got complex in #1944 🔬

As you might expect, I'm leading toward calling the proposal a breaking change, but I could be convinced otherwise. Here are some questions and comments:

we'll push autobinx: true back to the first trace, but autobin: false to all subsequent traces.

Was this a conscious decision in #1944? Would things break if all those "subsequent" traces get autobinx: true?

similar to how we handle stacked area options

I'm glad the (heavy) stacked area option logic has some parallels elsewhere in the library 👍

I think a much better solution would be to concatenate all the data together and autobin it as a single unit. That will generally result in roughly the largest bin size of any of the constituent traces, but sometimes it will be bigger than any of the individual traces

I agree 💯 % here

Drop autobin(x|y) entirely, and just use the (improved) autobin routine to fill in whatever gaps there are in the explicitly specified attributes.

Is dropping autobin(x|y) entirely necessary to get your proposal to work? Sounds to me like autobin(x|y) will still be valuable in e.g. the chart-editor where one could easily toggle between the auto values and their set values even if we make all non-overlaid histogram traces share the same bins.


All in all, it sounds to me the the proposed algorithm could be an opt-on feature (in v1.x) via a bingroup attribute. That is, if a user sets bingroup, they get:

  • shared bin values (using logic similar to stacked area trace)
  • auto-bins computed by ways of concatenate all the data together
  • no breaking changes 🎉

@alexcjohnson
Copy link
Collaborator Author

Was this a conscious decision in #1944? Would things break if all those "subsequent" traces get autobinx: true?

No, pure bug. It wouldn't break things, but given that this is already broken I feel we have some extra flexibility to improve things, as long as we don't change things that currently work.

Is dropping autobin(x|y) entirely necessary to get your proposal to work?

Perhaps not, but I'm not quite sure how autobinx should behave if we keep it - just ignore traceIn.xbins entirely?

Sounds to me like autobin(x|y) will still be valuable in e.g. the chart-editor where one could easily toggle between the auto values and their set values even if we make all non-overlaid histogram traces share the same bins.

I feel like there's actually more potential for confusion if we have autobin(x|y) attributes, at least after we stop mutating these pieces of gd.data, but would be curious to hear @nicolaskruchten's take.

If autobinx just means we ignore traceIn.xbins, then when it's on you would see values for xbins.(start|end|size) but either you wouldn't be able to change them, or if changing one turned off autobinx, then the other two would also suddenly jump to whatever non-auto values you had previously given them. Or perhaps changing one would turn off autobinx and then also unset the other two values...

And if we get that far, the entire autobinx attribute, at least as far as the GUI is concerned, could just be replaced by a button with no associated attribute (other than "do we have a value for start, end, or size") saying "unset start, end, and size." But in the scheme I'm proposing it's meaningful to set/unset each of these attributes individually, so while "unset all" might be nice, it's just one of several possible unsetting operations.

Is there something else autobinx could mean that would make more sense?

All in all, it sounds to me the the proposed algorithm could be an opt-on feature (in v1.x) via a bingroup attribute.

Unless we make some option for multiple overlay groups (where each group could have independent binning but we stack or something within each group... would be strange), I think this has to be all-or-nothing, at least per-subplot. So I suppose we could make a boolean to opt into this... but I'm hopeful that we can design it without that in such a way that the initial render is close enough to current behavior (ie just fixes bugs) and changes are mostly confined to dynamic behavior.

@etpinard
Copy link
Contributor

I feel like there's actually more potential for confusion if we have autobin(x|y) attributes, at least after we stop mutating these pieces of gd.data, but would be curious to hear @nicolaskruchten's take.

A few things to consider if we stopped mutating autobin values into gd.data:

  • this will bring in a situation similar to violin bandwidth and span (cc Fields missing in fullData #2834) where their the bins' start/end/size value won't be known after Plots.supplyDefaults and hence can't just be pasted into fullData during calc (this would throw off the Plotly.react diffing routine). So either (1) the editor would need to dive into gd.calcdata to grab the auto bin values or (2) we'll have to modify our supplyDefaults pipeline (probably by moving relinkPrivateKeys around). Option (2) sounds worthwhile here, as the end result of mutating less things in user data would be a major win.

It will be interesting to see how 🔪 autobin(x|y) works out. If does work out well, this could push similar attributes like autocontours, cauto (and its evil-twin zauto) to their deathbed.

So I suppose we could make a boolean to opt into this.

This gets my vote, but if

we can design it without that in such a way that the initial render is close enough to current behavior (ie just fixes bugs) and changes are mostly confined to dynamic behavior.

turns out ot be true when the PR is ready for review, I might change my mind.

@etpinard etpinard added bug something broken feature something new labels Sep 13, 2018
@alexcjohnson
Copy link
Collaborator Author

alexcjohnson commented Sep 14, 2018

For the record, some aspects of the current disaster behavior:

Single trace

Starting from {type: 'histogram', x: xArray} add the following:

  1. {} (full implied auto):
    We add autobin: true, xbins: {start, end, size} to gd.data. Future data changes cause new autobin.

  2. {autobinx: false} (specify not auto, but no values):
    We keep autobinx: false but add xbins: {start, end, size} to gd.data. So we autobin once but future data changes will not autobin again.

  3. {autobinx: ?, xbins: {size}} (specify any auto, incomplete bin values, any combination except all 3 bin attributes specified):
    Partial bin info is discarded, future autobin as in above two cases. Unspecified autobinx gets set to true.

  4. {xbins: {start, end, size}} (autobin not included, complete bin values):
    We add autobin: false to gd.data, no autobin ever happens.

  5. {autobinx: false, xbins: {start, end, size}} (specify not auto, complete values):
    No mutation to gd.data, no autobin ever happens.

  6. {autobinx: true, xbins: {start, end, size}} (specify auto AND complete values):
    New autobin values overwrite xbins. Future data changes cause new autobin.

As proposed above, the only change to first draw behavior would be case 3, where partial bin specification would respect the specified parts and auto-determine the others. But all gd.data mutations would disappear except the cleanData step of clearing xbins if autobinx=true then removing autobinx
On updates, the new proposal has no "autobin once" possibility. In fact I believe that's theoretically impossible without mutating gd.data. I suppose we could keep an explicit autobinx=false around until we finish autobinning, then push the results back to gd.data and delete autobinx... This would be similar to how we handle xaxis.autorange: 'reversed' and turn it into autorange: true, range: [high, low]

Multiple traces

  1. [{}, {}, {}] (full implied auto):
    xbins gets chosen to match across all traces (minimum size), but autobinx is set true in the first and false in others.

  2. [{autobinx:?}, {autobinx:?}, {autobinx:?}] (no bin defs but explicit autobinx either true or false):
    xbins as in implied auto, autobinx takes the requested value for each trace.

  3. [{xbins:{size}}, {xbins:{size}}, {xbins:{size}}] (partial bin defs):
    autobinx=true for all traces, but xbins.start,end are NOT filled in anywhere. PLOT FAILS TO RENDER. Same result if even one trace has partial bin defs and others have no bin defs (though specifying start and end with no size seems to nearly work.

  4. [{autobinx:true, xbins:{?}}, {autobinx:true, xbins:{?}}, {autobinx:true, xbins:{?}}] (explicit autobin, any or no bin defs):
    xbins gets chosen to match across all traces, autobinx is respected.

  5. [{}, {xbins:{start,end,size}}, {}] (full bin def for one trace):
    xbins of other traces sometimes get chosen to match the one with a full def, sometimes smaller bin sizes (we seem to have thought this acceptable at some point, it seems to always be the defined bin size divided by an integer? Whatever, it really makes no sense). autobinx is set false for the trace with defined bins, and otherwise true for the first trace and false for others (as in case 1)

  6. [{xbins:{start,end,size}}, {xbins:{start,end,size}}, {}] (full bin def for more than one trace):
    The defined bins are allowed, even if they conflict. Any undefined bin(s) get sized to the first defined bin size divided by some integer, always small enough to be less than the smallest defined bin size.

All of these scenarios have problems, other than (3) explicit and complete autobin (though even that I think we agree we should pick the bin size differently and usually larger). I'd say the multi-trace behavior is broken enough that anyone who's currently using it successfully must either have data that happens to work fine fully auto, or must have inserted complete manual (and explicitly matching!) bin specs. Neither of those cases would break using the above proposal.

@etpinard
Copy link
Contributor

Referencing #1318 which shares some thoughts about autobins and CDF histograms.

@etpinard
Copy link
Contributor

Referencing #1282 for some "old" thoughts on auto* attributes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something broken feature something new
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants