category_modulo and category_binning #927

o-smirnov · 2020-05-29T18:10:55Z

I suggested this in #907 -- I would like to humbly submit a proposed implementation.

Background

I often need to colourize (categorize) data using (a) indices from an integer column (possibly modulo some preset number of categories), or (b) by binning a float-valued column and assigning categories based on which bin a value falls into.

The data is typically too big to be in core, and is computed or loaded in a lazy fashion using dask dataframes and https://github.com/ska-sa/dask-ms. I have not found a way to map this into a Categorical column without triggering a computation (and the ensuring disk I/O), which defeats the purpose of the lazy-evaluation dask layer.

In addition, in #907 @maihde requested a mechanism for mapping a large number of categories into a reduced number of colours.

Proposal

This PR does the following:

Tweaks the by() reduction so that it deals with an abstract "categorizer" object (rather than explicitly using category_codes or category_values)
Implements two new categorizers, category_modulo and category_binning, derived from category codes. These assign categories based on the two use cases described above.
The by constructor will now accept a categorizer object, as an alternative to a column name. Thus e.g.
```
by(category_binning('x', 0, 10, 16))
```
constructs a by-reduction using 16 categories, with category 0 being 0<=x<10, category 1 being 10<=x<20, etc.

Calling the constructor with a column name implicitly constructs a category_codes, thus retaining the old behaviour.

TODO

The cuda code path of these two new categorizers currently throws a NotImplementedError. It's probably trivial to implement in cuda (as a minimum, categorization can be done on the CPU, then a to_gpu_array() can be called), but I haven't got the knowledge.
@maihde's simplify_categories() function can be implemented as another categorizer.

…d in holoviz#907

o-smirnov · 2020-05-29T18:35:36Z

Hmm ok, I gotta get the tests to work first, clearly...

jbednar · 2020-06-16T20:51:51Z

This looks great, thanks! I recently proposed implementing binning of numeric dimensions in #875 (comment) to make 3D aggregations, and it's fun to see it actually appear! Once the tests pass I'm happy to try it out and give feedback.

o-smirnov · 2020-06-17T08:46:43Z

Thanks, @jbednar! I'm quite keen for this and #926 to go in, because then I can release shadems on PyPI.

This PR should also allow other interesting user-defined categorizers, for example an outer product over mutiple categories, or a category remapper...

I might need some help with the Travis build though. It seems everything passes except this one thing:

Traceback (most recent call last):
  File "/home/travis/miniconda/envs/3.6/bin/datashader", line 11, in <module>
    load_entry_point('datashader', 'console_scripts', 'datashader')()
  File "/home/travis/build/holoviz/datashader/datashader/__main__.py", line 9, in main
    return pyct.cmd.substitute_main('datashader',args=args)
  File "/home/travis/miniconda/envs/3.6/lib/python3.6/site-packages/pyct/cmd.py", line 455, in substitute_main
    args.func(args) if hasattr(args,'func') else parser.error("must supply command to run")
  File "/home/travis/miniconda/envs/3.6/lib/python3.6/site-packages/pyct/cmd.py", line 394, in <lambda>
    parser.set_defaults(func=lambda args: fn(name, **{k: getattr(args,k) for k in vars(args) if k!='func'} ))
TypeError: fetch_data() got an unexpected keyword argument 'verbose'

....which has absolutely nothing to do with any code I touched, so I'm at a loss where to start here.

o-smirnov · 2020-06-18T20:55:47Z

I might need some help with the Travis build though. It seems everything passes except this one thing:

Actually it looks like master is currently failing in exactly the same way, so I'll just wait for you to fix it and merge master in again...

jbednar · 2020-06-18T22:55:01Z

Ok, I've asked @kebowen730 to look into that. Stay tuned!

o-smirnov · 2020-07-05T11:57:50Z

@jbednar with the latest changes to master all the tests pass. Could you please look into merging this?

jbednar · 2020-07-08T21:40:33Z

Sure, but at the moment I'm distracted by SciPy2020, and haven't had a chance to look at it yet. Soon!

o-smirnov · 2020-08-25T10:25:24Z

@jbednar any chance to look at this?

jbednar

Looks good; sorry for the delay in reviewing! It's hard to test it out without examples and tests, though. Would it be possible to add those?

datashader/reductions.py

o-smirnov · 2020-11-04T15:56:28Z

Thanks for the review @jbednar, I've implemented the suggested fixes, and there's now a whole bunch of tests for the reductions added in. They pass for me locally.

o-smirnov · 2020-11-05T11:06:19Z

I think master itself is currently not passing tests (I also see something like "fixture 'benchmark' not found" failing elsewhere). So once master is fixed, this should be good to go.

jbednar · 2020-11-11T04:12:49Z

I've fixed tests on master, so could you please rebase?

o-smirnov · 2020-11-11T14:49:58Z

Good to go!

jbednar · 2020-11-11T16:32:27Z

Thanks so much for the contribution; this is really cool stuff! I'll have to think about how to highlight it in the docs; any suggestions or sample bits of code welcome!

o-smirnov · 2020-11-11T16:46:11Z

Sure, shall I just write something up right here and you can cut-and-paste appropriately?

jbednar · 2020-11-11T22:15:55Z

That would be great, thanks!

o-smirnov · 2020-11-12T10:31:00Z

Let's say you have a dataframe with columns named x, y, gender, age and weight. Gender is a categorical column with N categories, age is integer, the others are floats.

Counting categories

The traditional count_cat aggregator can be used to aggregate points categorized by gender.

    agg = canvas.points(df,'x','y', agg=ds.count_cat('gender'))

The resulting cube has three axes: x, y and gender. Each pixel in the cube will contain a count of the points of the appropriate gender that fall into the corresponding x,y bin.

The by aggregator is a generalization of this. It is constructed with two arguments: a column, and a reduction function. The following is an exact equivalent of ds.count_cat('gender'):

    agg = canvas.points(df,'x','y', agg=ds.by('gender', ds.count()))

Aggregating statistics by category

However, more elaborate reduction functions can also be supplied:

    agg = canvas.points(df,'x','y', agg=ds.by('gender', ds.mean('weight')))

This returns a 3D cube where each x, y, gender pixel gives the mean weight of that gender over the x, y bin.

Categorizing by non-categorical columns

The examples above only work with a categorial column type. What if one wanted to categorize by a non-categorical column such as age (or even weight)? This can be done by creating a categorizer object for that column, and giving it to by:

    cat = ds.category_modulo('age', modulo=10, offset=16)
    agg = canvas.points(df,'x','y', agg=ds.by(cat, ds.mean('weight')))

This returns a 3D cube containing 10 slices. The category is computed as (age - 16)%10. Thus, the first slice will aggregate the mean weight (over an x, y bin) for ages 16, 26, 36, ..., the second slice ages 17, 27, 37, ...., etc.

The previous example is, admittedly, a contrived example. Here is something more realistic: let us look at the standard deviation in weight for particular age brackets:

    cat = ds.category_binning('age', lower=20, higher=100, nbins=8, include_under=False, include_over=False)
    agg = canvas.points(df,'x','y', agg=ds.by(cat, ds.std('weight')))

This returns a 3D cube containing 9 (nbins+1) slices. Slice 0 gives the stddev in weight (per each x, y bin) for ages [20,30), slice 1 for ages [30,40), ..., slice 7 for ages [90,100). The last slice, #8, is the "odd bin", i.e. it catches all "other" categories -- in this case, it gives the stddev in weight for ages below 20, over 100, and for an age of NaN (the latter would only be a possibility if age was a float, of course).

If we were to give include_under=True (default), ages below 20 would be included in the first bin, nominally [20,30). Likewise, with include_over=True, ages >= 100 will be included in the last bin, [90,100). In this case, the "odd bin" will only contain points with NaN ages.

Binning can also be done over a float-valued column:

    cat = ds.category_binning('weight', lower=0, higher=200, nbins=10)
    agg = canvas.points(df,'x','y', agg=ds.by(cat, ds.max('age')))

This returns a 3D cube containing 11 slices. Each point gives the maximum age over a particular x, y and weight bin. The last slice (#10) will catch negative and NaN weights, as well as weights >= 200.

Custom categorizers

It is possible to implement your own custom categorizations. You must derive a subclass from category_codes, category_modulo or category_binning. You must then implement the following methods __init__, _hashable_inputs, categories, validate, and apply. See e.g. the implementation of category_modulo in reductions.py for an example.

implements category_modulo and category_binning for by(), as suggeste…

12aa6f9

…d in holoviz#907

o-smirnov added 2 commits May 30, 2020 17:08

Fix broken category_codes. Replace uint with int for bin index.

457d609

added categorizer state to its hashable inputs

c3758ae

Merge remote-tracking branch 'origin/master' into extended-categories

3f8ba05

merged in master and resolved conflicts from holoviz#933

b134559

o-smirnov mentioned this pull request Oct 30, 2020

Implements tree reduction in the dask layer #926

Merged

jbednar reviewed Nov 3, 2020

View reviewed changes

datashader/reductions.py Outdated Show resolved Hide resolved

datashader/reductions.py Outdated Show resolved Hide resolved

datashader/reductions.py Outdated Show resolved Hide resolved

jbednar and others added 3 commits November 2, 2020 18:16

Fixed typo

a674ddb

Merge remote-tracking branch 'upstream/master' into extended-categories

655f3dc

added tests for new reductions. Implemented @jbednar's suggestions

d542970

Merge remote-tracking branch 'upstream/master' into extended-categories

5568fe5

jbednar merged commit 86d4498 into holoviz:master Nov 11, 2020

jbednar changed the title ~~suggested implementation for category_modulo and category_binning~~ category_modulo and category_binning Nov 11, 2020

jbednar mentioned this pull request Nov 11, 2020

Implement 3D Volume regridding #946

Open

3 tasks

philippjfr added this to the v0.12.0 milestone Dec 1, 2020

jbednar mentioned this pull request Dec 18, 2020

3Dlines plotting #978

Closed

jbednar mentioned this pull request Jun 17, 2021

Document category_modulo and category_binning #1014

Closed

jbednar mentioned this pull request Apr 20, 2022

Added missing docs for new features #1071

Merged

This was referenced Jul 18, 2022

Datashader internals to-do list #672

Open

Add cyclic aggregator #630

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

category_modulo and category_binning #927

category_modulo and category_binning #927

o-smirnov commented May 29, 2020

o-smirnov commented May 29, 2020

jbednar commented Jun 16, 2020

o-smirnov commented Jun 17, 2020

o-smirnov commented Jun 18, 2020

jbednar commented Jun 18, 2020

o-smirnov commented Jul 5, 2020

jbednar commented Jul 8, 2020

o-smirnov commented Aug 25, 2020

jbednar left a comment

o-smirnov commented Nov 4, 2020

o-smirnov commented Nov 5, 2020

jbednar commented Nov 11, 2020

o-smirnov commented Nov 11, 2020

jbednar commented Nov 11, 2020

o-smirnov commented Nov 11, 2020

jbednar commented Nov 11, 2020

o-smirnov commented Nov 12, 2020

category_modulo and category_binning #927

category_modulo and category_binning #927

Conversation

o-smirnov commented May 29, 2020

Background

Proposal

TODO

o-smirnov commented May 29, 2020

jbednar commented Jun 16, 2020

o-smirnov commented Jun 17, 2020

o-smirnov commented Jun 18, 2020

jbednar commented Jun 18, 2020

o-smirnov commented Jul 5, 2020

jbednar commented Jul 8, 2020

o-smirnov commented Aug 25, 2020

jbednar left a comment

Choose a reason for hiding this comment

o-smirnov commented Nov 4, 2020

o-smirnov commented Nov 5, 2020

jbednar commented Nov 11, 2020

o-smirnov commented Nov 11, 2020

jbednar commented Nov 11, 2020

o-smirnov commented Nov 11, 2020

jbednar commented Nov 11, 2020

o-smirnov commented Nov 12, 2020

Counting categories

Aggregating statistics by category

Categorizing by non-categorical columns

Custom categorizers