Separate `OutcomeSpace` and `ProbabilitiesEstimator` #285

kahaaga · 2023-08-11T10:18:18Z

This PR supersedes #279. Due to type renaming and folder structure changes, the diff is huge. The main changes are summarised below. I may have forgotten some things in this list. To be updated.

This is still WIP, but the API changes can be reviewed. Changes are best viewed in the updated docs.

Changes

I've been experimenting with a few different solutions to #280 and #237. This PR provides a combined solution to both by redesigning the probabilities-based code. Changes are best understood by a few examples:

Examples (comparing old and new API)

Probabilities can now be estimated either 1) using an OutcomeSpace directly (which uses MLE estimation by default, and is equivalent to how probabilities were estimated before), or 2) using a dedicated ProbabilitiesEstimator, which computes probabilities in some fancy way from raw counts. The probabilities estimator always takes the OutcomeSpace as the first input.

# Old example
x = rand(100)
o = SymbolicPermutation(m = 3) # an ordinal pattern based outcome space
ps = probabilities(o, x) # old syntax

# New 
ps = probabilities(MLE(o), x) # equivalent to the above, so non-breaking
ps_bayes = probabilities(Bayes(o), x) # new `Bayes` probabilities
ps_shrink = probabilities(Shrinkage(o), x) # new `Shrinkage` probabilities estimator

This also works with information and friends. We can now arbitrarily combine any DiscreteInfoEstimator with any ProbabilitiesEstimator, which enables a bunch of new estimators, many of which have not been seen in the literature before.

x = rand(100)
o = SymbolicPermutation(m = 3)
# Using a definition and and outcome space (old syntax) is equivalent to using the `PlugIn` and `MLE` estimators.
h_s = information(Shannon(), o, x) #  old 
h_s = information(PlugIn(Shannon()), o, x) # new. equivalent to the above, so non-breaking
h_s = information(PlugIn(Shannon()), MLE(o), x) # new, equivalent to the above, so non-breaking

# We can now combine any `DiscreteInfoEstimator` with any `ProbabilitiesEstimator` to compute entropies
h_sb = information(PlugIn(Shannon()), Bayes(o), x)
h_ss = information(MillerMadow(Shannon()), Shrinkage(o), x)
h_rs = information(Jackknife(Renyi()), Shrinkage(o), x)

# The generic `PlugIn` and `Jackknife` estimator can of course be used with any measure
jr_jack_mle = information(PlugIn(RenyiExtropy()), MLE(o), x)
jt_jack_shrinkage = information(Jackknife(TsallisExtropy()), Shrinkage(o), x)

Naming:

I'm happy to just keep the names current proposal, since it doesn't really matter for the end user. They only use concrete types, not the abstract ones, so it doesn't really matter what the abstract types are called, as long as they are not misleading. Dealing with Was information measure a mistake? #284 should, if done, be performed after this PR is merged.

Codebase:

ProbabilitiesEstimator becomes OutcomeSpace. The OutcomeSpace defines the possible outcomes, and defines how encoding/discretization of input data is done.
ProbabilitiesEstimator still exists, but has the role of converting outcome counts (or pseudo-counts, in the case of e.g. WaveletOverlap, PowerSpectrum or TransferOperator) to probabilities. The basic estimator is MLE (maximum likelihood estimator), which estimates probabilities using relative counts. We also now have the Bayes and Shrinkage probabilities estimators, which through parameterisation covers a lot of literature estimators.
The transition is done seamlessly by introducing the functions counts_and_outcomes/allcounts_and_outcomes, which are completely analogous to probabilities_and_outcomes/allprobabilities_and_outcomes, except they return raw counts instead of probabilities.
This transition is necessary because I realized that probabilities are not the fundamental quantity we though. It is actually the counts of outcomes that are the fundamental quantity, because these define how probabilities are estimated, using some ProbabilitiesEstimator. Not all OutcomeSpaces are counting-compatible (e.g. WaveletOverlap), but in this implementation we can still use these OutcomeSpaces directly with the naive MLE probabilities. The justification is that they still return some form of "normalized relative counts", so it fits with the MLE estimator, even though they are not strictly counts. This avoids breaking any code, and the design choice is documented.
The separation into OutcomeSpace and ProbabilitiesEstimator is backwards-compatible, because using an OutcomeSpace directly is equivalent to using the MLE estimator.

Documentation

Moved the Encoding API to under "Probabilities" in the docs
Re-arranged a bunch of content, so that it fits more logically in with the new API.
I've rewritten the docstrings a bit, to better highlight how the different components come together. These descriptions are up for discussion, but I'm pretty happy with them as is.

TODO:

Make probabilities/probabilities_and_outcomes generic. There's no need to specialize methods unless something unique happens for a particular OutcomeSpace.
Add a dedicated Renyi discrete estimator
Add a sample-coverage based ProbabilitiesEstimator (the Good-Turing correction)
Ensure test coverage doesn't decrease.
Thoroughly read docs and make sure they make sense and are updated.
Check all doc links.

Fixes #283

Fixes a bug too, where repeated outcomes were returned by `probabilities_and_outcomes`. This wasn't caught before, because the tests were wrong too.

codecov · 2023-08-11T10:29:33Z

Codecov Report

Merging #285 (16950d2) into main (68fc7fa) will increase coverage by 0.52%.
The diff coverage is 90.71%.

@@            Coverage Diff             @@
##             main     #285      +/-   ##
==========================================
+ Coverage   85.88%   86.40%   +0.52%     
==========================================
  Files          57       71      +14     
  Lines        1438     1751     +313     
==========================================
+ Hits         1235     1513     +278     
- Misses        203      238      +35

Files Changed	Coverage Δ
src/ComplexityMeasures.jl	`100.00% <ø> (ø)`
src/complexity_measures/statistical_complexity.jl	`100.00% <ø> (ø)`
src/core/information_measures.jl	`83.33% <0.00%> (-10.00%)`	⬇️
...rc/encoding_implementations/rectangular_binning.jl	`91.56% <0.00%> (+0.10%)`	⬆️
src/outcome_spaces/spatial/utils.jl	`94.11% <ø> (ø)`
...rc/outcome_spaces/transfer_operator/GroupSlices.jl	`66.66% <ø> (ø)`
src/outcome_spaces/symbolic_permutation.jl	`81.94% <66.66%> (ø)`
src/outcome_spaces/value_histogram.jl	`76.19% <72.72%> (ø)`
.../spatial_permutation/SpatialSymbolicPermutation.jl	`82.69% <82.60%> (ø)`
src/discrete_info_estimators/schurmann.jl	`85.00% <85.00%> (ø)`
... and 24 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

kahaaga · 2023-08-11T11:05:34Z

It will take me a few days to finalize the remaining estimators I have for this PR, so if you have any feedback @Datseris, please give it, and I will try to incorporate it in the meanwhile.

Datseris · 2023-08-11T12:18:07Z

Hi @kahaaga ,

thanks a lot for the PR. Thanks for taking the time to write an extensive PR description that outlines and motivates major changes. I can't stress enough how extremely helpful this description is to review the PR.

I went through the changes and the new docs. I have not went through the source code yet, will do only when the PR is final.

I think this is incredible and I accept all changes (minor comments only).

When merged, I will make a high level tutorial just like I have done for Agents.jl and DynamicalSystems.jl. This library is now so extensive that it really needs such a tutorial.

`probabilities` and `counts` source

How does probabilities work now? What is its source code? Is it a generic function or is it outcome space dependent? What does the counts function do for outcome spaces that don't have counts? I guess I can get the answers to these questions myself once I read the source, but you already have the answers so might as well give the mto me.

About `information`

I think how to tackle #284 is simple. information and the whole baggage stays as is. Howewver, it is documented in the Dev Docs page. The Entropy page remains as is in currect stable release and replaces the Information page and measures. This allows us to have soemthing simple for users, yet extendable for developers or advanced users. The information interface will be mentioned in the entropy docstring of course. Extropy becoems a subsection in the Entropy docs.

The source code is not affected by this : highlevel functions entropy, extropy just forward to information.

Do we agree on this? It keeps all your code as is and only changes the way we document it and expose it to users.

Minor comments

Do we really need allprobabilities_and_outcomes? and same for allcounts_and_outcomes? I would vote no.
Docstring of OutcomeSpace: say "rules for mapping input data to each outcome" (no parenthesis on dimension)
Same docstring: "outcome space model" should be just "outcome space".
MLE should be renamed to RelativeCounts or something similar and not use abbtreviations. In its formula I would aruge to also rename f_k to n_k.
In general I would argue to rename f to n when we are referring to counts, e.g., in the docstring of counts_and_outcomes.
in the Probabilities docpage, the OutComeSpace section should be before the probability estimators.
There was a significant drop of test coverage in this PR, I hope we can recover it, if not in this PR then in the near future with a new PR.
In Bayes estimator docstring, you should say "if used with allprobablitis, then length(a) == total_outcomes(o, x) instead of sing length(outocmes(o, x)).
It is not clear what m is in the Bayes estimator in the Assumptions section. Is it Ω?
Mathematical symbol m is also used in Shrinkage but not clear what it is . Let's try to keep symbol variability to absolute minimum. Is m the length of Ω? In OutcomeSpace this is denoted as L. Let's make it explicit by deciding L to be the symbol. In general, let's use N as data length, L as outcome space cardinality, and D as dimensionality of input data. Latter two follow the general DynamicalSystems.jl convention.

Datseris · 2023-08-11T12:24:19Z

in general please be careful with symbols. I see eg in counts_and_outcomes source code "Ωᵢ ∈ Ω" but we consistently have used small ωι for the elements of Ω in other places. Let's stick with consistency as much as possible.

kahaaga · 2023-08-11T12:44:37Z

I'll address when I have more time, but for now:

When merged, I will make a high level tutorial just like I have done for Agents.jl and DynamicalSystems.jl. This library is now so extensive that it really needs such a tutorial.

Agreed.

I think how to tackle #284 is simple. information and the whole baggage stays as is. Howewver, it is documented in the Dev Docs page. The Entropy page remains as is in currect stable release and replaces the Information page and measures. This allows us to have soemthing simple for users, yet extendable for developers or advanced users. The information interface will be mentioned in the entropy docstring of course. Extropy becoems a subsection in the Entropy docs.
The source code is not affected by this : highlevel functions entropy, extropy just forward to information.
Do we agree on this? It keeps all your code as is and only changes the way we document it and expose it to user

Yes, agreed.

How does probabilities work now?

probabilities/probabilities_and_outcomes is now a generic wrapper to counts/counts_and_outcomes (which is outcome space specific). The wrapper simply does

function probabilities_and_outcomes(est::ProbabilitiesEstimator{<:OutcomeSpace}, x)
    cts, outs = counts_and_outcomes(est.outcomespace, x) 
    return Probabilities(cts), out
end

Currently, the function is explicitly implemented for each outcome space, but we can make it generic. I'll add this to the TODO-list.

What does the counts function do for outcome spaces that don't have counts?

Outcome spaces that don't have counts simply don't implement counts/counts_and_outcomes. They implement probabilities/probabilities_and_outcomes directly. As an example, here's the source code for WaveletOverlap:

# MLE estimation on "pseudo-counts"
function probabilities_and_outcomes(est::WaveletOverlap, x)
    x isa AbstractVector{<:Real} || error("`WaveletOverlap` only works for timeseries input!")
    freqs = time_scale_density(x, est.wl) # relative energies ("pseudo-counts")
    return Probabilities(freqs), 1:length(freqs)
end

This way, we just consider all "relative counts"/"relative energies"/"relative whatever" as pre-computed probabilities (at least after normalization). What it means in practice is that count-based outcome spaces will work with any ProbabilitiesEstimator, while those not based on counting will just work with MLE (or whatever we decide to name it).

Datseris · 2023-08-11T12:47:45Z

let me tackle the information change in the same PR as the high level tutorial (after this is merged).

kahaaga · 2023-08-11T12:49:14Z

let me tackle the information change in the same PR as the high level tutorial (after this is merged).

So I keep the code as is for now on a macro-level, and we deal with #284 after?

It is worth noting that entropy(::DiscreteInfoEstimator, ::ProbabilitiesEstimator, x) already works (due to deprecations), it's just not documented anywhere.

Datseris · 2023-08-11T13:03:38Z

So I keep the code as is for now on a macro-level, and we deal with #284 after?

yes/

kahaaga · 2023-08-12T09:13:56Z

MLE should be renamed to RelativeCounts or something similar and not use abbreviations.

In the literature, this procedure is almost exclusively referred to as the "maximum likelihood" estimator. Maybe MaxLikelihood/MaximumLikelihood? There are good theoretical reasons for calling it this, related to the multinomial distribution (which is what connects counts and probabilities, and leads to the estimator). But perhaps we should avoid it, since we use it for counts and relative quantities that are not actually counts.

RelativeCounts is also ok for the counting-based outcome spaces, but it doesn't quite cover the non-counting-based outcome spaces. Maybe just Relative, to cover both relative frequencies/energies/whatever and relative counts?

Edit: maybe Empirical could work too?

Datseris · 2023-08-12T10:15:01Z

Hm, I am not sure I like Empirical; aren't all estimators empirical in some sense? And Relative is too generic of a word. On the other hand, isn't MaxLikelihood something that we also have with entropies? There we called it PlugIn though, I am not sure why.

I think I like RelativeAmount the most: it is not as abstract as Relative and is not limited to only ount-based.

kahaaga · 2023-08-12T11:05:23Z

On the other hand, isn't MaxLikelihood something that we also have with entropies? There we called it PlugIn though, I am not sure why.

Yes, so the issue is that maximum likelihood estimation is a generic principle that can be applied to many things. So there is a maximum likelihood estimator of entropy, and there is a maximum likelihood estimator of probabilities, and for many other quantities. So in principle, we could use the same name at both levels (this would work, because the type parameterization is generic), but I think that would be confusing from an educational perspective.

I think I like RelativeAmount the most: it is not as abstract as Relative and is not limited to only count-based.

I'm not totally happy withRelativeAmount either, but I can't come up with anything better myself, so I think RelativeAmount is fine. In its docstring state that for counting-based outcome spaces, the estimator is the maximum likelihood estimator. We also document that for other outcome spaces, it means pre-normalized quantities. Sounds good?

Datseris · 2023-08-12T13:55:01Z

i don't know what you mean by "pre-normalized". The quantities are just normalized when turned into probabilities. What does the "pre" convey?

Datseris · 2023-08-12T13:56:02Z

If we made a generic MaxLikelihood type, how would the code work for using the same type for both entropy and probabilities?

Datseris · 2023-08-18T11:48:02Z

I don't know where the functions frequencies and frequencies_and_outcomes are defined, but let's please remove them. We already have too many functions in this spirit, surely with probabilities and counts we may obtain frequencies. In fact, isn't frequencies the same as calling probabilities with RelativeAmount?

Datseris · 2023-08-18T11:50:20Z

okay, turns out source code of SampleCoverage was just wrong so I've fixed that (still no test though, feel free to write here a quick test I can add)

kahaaga · 2023-08-18T11:50:53Z

Forgot to mention:

I found what I believe to be a bug in the WaveletOverlap. When getting the coefficients for the different levels, the returned matrix had one more column than the number of transform levels.

The following point should be added to the changelog:

- WaveletOverlap now computes probabilities (relative energies) over the correct number of transform levels. Previously, the *scaling *coefficients for the max transform level were incorrectly included, as an extra set of coefficients in addition to the (correctly included) wavelet coefficients. This caused a lot of energy to be concentrated at low frequencies, even for high-frequency signals. Thus the corresponding Probabilities had an extra element which in many cases dominated the rest of the distribution.

kahaaga · 2023-08-18T11:52:12Z

okay, turns out source code of SampleCoverage was just wrong so I've fixed that (still no test though, feel free to write here a quick test I can add)

Can we just remove SampleCoverage completely? I was supposed to delete it, but forgot to do so. I am working on a generic Good-Turing probabilities estimator which will replace it. There so many literature estimators that do similar things that I think it is better to just have a single parameterised estimator for it.

kahaaga · 2023-08-18T11:52:53Z

I don't know where the functions frequencies and frequencies_and_outcomes are defined, but let's please remove them. We already have too many functions in this spirit, surely with probabilities and counts we may obtain frequencies. In fact, isn't frequencies the same as calling probabilities with RelativeAmount?

This ties in to my previous comment - we should just delete SampleCoverage completely. It is old code, and these functions no longer exist.

Datseris · 2023-08-18T11:56:12Z

ok done.

Datseris · 2023-08-18T11:57:48Z

I have to go now but I ]ll be back in 3-4 hours. Test passing is what remains to be done.

kahaaga · 2023-08-18T12:08:31Z

I have to go now but I ]ll be back in 3-4 hours. Test passing is what remains to be done.

I'll try to fix the remaining tests in the meanwhile. If tests pass when you're back, feel free to merge. I will likely be unavailable until Sunday evening CET.

kahaaga · 2023-08-18T12:18:37Z

The remaining tests fail due to SymbolicWeightedPermutation and SymbolicAmplitudeAdjustedPermutation.

The reason is the definition abstract type PermutationOutcomeSpace{m} <: CountBasedOutcomeSpace end. Only SymbolicPermutation is strictly counting based. SymbolicWeightedPermutation and SymbolicAmplitudeAdjustedPermutation don't return integer-valued histograms. They return scaled, normalized histograms (due to the weights). Thus, this subtyping won't work.

I think this is why I went with the is_counting_based function instead of using CountBasedOutcome dispatch.

I actually have a paper I'm working on where I modify these outcome spaces such that they can be counted, but it isn't included here at the moment.

Either we need an additional abstract type to dispatch on, or we need another solution.

Datseris · 2023-08-18T12:25:29Z

well the simplest solution is to explicitly extend the function is_counting_based to flse for weighted andamplitude wieghted. it isn't too bad.

kahaaga · 2023-08-18T13:03:59Z

well the simplest solution is to explicitly extend the function is_counting_based to flse for weighted andamplitude wieghted. it isn't too bad.

Perfect! I though you removed is_counting_based completely, but that is obviously not the case. Simply extending this method explicitly for SymbolicWeightedPermutation and SymbolicAmplitudeAwarePermutation works.

kahaaga · 2023-08-18T13:40:42Z

This now looks good from my side.

Datseris

oh man, what a change. but this is awesome! WIll contribute tutorial later tonight!!!!

kahaaga added 8 commits August 1, 2023 04:51

Discrete estimators and start of API

32a720a

Total rewrite for OutcomeSpace and ProbabilitiesEstimator`

3604513

Convert to SSSet if input is array

f19855f

Silently deprecate. Remove warning.

4220a6c

Fixes #283

New API for SpatialSymbolicPermutation too

c227f19

Fixes a bug too, where repeated outcomes were returned by `probabilities_and_outcomes`. This wasn't caught before, because the tests were wrong too.

Test outcome spaces

2f5552f

Update deprecations

387796c

Remove link to deleted file

ed07dcc

kahaaga changed the title ~~Probs estimators new~~ Separate OutcomeSpace and ProbabilitiesEstimator Aug 11, 2023

kahaaga mentioned this pull request Aug 11, 2023

WIP: discrete estimators #279

Closed

8 tasks

kahaaga added 7 commits August 11, 2023 15:08

Notation

7adc26a

Simplify docstring

b9c740e

"Outcome spaces" before "Probabilities"

4c8fdce

Clarify what m is

9f85ef0

Consistent symbols

6d0fac5

Update probabilities docstring

25018b5

Update summary page

a60b873

Datseris added 4 commits August 18, 2023 12:34

reorg probabilities file

98d6800

make default all = true for missing outcomes

a3d69db

non-duplicate estimation of L in missing outcomes

6086d2b

seprate counts core into new file

4f272ef

Datseris force-pushed the probs_estimators_new branch from 833cb06 to 4f272ef Compare August 18, 2023 11:38

simplify internal encoded_space_cardinality function

c0415ac

fix Sample Coverage

daf84f0

correct usage of verify_counting_based

5171d27

Datseris added 2 commits August 18, 2023 12:55

typo in Shrinkage docs

5271352

delete samplecoverage

029242e

Test file has changed name

02db322

kahaaga added 2 commits August 18, 2023 15:13

Fix tests for weighted permutation based outcome spaces

a186729

Use correct terminology in OutcomeSpace docstrings

16950d2

Datseris approved these changes Aug 18, 2023

View reviewed changes

Datseris merged commit 9d6030b into main Aug 18, 2023
4 checks passed

Datseris deleted the probs_estimators_new branch August 18, 2023 13:57

kahaaga mentioned this pull request Aug 19, 2023

Add changelog entry for WaveletOverlap bug fix for next release #292

Closed

Datseris mentioned this pull request Aug 20, 2023

Clean up source code of probabilities: should just dispatch on count; see discussion #297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate `OutcomeSpace` and `ProbabilitiesEstimator` #285

Separate `OutcomeSpace` and `ProbabilitiesEstimator` #285

kahaaga commented Aug 11, 2023 •

edited

Loading

codecov bot commented Aug 11, 2023 •

edited

Loading

kahaaga commented Aug 11, 2023 •

edited by Datseris

Loading

Datseris commented Aug 11, 2023

Datseris commented Aug 11, 2023

kahaaga commented Aug 11, 2023 •

edited

Loading

Datseris commented Aug 11, 2023

kahaaga commented Aug 11, 2023 •

edited

Loading

Datseris commented Aug 11, 2023

kahaaga commented Aug 12, 2023 •

edited

Loading

Datseris commented Aug 12, 2023

kahaaga commented Aug 12, 2023 •

edited

Loading

Datseris commented Aug 12, 2023

Datseris commented Aug 12, 2023

Datseris commented Aug 18, 2023

Datseris commented Aug 18, 2023

kahaaga commented Aug 18, 2023

kahaaga commented Aug 18, 2023

kahaaga commented Aug 18, 2023

Datseris commented Aug 18, 2023

Datseris commented Aug 18, 2023

kahaaga commented Aug 18, 2023

kahaaga commented Aug 18, 2023

Datseris commented Aug 18, 2023 •

edited

Loading

kahaaga commented Aug 18, 2023

kahaaga commented Aug 18, 2023

Datseris left a comment

Separate OutcomeSpace and ProbabilitiesEstimator #285

Separate OutcomeSpace and ProbabilitiesEstimator #285

Conversation

kahaaga commented Aug 11, 2023 • edited Loading

Changes

Examples (comparing old and new API)

TODO:

codecov bot commented Aug 11, 2023 • edited Loading

Codecov Report

kahaaga commented Aug 11, 2023 • edited by Datseris Loading

Datseris commented Aug 11, 2023

probabilities and counts source

About information

Minor comments

Datseris commented Aug 11, 2023

kahaaga commented Aug 11, 2023 • edited Loading

Datseris commented Aug 11, 2023

kahaaga commented Aug 11, 2023 • edited Loading

Datseris commented Aug 11, 2023

kahaaga commented Aug 12, 2023 • edited Loading

Datseris commented Aug 12, 2023

kahaaga commented Aug 12, 2023 • edited Loading

Datseris commented Aug 12, 2023

Datseris commented Aug 12, 2023

Datseris commented Aug 18, 2023

Datseris commented Aug 18, 2023

kahaaga commented Aug 18, 2023

kahaaga commented Aug 18, 2023

kahaaga commented Aug 18, 2023

Datseris commented Aug 18, 2023

Datseris commented Aug 18, 2023

kahaaga commented Aug 18, 2023

kahaaga commented Aug 18, 2023

Datseris commented Aug 18, 2023 • edited Loading

kahaaga commented Aug 18, 2023

kahaaga commented Aug 18, 2023

Datseris left a comment

Choose a reason for hiding this comment

Separate `OutcomeSpace` and `ProbabilitiesEstimator` #285

Separate `OutcomeSpace` and `ProbabilitiesEstimator` #285

kahaaga commented Aug 11, 2023 •

edited

Loading

codecov bot commented Aug 11, 2023 •

edited

Loading

kahaaga commented Aug 11, 2023 •

edited by Datseris

Loading

`probabilities` and `counts` source

About `information`

kahaaga commented Aug 11, 2023 •

edited

Loading

kahaaga commented Aug 11, 2023 •

edited

Loading

kahaaga commented Aug 12, 2023 •

edited

Loading

kahaaga commented Aug 12, 2023 •

edited

Loading

Datseris commented Aug 18, 2023 •

edited

Loading