Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate OutcomeSpace and ProbabilitiesEstimator #285

Merged
merged 74 commits into from
Aug 18, 2023
Merged

Conversation

kahaaga
Copy link
Member

@kahaaga kahaaga commented Aug 11, 2023

This PR supersedes #279. Due to type renaming and folder structure changes, the diff is huge. The main changes are summarised below. I may have forgotten some things in this list. To be updated.

This is still WIP, but the API changes can be reviewed. Changes are best viewed in the updated docs.

Changes

I've been experimenting with a few different solutions to #280 and #237. This PR provides a combined solution to both by redesigning the probabilities-based code. Changes are best understood by a few examples:

Examples (comparing old and new API)

Probabilities can now be estimated either 1) using an OutcomeSpace directly (which uses MLE estimation by default, and is equivalent to how probabilities were estimated before), or 2) using a dedicated ProbabilitiesEstimator, which computes probabilities in some fancy way from raw counts. The probabilities estimator always takes the OutcomeSpace as the first input.

# Old example
x = rand(100)
o = SymbolicPermutation(m = 3) # an ordinal pattern based outcome space
ps = probabilities(o, x) # old syntax

# New 
ps = probabilities(MLE(o), x) # equivalent to the above, so non-breaking
ps_bayes = probabilities(Bayes(o), x) # new `Bayes` probabilities
ps_shrink = probabilities(Shrinkage(o), x) # new `Shrinkage` probabilities estimator

This also works with information and friends. We can now arbitrarily combine any DiscreteInfoEstimator with any ProbabilitiesEstimator, which enables a bunch of new estimators, many of which have not been seen in the literature before.

x = rand(100)
o = SymbolicPermutation(m = 3)
# Using a definition and and outcome space (old syntax) is equivalent to using the `PlugIn` and `MLE` estimators.
h_s = information(Shannon(), o, x) #  old 
h_s = information(PlugIn(Shannon()), o, x) # new. equivalent to the above, so non-breaking
h_s = information(PlugIn(Shannon()), MLE(o), x) # new, equivalent to the above, so non-breaking

# We can now combine any `DiscreteInfoEstimator` with any `ProbabilitiesEstimator` to compute entropies
h_sb = information(PlugIn(Shannon()), Bayes(o), x)
h_ss = information(MillerMadow(Shannon()), Shrinkage(o), x)
h_rs = information(Jackknife(Renyi()), Shrinkage(o), x)

# The generic `PlugIn` and `Jackknife` estimator can of course be used with any measure
jr_jack_mle = information(PlugIn(RenyiExtropy()), MLE(o), x)
jt_jack_shrinkage = information(Jackknife(TsallisExtropy()), Shrinkage(o), x)

Naming:

  • I'm happy to just keep the names current proposal, since it doesn't really matter for the end user. They only use concrete types, not the abstract ones, so it doesn't really matter what the abstract types are called, as long as they are not misleading. Dealing with Was information measure a mistake? #284 should, if done, be performed after this PR is merged.

Codebase:

  • ProbabilitiesEstimator becomes OutcomeSpace. The OutcomeSpace defines the possible outcomes, and defines how encoding/discretization of input data is done.

  • ProbabilitiesEstimator still exists, but has the role of converting outcome counts (or pseudo-counts, in the case of e.g. WaveletOverlap, PowerSpectrum or TransferOperator) to probabilities. The basic estimator is MLE (maximum likelihood estimator), which estimates probabilities using relative counts. We also now have the Bayes and Shrinkage probabilities estimators, which through parameterisation covers a lot of literature estimators.

  • The transition is done seamlessly by introducing the functions counts_and_outcomes/allcounts_and_outcomes, which are completely analogous to probabilities_and_outcomes/allprobabilities_and_outcomes, except they return raw counts instead of probabilities.

  • This transition is necessary because I realized that probabilities are not the fundamental quantity we though. It is actually the counts of outcomes that are the fundamental quantity, because these define how probabilities are estimated, using some ProbabilitiesEstimator. Not all OutcomeSpaces are counting-compatible (e.g. WaveletOverlap), but in this implementation we can still use these OutcomeSpaces directly with the naive MLE probabilities. The justification is that they still return some form of "normalized relative counts", so it fits with the MLE estimator, even though they are not strictly counts. This avoids breaking any code, and the design choice is documented.

  • The separation into OutcomeSpace and ProbabilitiesEstimator is backwards-compatible, because using an OutcomeSpace directly is equivalent to using the MLE estimator.

Documentation

  • Moved the Encoding API to under "Probabilities" in the docs
  • Re-arranged a bunch of content, so that it fits more logically in with the new API.
  • I've rewritten the docstrings a bit, to better highlight how the different components come together. These descriptions are up for discussion, but I'm pretty happy with them as is.

TODO:

  • Make probabilities/probabilities_and_outcomes generic. There's no need to specialize methods unless something unique happens for a particular OutcomeSpace.
  • Add a dedicated Renyi discrete estimator
  • Add a sample-coverage based ProbabilitiesEstimator (the Good-Turing correction)
  • Ensure test coverage doesn't decrease.
  • Thoroughly read docs and make sure they make sense and are updated.
  • Check all doc links.

@kahaaga kahaaga changed the title Probs estimators new Separate OutcomeSpace and ProbabilitiesEstimator Aug 11, 2023
@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Merging #285 (16950d2) into main (68fc7fa) will increase coverage by 0.52%.
The diff coverage is 90.71%.

@@            Coverage Diff             @@
##             main     #285      +/-   ##
==========================================
+ Coverage   85.88%   86.40%   +0.52%     
==========================================
  Files          57       71      +14     
  Lines        1438     1751     +313     
==========================================
+ Hits         1235     1513     +278     
- Misses        203      238      +35     
Files Changed Coverage Δ
src/ComplexityMeasures.jl 100.00% <ø> (ø)
src/complexity_measures/statistical_complexity.jl 100.00% <ø> (ø)
src/core/information_measures.jl 83.33% <0.00%> (-10.00%) ⬇️
...rc/encoding_implementations/rectangular_binning.jl 91.56% <0.00%> (+0.10%) ⬆️
src/outcome_spaces/spatial/utils.jl 94.11% <ø> (ø)
...rc/outcome_spaces/transfer_operator/GroupSlices.jl 66.66% <ø> (ø)
src/outcome_spaces/symbolic_permutation.jl 81.94% <66.66%> (ø)
src/outcome_spaces/value_histogram.jl 76.19% <72.72%> (ø)
.../spatial_permutation/SpatialSymbolicPermutation.jl 82.69% <82.60%> (ø)
src/discrete_info_estimators/schurmann.jl 85.00% <85.00%> (ø)
... and 24 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@kahaaga
Copy link
Member Author

kahaaga commented Aug 11, 2023

It will take me a few days to finalize the remaining estimators I have for this PR, so if you have any feedback @Datseris, please give it, and I will try to incorporate it in the meanwhile.

@kahaaga kahaaga mentioned this pull request Aug 11, 2023
8 tasks
@Datseris
Copy link
Member

Hi @kahaaga ,

thanks a lot for the PR. Thanks for taking the time to write an extensive PR description that outlines and motivates major changes. I can't stress enough how extremely helpful this description is to review the PR.

I went through the changes and the new docs. I have not went through the source code yet, will do only when the PR is final.

I think this is incredible and I accept all changes (minor comments only).

When merged, I will make a high level tutorial just like I have done for Agents.jl and DynamicalSystems.jl. This library is now so extensive that it really needs such a tutorial.

probabilities and counts source

How does probabilities work now? What is its source code? Is it a generic function or is it outcome space dependent? What does the counts function do for outcome spaces that don't have counts? I guess I can get the answers to these questions myself once I read the source, but you already have the answers so might as well give the mto me.

About information

I think how to tackle #284 is simple. information and the whole baggage stays as is. Howewver, it is documented in the Dev Docs page. The Entropy page remains as is in currect stable release and replaces the Information page and measures. This allows us to have soemthing simple for users, yet extendable for developers or advanced users. The information interface will be mentioned in the entropy docstring of course. Extropy becoems a subsection in the Entropy docs.

The source code is not affected by this : highlevel functions entropy, extropy just forward to information.

Do we agree on this? It keeps all your code as is and only changes the way we document it and expose it to users.

Minor comments

  • Do we really need allprobabilities_and_outcomes? and same for allcounts_and_outcomes? I would vote no.

  • Docstring of OutcomeSpace: say "rules for mapping input data to each outcome" (no parenthesis on dimension)

  • Same docstring: "outcome space model" should be just "outcome space".

  • MLE should be renamed to RelativeCounts or something similar and not use abbtreviations. In its formula I would aruge to also rename f_k to n_k.

  • In general I would argue to rename f to n when we are referring to counts, e.g., in the docstring of counts_and_outcomes.

  • in the Probabilities docpage, the OutComeSpace section should be before the probability estimators.

  • There was a significant drop of test coverage in this PR, I hope we can recover it, if not in this PR then in the near future with a new PR.

  • In Bayes estimator docstring, you should say "if used with allprobablitis, then length(a) == total_outcomes(o, x) instead of sing length(outocmes(o, x)).

  • It is not clear what m is in the Bayes estimator in the Assumptions section. Is it Ω?

  • Mathematical symbol m is also used in Shrinkage but not clear what it is . Let's try to keep symbol variability to absolute minimum. Is m the length of Ω? In OutcomeSpace this is denoted as L. Let's make it explicit by deciding L to be the symbol. In general, let's use N as data length, L as outcome space cardinality, and D as dimensionality of input data. Latter two follow the general DynamicalSystems.jl convention.

@Datseris
Copy link
Member

in general please be careful with symbols. I see eg in counts_and_outcomes source code "Ωᵢ ∈ Ω" but we consistently have used small ωι for the elements of Ω in other places. Let's stick with consistency as much as possible.

@kahaaga
Copy link
Member Author

kahaaga commented Aug 11, 2023

I'll address when I have more time, but for now:

When merged, I will make a high level tutorial just like I have done for Agents.jl and DynamicalSystems.jl. This library is now so extensive that it really needs such a tutorial.

Agreed.

I think how to tackle #284 is simple. information and the whole baggage stays as is. Howewver, it is documented in the Dev Docs page. The Entropy page remains as is in currect stable release and replaces the Information page and measures. This allows us to have soemthing simple for users, yet extendable for developers or advanced users. The information interface will be mentioned in the entropy docstring of course. Extropy becoems a subsection in the Entropy docs.
The source code is not affected by this : highlevel functions entropy, extropy just forward to information.
Do we agree on this? It keeps all your code as is and only changes the way we document it and expose it to user

Yes, agreed.

How does probabilities work now?

probabilities/probabilities_and_outcomes is now a generic wrapper to counts/counts_and_outcomes (which is outcome space specific). The wrapper simply does

function probabilities_and_outcomes(est::ProbabilitiesEstimator{<:OutcomeSpace}, x)
    cts, outs = counts_and_outcomes(est.outcomespace, x) 
    return Probabilities(cts), out
end

Currently, the function is explicitly implemented for each outcome space, but we can make it generic. I'll add this to the TODO-list.

What does the counts function do for outcome spaces that don't have counts?

Outcome spaces that don't have counts simply don't implement counts/counts_and_outcomes. They implement probabilities/probabilities_and_outcomes directly. As an example, here's the source code for WaveletOverlap:

# MLE estimation on "pseudo-counts"
function probabilities_and_outcomes(est::WaveletOverlap, x)
    x isa AbstractVector{<:Real} || error("`WaveletOverlap` only works for timeseries input!")
    freqs = time_scale_density(x, est.wl) # relative energies ("pseudo-counts")
    return Probabilities(freqs), 1:length(freqs)
end

This way, we just consider all "relative counts"/"relative energies"/"relative whatever" as pre-computed probabilities (at least after normalization). What it means in practice is that count-based outcome spaces will work with any ProbabilitiesEstimator, while those not based on counting will just work with MLE (or whatever we decide to name it).

@Datseris
Copy link
Member

let me tackle the information change in the same PR as the high level tutorial (after this is merged).

@kahaaga
Copy link
Member Author

kahaaga commented Aug 11, 2023

let me tackle the information change in the same PR as the high level tutorial (after this is merged).

So I keep the code as is for now on a macro-level, and we deal with #284 after?

It is worth noting that entropy(::DiscreteInfoEstimator, ::ProbabilitiesEstimator, x) already works (due to deprecations), it's just not documented anywhere.

@Datseris
Copy link
Member

So I keep the code as is for now on a macro-level, and we deal with #284 after?

yes/

@kahaaga
Copy link
Member Author

kahaaga commented Aug 12, 2023

MLE should be renamed to RelativeCounts or something similar and not use abbreviations.

In the literature, this procedure is almost exclusively referred to as the "maximum likelihood" estimator. Maybe MaxLikelihood/MaximumLikelihood? There are good theoretical reasons for calling it this, related to the multinomial distribution (which is what connects counts and probabilities, and leads to the estimator). But perhaps we should avoid it, since we use it for counts and relative quantities that are not actually counts.

RelativeCounts is also ok for the counting-based outcome spaces, but it doesn't quite cover the non-counting-based outcome spaces. Maybe just Relative, to cover both relative frequencies/energies/whatever and relative counts?

Edit: maybe Empirical could work too?

@Datseris
Copy link
Member

Hm, I am not sure I like Empirical; aren't all estimators empirical in some sense? And Relative is too generic of a word. On the other hand, isn't MaxLikelihood something that we also have with entropies? There we called it PlugIn though, I am not sure why.

I think I like RelativeAmount the most: it is not as abstract as Relative and is not limited to only ount-based.

@kahaaga
Copy link
Member Author

kahaaga commented Aug 12, 2023

On the other hand, isn't MaxLikelihood something that we also have with entropies? There we called it PlugIn though, I am not sure why.

Yes, so the issue is that maximum likelihood estimation is a generic principle that can be applied to many things. So there is a maximum likelihood estimator of entropy, and there is a maximum likelihood estimator of probabilities, and for many other quantities. So in principle, we could use the same name at both levels (this would work, because the type parameterization is generic), but I think that would be confusing from an educational perspective.

I think I like RelativeAmount the most: it is not as abstract as Relative and is not limited to only count-based.

I'm not totally happy withRelativeAmount either, but I can't come up with anything better myself, so I think RelativeAmount is fine. In its docstring state that for counting-based outcome spaces, the estimator is the maximum likelihood estimator. We also document that for other outcome spaces, it means pre-normalized quantities. Sounds good?

@Datseris
Copy link
Member

i don't know what you mean by "pre-normalized". The quantities are just normalized when turned into probabilities. What does the "pre" convey?

@Datseris
Copy link
Member

If we made a generic MaxLikelihood type, how would the code work for using the same type for both entropy and probabilities?

@Datseris
Copy link
Member

I don't know where the functions frequencies and frequencies_and_outcomes are defined, but let's please remove them. We already have too many functions in this spirit, surely with probabilities and counts we may obtain frequencies. In fact, isn't frequencies the same as calling probabilities with RelativeAmount?

@Datseris
Copy link
Member

okay, turns out source code of SampleCoverage was just wrong so I've fixed that (still no test though, feel free to write here a quick test I can add)

@kahaaga
Copy link
Member Author

kahaaga commented Aug 18, 2023

Forgot to mention:

I found what I believe to be a bug in the WaveletOverlap. When getting the coefficients for the different levels, the returned matrix had one more column than the number of transform levels.

The following point should be added to the changelog:

- WaveletOverlap now computes probabilities (relative energies) over the correct number of transform levels. Previously, the *scaling *coefficients for the max transform level were incorrectly included, as an extra set of coefficients in addition to the (correctly included) wavelet coefficients. This caused a lot of energy to be concentrated at low frequencies, even for high-frequency signals. Thus the corresponding Probabilities had an extra element which in many cases dominated the rest of the distribution.

@kahaaga
Copy link
Member Author

kahaaga commented Aug 18, 2023

okay, turns out source code of SampleCoverage was just wrong so I've fixed that (still no test though, feel free to write here a quick test I can add)

Can we just remove SampleCoverage completely? I was supposed to delete it, but forgot to do so. I am working on a generic Good-Turing probabilities estimator which will replace it. There so many literature estimators that do similar things that I think it is better to just have a single parameterised estimator for it.

@kahaaga
Copy link
Member Author

kahaaga commented Aug 18, 2023

I don't know where the functions frequencies and frequencies_and_outcomes are defined, but let's please remove them. We already have too many functions in this spirit, surely with probabilities and counts we may obtain frequencies. In fact, isn't frequencies the same as calling probabilities with RelativeAmount?

This ties in to my previous comment - we should just delete SampleCoverage completely. It is old code, and these functions no longer exist.

@Datseris
Copy link
Member

ok done.

@Datseris
Copy link
Member

I have to go now but I ]ll be back in 3-4 hours. Test passing is what remains to be done.

@kahaaga
Copy link
Member Author

kahaaga commented Aug 18, 2023

I have to go now but I ]ll be back in 3-4 hours. Test passing is what remains to be done.

I'll try to fix the remaining tests in the meanwhile. If tests pass when you're back, feel free to merge. I will likely be unavailable until Sunday evening CET.

@kahaaga
Copy link
Member Author

kahaaga commented Aug 18, 2023

The remaining tests fail due to SymbolicWeightedPermutation and SymbolicAmplitudeAdjustedPermutation.

The reason is the definition abstract type PermutationOutcomeSpace{m} <: CountBasedOutcomeSpace end. Only SymbolicPermutation is strictly counting based. SymbolicWeightedPermutation and SymbolicAmplitudeAdjustedPermutation don't return integer-valued histograms. They return scaled, normalized histograms (due to the weights). Thus, this subtyping won't work.

I think this is why I went with the is_counting_based function instead of using CountBasedOutcome dispatch.

I actually have a paper I'm working on where I modify these outcome spaces such that they can be counted, but it isn't included here at the moment.

Either we need an additional abstract type to dispatch on, or we need another solution.

@Datseris
Copy link
Member

Datseris commented Aug 18, 2023

well the simplest solution is to explicitly extend the function is_counting_based to flse for weighted andamplitude wieghted. it isn't too bad.

@kahaaga
Copy link
Member Author

kahaaga commented Aug 18, 2023

well the simplest solution is to explicitly extend the function is_counting_based to flse for weighted andamplitude wieghted. it isn't too bad.

Perfect! I though you removed is_counting_based completely, but that is obviously not the case. Simply extending this method explicitly for SymbolicWeightedPermutation and SymbolicAmplitudeAwarePermutation works.

@kahaaga
Copy link
Member Author

kahaaga commented Aug 18, 2023

This now looks good from my side.

Copy link
Member

@Datseris Datseris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh man, what a change. but this is awesome! WIll contribute tutorial later tonight!!!!

@Datseris Datseris merged commit 9d6030b into main Aug 18, 2023
4 checks passed
@Datseris Datseris deleted the probs_estimators_new branch August 18, 2023 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants