Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting behavior: What to do with elements that occurred before the predefined splitting bins? #267

Open
bockthom opened this issue Jul 31, 2024 · 1 comment

Comments

@bockthom
Copy link
Collaborator

bockthom commented Jul 31, 2024

This issue has been detected by @MaLoefUDS in PR #263. In the following, I quote two comments that @MaLoefUDS had posted in PR #263:

Original posts by @MaLoefUDS:


[...] I have some nice insight on the issue regarding the splitting in showcase.R that "breaks" which @Leo-Send found here:

When testing without commit interactions in showcase and setting 'browser()' statements in the function 'split.data.by.time.or.bins', I found that the 'Re-arranging data' part around line 1025 sets the content of 'data.split' to NULL. It does this regardless of the commit interactions. Is this intended?

First of all, this does have nothing to do with the selected data source or if commit interactions are present (as Leo already mentioned). Interestingly this broken application has been in showcase.R for a really long time. I guess nobody really cared because the only use of the split data is plotting it once, which does not break directly.

The reason for the break is that for some data sources, some of their elements are before the predefined bins by which we split. When we split by given bins, we use the base::findInterval method to separate all data elements into the corresponding bins they belong, i.e., it returns a vector of integers, of which each integer describes the index of the bin in which we should place an item (e.g., 1 1 1 2 2 3 3 3 3 3 6). Then we split according to this vector and rename all splits according to the datestring of the bin.

However, the findInterval function returns 0 for datapoints that do not lie in any bin (i.e., before the first bin). Consequently, when trying to find the name for said split, we index into the vector of bin-labels with 0 resulting in character(0). This in the end leads to the breakage observed. This might actually require a fix from our side. I believe, spitting should not break like this when some elements are not in any bin.

Originally posted by @MaLoefUDS in #263 (comment)


Regarding fixing this problem of broken splitting which may or may not be the sole cause of Leo's problems, I have two possible fixes in mind.

But first of: Currently we do not have any problems with data elements that lie beyond the last bin. The bins we provide are always just the first date bound. The next following bin is then the last date bound, while also being the first date of its own new bin. In the current implementation, the last bin is then holds all elements from this point on until infinity. Obviously, we cannot simply include all elements before the first bin into the first bin as well, as this would lead to unintuitive behavior (i.e., what would it mean to provide only one bin when splitting?).

Here are my ideas for fixes:

  1. Ignore all elements before the first bin and disregard them. We can simply filter the bin assignment vector and filter out all 0s.
  2. Create an artificial bin before the first actual bin with some made up name (?) that holds all elements before the first bin. This is also easy to implement as we can just rename the bin of character(0) to whatever.

I don't have any preference on this, since I am not sure which idiomatic values of coronet are at stake 😅

Originally posted by @MaLoefUDS in #263 (comment)

@bockthom
Copy link
Collaborator Author

We need to figure out in which situations the elements before the first bin are actually needed. This is the key question to this problem, and I don't have an answer to this question. If there are no such situations, we could potentially remove them. Otherwise we need to find a way to keep them. But first we need to figure out whether there is any use case in which the elements before the first bin are relevant.

Originally posted by @bockthom in #263 (comment)

@bockthom bockthom added this to the v4.5 milestone Jul 31, 2024
@bockthom bockthom mentioned this issue Jul 31, 2024
7 tasks
@bockthom bockthom changed the title Splitting behavior: What to do with elements before the predefined splitting bins? Splitting behavior: What to do with elements that occurred before the predefined splitting bins? Jul 31, 2024
MaLoefUDS added a commit to MaLoefUDS/coronet that referenced this issue Aug 27, 2024
This change is in-line with the current treatment of the elements that
occur after the last bin.

Further, this fixes a minor "bug" where bin indecees are used to
determine bin labels by indexing into a list of labels. Elements before
the first bin received bin index 0, however, R is 1-indexed. Note: This
"bug" did not cause problems because the bin with the incorrect label
will be ignored later anyways.

Works towards se-sic#267.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
MaLoefUDS added a commit to MaLoefUDS/coronet that referenced this issue Aug 27, 2024
This change is in-line with the current treatment of the elements that
occur after the last bin.

Further, this fixes a minor "bug" where bin indecees are used to
determine bin labels by indexing into a list of labels. Elements before
the first bin received bin index 0, however, R is 1-indexed. Note: This
"bug" did not cause problems because the bin with the incorrect label
will be ignored later anyways.

Works towards se-sic#267.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
MaLoefUDS added a commit to MaLoefUDS/coronet that referenced this issue Sep 2, 2024
This change is in-line with the current treatment of the elements that
occur after the last bin.

Further, this fixes a minor "bug" where bin indecees are used to
determine bin labels by indexing into a list of labels. Elements before
the first bin received bin index 0, however, R is 1-indexed. Note: This
"bug" did not cause problems because the bin with the incorrect label
will be ignored later anyways.

Works towards se-sic#267.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
MaLoefUDS added a commit to MaLoefUDS/coronet that referenced this issue Sep 10, 2024
This change is in-line with the current treatment of the elements that
occur after the last bin.

Further, this fixes a minor "bug" where bin indecees are used to
determine bin labels by indexing into a list of labels. Elements before
the first bin received bin index 0, however, R is 1-indexed. Note: This
"bug" did not cause problems because the bin with the incorrect label
will be ignored later anyways.

Works towards se-sic#267.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
MaLoefUDS added a commit to MaLoefUDS/coronet that referenced this issue Sep 10, 2024
This change is in-line with the current treatment of the elements that
occur after the last bin.

Further, this fixes a minor "bug" where bin indecees are used to
determine bin labels by indexing into a list of labels. Elements before
the first bin received bin index 0, however, R is 1-indexed. Note: This
"bug" did not cause problems because the bin with the incorrect label
will be ignored later anyways.

Works towards se-sic#267.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants