-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added value to OpVectorColumnMetadata for numeric column descriptors #89
Conversation
I propose to keep the name |
Existing models will not load with this release anyway because of the model selector change |
@@ -75,6 +79,9 @@ case class OpVectorColumnMetadata | |||
assert(parentFeatureName.length == parentFeatureType.length, | |||
s"must provide both type and name for every parent feature," + | |||
s" names: $parentFeatureName and types: $parentFeatureType do not have the same length") | |||
assert(indicatorValue.isEmpty || descriptorValue.isEmpty, "cannot have both an indicatorValue and a," + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cannot have both an indicatorValue and descriptorValue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are making breaking changes anyways, I keep thinking that a base trait + two specific classes makes even more sense. This way we would avoid asserts like this ^
trait OpVectorColumnMetadata
(
def parentFeatureName: Seq[String]
def parentFeatureType: Seq[String]
def grouping: Option[String]
def index: Int
)
case class CategoricalVectorColumnMetadata(
parentFeatureName: Seq[String],
parentFeatureType: Seq[String],
grouping: Option[String],
indicatorValue: Option[String],
index: Int = 0
) extends OpVectorColumnMetadata
case class ContinuousVectorColumnMetadata(
parentFeatureName: Seq[String],
parentFeatureType: Seq[String],
grouping: Option[String],
descriptorValue: Option[String]
index: Int = 0
) extends OpVectorColumnMetadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would break our internal usage - look at sanity checker @tovbinm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make a separate ticket do do this as part of metadata restructuring
@Jauntbox thoughts on indicatorGroup versus grouping? |
Codecov Report
@@ Coverage Diff @@
## master #89 +/- ##
==========================================
+ Coverage 85.47% 85.98% +0.51%
==========================================
Files 294 294
Lines 8792 9555 +763
Branches 309 533 +224
==========================================
+ Hits 7515 8216 +701
- Misses 1277 1339 +62
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #89 +/- ##
==========================================
+ Coverage 85.62% 86.11% +0.49%
==========================================
Files 292 292
Lines 8763 9490 +727
Branches 306 316 +10
==========================================
+ Hits 7503 8172 +669
- Misses 1260 1318 +58
Continue to review full report at Codecov.
|
I don't think we should continue using The only case I have questions about now is how we would deal with cases where we may split a single feature into multiple groupings. I can't think of any examples where we'd want multiple indicator groups or multiple numeric groups, but we might have some in the future. One way is to generate multiple different separate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -465,10 +465,10 @@ class SanityChecker(uid: String = UID[SanityChecker]) | |||
|
|||
// Only perform Cramer's V calculation on columns that have an indicatorGroup and indicatorValue defined (right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment to match new name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leahmcguire I think this is the only one left. try grepping the codebase indicatorGroup
@@ -319,23 +319,6 @@ case object SanityCheckerSummary { | |||
) | |||
} | |||
|
|||
@deprecated("CategoricalStats replaced by Array[CategoricalGroupStats]", "3.3.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor comment left, lgtm!! let's create an issue to refactor for specific vector metadata classes later then.
Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
Related issues
We use the indicatorGroup and indicatorValue in OpVectorColumnMetadata not being empty to decide when we need to run sanity checker categorical tests in metadata. This is a problem for cases where we have information about a derived field that we want to represent but the field is not a boolean indicator (we have no place to put them). We have hacked around this by just putting this information in the indicatorValue anyway and not having anything in the indicator group. However this doesn't work if we have the same transformation for maps (which we need to add for feature complete).
Describe the proposed solution
Right now our column metadata is structured like this
we will add a value to hold numeric column name information and rename indicatorGroup to grouping to make it's usage more clear.
Describe alternatives you've considered
put in an check for parent feature type to exclude specific types from the sanity checker categorical tests