-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify categorical variables in metadata #120
Conversation
import org.scalatest.Matchers | ||
import org.scalatest.junit.JUnitRunner | ||
|
||
object AttributeTestUtils extends Matchers{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- better use a trait - it will be cleaner and simpler to use.
- make return type the
Assertion
here it is:
trait AttributeAsserts {
self: Matchers =>
final def assertNominal(schema: StructField, expectedNominal: Array[Boolean]): Assertion = ???
}
then mixin it like this:
@RunWith(classOf[JUnitRunner])
class BinaryVectorizerTest extends OpTransformerSpec[OPVector, BinaryVectorizer] with AttributeAsserts { ... }
@@ -72,6 +75,8 @@ class OpVectorMetadata private | |||
newColumns: Array[OpVectorColumnMetadata] | |||
): OpVectorMetadata = OpVectorMetadata(name, newColumns, history) | |||
|
|||
val textTypes = Seq(MultiPickList, MultiPickListMap, Text, TextArea, TextAreaMap, TextMap, Binary, BinaryMap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- how is
Binary
andBinaryMap
are testTypes? - also instead please use
FeatureType.shortTypeName[Text]
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want the package name as well. E.g com.salesforce.features.types. MultiPickList
As it is specified in OpVectorColumnMetadata.parentFeatureType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then FeatureType.typeName
@@ -31,15 +31,18 @@ | |||
package com.salesforce.op.utils.spark | |||
|
|||
import com.salesforce.op.FeatureHistory | |||
import com.salesforce.op.features.types.{Binary, BinaryMap, MultiPickList, MultiPickListMap, Text, TextArea, TextAreaMap, TextList, TextMap} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import com.salesforce.op.features.types._
Codecov Report
@@ Coverage Diff @@
## master #120 +/- ##
==========================================
+ Coverage 86.18% 86.21% +0.02%
==========================================
Files 297 297
Lines 9670 9676 +6
Branches 334 539 +205
==========================================
+ Hits 8334 8342 +8
+ Misses 1336 1334 -2
Continue to review full report at Codecov.
|
* @param expectedNominal Expected array of booleans. True if the field is nominal, false if not. | ||
*/ | ||
final def assertNominal(schema: StructField, expectedNominal: Array[Boolean]): Assertion = { | ||
val attributes = AttributeGroup.fromStructField(schema).attributes.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid .get
, rather do attributes.map(_.map(_.isNominal)) shouldBe Some(expectedNominal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, that is a lot of updated tests! My only real question is if those are the only types we want to make sure are flagged. Also text is not actually a binary attribute by default it is a count of occurrences of the hash. Is there another attribute group type for counts? or should they be numeric?
val categoricalTypes = Seq(FeatureType.typeName[MultiPickList], FeatureType.typeName[MultiPickListMap], | ||
FeatureType.typeName[Text], FeatureType.typeName[TextArea], FeatureType.typeName[TextAreaMap], | ||
FeatureType.typeName[TextMap], FeatureType.typeName[Binary], FeatureType.typeName[BinaryMap], | ||
FeatureType.typeName[TextList]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
picklist? Combo box? country, state, city, id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah If it is only hashing + count, let's remove all these Text types. Do we only do hashing to Combo box, country, state, city, id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do pivot - so they should be picked up automatically. I think we also do pivot on multiPickList. So you may want to remove the categorical types check completely and only rely on the indicatorValue
.map { case (_, g) => g.head -> g.map(_.index) } | ||
val colMeta = colData.map { case (c, i) => | ||
c.toMetadata(i) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit unnecessary new line hanging => is bad
.putMetadataArray(OpVectorMetadata.ColumnsKey, colMeta.toArray) | ||
.putMetadata(OpVectorMetadata.HistoryKey, FeatureHistory.toMetadata(history)) | ||
.build() | ||
val attributes = columns.map { c => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val attributes = columns.map {
case c if c.indicatorValue.isDefined || categoricalTypes.exists(c.parentFeatureType.contains) =>
BinaryAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index)
case c =>
NumericAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@michaelweilsalesforce please update the description of the PR to clarify the reasons and the solution. |
Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this. |
Related issues
The one-hot encoding provided by Transmogrify doesn't specify that the created columns are categorical. As a consequence, Decision Tree and Random Forest will treat these columns as numerics.
Describe the proposed solution
For each feature engineering transformation that creates categorical columns, we add the
Binary
attribute to their metadata.