Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] adds new feature_processors field for data frame analytics #60528

Merged
merged 15 commits into from
Aug 14, 2020

Conversation

benwtrent
Copy link
Member

feature_processors allow users to create custom features from
individual document fields.

These feature_processors are the same object as the trained model's pre_processors.

They are passed to the native process and the native process then appends them to the
pre_processor array in the inference model.

closes #59327

@benwtrent
Copy link
Member Author

Need to write tests still, but the overall design is hammered out.

@benwtrent benwtrent force-pushed the feature/ml-dfa-add-processors branch 3 times, most recently from bb3d527 to 53730cd Compare August 4, 2020 13:46
feature_processors allow users to create custom features from
individual document fields.
@benwtrent benwtrent force-pushed the feature/ml-dfa-add-processors branch from 53730cd to 497c7a8 Compare August 4, 2020 14:11
@benwtrent benwtrent marked this pull request as ready for review August 4, 2020 14:17
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just need to work through a few minor comments and some simplifications if possible.

@dimitris-athanasiou
Copy link
Contributor

Almost there! I think the last bit missing is covering the changes in ExtractedFieldsDetector with unit tests in ExtractedFieldsDetectorTests.

@benwtrent
Copy link
Member Author

@elasticmachine update branch

Set<String> duplicatedFields = new HashSet<>();
for (ProcessedField processedField : processedFields) {
for (String output : processedField.getOutputFieldNames()) {
if(processedFeatures.add(output) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after if

@@ -52,7 +52,7 @@ public ProcessedField(PreProcessor processor) {
}
}
preProcessor.process(inputs);
return preProcessor.outputFields().stream().map(inputs::get).toArray();
return preProcessor.outputFields().stream().map(inputs::get).filter(Objects::nonNull).toArray();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work correctly? If we filter out null objects, won't we mess the correspondence of the values to the output fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think on this more.

We don't want to return partial lists, for sure. But we also don't want to put empty/missing unless the caller supports missing values...

@@ -472,12 +479,100 @@ public void testGetCategoricalFields() {
containsInAnyOrder("field_keyword", "field_text", "field_boolean"));
}

public void testWithProcessedFeatures_FieldInfo() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to testGetFieldNames_GivenProcessesFeatures ?

@@ -551,4 +646,70 @@ protected SearchResponse executeSearchScrollRequest(String scrollId) {
return searchResponse;
}
}

static class CategoricalPreProcessor implements PreProcessor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we make this private?

@@ -943,6 +949,196 @@ public void testDetect_GivenAnalyzedFieldExcludesObjectField() {
assertThat(e.getMessage(), equalTo("analyzed_fields must not include or exclude object fields: [object_field]"));
}

public void testDetect_givenFeatureProcessorsFailures() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a lot of value on keeping the unit tests targeting a very specific piece of functionality when possible. The reason for that is that when a test fails, it is really helpful it if makes it clear what the problem was. I would suggest splitting this test into individual tests with names that indicate the validation that is tested. It also makes the tests serve as live documentation.

I realise this is a subjective preference. If you are not convinced by the argument, you can of course leave it as is :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭

This PR is going to end up being near 2k lines.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent
Copy link
Member Author

run elasticsearch-ci/packaging-sample-windows

@benwtrent benwtrent merged commit de3107a into elastic:master Aug 14, 2020
@benwtrent benwtrent deleted the feature/ml-dfa-add-processors branch August 14, 2020 12:01
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 14, 2020
…tic#60528)

feature_processors allow users to create custom features from
individual document fields.

These `feature_processors` are the same object as the trained model's pre_processors.

They are passed to the native process and the native process then appends them to the
pre_processor array in the inference model.

closes elastic#59327
benwtrent added a commit that referenced this pull request Aug 14, 2020
…) (#61148)

feature_processors allow users to create custom features from
individual document fields.

These `feature_processors` are the same object as the trained model's pre_processors.

They are passed to the native process and the native process then appends them to the
pre_processor array in the inference model.

closes #59327
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Add new feature_processing field to Data frame analytics config
4 participants