Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fingerprint ingest processor #68415

Merged
merged 6 commits into from
Feb 9, 2021

Conversation

danhermann
Copy link
Contributor

Adds a fingerprint processor that computes hashes of document content for content fingerprinting use cases.

E.g.:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "fingerprint": {
          "fields": ["user"]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "user": {
          "last_name": "Smith",
          "first_name": "John",
          "date_of_birth": "1980-01-15",
          "is_active": true
        }
      }
    }
  ]
}

Which produces:

"_source" : {
  "fingerprint" : "WbSUPW4zY1PBPehh2AA/sSxiRjw=",
  "user" : {
    "last_name" : "Smith",
    "first_name" : "John",
    "date_of_birth" : "1980-01-15",
    "is_active" : true
  }
}

Supports any number of document fields, nested document content, any hash from [MD5, SHA-1, SHA-256, SHA-512], and a per-processor salt.

Closes #53578 though it addresses only content fingerprinting and not anonymization use cases.

@danhermann danhermann added >feature :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 v7.12.0 labels Feb 2, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Feb 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@danhermann
Copy link
Contributor Author

@graphaelli, could you or a member of your team look this over to verify that it provides the functionality that the APM use case requires?

@danhermann
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

@graphaelli
Copy link
Member

Thanks @danhermann! @elastic/apm-server - can you provide feedback here?

@mark-vieira
Copy link
Contributor

@elasticmachine update branch

@axw
Copy link
Member

axw commented Feb 3, 2021

@danhermann thank you, this will do nicely for APM.

We'll need MD5 (which I see is available) for now, but later I would like to migrate to something faster, such as Murmurhash3 (non-cryptographic would be fine). Any reason why we couldn't add that later? Doesn't look like it to me, just wanted to double check.

@danhermann
Copy link
Contributor Author

@danhermann thank you, this will do nicely for APM.

We'll need MD5 (which I see is available) for now, but later I would like to migrate to something faster, such as Murmurhash3 (non-cryptographic would be fine). Any reason why we couldn't add that later? Doesn't look like it to me, just wanted to double check.

Yes, adding support for the murmur3 hash should be doable.

@dakrone dakrone self-requested a review February 4, 2021 16:44
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @danhermann, I left a few really minor comments, but I think we need to add regular docs also right?

Comment on lines 98 to 101
var setList = set.stream().sorted().collect(Collectors.toList());
for (int k = setList.size() - 1; k >= 0; k--) {
values.push(setList.get(k));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streams are super great, but they aren't quite as fast compared to foreach clauses yet, and since this is performance sensitive code, what do you think about doing:

Suggested change
var setList = set.stream().sorted().collect(Collectors.toList());
for (int k = setList.size() - 1; k >= 0; k--) {
values.push(setList.get(k));
}
var setList = new ArrayList<?>(set);
setList.sort(Comparator.naturalOrder());
for (Object thing : setList) {
values.push(thing);
}

?

(I might be missing something about why you're using a for loop indexed rather than foreach, as I think the foreach compiles to the same bytecode, but I could totally be wrong)

Copy link
Contributor Author

@danhermann danhermann Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The for loop iterates backward over the sorted list to push the values onto the stack. That means that the values are processed in sorted order after being popped off the stack which is not strictly necessary since only a stable hash is required, but having the values in sorted order is certainly nicer when debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did replace the usage of streams both here and below.

@danhermann
Copy link
Contributor Author

Thanks for the review, @dakrone. I'll open a separate PR with the docs for this one shortly.

@danhermann
Copy link
Contributor Author

@dakrone, I've updated this with all your suggestions except the indexed for loop. Let me know if you're ok with the reverse iteration so elements from the stack are processed in sorted order.

@danhermann
Copy link
Contributor Author

@elasticmachine update branch

@danhermann
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/2

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'm totally cool with the reverse ordering, thanks for iterating!

@danhermann
Copy link
Contributor Author

cc: @elastic/es-ui in case Kibana auto-complete needs to be updated with this new processor.

@danhermann danhermann merged commit 0083e9c into elastic:master Feb 9, 2021
@danhermann danhermann deleted the 53578_fingerprint_processor branch February 9, 2021 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature Team:Data Management Meta label for data/management team v7.12.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "fingerprint" ingest processor
7 participants