Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLRC: ML Post Data #33443

Merged
merged 10 commits into from
Sep 7, 2018
Merged

Conversation

benwtrent
Copy link
Member

Adds the ability to Post Data to an ML job in the HLRC.

This relates to (#29827)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start. I left a few notes and ideas. I'll also ask the rest of the team to think about some of these design decisions.

/**
* Specifies the end of the bucket resetting range
*
* @param resetEnd String representation of a timestamp; may be an epoch seconds, epoch millis or an ISO string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be best to say ISO 8601 instead of just ISO.

Also, this can be set to the string "now".

for(int i = 0; i < 10; i++) {
Map<String, Object> hashMap = new HashMap<>();
hashMap.put("total", randomInt(1000));
postDataRequest.addDoc(hashMap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if these docs contained the job's time field, with the loop generating some ascending values for it. Then the server could validate that they were received in ascending order.

To make this change an overload of buildJob() will have to be added that sets the time field and time format to specific values rather than random values, so that the loop knows what to add.

}
for (Map<String, Object> objectMap : objectMaps) {
builder.map(objectMap);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a problem here because if data has been added as a mixture of bytes references and maps then all the maps will be sent after all the bytes references, and that could mean the data is not sent in ascending time order.

Two possible solutions I can think of are:

  1. Document that all data must be supplied in the same format - either bytes references or maps - and enforce this in the addDoc() methods.
  2. Convert the maps to bytes references in addDoc() so that there are only bytes references being stored.

This also shows that we should say somewhere in the Javadocs that docs will be processed by the job in the order they're added to the request and that therefore they should be added to the request in ascending time order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of doing option 2 and simply calling the bytesReferences overload of addDoc from the object map one.

The downside of serializing down to simply use the BytesReference overload is performance. Though, I may be preoptimizing here.

* @param bytesReference document to add to bulk request, format must match the set XContentType
*/
public void addDoc(BytesReference bytesReference) {
this.bytesReferences.add(Objects.requireNonNull(bytesReference, "bytesReferences must not be null"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw an exception if content != null? Or otherwise return early to ignore as the Javadoc says.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thought, Since they set the the whole bulk content earlier, there is no need to continue to collect the individual docs.

* @param objectMap document object to add to bulk request
*/
public void addDoc(Map<String, Object> objectMap) {
this.objectMaps.add(Objects.requireNonNull(objectMap, "objectMap must not be null"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw an exception if content != null? Or otherwise return early to ignore as the Javadoc says.

}

/**
* Set the total content to post.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add again that this takes precedence over any individual docs already added.

It could also .clear() the lists of individual docs.

An alternative to consider would be to throw an exception if individual docs have already been added. It seems like client code is badly designed if it adds individual docs then wipes them out by overriding with externally formatted content.

@dimitris-athanasiou
Copy link
Contributor

dimitris-athanasiou commented Sep 6, 2018

I've had a thought about this. Here is what I propose:

PostDataRequest should have the following constructors:

  1. PostDataRequest(BytesReference, XContentType): this accepts the full load and is equivalent to what we currently have (checks BWC)
  2. PostDataRequest(byte[], XContentType): this accepts the full load as a byte[]. This is nice to add as client users should not have to learn elasticsearch's BytesReference class set.
  3. PostDataRequest(JsonBuilder): This specifically allows using a new JsonBuilder which allows doc-by-doc creation in a user-friendly but error-free way.

All these constructors end up building the final BytesReference that we'll end up sending over.

Then the JsonBuilder - an inner static class - has the following methods:

  1. public JsonBuilder addDoc(String): this allows a String which is convenient for users who have their JSON stored as a String directly

  2. public JsonBuilder addDoc(byte[]): for it is possibly a common format users will be able to convert their docs into

  3. public JsonBuilder addDoc(Map<String, Object>): for supporting maps

  4. private BytesReference build(): which combines all docs and returns a single BytesReference. This will only be called by the constructor of the PostDataRequest.

The above keeps means the users cannot possibly create an invalid PostDataRequest, keeps BWC and improves usability for JSON.

What do you think?

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new way of structuring the request object is much better. I just left a few more minor comments.

public PostDataRequest(String jobId, XContentType xContentType, BytesReference content) {
this.jobId = Objects.requireNonNull(jobId, "job_id must not be null");
this.xContentType = Objects.requireNonNull(xContentType, "content_type must not be null");
this.content = content;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also requireNonNull for content?

this.xContentType = Objects.requireNonNull(xContentType, "content_type must not be null");
ByteBuffer buffer = ByteBuffer.wrap(content);
ByteBuffer[] buffers = new ByteBuffer[]{ buffer };
this.content = BytesReference.fromByteBuffers(buffers);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid the ByteBuffer[] by using this.content = new ByteArray(content)? IIRC ByteArray extends BytesReference.


@Override
public int hashCode() {
return Objects.hash(jobId, resetStart, resetEnd, xContentType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add a comment that content is deliberately left out as it is on the server side too. (I'm not convinced that was a good decision as it means two radically different posts can be the equal, but we are where we are.) But at least by having a comment it avoids someone adding it in in one place but not the other in the future.

Same for the comparison in equals() below.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit 9230a48 into elastic:master Sep 7, 2018
@benwtrent benwtrent deleted the feature/hlrc-ml-post-data branch September 7, 2018 12:04
benwtrent added a commit that referenced this pull request Sep 7, 2018
* HLRC: ML Post data
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 8, 2018
* master: (30 commits)
  Include fallback settings when checking dependencies (elastic#33522)
  [DOCS] Fixing formatting issues in breaking changes
  CRUD: Disable wait for refresh tests with delete
  Test: Fix test name (elastic#33510)
  HLRC: split ingest request converters (elastic#33435)
  Logging: Configure the node name when we have it (elastic#32983)
  HLRC: split xpack request converters (elastic#33444)
  HLRC: split watcher request converters (elastic#33442)
  HLRC: add enable and disable user API support (elastic#33481)
  [DOCS] Fixes formatting error
  TEST: Ensure merge triggered in _source retention test (elastic#33487)
  [ML] Add a file structure determination endpoint (elastic#33471)
  HLRC: ML Forecast Job (elastic#33506)
  HLRC: split migration request converters (elastic#33436)
  HLRC: split snapshot request converters (elastic#33439)
  Make Watcher validation message copy/pasteable
  Removes redundant test method in SQL tests (elastic#33498)
  HLRC: ML Post Data (elastic#33443)
  Pass Directory instead of DirectoryService to Store (elastic#33466)
  Collapse package structure for metrics aggs (elastic#33463)
  ...
@benwtrent benwtrent removed the :ml Machine learning label Oct 25, 2018
@jimczi jimczi removed the v7.0.0 label Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants