Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include unindexed field in FieldStats response #21821

Merged
merged 5 commits into from
Dec 6, 2016
Merged

Include unindexed field in FieldStats response #21821

merged 5 commits into from
Dec 6, 2016

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Nov 28, 2016

This change adds non-searchable fields to the FieldStats response. These fields do not have min/max informations but they can be aggregatable. Fields that are only stored in _source (store:no, index:no, doc_values:no) will still be missing since they do not have any useful information to show. Indices and clients must be at least on V_5_2_0 to see this change.

Closes #21952

@jimczi jimczi added :Data Management/Stats Statistics tracking and retrieval APIs >enhancement review v5.2.0 labels Nov 28, 2016
protected T minValue;
protected T maxValue;

FieldStats(byte type, long maxDoc, long docCount, long sumDocFreq, long sumTotalTermFreq,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Javadoc for these two ctors now?

@@ -550,27 +625,50 @@ public static FieldStats readFrom(StreamInput in) throws IOException {
long sumTotalTermFreq = in.readLong();
boolean isSearchable = in.readBoolean();
boolean isAggregatable = in.readBoolean();

boolean hasMinMax = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird to see this pre-NamedWriteable stuff still around. It might make sense to convert these to NamedWriteable in master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure of this, NamedWriteable is ok when you want to serialize objects that have nothing in common but that's not the case here. We just want to serialize the min/max value differently for each specialization of the class. Rewriting this as a NamedWriteable would be completely different. We would need to create a class to encapsulate the min/max info and then use NamedWriteable to serialize these classes. I can do that but I prefer to do it in another PR since it's not related to the change that we're trying to make here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, another PR is totally fine. No need to hold this up. And I'm not 100% sure it makes sense to convert them, just that it makes sense to think about it.

We have lots of NamedWriteables that write things in common - queries and aggregations do it all the time. They defer to superclasses to load the common stuff.

@@ -398,6 +456,12 @@ public String getMaxValueAsString() {
private FormatDateTimeFormatter formatter;

public Date(long maxDoc, long docCount, long sumDocFreq, long sumTotalTermFreq,
boolean isSearchable, boolean isAggregatable) {
super((byte) 2, maxDoc, docCount, sumDocFreq, sumTotalTermFreq, isSearchable, isAggregatable);
this.formatter = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeMinMax isn't going to work for this one, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ? writeMinMax is only called when min/max infos are present. For this case there is no min/max so we don't call writeMinMax at all. Does it make sense ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. That is fine then.

.setFields("field_index", "field_dv", "field_stored", "field_source").get();
assertThat(result.getAllFieldStats().size(), equalTo(3));
for (String field : new String[] {"field_index", "field_dv", "field_stored"}) {
assertThat(result.getAllFieldStats().get(field).getMaxDoc(), equalTo(11L));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this'll be more readable if you assign result.getAllFieldStats().get(field) to a variable.

assertThat(result.getAllFieldStats().get(field).getDisplayType(),
equalTo("string"));
if ("field_index".equals(field)) {
assertThat(result.getAllFieldStats().get(field).getMinValue(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a fan of assertThat(..., equalTo(...)). I feel like it is usually more clear to do assertEquals(...). Like, if you have lots of assertions and you think they should line up then fine, but in this case I think they are all equalTo so maybe switch? Not a big deal either way.

@@ -519,49 +600,84 @@ public void testMetaFieldsNotIndexed() {
}

public void testSerialization() throws IOException {
Version version = randomBoolean() ? Version.CURRENT : Version.V_5_1_0_UNRELEASED;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd test both in the same test run because it is fast.

}
}

/**
* creates a random field stats which does not guarantee that {@link FieldStats#maxValue} is greater than {@link FieldStats#minValue}
**/
private FieldStats randomFieldStats() throws UnknownHostException {
private FieldStats randomFieldStats(boolean withNullMinMax) throws UnknownHostException {
int type = randomInt(5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it isn't a part of this change but I'm not a fan of the int type = randomInt(5); switch (type) {...} idiom because I think it is easy to make a mistake that excludes a branch. I'd prefer randomFrom(() -> {...}, () -> {...}).get() because it is more obvious that you haven't forgot a branch. Is that crazy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to use the switch statement as long as the default throws an exception ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine but it makes me worried because I have to double check that randomInt is inclusive and the last element of the switch statement lines up.

@jimczi
Copy link
Contributor Author

jimczi commented Nov 28, 2016

Thanks @nik9000
I pushed a commit to address your comments.
If it's ok for you I'd like to push this change (modulo the changes that you requested) and do the NamedWriteable stuff in another PR.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a question.

if (hasMinMax == false) {
hasMinMax = true;
minValue = (T) other.minValue;
maxValue = (T) other.maxValue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels wrong to say that we know the min/max unless the information is available for all indices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed be82024

Though I think we should distinguish null values because the shard is empty and null values because the field is not indexed. I could not do it in this commit because SeqNoFieldType#stats computes min/max values on a non-searchable field.
@bleskes @jasontedor what is the plan for this SeqNoFieldType#stats ? Are we going to index this field and remove the min/max retrieval ?

@@ -272,6 +335,9 @@ public final void writeTo(StreamOutput out) throws IOException {
* otherwise <code>false</code> is returned
*/
public boolean match(IndexConstraint constraint) {
if (hasMinMax == false) {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we throw an exception if min/max info is not available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the current behavior for fields that are not indexed so maybe better to just ignore these fields for now ? The index constraint works on indexed fields and always returns false on non-indexed fields, I can add a sentence like this in the docs if needed ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me. Thanks for the explanation.

This change adds non-searchable fields to the FieldStats response. These fields do not have min/max informations but they can be aggregatable. Fields that are only stored in _source (store:no, index:no, doc_values:no) will still be missing since they do not have any useful information to show. Indices and clients must be at least on V_5_2_0 to see this change.
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jimczi jimczi merged commit b42ca6b into elastic:master Dec 6, 2016
@jimczi jimczi deleted the field_stats_field_infos branch December 6, 2016 12:32
@jimczi
Copy link
Contributor Author

jimczi commented Dec 6, 2016

Thanks @jpountz and @nik9000 !
I'll merge this to 5.2.0 now

jimczi added a commit that referenced this pull request Dec 6, 2016
* Include unindexed field in FieldStats response

This change adds non-searchable fields to the FieldStats response. These fields do not have min/max informations but they can be aggregatable. Fields that are only stored in _source (store:no, index:no, doc_values:no) will still be missing since they do not have any useful information to show. Indices and clients must be at least on V_5_2_0 to see this change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants