Dots in field names exception #853

jimmyjones2 · 2016-09-25T20:57:18Z

spark-1.6.2-bin-hadoop2.6, elasticsearch-5.0.0-beta1, elasticsearch-hadoop-5.0.0-beta1

curl -XPOST localhost:9200/test4/test -d '{"b":0,"e":{"f.g":"hello"}}'
./bin/pyspark --driver-class-path=../elasticsearch-hadoop-5.0.0-beta1/dist/elasticsearch-hadoop-5.0.0-beta1.jar
>>> df1 = sqlContext.read.format("org.elasticsearch.spark.sql").load("test4/test")
>>> df1.printSchema()
root
 |-- b: long (nullable = true)
 |-- e: struct (nullable = true)
 |    |-- f: struct (nullable = true)
 |    |    |-- g: string (nullable = true)

>>> df1.show()
---8<--- snip ---8<--- 
org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'e.f.g' not found in row; typically this is caused by a mapping inconsistency
    at org.elasticsearch.spark.sql.RowValueReader$class.addToBuffer(RowValueReader.scala:45)
    at org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:14)
    at org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:94)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:806)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:696)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:806)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:696)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:466)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:391)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:286)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:259)
    at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:365)
    at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
    at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

jbaiera · 2016-09-28T05:55:04Z

Thanks for opening this. I was able to reproduce this and it's definitely a bug. When inserting the document, the mapping on the type recognizes the f.g field as a subfield (the following):

"mappings" : {
  "test" : {
    "properties" : {
      "b" : {
        "type" : "long"
      },
      "e" : {
        "properties" : {
          "f" : {
            "properties" : {
              "g" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

When we discover the mapping in ES-Hadoop we correctly parse a schema as follows:

StructType(
  StructField(b,LongType,true),
  StructField(e,StructType(StructField(f,StructType(StructField(g,StringType,true)),true)),true)
)

However, when reading the values from the scroll, the original source value is used:

{
  "_index" : "test4",
  "_type" : "test",
  "_id" : "AVdvSwQTnlURBA5E_yk9",
  "_score" : 1.0,
  "_source" : {
    "b" : 0,
    "e" : {
      "f.g" : "hello"
    }
  }
}

During the document parsing, the parser gets confused as it cannot find a valid schema field by the name of f.g. The reader will have to be updated to handle dots in field names by splitting them into a separate map layer each field.

sandeep-telsiz · 2017-06-01T22:32:45Z

Is there a workaround for this issue ?

jbaiera · 2017-06-06T14:42:23Z

@sandeep-telsiz unfortunately this is not a straightforward fix. The parsing code for reading scrolls needs to be completely rebuilt to support this. Rest assured that this is a big item on our radar that we hope to tackle soon.

liveangel-js · 2018-12-14T12:02:10Z

same issue

danielyahn · 2019-03-12T19:11:17Z

+1

liansghuaifan · 2019-10-22T06:34:06Z

+1

vladimirkore · 2020-01-07T09:19:46Z

+1

upupfeng · 2020-01-15T09:53:58Z

same issue

SouhailBenAli1 · 2020-02-27T16:44:59Z

+1 ! Any workaround for this issue ?

vsethi13 · 2020-03-02T09:46:02Z

Any workaround in the meantime? @jbaiera

jbaiera · 2020-03-02T16:17:45Z

As mentioned above, this is not a straightforward change to make as it will require a large rewrite of the document parsing code. One possible workaround is configuring the library to read data from Elasticsearch as raw JSON data and performing the parsing yourself before operating on the data. Unfortunately, this workaround would only be feasible on MR and Spark where you can run arbitrary code on the data.

robinsonmhj · 2020-05-09T23:29:42Z

+1

arjansh · 2020-07-03T07:27:15Z

+1

… indexed by Elasticsearch which contains dots in its fieldnames. Note that this is actually caused by Elasticsearch, because the mapping returned for such a document by Elasticsearch doesn't match the source returned by Elasticsearch when getting the source of a search hit. (see elastic/elasticsearch-hadoop#853 and related issues for more info on this).

QuentinAdt · 2021-05-27T20:33:30Z

+1

masseyke · 2022-01-31T22:51:00Z

As @jbaiera has said, this one is complicated. When we write {"f.g":"hello"} as a field, elasticsearch treats f as an object field and g as a text field within that. So the mapping is the same as if you had written {"f":{"g":"hello"}}. And internally it's indexed the same way. But elasticsearch keeps the original source. When es-hadoop queries the data it gets the same {"f.g":"hello"} that was written to elasticsearch. Right now the parser blows up on that. I believe we can fix the parser to work with this, but that means effectively translating the source from {"f.g":"hello"} to {"f":{"g":"hello"}}. The reason is that that is what spark is expecting since that is what the mapping declares. This means that if we write the data back into elasticsearch, we write the changed source. It means the same thing to elasticsearch, but it could possibly break some application that is depending on the format of the source.
So "fixing" this one (at least without some more thought) might actually do more harm than not fixing it.

masseyke · 2022-02-04T16:13:35Z

After discussing with the team, we decided that the behavior of es-hadoop would be dangerous if it were to support dots in field names because es-hadoop would be silently rewriting _source data. So I've put up a PR to document that we do not, and provide a better error message.

masseyke · 2022-02-04T16:16:19Z

Here is the ticket where dot support was added (back) to Elasticsearch -- elastic/elasticsearch#15951.

Es-hadoop does not support fields with dots in their names (#853). Adding support is likely to cause more problems than it fixes. So this commit documents that we do not support them, and adds a better error message.

masseyke · 2022-02-08T20:19:09Z

I'm closing this as one that we will intentionally not fix. Unfortunately the safest option is to use something like the Dot Processor to convert your field names with dots to object structures.

asapegin · 2022-02-14T15:33:00Z

I'm closing this as one that we will intentionally not fix. Unfortunately the safest option is to use something like the Dot Processor to convert your field names with dots to object structures.

Sorry, but what do you mean under "convert your field names with dots"?! The fields with dots are YOUR (Elastic) standard field names defined in the ECS Field Reference (https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html). 99% of all field names contains dots there.

asapegin · 2022-02-14T15:44:03Z

But then it will be needed to convert all related SIEM functionality in Elastic Security then, including rules, detections, alerts, .siem-signals index, etc., etc.

jbaiera · 2022-05-12T21:03:57Z

I am going to go ahead and re-open this since it seems like this "problem" of dots in field names is less of a "problem" and more just where things are trending toward in the data integration space. It would be unwise of us to ignore this issue given recent developments across existing solutions.

That said, this issue is not an easy fix and requires some adjusting of invariants that we have treated very carefully over the years - most notably that _source is sacred and should only be changed judiciously. Additionally, document update logic likely will need looking at (just try running a partial document update using normalized JSON in the request against a document containing dotted field names).

tsikerdekis · 2024-03-11T01:39:02Z

Any updates on this issue? I have ingestors that pull data off of suricata's eve.json and contain . fields for all sorts of default and non-default dashboards on kibana. I can't just rename them and not sure how to prevent pyspark from nagging about these fields. It makes elasticsearch-hadoop unusable unless someone has found a workaround.

jimmyjones2 mentioned this issue Sep 25, 2016

Values of fields with names containing dots shown as null #854

Open

jbaiera added bug :Rest v5.0.0-rc1 labels Sep 28, 2016

jbaiera added v5.0.0 and removed v5.0.0-rc1 labels Oct 10, 2016

jbaiera added v5.0.1 and removed v5.0.0 labels Oct 26, 2016

jbaiera added v5.0.2 and removed v5.0.1 labels Nov 17, 2016

jbaiera added v5.2.0 and removed v5.0.2 labels Dec 15, 2016

jbaiera added v5.3.0 and removed v5.2.0 labels Jan 31, 2017

jbaiera added v5.4.0 :Serialization and removed v5.3.0 v5.4.0 :Rest labels Mar 28, 2017

j040p3d20 mentioned this issue Oct 8, 2018

mapping inconsistency when using dots in field name #1199

Closed

arjansh mentioned this issue Jul 6, 2020

Workaround for fieldnames with dots in elasticsearch apache/metamodel#243

Merged

jbaiera mentioned this issue May 10, 2021

Position for 'params.expo_cnt' not found in row; typically this is caused by a mapping inconsistency #1437

Closed

2 tasks

masseyke mentioned this issue Feb 4, 2022

Documenting that we do not support dots in field names #1900

Merged

masseyke closed this as completed Feb 8, 2022

masseyke added the wontfix label Feb 8, 2022

asapegin mentioned this issue Mar 17, 2022

Issue to re-open issue 853 #1928

Closed

jbaiera mentioned this issue Mar 28, 2022

Dotted field names that conflict with objects elastic/elasticsearch#63530

Closed

jbaiera reopened this May 12, 2022

jbaiera removed the wontfix label May 12, 2022

jbaiera mentioned this issue May 24, 2023

Support dotted field notations in the reroute processor elastic/elasticsearch#96243

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dots in field names exception #853

Dots in field names exception #853

jimmyjones2 commented Sep 25, 2016

jbaiera commented Sep 28, 2016

sandeep-telsiz commented Jun 1, 2017

jbaiera commented Jun 6, 2017

liveangel-js commented Dec 14, 2018

danielyahn commented Mar 12, 2019

liansghuaifan commented Oct 22, 2019

vladimirkore commented Jan 7, 2020

upupfeng commented Jan 15, 2020

SouhailBenAli1 commented Feb 27, 2020

vsethi13 commented Mar 2, 2020

jbaiera commented Mar 2, 2020

robinsonmhj commented May 9, 2020

arjansh commented Jul 3, 2020

QuentinAdt commented May 27, 2021

masseyke commented Jan 31, 2022

masseyke commented Feb 4, 2022

masseyke commented Feb 4, 2022 •

edited

Loading

masseyke commented Feb 8, 2022

asapegin commented Feb 14, 2022 •

edited

Loading

asapegin commented Feb 14, 2022 •

edited

Loading

jbaiera commented May 12, 2022

tsikerdekis commented Mar 11, 2024

Dots in field names exception #853

Dots in field names exception #853

Comments

jimmyjones2 commented Sep 25, 2016

jbaiera commented Sep 28, 2016

sandeep-telsiz commented Jun 1, 2017

jbaiera commented Jun 6, 2017

liveangel-js commented Dec 14, 2018

danielyahn commented Mar 12, 2019

liansghuaifan commented Oct 22, 2019

vladimirkore commented Jan 7, 2020

upupfeng commented Jan 15, 2020

SouhailBenAli1 commented Feb 27, 2020

vsethi13 commented Mar 2, 2020

jbaiera commented Mar 2, 2020

robinsonmhj commented May 9, 2020

arjansh commented Jul 3, 2020

QuentinAdt commented May 27, 2021

masseyke commented Jan 31, 2022

masseyke commented Feb 4, 2022

masseyke commented Feb 4, 2022 • edited Loading

masseyke commented Feb 8, 2022

asapegin commented Feb 14, 2022 • edited Loading

asapegin commented Feb 14, 2022 • edited Loading

jbaiera commented May 12, 2022

tsikerdekis commented Mar 11, 2024

masseyke commented Feb 4, 2022 •

edited

Loading

asapegin commented Feb 14, 2022 •

edited

Loading

asapegin commented Feb 14, 2022 •

edited

Loading