Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dots in field names exception #853

Open
jimmyjones2 opened this issue Sep 25, 2016 · 22 comments
Open

Dots in field names exception #853

jimmyjones2 opened this issue Sep 25, 2016 · 22 comments

Comments

@jimmyjones2
Copy link

spark-1.6.2-bin-hadoop2.6, elasticsearch-5.0.0-beta1, elasticsearch-hadoop-5.0.0-beta1

curl -XPOST localhost:9200/test4/test -d '{"b":0,"e":{"f.g":"hello"}}'
./bin/pyspark --driver-class-path=../elasticsearch-hadoop-5.0.0-beta1/dist/elasticsearch-hadoop-5.0.0-beta1.jar
>>> df1 = sqlContext.read.format("org.elasticsearch.spark.sql").load("test4/test")
>>> df1.printSchema()
root
 |-- b: long (nullable = true)
 |-- e: struct (nullable = true)
 |    |-- f: struct (nullable = true)
 |    |    |-- g: string (nullable = true)

>>> df1.show()
---8<--- snip ---8<--- 
org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'e.f.g' not found in row; typically this is caused by a mapping inconsistency
    at org.elasticsearch.spark.sql.RowValueReader$class.addToBuffer(RowValueReader.scala:45)
    at org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:14)
    at org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:94)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:806)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:696)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:806)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:696)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:466)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:391)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:286)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:259)
    at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:365)
    at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
    at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
@jbaiera
Copy link
Member

jbaiera commented Sep 28, 2016

Thanks for opening this. I was able to reproduce this and it's definitely a bug. When inserting the document, the mapping on the type recognizes the f.g field as a subfield (the following):

"mappings" : {
  "test" : {
    "properties" : {
      "b" : {
        "type" : "long"
      },
      "e" : {
        "properties" : {
          "f" : {
            "properties" : {
              "g" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

When we discover the mapping in ES-Hadoop we correctly parse a schema as follows:

StructType(
  StructField(b,LongType,true),
  StructField(e,StructType(StructField(f,StructType(StructField(g,StringType,true)),true)),true)
)

However, when reading the values from the scroll, the original source value is used:

{
  "_index" : "test4",
  "_type" : "test",
  "_id" : "AVdvSwQTnlURBA5E_yk9",
  "_score" : 1.0,
  "_source" : {
    "b" : 0,
    "e" : {
      "f.g" : "hello"
    }
  }
}

During the document parsing, the parser gets confused as it cannot find a valid schema field by the name of f.g. The reader will have to be updated to handle dots in field names by splitting them into a separate map layer each field.

@sandeep-telsiz
Copy link

Is there a workaround for this issue ?

@jbaiera
Copy link
Member

jbaiera commented Jun 6, 2017

@sandeep-telsiz unfortunately this is not a straightforward fix. The parsing code for reading scrolls needs to be completely rebuilt to support this. Rest assured that this is a big item on our radar that we hope to tackle soon.

@liveangel-js
Copy link

same issue

@danielyahn
Copy link

+1

2 similar comments
@liansghuaifan
Copy link

+1

@vladimirkore
Copy link

+1

@upupfeng
Copy link

same issue

@SouhailBenAli1
Copy link

+1 ! Any workaround for this issue ?

@vsethi13
Copy link

vsethi13 commented Mar 2, 2020

Any workaround in the meantime? @jbaiera

@jbaiera
Copy link
Member

jbaiera commented Mar 2, 2020

As mentioned above, this is not a straightforward change to make as it will require a large rewrite of the document parsing code. One possible workaround is configuring the library to read data from Elasticsearch as raw JSON data and performing the parsing yourself before operating on the data. Unfortunately, this workaround would only be feasible on MR and Spark where you can run arbitrary code on the data.

@robinsonmhj
Copy link

+1

1 similar comment
@arjansh
Copy link

arjansh commented Jul 3, 2020

+1

arjansh pushed a commit to arjansh/metamodel that referenced this issue Jul 3, 2020
… indexed by Elasticsearch which contains dots in its fieldnames. Note that this is actually caused by Elasticsearch, because the mapping returned for such a document by Elasticsearch doesn't match the source returned by Elasticsearch when getting the source of a search hit. (see elastic/elasticsearch-hadoop#853 and related issues for more info on this).
@QuentinAdt
Copy link

+1

@masseyke
Copy link
Member

As @jbaiera has said, this one is complicated. When we write {"f.g":"hello"} as a field, elasticsearch treats f as an object field and g as a text field within that. So the mapping is the same as if you had written {"f":{"g":"hello"}}. And internally it's indexed the same way. But elasticsearch keeps the original source. When es-hadoop queries the data it gets the same {"f.g":"hello"} that was written to elasticsearch. Right now the parser blows up on that. I believe we can fix the parser to work with this, but that means effectively translating the source from {"f.g":"hello"} to {"f":{"g":"hello"}}. The reason is that that is what spark is expecting since that is what the mapping declares. This means that if we write the data back into elasticsearch, we write the changed source. It means the same thing to elasticsearch, but it could possibly break some application that is depending on the format of the source.
So "fixing" this one (at least without some more thought) might actually do more harm than not fixing it.

@masseyke
Copy link
Member

masseyke commented Feb 4, 2022

After discussing with the team, we decided that the behavior of es-hadoop would be dangerous if it were to support dots in field names because es-hadoop would be silently rewriting _source data. So I've put up a PR to document that we do not, and provide a better error message.

@masseyke
Copy link
Member

masseyke commented Feb 4, 2022

Here is the ticket where dot support was added (back) to Elasticsearch -- elastic/elasticsearch#15951.

masseyke added a commit that referenced this issue Feb 8, 2022
Es-hadoop does not support fields with dots in their names (#853). Adding support is likely to cause more problems
than it fixes. So this commit documents that we do not support them, and adds a better error message.
@masseyke
Copy link
Member

masseyke commented Feb 8, 2022

I'm closing this as one that we will intentionally not fix. Unfortunately the safest option is to use something like the Dot Processor to convert your field names with dots to object structures.

@asapegin
Copy link

asapegin commented Feb 14, 2022

I'm closing this as one that we will intentionally not fix. Unfortunately the safest option is to use something like the Dot Processor to convert your field names with dots to object structures.

Sorry, but what do you mean under "convert your field names with dots"?! The fields with dots are YOUR (Elastic) standard field names defined in the ECS Field Reference (https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html). 99% of all field names contains dots there.

@asapegin
Copy link

asapegin commented Feb 14, 2022

But then it will be needed to convert all related SIEM functionality in Elastic Security then, including rules, detections, alerts, .siem-signals index, etc., etc.

@jbaiera
Copy link
Member

jbaiera commented May 12, 2022

I am going to go ahead and re-open this since it seems like this "problem" of dots in field names is less of a "problem" and more just where things are trending toward in the data integration space. It would be unwise of us to ignore this issue given recent developments across existing solutions.

That said, this issue is not an easy fix and requires some adjusting of invariants that we have treated very carefully over the years - most notably that _source is sacred and should only be changed judiciously. Additionally, document update logic likely will need looking at (just try running a partial document update using normalized JSON in the request against a document containing dotted field names).

@tsikerdekis
Copy link

Any updates on this issue? I have ingestors that pull data off of suricata's eve.json and contain . fields for all sorts of default and non-default dashboards on kibana. I can't just rename them and not sure how to prevent pyspark from nagging about these fields. It makes elasticsearch-hadoop unusable unless someone has found a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests