1.0.68 introduces memory leak #546

TJC · 2022-04-04T05:42:05Z

Version 1.0.68 introduces a memory leak.

When called repeatedly, we see the JSON Schema validator consumes all available memory, until it crashes with java.lang.OutOfMemoryError: Java heap space

The following script demonstrates it (uses Ammonite on JDK 17, Scala 2.13).
Run the script after setting JAVA_OPTS=-Xmx256M

This assumes you're parsing some fairly large JSON files (500kbyte in my case), and the JSON Schema is of some complexity. I assume you have some examples around like this to use.

Note that if you revert the version back to 1.0.67, the script runs successfully.

#!/usr/bin/env amm

interp.load.ivy("com.fasterxml.jackson.core" % "jackson-databind" % "2.13.2.1")
interp.load.ivy("com.networknt" % "json-schema-validator" % "1.0.68")

@

import java.nio.file.{Files, Path}

import com.fasterxml.jackson.databind.{JsonNode, ObjectMapper}
import com.networknt.schema.{JsonSchema, JsonSchemaFactory, SpecVersion}

val bigFile = Files.readString(Path.of("big-json-file.json"))

// In Scala, an object is similar to Java static classes, I think?
object JsonSchemaCheck {
  private val jsonSchemaContent = Files.readString(Path.of("schema.json"))
  private val validator = getJsonSchema(jsonSchemaContent)

  private def getJson(content: String): JsonNode = {
    val mapper = new ObjectMapper
    mapper.readTree(content)
  }

  private def getJsonSchema(schema: String): JsonSchema = {
    val factory = JsonSchemaFactory.getInstance(SpecVersion.VersionFlag.V7)
    factory.getSchema(schema)
  }

  def check(rawDoc: String) = {
    val json = getJson(rawDoc)
    validator.validate(json)
  }
}

@main
def main(): Unit = {
  Range(1,500).foreach { i =>
    println(i)
    JsonSchemaCheck.check(bigFile)
  }
  println("Done")
}

The text was updated successfully, but these errors were encountered:

TJC · 2022-04-04T07:14:45Z

Upon further testing, it seems the memory leak was introduced in 1.0.68. Reverting back to .67 fixes the problem.

TJC · 2022-04-04T09:05:23Z

After some git bisecting, the cause of the memory leak seems to be this commit: 5007242 in #534

AndreasALoew · 2022-04-04T09:47:12Z

Can you still please try to run the JDK 17 JVM you seem to be using as the Scala/ammonite runtime platform to start with java command line option "-XX:+HeapDumpOnOutOfMemoryError" 😄

For details, see:
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/clopts001.html

This would allow Java developers to pinpoint the place where the memory is lost rather than looking for a needle in a haystack... 😢

AndreasALoew · 2022-04-04T09:49:16Z

Note:
By default, then, the heap dump is created in a file called java_pidpid.hprof in the working directory of the VM, as in the example above. You can specify an alternative file name or directory with the -XX:HeapDumpPath= option. For example -XX:HeapDumpPath=/disk2/dumps will cause the heap dump to be generated in the /disk2/dumps directory.

TJC · 2022-04-04T15:24:19Z

Thanks for the quick response. I can't send a heap dump based on the current JSON source that I'm using; I'll need to run something over the JSON to replace many strings with random characters. But I can work on that.

I was hoping that you might have some JSON files and Schemas nearby yourself, to use with the script I posted, as I doubt the exact content of my JSON matters for triggering the bug. Just that the file is fairly large.

If you're on a Mac, then it's just brew install ammonite-repl and then chmod +x my-test-script && ./my-test-script to use it

(It looks for a big-json-file.json and schema.json in the current directory)

AndreasALoew · 2022-04-04T15:31:16Z

If you can't disclose the whole heap dump due to the private contents of its Strings, can you maybe just list the major findings from opening the heap dump in a heap dump analyzer (like Eclipse MAT etc) so that we have a chance to see objects of which class/in which map etc. (regardless of particular content) do accumulate without limits over the course of time?

TJC · 2022-04-04T15:34:21Z

Sure, I can do that. It's 1:30am here though, so it'll be "tomorrow" for me when I get to it. Thanks

TJC · 2022-04-05T00:29:29Z

Looks like almost all the memory is consumed by an ArrayList from AnyOfValidator.java

It contains four million elements, which is far more than the number of elements in my source JSON. It seems like it is not clearing that array between validations?

TJC · 2022-04-05T00:31:47Z

Also, I have built a redacted JSON file, so I'm happy to share a heap dump with you privately if you like.

AndreasALoew · 2022-04-05T09:00:48Z

Sorry, I myself don't have time until this evening (Tue, and I'm in Europe/MET) to look into it (just in case nobody else wants to take over in the meantime... 😀), but the heap dump should compress quite well, so I assume you can/should even be able to reduce -Xmx and attach the redacted/"anonymized" one here... Use the max compression level of e.g. xz, bz2 or zip and attach the compressed dump (max file size here seems to be 25MB). If that's not sufficient, can you prepare/provide a download link for the heap dump? Thx! 😄

But I think even the info from the screenshot in your previous comment will most likely help to find out what's going wrong here in the AnyOfValidator, it's a pretty good starting point... 👍

AndreasALoew · 2022-04-07T00:20:43Z

Sorry - I still haven't received a heap dump from @TJC ... 😞

Please be informed that without a "readily prepared" heap dump to analyze, I will only find some time to reproduce the issue by myself over the weekend at the earliest...

Maybe @prashanthjos / @prashanth-chaitanya as the authors of the most likely faulty (by result of bisect) commit mentioned above can help earlier?

TJC · 2022-04-08T07:55:03Z

Sorry - I still haven't received a heap dump from @TJC ... 😞

My previous offer to share it with you privately still stands. And as I said in an earlier comment - I strongly suspect this issue will be easy to reproduce using your own collection of JSON and JSON Schemas. Maybe just ensure your test schema includes a couple of nested anyOf conditions, since the issues appears to originate there.

AndreasALoew · 2022-04-08T08:48:25Z

@TJC so please indeed share it privately with me - how do you plan to do it? I can provide you an https upload link to my GMX web space, or you put it onto some location where I can download it via https.

I'll immediately send a private e-mail to the adress in your GitHub user profile offering my upload URL. In order to separate credentials from the link for security reasons: the password for the (only privately shared) upload URL contained in the e-mail is "networkNTjson". 😉

prashanthjos · 2022-04-08T08:59:29Z

hi @AndreasALoew, @networknt/json-schema-validator apologies for my late entry into the issue, I have made change to add evaluated and unevaluatedProperties to CollectorContext (Which is thread level context that uses ThreadLocal). Looks like you are repeatedly calling json schema validate in a loop. Can you try resetting the CollectorContext before every call to validate using CollectorContext.getInstance().reset() ?

CollectorContext is not automatically reset or removed from ThreadLocal, It was a decision left to the user of the framework deliberately.

Please let me know your thoughts, always happy to help.

AndreasALoew · 2022-04-08T09:13:26Z

Hello @prashanthjos , many thanks for the explanation 😄

My immediate follow-up question to you would be: what is the advantage of not resetting i.e. "reusing" a CollectorContext for two subsequent calls to validate()?

If there is no such advantage, wouldn't it make more sense to clean the thread-local CollectorContext in a finally() block immediately before returning from any call to validate()?

Or, in case there is a valid scenario, I'd like to propose that we change the default behaviour for the CollectorContext to be cleaned immediately before returning from validate, but we add an additional validate() call passing a boolean parameter that would explicitly tell the framework to keep and NOT clean the CollectorContext at the end of validate(), and thereby leave the decision when to clean it to the caller (as you seem to have intended).

What do you think?

prashanthjos · 2022-04-08T09:15:44Z

Agree with you @AndreasALoew we can add an additional validate and walk methods that will NOT reset the CollectorContext based on user input.

@TJC can you quickly verify if the reset of CollectorContext fixes your issue.

AndreasALoew · 2022-04-08T09:20:37Z

... such that the default (i.e. if you continue to use the framework like it has been in previous releases) cleans the context, and if you explicitly would like to keep the CollectorContext (still the question is: why would you want do do so - please explain), then this can be expressed by an additional optional boolean parameter as proposed earlier...

Big advantage of doing it this way is that it won't break (i.e. cause OutOfMemory) all existing code using the validator...

So @prashanthjos , would you be willing to propose such an additional PR? That'd be just great! 👍

prashanthjos · 2022-04-08T09:32:01Z

Information collected in CollectorContext is used by us beyond validation. A simple example could be collecting information about references($ref) and there data while walking over the schema or we may want to store specific node information into our database after we validate, Since the validation already walks over every schema node and its corresponding data node
we can collect this information while validating or walking. We have several custom keywords that do this kind of data collection while validating and we later use that data in downstream code.

I am out of town till Monday(11th April 2022). I will be happy to create a PR once I am back.

In the mean time I would request @TJC to test with CollectorContext reset fix and see if it works.

TJC · 2022-04-08T09:39:02Z

@TJC so please indeed share it privately with me - I can provide you an https upload link to my GMX web space, or you put it onto some location where I can download it via https.

Thanks for helping out with that - if you check your email now, you should find details about how to access the heap dump :)

prashanthjos · 2022-04-08T10:51:30Z

@TJC can you please confirm here once you test if CollectorContext.getInstance().reset() works. Thanks!

TJC · 2022-04-08T13:46:25Z

Looks like you are repeatedly calling json schema validate in a loop. Can you try resetting the CollectorContext before every call to validate using CollectorContext.getInstance().reset() ?

My demonstration script is calling .validate() in a loop, but that was for demonstration. In production code, the .validate() call happens each time a webserver receives an HTTP request. Which can happen quite frequently, but it still takes hours in normal use before all memory is consumed by this bug.

We deployed some code to production that included several dependency updates, including the minor patch version of the JSON Schema validator. In the middle of the night, the systems started running out of memory, causing us to get people out of bed to investigate the issue.

CollectorContext is not automatically reset or removed from ThreadLocal, It was a decision left to the user of the framework deliberately.

Normally when a significant breaking change like this is made to software, the major version number gets incremented, so that users of the software know to pay more attention. eg. Changing the version from 1.0.67 to 2.0.0. Because the version just incremented from 1.0.67 to 1.0.68, we thought it only included minor changes and bugfixes, and would not require us to make changes to how we use the software.

I also note that the documentation doesn't seem to reflect this new requirement.

TJC · 2022-04-08T13:48:01Z

I can confirm that calling reset on the CollectorContext does indeed seem to fix the memory issue.

However I would encourage you to make the reset a default experience rather than an optional one, as I'm sure my company is not the only one that will step on this landmine.

jochenberger · 2022-04-08T13:53:46Z

No, it isn't. 😉
Luckily, our test suite caught this.

prashanthjos · 2022-04-08T16:27:19Z

@TJC sorry for the inconvenience and thank you for confirming. CollectorContext was added long before and it is noted in the documentation that data is stored in ThreadLocal. Please note the issue could be because of the UnevaluatedProperties accumulation in CollectorContext. CollectorContext is being used in this library for several use-cases like adding defaults.

Also http servers based on Java Servlets , could tend to reuse same threads for processing requests so the underlying ThreadLocal also might be reused thus not clearing the older CollectorContext from ThreadLocal. It is always recommended to reset the CollectorContext explicitly.

Please note another class from the Library ValidatorState is also on ThreadLocal.

Also I see there is no documentation for every new keyword that is added. I can add one for this though.

I would also recommend testing in your test/perf environments with an approximate load replicating production in future.

Yes reset will be a default experience in my PR next week.

TJC · 2022-04-09T02:22:05Z

It is always recommended to reset the CollectorContext explicitly.

When you say it was always recommended.. Where was that recommendation made?

Let me check the documentation again..

If I follow the doc link from the very top of this repo, I get to this page: https://doc.networknt.com/library/json-schema-validator/ but that doesn't mention it.
If I look at the docs within this repo, I can check https://github.com/networknt/json-schema-validator/blob/master/doc/quickstart.md and https://github.com/networknt/json-schema-validator/blob/master/doc/collector-context.md and note that neither mentions the word "reset".
In fact I can do a search of the whole repo for reset() and it also comes back with no results apart from the actual method itself.

TJC · 2022-04-09T02:27:23Z

Anyway, thank you for making a fix. I look forward to it.

prashanthjos · 2022-04-09T04:08:28Z

Yes it was supposed to be updated. But however we clearly mentioned that there are ThreadLocal usages in the library. I will try to make it more explicit in the documentation.

AndreasALoew · 2022-04-10T23:17:02Z

@prashanthjos I can now finally definitely confirm from heap dump analysis that the Memory Leak observed is indeed caused by the "com.networknt.schema.CollectorKey" CollectorContext instance and its "collectorMap" HashMap entry with key "com.networknt.schema.UnEvaluatedPropertiesValidator.EvaluatedProperties" pointing to an ArrayList that grows indefinitely in thread-local storage when executing multiple JSON validations in a loop from the same worker thread.

See the following screenshot from YourKit profiler for the details:

So I fully second the request to do an implicit default reset of the CollectorContext in a finally block immediately before returning from the (common) default validate (and possibly walk) method(s), and adding an additional variant of the validate (and possibly walk) method with a boolean "reset" flag that can be used for the non-default (uncommon) use case of explicitly wanting to collect CollectorContext information throughout the sequential validation of several JSON documents in a row.

Many thanks! 😄

stevehu · 2022-04-11T12:56:40Z

FYI. There are some discussions in another thread about how to fix this memory leak issue and proposed design solutions. Please let us know what you think about the solution.

#544

stevehu · 2022-04-12T01:52:44Z

I have merged @prashanthjos' code to the master branch. With the newly introduced config flag, most users won't need to worry about resetting the context. Advanced users can set the flag to false to allow the context to live across multiple validators. I am wondering if we should update the document for this flag. Also, let me know if we are ready to have a quick release. Thank you everybody for taking the effort to get it implemented.

TJC · 2022-04-12T07:54:47Z

Running with the current master version, I'm not seeing memory exhaustion. That's good.

prashanthjos · 2022-04-12T13:58:29Z

@stevehu thanking you for merging. I will update the documentation in a couple of days. Thank you @TJC and @AndreasALoew for your patience and testing with master change.

stevehu · 2022-04-18T20:13:34Z

This issue is resolved in the 1.0.69 release. Thank you everyone who are working hard to get this issue resolved.

TJC changed the title ~~Appears to have memory leak~~ 1.0.68 introduces memory leak Apr 4, 2022

TJC mentioned this issue Apr 4, 2022

Refactoring-code #539

Merged

This was referenced Apr 4, 2022

Adding Unevaluated properties keyword. #534

Merged

Issue535: OneOf validation gives unnecessary errors #537

Merged

stevehu mentioned this issue Apr 6, 2022

Validator Consistent OutOfMemoryError #549

Closed

stevehu closed this as completed Apr 18, 2022

AndreasALoew mentioned this issue May 6, 2022

AllOfValidator runs out of memory under a concurrency scenario. #568

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.0.68 introduces memory leak #546

1.0.68 introduces memory leak #546

TJC commented Apr 4, 2022 •

edited

Loading

TJC commented Apr 4, 2022

TJC commented Apr 4, 2022

AndreasALoew commented Apr 4, 2022 •

edited

Loading

AndreasALoew commented Apr 4, 2022

TJC commented Apr 4, 2022 •

edited

Loading

AndreasALoew commented Apr 4, 2022

TJC commented Apr 4, 2022

TJC commented Apr 5, 2022

TJC commented Apr 5, 2022

AndreasALoew commented Apr 5, 2022

AndreasALoew commented Apr 7, 2022

TJC commented Apr 8, 2022

AndreasALoew commented Apr 8, 2022

prashanthjos commented Apr 8, 2022 •

edited

Loading

AndreasALoew commented Apr 8, 2022 •

edited

Loading

prashanthjos commented Apr 8, 2022 •

edited

Loading

AndreasALoew commented Apr 8, 2022 •

edited

Loading

prashanthjos commented Apr 8, 2022

TJC commented Apr 8, 2022

prashanthjos commented Apr 8, 2022

TJC commented Apr 8, 2022

TJC commented Apr 8, 2022

jochenberger commented Apr 8, 2022

prashanthjos commented Apr 8, 2022 •

edited

Loading

TJC commented Apr 9, 2022

TJC commented Apr 9, 2022

prashanthjos commented Apr 9, 2022

AndreasALoew commented Apr 10, 2022 •

edited

Loading

stevehu commented Apr 11, 2022

stevehu commented Apr 12, 2022 •

edited

Loading

TJC commented Apr 12, 2022

prashanthjos commented Apr 12, 2022

stevehu commented Apr 18, 2022

1.0.68 introduces memory leak #546

1.0.68 introduces memory leak #546

Comments

TJC commented Apr 4, 2022 • edited Loading

TJC commented Apr 4, 2022

TJC commented Apr 4, 2022

AndreasALoew commented Apr 4, 2022 • edited Loading

AndreasALoew commented Apr 4, 2022

TJC commented Apr 4, 2022 • edited Loading

AndreasALoew commented Apr 4, 2022

TJC commented Apr 4, 2022

TJC commented Apr 5, 2022

TJC commented Apr 5, 2022

AndreasALoew commented Apr 5, 2022

AndreasALoew commented Apr 7, 2022

TJC commented Apr 8, 2022

AndreasALoew commented Apr 8, 2022

prashanthjos commented Apr 8, 2022 • edited Loading

AndreasALoew commented Apr 8, 2022 • edited Loading

prashanthjos commented Apr 8, 2022 • edited Loading

AndreasALoew commented Apr 8, 2022 • edited Loading

prashanthjos commented Apr 8, 2022

TJC commented Apr 8, 2022

prashanthjos commented Apr 8, 2022

TJC commented Apr 8, 2022

TJC commented Apr 8, 2022

jochenberger commented Apr 8, 2022

prashanthjos commented Apr 8, 2022 • edited Loading

TJC commented Apr 9, 2022

TJC commented Apr 9, 2022

prashanthjos commented Apr 9, 2022

AndreasALoew commented Apr 10, 2022 • edited Loading

stevehu commented Apr 11, 2022

stevehu commented Apr 12, 2022 • edited Loading

TJC commented Apr 12, 2022

prashanthjos commented Apr 12, 2022

stevehu commented Apr 18, 2022

TJC commented Apr 4, 2022 •

edited

Loading

AndreasALoew commented Apr 4, 2022 •

edited

Loading

TJC commented Apr 4, 2022 •

edited

Loading

prashanthjos commented Apr 8, 2022 •

edited

Loading

AndreasALoew commented Apr 8, 2022 •

edited

Loading

prashanthjos commented Apr 8, 2022 •

edited

Loading

AndreasALoew commented Apr 8, 2022 •

edited

Loading

prashanthjos commented Apr 8, 2022 •

edited

Loading

AndreasALoew commented Apr 10, 2022 •

edited

Loading

stevehu commented Apr 12, 2022 •

edited

Loading