Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.0.68 introduces memory leak #546

Closed
TJC opened this issue Apr 4, 2022 · 33 comments
Closed

1.0.68 introduces memory leak #546

TJC opened this issue Apr 4, 2022 · 33 comments

Comments

@TJC
Copy link

TJC commented Apr 4, 2022

Version 1.0.68 introduces a memory leak.

When called repeatedly, we see the JSON Schema validator consumes all available memory, until it crashes with java.lang.OutOfMemoryError: Java heap space

The following script demonstrates it (uses Ammonite on JDK 17, Scala 2.13).
Run the script after setting JAVA_OPTS=-Xmx256M

This assumes you're parsing some fairly large JSON files (500kbyte in my case), and the JSON Schema is of some complexity. I assume you have some examples around like this to use.

Note that if you revert the version back to 1.0.67, the script runs successfully.

#!/usr/bin/env amm

interp.load.ivy("com.fasterxml.jackson.core" % "jackson-databind" % "2.13.2.1")
interp.load.ivy("com.networknt" % "json-schema-validator" % "1.0.68")

@

import java.nio.file.{Files, Path}

import com.fasterxml.jackson.databind.{JsonNode, ObjectMapper}
import com.networknt.schema.{JsonSchema, JsonSchemaFactory, SpecVersion}

val bigFile = Files.readString(Path.of("big-json-file.json"))

// In Scala, an object is similar to Java static classes, I think?
object JsonSchemaCheck {
  private val jsonSchemaContent = Files.readString(Path.of("schema.json"))
  private val validator = getJsonSchema(jsonSchemaContent)

  private def getJson(content: String): JsonNode = {
    val mapper = new ObjectMapper
    mapper.readTree(content)
  }

  private def getJsonSchema(schema: String): JsonSchema = {
    val factory = JsonSchemaFactory.getInstance(SpecVersion.VersionFlag.V7)
    factory.getSchema(schema)
  }

  def check(rawDoc: String) = {
    val json = getJson(rawDoc)
    validator.validate(json)
  }
}

@main
def main(): Unit = {
  Range(1,500).foreach { i =>
    println(i)
    JsonSchemaCheck.check(bigFile)
  }
  println("Done")
}
@TJC TJC changed the title Appears to have memory leak 1.0.68 introduces memory leak Apr 4, 2022
@TJC
Copy link
Author

TJC commented Apr 4, 2022

Upon further testing, it seems the memory leak was introduced in 1.0.68. Reverting back to .67 fixes the problem.

@TJC TJC mentioned this issue Apr 4, 2022
@TJC
Copy link
Author

TJC commented Apr 4, 2022

After some git bisecting, the cause of the memory leak seems to be this commit: 5007242 in #534

@AndreasALoew
Copy link
Contributor

AndreasALoew commented Apr 4, 2022

Can you still please try to run the JDK 17 JVM you seem to be using as the Scala/ammonite runtime platform to start with java command line option "-XX:+HeapDumpOnOutOfMemoryError" 😄

For details, see:
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/clopts001.html

This would allow Java developers to pinpoint the place where the memory is lost rather than looking for a needle in a haystack... 😢

@AndreasALoew
Copy link
Contributor

Note:
By default, then, the heap dump is created in a file called java_pidpid.hprof in the working directory of the VM, as in the example above. You can specify an alternative file name or directory with the -XX:HeapDumpPath= option. For example -XX:HeapDumpPath=/disk2/dumps will cause the heap dump to be generated in the /disk2/dumps directory.

@TJC
Copy link
Author

TJC commented Apr 4, 2022

Thanks for the quick response. I can't send a heap dump based on the current JSON source that I'm using; I'll need to run something over the JSON to replace many strings with random characters. But I can work on that.

I was hoping that you might have some JSON files and Schemas nearby yourself, to use with the script I posted, as I doubt the exact content of my JSON matters for triggering the bug. Just that the file is fairly large.

If you're on a Mac, then it's just brew install ammonite-repl and then chmod +x my-test-script && ./my-test-script to use it

(It looks for a big-json-file.json and schema.json in the current directory)

@AndreasALoew
Copy link
Contributor

If you can't disclose the whole heap dump due to the private contents of its Strings, can you maybe just list the major findings from opening the heap dump in a heap dump analyzer (like Eclipse MAT etc) so that we have a chance to see objects of which class/in which map etc. (regardless of particular content) do accumulate without limits over the course of time?

@TJC
Copy link
Author

TJC commented Apr 4, 2022

Sure, I can do that. It's 1:30am here though, so it'll be "tomorrow" for me when I get to it. Thanks

@TJC
Copy link
Author

TJC commented Apr 5, 2022

Screen Shot 2022-04-05 at 10 25 10 am

Looks like almost all the memory is consumed by an ArrayList from AnyOfValidator.java

It contains four million elements, which is far more than the number of elements in my source JSON. It seems like it is not clearing that array between validations?

@TJC
Copy link
Author

TJC commented Apr 5, 2022

Also, I have built a redacted JSON file, so I'm happy to share a heap dump with you privately if you like.

@AndreasALoew
Copy link
Contributor

Sorry, I myself don't have time until this evening (Tue, and I'm in Europe/MET) to look into it (just in case nobody else wants to take over in the meantime... 😀), but the heap dump should compress quite well, so I assume you can/should even be able to reduce -Xmx and attach the redacted/"anonymized" one here... Use the max compression level of e.g. xz, bz2 or zip and attach the compressed dump (max file size here seems to be 25MB). If that's not sufficient, can you prepare/provide a download link for the heap dump? Thx! 😄

But I think even the info from the screenshot in your previous comment will most likely help to find out what's going wrong here in the AnyOfValidator, it's a pretty good starting point... 👍

@AndreasALoew
Copy link
Contributor

Sorry - I still haven't received a heap dump from @TJC ... 😞

Please be informed that without a "readily prepared" heap dump to analyze, I will only find some time to reproduce the issue by myself over the weekend at the earliest...

Maybe @prashanthjos / @prashanth-chaitanya as the authors of the most likely faulty (by result of bisect) commit mentioned above can help earlier?

@TJC
Copy link
Author

TJC commented Apr 8, 2022

Sorry - I still haven't received a heap dump from @TJC ... 😞

My previous offer to share it with you privately still stands. And as I said in an earlier comment - I strongly suspect this issue will be easy to reproduce using your own collection of JSON and JSON Schemas. Maybe just ensure your test schema includes a couple of nested anyOf conditions, since the issues appears to originate there.

@AndreasALoew
Copy link
Contributor

@TJC so please indeed share it privately with me - how do you plan to do it? I can provide you an https upload link to my GMX web space, or you put it onto some location where I can download it via https.

I'll immediately send a private e-mail to the adress in your GitHub user profile offering my upload URL. In order to separate credentials from the link for security reasons: the password for the (only privately shared) upload URL contained in the e-mail is "networkNTjson". 😉

@prashanthjos
Copy link
Member

prashanthjos commented Apr 8, 2022

hi @AndreasALoew, @networknt/json-schema-validator apologies for my late entry into the issue, I have made change to add evaluated and unevaluatedProperties to CollectorContext (Which is thread level context that uses ThreadLocal). Looks like you are repeatedly calling json schema validate in a loop. Can you try resetting the CollectorContext before every call to validate using CollectorContext.getInstance().reset() ?

CollectorContext is not automatically reset or removed from ThreadLocal, It was a decision left to the user of the framework deliberately.

Please let me know your thoughts, always happy to help.

@AndreasALoew
Copy link
Contributor

AndreasALoew commented Apr 8, 2022

Hello @prashanthjos , many thanks for the explanation 😄

My immediate follow-up question to you would be: what is the advantage of not resetting i.e. "reusing" a CollectorContext for two subsequent calls to validate()?

If there is no such advantage, wouldn't it make more sense to clean the thread-local CollectorContext in a finally() block immediately before returning from any call to validate()?

Or, in case there is a valid scenario, I'd like to propose that we change the default behaviour for the CollectorContext to be cleaned immediately before returning from validate, but we add an additional validate() call passing a boolean parameter that would explicitly tell the framework to keep and NOT clean the CollectorContext at the end of validate(), and thereby leave the decision when to clean it to the caller (as you seem to have intended).

What do you think?

@prashanthjos
Copy link
Member

prashanthjos commented Apr 8, 2022

Agree with you @AndreasALoew we can add an additional validate and walk methods that will NOT reset the CollectorContext based on user input.

@TJC can you quickly verify if the reset of CollectorContext fixes your issue.

@AndreasALoew
Copy link
Contributor

AndreasALoew commented Apr 8, 2022

... such that the default (i.e. if you continue to use the framework like it has been in previous releases) cleans the context, and if you explicitly would like to keep the CollectorContext (still the question is: why would you want do do so - please explain), then this can be expressed by an additional optional boolean parameter as proposed earlier...

Big advantage of doing it this way is that it won't break (i.e. cause OutOfMemory) all existing code using the validator...

So @prashanthjos , would you be willing to propose such an additional PR? That'd be just great! 👍

@prashanthjos
Copy link
Member

Information collected in CollectorContext is used by us beyond validation. A simple example could be collecting information about references($ref) and there data while walking over the schema or we may want to store specific node information into our database after we validate, Since the validation already walks over every schema node and its corresponding data node
we can collect this information while validating or walking. We have several custom keywords that do this kind of data collection while validating and we later use that data in downstream code.

I am out of town till Monday(11th April 2022). I will be happy to create a PR once I am back.

In the mean time I would request @TJC to test with CollectorContext reset fix and see if it works.

@TJC
Copy link
Author

TJC commented Apr 8, 2022

@TJC so please indeed share it privately with me - I can provide you an https upload link to my GMX web space, or you put it onto some location where I can download it via https.

Thanks for helping out with that - if you check your email now, you should find details about how to access the heap dump :)

@prashanthjos
Copy link
Member

@TJC can you please confirm here once you test if CollectorContext.getInstance().reset() works. Thanks!

@TJC
Copy link
Author

TJC commented Apr 8, 2022

Looks like you are repeatedly calling json schema validate in a loop. Can you try resetting the CollectorContext before every call to validate using CollectorContext.getInstance().reset() ?

My demonstration script is calling .validate() in a loop, but that was for demonstration. In production code, the .validate() call happens each time a webserver receives an HTTP request. Which can happen quite frequently, but it still takes hours in normal use before all memory is consumed by this bug.

We deployed some code to production that included several dependency updates, including the minor patch version of the JSON Schema validator. In the middle of the night, the systems started running out of memory, causing us to get people out of bed to investigate the issue.

CollectorContext is not automatically reset or removed from ThreadLocal, It was a decision left to the user of the framework deliberately.

Normally when a significant breaking change like this is made to software, the major version number gets incremented, so that users of the software know to pay more attention. eg. Changing the version from 1.0.67 to 2.0.0. Because the version just incremented from 1.0.67 to 1.0.68, we thought it only included minor changes and bugfixes, and would not require us to make changes to how we use the software.

I also note that the documentation doesn't seem to reflect this new requirement.

@TJC
Copy link
Author

TJC commented Apr 8, 2022

I can confirm that calling reset on the CollectorContext does indeed seem to fix the memory issue.

However I would encourage you to make the reset a default experience rather than an optional one, as I'm sure my company is not the only one that will step on this landmine.

@jochenberger
Copy link

No, it isn't. 😉
Luckily, our test suite caught this.

@prashanthjos
Copy link
Member

prashanthjos commented Apr 8, 2022

@TJC sorry for the inconvenience and thank you for confirming. CollectorContext was added long before and it is noted in the documentation that data is stored in ThreadLocal. Please note the issue could be because of the UnevaluatedProperties accumulation in CollectorContext. CollectorContext is being used in this library for several use-cases like adding defaults.

Also http servers based on Java Servlets , could tend to reuse same threads for processing requests so the underlying ThreadLocal also might be reused thus not clearing the older CollectorContext from ThreadLocal. It is always recommended to reset the CollectorContext explicitly.

Please note another class from the Library ValidatorState is also on ThreadLocal.

Also I see there is no documentation for every new keyword that is added. I can add one for this though.

I would also recommend testing in your test/perf environments with an approximate load replicating production in future.

Yes reset will be a default experience in my PR next week.

@TJC
Copy link
Author

TJC commented Apr 9, 2022

It is always recommended to reset the CollectorContext explicitly.

When you say it was always recommended.. Where was that recommendation made?

Let me check the documentation again..

@TJC
Copy link
Author

TJC commented Apr 9, 2022

Anyway, thank you for making a fix. I look forward to it.

@prashanthjos
Copy link
Member

Yes it was supposed to be updated. But however we clearly mentioned that there are ThreadLocal usages in the library. I will try to make it more explicit in the documentation.

@AndreasALoew
Copy link
Contributor

AndreasALoew commented Apr 10, 2022

@prashanthjos I can now finally definitely confirm from heap dump analysis that the Memory Leak observed is indeed caused by the "com.networknt.schema.CollectorKey" CollectorContext instance and its "collectorMap" HashMap entry with key "com.networknt.schema.UnEvaluatedPropertiesValidator.EvaluatedProperties" pointing to an ArrayList that grows indefinitely in thread-local storage when executing multiple JSON validations in a loop from the same worker thread.

See the following screenshot from YourKit profiler for the details:

LeakUnEvaluatedPropertiesValidator

So I fully second the request to do an implicit default reset of the CollectorContext in a finally block immediately before returning from the (common) default validate (and possibly walk) method(s), and adding an additional variant of the validate (and possibly walk) method with a boolean "reset" flag that can be used for the non-default (uncommon) use case of explicitly wanting to collect CollectorContext information throughout the sequential validation of several JSON documents in a row.

Many thanks! 😄

@stevehu
Copy link
Contributor

stevehu commented Apr 11, 2022

FYI. There are some discussions in another thread about how to fix this memory leak issue and proposed design solutions. Please let us know what you think about the solution.

#544

@stevehu
Copy link
Contributor

stevehu commented Apr 12, 2022

I have merged @prashanthjos' code to the master branch. With the newly introduced config flag, most users won't need to worry about resetting the context. Advanced users can set the flag to false to allow the context to live across multiple validators. I am wondering if we should update the document for this flag. Also, let me know if we are ready to have a quick release. Thank you everybody for taking the effort to get it implemented.

@TJC
Copy link
Author

TJC commented Apr 12, 2022

Running with the current master version, I'm not seeing memory exhaustion. That's good.

@prashanthjos
Copy link
Member

@stevehu thanking you for merging. I will update the documentation in a couple of days. Thank you @TJC and @AndreasALoew for your patience and testing with master change.

@stevehu
Copy link
Contributor

stevehu commented Apr 18, 2022

This issue is resolved in the 1.0.69 release. Thank you everyone who are working hard to get this issue resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants