-
-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance: Deserialization should be shorter than building the model from sources #1983
Comments
awesome! a pull request would be highly welcome :-) |
I think it is possible to reduce consequently the size of the serialization of spoon, by creating a custom serialization technique. I did a small test on the spoon model that count the number of duplicate string in the model (look at all the fields of all the class in the model) Results: Nb unique String 5 334 Basically, there is only 1% of unique string in the model. Structure idea {
"uniqueString": ["myPacakge", "MyClass", "toto"]
"tree": {
"type": "root",
"packages": [{
"type": "CtPackageImpl",
"simpleName": 0, // index in uniqueString
"otherFieldName": ...,
"classes": [{
"type": "CtClassImpl",
"simpleName": 1
}]
}]
}
} |
This is quite related to method FactoryImp#dedup
|
no more activity here, closing the issue. don't hesitate to re-open if appropriate. |
We current have an issue, with the deserialization in spoon, it takes longer than the creation of the model itself.
|
For the moment I played a bit with the probability of caching the strings in FactoryImpl, without any success. I played as with the gc1 and string duplicate JVM options, without any results, for now! |
I played a little bit, the IO does not seem to have a big impact on the deserialization. I have pretty much the same results when I read a file from a tempfs partition. |
I noticed that it is much faster to read completely the file then to deserialize it than to provide in FileInputStream (3 times). |
Maybe a FileInputStream is not buffered? |
It is not, if I use a BufferedInputStream it is much faster 2847ms instead of 10723ms. |
Another library for (possibly faster) serialization: https://github.com/protostuff/protostuff |
I tested with 6.3.0-SNAPSHOT that added the encapsulation with a BufferedInputStream. For a model taking 40 minutes to be built, serialized as a 5.5GB file... it requires 300 seconds to read only! This is a huge gain, so for me the problem is solved (it was taking something like 1h30 before!). This is huge :) The fact that the serialized object is very redundant is still a problem, but another problem. |
the redundancy can be easily solved by GZip compression. It will be smaller file and it will probably run faster too. |
that's great, closing the issue then! thanks for the bug report @mehdi-kaytoue we can open a new issue about the size of the serialized model |
Just tried GZIP normal compression with 7Zip (ad hoc), reduced my model from 5.5GB to 0.7GB using basic GZIPInputStream. Great news, read time remains the same, so this solves the space storage issue temporarily! The best option would be to rework the serialization schema, but as you said @monperrus , this is another story. Thank you all, this REALLY helps :) |
See #2094
|
Dear all,
following the advice of @monperrus in #1526
I am wondering if there is a perf issue during deserialization?
From my other post :
@tdurieux suggested to test https://github.com/RuedigerMoeller/fast-serialization
but I wonder if its not the factory assignment which takes time.
Thanks a lot!!
Mehdi
Edit: on my first tests, using FST looks indeed promising... factor 10x reduction!
Edit2: apparently, there is a 1.3GB size limit for fst serialized objects. Mine is 4.3 gb :)
The text was updated successfully, but these errors were encountered: