Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance: Deserialization should be shorter than building the model from sources #1983

Closed
mehdi-kaytoue opened this issue May 3, 2018 · 16 comments
Labels
performance related with performance issues / improvements

Comments

@mehdi-kaytoue
Copy link
Contributor

mehdi-kaytoue commented May 3, 2018

Dear all,
following the advice of @monperrus in #1526
I am wondering if there is a perf issue during deserialization?

From my other post :

However, deserialization its twice longer that reading the model from the sources.
Looking at SerializationModelStreamer.load(), I wonder if we can have something more efficient?
Not sure to fully understand, but we need to assign the factory to each CtElement of the model?

@tdurieux suggested to test https://github.com/RuedigerMoeller/fast-serialization
but I wonder if its not the factory assignment which takes time.

Thanks a lot!!
Mehdi

Edit: on my first tests, using FST looks indeed promising... factor 10x reduction!
Edit2: apparently, there is a 1.3GB size limit for fst serialized objects. Mine is 4.3 gb :)

@monperrus
Copy link
Collaborator

on my first tests, using FST looks indeed promising... factor 10x reduction!

awesome! a pull request would be highly welcome :-)

@tdurieux
Copy link
Collaborator

tdurieux commented May 5, 2018

I think it is possible to reduce consequently the size of the serialization of spoon, by creating a custom serialization technique.

I did a small test on the spoon model that count the number of duplicate string in the model (look at all the fields of all the class in the model)

Results:

Nb unique String 5 334
Nb string fiel: 553 971

Basically, there is only 1% of unique string in the model.
I think we can use this to create a serialization format that can have good performance.
I don't have expertise in serialization, it is just an idea. WDYT?

Structure idea

{
	"uniqueString": ["myPacakge", "MyClass", "toto"]
	"tree": {
		"type": "root",
		"packages": [{
			"type": "CtPackageImpl",
			"simpleName": 0, // index in uniqueString
			"otherFieldName": ...,
			"classes": [{
				"type": "CtClassImpl",
				"simpleName": 1
			}]
		}]
    } 
}

@monperrus
Copy link
Collaborator

monperrus commented May 6, 2018 via email

@monperrus monperrus changed the title Deserialization is twice longer than building the model from sources performance: Deserialization is twice longer than building the model from sources May 11, 2018
@monperrus
Copy link
Collaborator

no more activity here, closing the issue. don't hesitate to re-open if appropriate.

@monperrus monperrus changed the title performance: Deserialization is twice longer than building the model from sources performance: Deserialization should be shorter than building the model from sources Jun 4, 2018
@tdurieux
Copy link
Collaborator

tdurieux commented Jun 8, 2018

@monperrus

We current have an issue, with the deserialization in spoon, it takes longer than the creation of the model itself.

Task Time (ms)
Creation of the Model 7793
Serialization 7318
Deserialization 10723

@tdurieux tdurieux reopened this Jun 8, 2018
@mehdi-kaytoue
Copy link
Contributor Author

For the moment I played a bit with the probability of caching the strings in FactoryImpl, without any success. I played as with the gc1 and string duplicate JVM options, without any results, for now!

@tdurieux
Copy link
Collaborator

tdurieux commented Jun 8, 2018

I played a little bit, the IO does not seem to have a big impact on the deserialization. I have pretty much the same results when I read a file from a tempfs partition.
But it has an impact on the serialization, it is reduced to (4121ms)

@tdurieux
Copy link
Collaborator

tdurieux commented Jun 8, 2018

I noticed that it is much faster to read completely the file then to deserialize it than to provide in FileInputStream (3 times).
I don't know why

@msteinbeck
Copy link
Contributor

Maybe a FileInputStream is not buffered?

@tdurieux
Copy link
Collaborator

tdurieux commented Jun 8, 2018

It is not, if I use a BufferedInputStream it is much faster 2847ms instead of 10723ms.

@surli surli added the performance related with performance issues / improvements label Jun 14, 2018
@monperrus
Copy link
Collaborator

Another library for (possibly faster) serialization: https://github.com/protostuff/protostuff

@mehdi-kaytoue
Copy link
Contributor Author

I tested with 6.3.0-SNAPSHOT that added the encapsulation with a BufferedInputStream. For a model taking 40 minutes to be built, serialized as a 5.5GB file... it requires 300 seconds to read only! This is a huge gain, so for me the problem is solved (it was taking something like 1h30 before!).

This is huge :)

The fact that the serialized object is very redundant is still a problem, but another problem.

@pvojtechovsky
Copy link
Collaborator

the redundancy can be easily solved by GZip compression. It will be smaller file and it will probably run faster too.

@monperrus
Copy link
Collaborator

that's great, closing the issue then!

thanks for the bug report @mehdi-kaytoue

we can open a new issue about the size of the serialized model

@mehdi-kaytoue
Copy link
Contributor Author

mehdi-kaytoue commented Jun 21, 2018

Just tried GZIP normal compression with 7Zip (ad hoc), reduced my model from 5.5GB to 0.7GB using basic GZIPInputStream.

Great news, read time remains the same, so this solves the space storage issue temporarily!

The best option would be to rework the serialization schema, but as you said @monperrus , this is another story.

Thank you all, this REALLY helps :)

@monperrus
Copy link
Collaborator

monperrus commented Jun 21, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance related with performance issues / improvements
Projects
None yet
Development

No branches or pull requests

6 participants