performance: Deserialization should be shorter than building the model from sources #1983

mehdi-kaytoue · 2018-05-03T15:13:42Z

Dear all,
following the advice of @monperrus in #1526
I am wondering if there is a perf issue during deserialization?

From my other post :

However, deserialization its twice longer that reading the model from the sources.
Looking at SerializationModelStreamer.load(), I wonder if we can have something more efficient?
Not sure to fully understand, but we need to assign the factory to each CtElement of the model?

@tdurieux suggested to test https://github.com/RuedigerMoeller/fast-serialization
but I wonder if its not the factory assignment which takes time.

Thanks a lot!!
Mehdi

Edit: on my first tests, using FST looks indeed promising... factor 10x reduction!
Edit2: apparently, there is a 1.3GB size limit for fst serialized objects. Mine is 4.3 gb :)

monperrus · 2018-05-04T14:59:53Z

on my first tests, using FST looks indeed promising... factor 10x reduction!

awesome! a pull request would be highly welcome :-)

tdurieux · 2018-05-05T09:21:50Z

I think it is possible to reduce consequently the size of the serialization of spoon, by creating a custom serialization technique.

I did a small test on the spoon model that count the number of duplicate string in the model (look at all the fields of all the class in the model)

Results:

Nb unique String 5 334
Nb string fiel: 553 971

Basically, there is only 1% of unique string in the model.
I think we can use this to create a serialization format that can have good performance.
I don't have expertise in serialization, it is just an idea. WDYT?

Structure idea

{
	"uniqueString": ["myPacakge", "MyClass", "toto"]
	"tree": {
		"type": "root",
		"packages": [{
			"type": "CtPackageImpl",
			"simpleName": 0, // index in uniqueString
			"otherFieldName": ...,
			"classes": [{
				"type": "CtClassImpl",
				"simpleName": 1
			}]
		}]
    } 
}

monperrus · 2018-05-06T06:00:46Z

This is quite related to method FactoryImp#dedup

monperrus · 2018-06-03T10:08:06Z

no more activity here, closing the issue. don't hesitate to re-open if appropriate.

tdurieux · 2018-06-08T14:09:22Z

@monperrus

We current have an issue, with the deserialization in spoon, it takes longer than the creation of the model itself.

Task	Time (ms)
Creation of the Model	7793
Serialization	7318
Deserialization	10723

mehdi-kaytoue · 2018-06-08T14:13:16Z

For the moment I played a bit with the probability of caching the strings in FactoryImpl, without any success. I played as with the gc1 and string duplicate JVM options, without any results, for now!

tdurieux · 2018-06-08T14:17:47Z

I played a little bit, the IO does not seem to have a big impact on the deserialization. I have pretty much the same results when I read a file from a tempfs partition.
But it has an impact on the serialization, it is reduced to (4121ms)

tdurieux · 2018-06-08T14:28:15Z

I noticed that it is much faster to read completely the file then to deserialize it than to provide in FileInputStream (3 times).
I don't know why

msteinbeck · 2018-06-08T14:33:40Z

Maybe a FileInputStream is not buffered?

tdurieux · 2018-06-08T14:34:45Z

It is not, if I use a BufferedInputStream it is much faster 2847ms instead of 10723ms.

monperrus · 2018-06-15T14:39:56Z

Another library for (possibly faster) serialization: https://github.com/protostuff/protostuff

mehdi-kaytoue · 2018-06-21T08:19:57Z

I tested with 6.3.0-SNAPSHOT that added the encapsulation with a BufferedInputStream. For a model taking 40 minutes to be built, serialized as a 5.5GB file... it requires 300 seconds to read only! This is a huge gain, so for me the problem is solved (it was taking something like 1h30 before!).

This is huge :)

The fact that the serialized object is very redundant is still a problem, but another problem.

pvojtechovsky · 2018-06-21T12:28:46Z

the redundancy can be easily solved by GZip compression. It will be smaller file and it will probably run faster too.

monperrus · 2018-06-21T12:44:52Z

that's great, closing the issue then!

thanks for the bug report @mehdi-kaytoue

we can open a new issue about the size of the serialized model

mehdi-kaytoue · 2018-06-21T12:51:38Z

Just tried GZIP normal compression with 7Zip (ad hoc), reduced my model from 5.5GB to 0.7GB using basic GZIPInputStream.

Great news, read time remains the same, so this solves the space storage issue temporarily!

The best option would be to rework the serialization schema, but as you said @monperrus , this is another story.

Thank you all, this REALLY helps :)

monperrus · 2018-06-21T16:14:30Z

See #2094

monperrus changed the title ~~Deserialization is twice longer than building the model from sources~~ performance: Deserialization is twice longer than building the model from sources May 11, 2018

mehdi-kaytoue mentioned this issue May 15, 2018

REVIEW add support for incremental build #1905

Merged

monperrus closed this as completed Jun 3, 2018

monperrus changed the title ~~performance: Deserialization is twice longer than building the model from sources~~ performance: Deserialization should be shorter than building the model from sources Jun 4, 2018

tdurieux reopened this Jun 8, 2018

surli added the performance related with performance issues / improvements label Jun 14, 2018

monperrus closed this as completed Jun 21, 2018

monperrus mentioned this issue Jun 21, 2018

performance: serialization should use GZIPInputStream to take less space on disk #2094

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance: Deserialization should be shorter than building the model from sources #1983

performance: Deserialization should be shorter than building the model from sources #1983

mehdi-kaytoue commented May 3, 2018 •

edited

Loading

monperrus commented May 4, 2018

tdurieux commented May 5, 2018 •

edited

Loading

monperrus commented May 6, 2018 via email

monperrus commented Jun 3, 2018

tdurieux commented Jun 8, 2018 •

edited

Loading

mehdi-kaytoue commented Jun 8, 2018

tdurieux commented Jun 8, 2018

tdurieux commented Jun 8, 2018

msteinbeck commented Jun 8, 2018

tdurieux commented Jun 8, 2018 •

edited

Loading

monperrus commented Jun 15, 2018

mehdi-kaytoue commented Jun 21, 2018

pvojtechovsky commented Jun 21, 2018

monperrus commented Jun 21, 2018

mehdi-kaytoue commented Jun 21, 2018 •

edited

Loading

monperrus commented Jun 21, 2018 via email

performance: Deserialization should be shorter than building the model from sources #1983

performance: Deserialization should be shorter than building the model from sources #1983

Comments

mehdi-kaytoue commented May 3, 2018 • edited Loading

monperrus commented May 4, 2018

tdurieux commented May 5, 2018 • edited Loading

monperrus commented May 6, 2018 via email

monperrus commented Jun 3, 2018

tdurieux commented Jun 8, 2018 • edited Loading

mehdi-kaytoue commented Jun 8, 2018

tdurieux commented Jun 8, 2018

tdurieux commented Jun 8, 2018

msteinbeck commented Jun 8, 2018

tdurieux commented Jun 8, 2018 • edited Loading

monperrus commented Jun 15, 2018

mehdi-kaytoue commented Jun 21, 2018

pvojtechovsky commented Jun 21, 2018

monperrus commented Jun 21, 2018

mehdi-kaytoue commented Jun 21, 2018 • edited Loading

monperrus commented Jun 21, 2018 via email

mehdi-kaytoue commented May 3, 2018 •

edited

Loading

tdurieux commented May 5, 2018 •

edited

Loading

tdurieux commented Jun 8, 2018 •

edited

Loading

tdurieux commented Jun 8, 2018 •

edited

Loading

mehdi-kaytoue commented Jun 21, 2018 •

edited

Loading