Skip to content
This repository has been archived by the owner on Feb 19, 2021. It is now read-only.

Use msgpack instead of json for serializing events #17

Closed
wants to merge 5 commits into from

Conversation

deactivated
Copy link
Contributor

First: this is a fairly intrusive patch, and I don't really expect it to be merged as is.

Several bugs that I ran into in early IDArling use were caused because binary data wasn't encoded for transit via json (e.g. 2bd7be7). The mismatch between IDA strings and JSON also results in a fair amount of boilerplate spread throughout events.py.

This patch addresses these issues by using msgpack for serialization. The msgpack data model is very similar to JSON, but it allows unicode strings to be distinguished from binary data.

The patch addresses python 2/3 interoperability by explicitly using unicode for event keys.

A small script is included for migrating databases from JSON to msgpack.

@patateqbool
Copy link
Member

Indeed, it's a lot of change :).
JSON does not seem to be very well suited to our needs, I do not see any major drawback to switch for msgpack, what do you think @NeatMonster
For database migration, I do not think it's necessary to merge this script into upstream, as IDArling is not yet stable, users can just delete and redo their modifications ?

@NeatMonster
Copy link
Member

Here's my two cents:

  • msgpack certainly is more elegant, compact and practical than JSON, but it also introduces an extra dependency on the user, which is something we have tried to avoid at all costs since the beggining.
  • While it allows to get rid of [en|de]code_bytes calls, we still have to call [en|de]code on strings, which in my opinion is the biggest issue currenlty and isn't solved by this change.
  • By switching everything to unicode, requiring to use the unicode literals and adding some encoding/decoding calls in other places, I'm not sure the code is less error-prone.

@NyaMisty
Copy link
Contributor

NyaMisty commented Aug 15, 2018

Maybe we can also use something like pickle?
PS: personally I think that JSON is easier to debug

@NeatMonster
Copy link
Member

Maybe we can also use something like pickle?

I'm not quite sure that pickle is inter-operable between Python 2 and 3.

@NyaMisty
Copy link
Contributor

Unlike JSON, MessagePack is able to preserve binary data. Using
MessagePack for event serialization means that native binary data can
be transmitted without modification and Event.encode_bytes is no
longer necessary.

Preserving the bytestring/unicode distinction requires some extra
steps to preserve Python 2/3 interop. Specifically, literal keys are
marked as unicode explicitly (i.e. using u'...'). Branch and repo
names are also encoded with UTF-8.

The database schema is modified to use BLOB instead of TEXT to record
events. A migration script is provided to update a database in place.
Metaclass type names need to be made using "str(...)" to maintain
python 2/3 compat
All IDA platforms use UTF-8 internally. Decoding strings should only
be necessary if they're displayed to the user.
@deactivated
Copy link
Contributor Author

deactivated commented Aug 16, 2018

@NeatMonster I agree with all of those issues:

msgpack certainly is more elegant, compact and practical than JSON, but it also introduces an extra dependency on the user, which is something we have tried to avoid at all costs since the beggining.

u-msg-pack is a reasonably efficient, pure-python implementation of msgpack distributed in a single file. It would be possible to include a copy of it and use it if the msgpack library isn't available. Would you be ok with that?

While it allows to get rid of [en|de]code_bytes calls, we still have to call [en|de]code on strings, which in my opinion is the biggest issue currenlty and isn't solved by this change.

Now that it's clear that all IDA platforms use UTF-8 internally, it should be possible to completely remove string encoding/decoding for events. Strings only need to be decoded if they're directly shown to the user. I made this change in d78643e.

By switching everything to unicode, requiring to use the unicode literals and adding some encoding/decoding calls in other places, I'm not sure the code is less error-prone.

Using explicit unicode literals really is annoying and error prone. By using str(...) for the metaclass type name, I was able to use from __future__ import unicode_literals in commands.py, packets.py, and models.py in d78643e. I'm pretty sure this bytes/unicode issue will also crop up if pickle is used.

@NyaMisty
Copy link
Contributor

bytes/Unicode issue will also crop up if pickle is used
Why? According to the documentation, the pickle offers option for 'str' type handling.
Also, as IDA are using UTF-8 and Unicode everywhere on every platform itself, encoding aren’t a big problem now.
Instead, more transparency will be a better choice, which avoid lots of serialization code when the event become more and more complicated.

@NeatMonster NeatMonster added the enhancement New feature or request label Aug 17, 2018
@NeatMonster
Copy link
Member

Would you be ok with that?

It still bothers me a little. I've scrolled through the u-msg-pack file quickly, as it is not very long, and it got me thinking. Why not just use msgpack as our format for sending the packets over the network and implement only the necessary bits for packing into/unpacking from this format?

Now that it's clear that all IDA platforms use UTF-8 internally

Are you 100% sure about that? I was able to verify it in most cases, but I would love to have an official source on this. Still, it makes our lives much easier to not have to encode strings/binary strings.

I was able to use [...] in d78643e

That's a lot better in my opinion, thanks!

The one question that remains for me is: can we keep the same level of detail shown to the user and inter-operability if we switch to msgpack? For example, I'd like the user be able to see which packet is sent and received by the client and the server. I don't think that would be an issue for the client, but it might require additional modifications for the server. I think inter-operability will be okay if it is confirmed that UTF-8 is used everywhere.

@NyaMisty
Copy link
Contributor

Yeah I can give you the source of that.
In https://www.hex-rays.com/products/ida/7.0/index.shtml
Now IDA is a truly international application that can speak all languages of the world because it uses UTF-8 everywhere. All scripts and plugins can use it. You can use UTF-8 in the disassembly listing, including comments or even the function names. This is not what we advise, therefore odd characters in names require some fine tuning. See the IDA 7.0: Automatic discovery of string literals during auto-analysis page for all the gory details.

@NeatMonster
Copy link
Member

Thanks @NyaMisty. We may not need to worry about the encoding of string anymore if we get rid of JSON then. But to display the packets to the user, I'm afraid that will still be necessary.

@NyaMisty
Copy link
Contributor

Maybe we should close the this PR and move it into a separate issue?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants