Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory consumption #90

Closed
pkgw opened this issue Oct 4, 2022 · 12 comments · Fixed by #173
Closed

Excessive memory consumption #90

pkgw opened this issue Oct 4, 2022 · 12 comments · Fixed by #173

Comments

@pkgw
Copy link
Owner

pkgw commented Oct 4, 2022

As reported in the discussion of #66, it seems that this program sometimes eats up an amount of memory that is totally out of line with what it should be doing. E.g., "Analyzing a 123MB binary OOMs my 24GB RAM machine."

I don't use this tool myself anymore and unfortunately I'm not in a position to spend time on it, but maybe the proposal in #47 would help with addressing this. The approach taken by the code for this program is about as straightforward as it can get, so I feel like there must be a problem with one of the underlying libraries (or their Rust bindings).

CC @rowanworth @Ristovski @oldherl

@HanabishiRecca
Copy link
Contributor

HanabishiRecca commented Oct 21, 2023

I managed to reduce the memory consumption dramatically by splitting section into chunks of 1 MiB.

E.g. currently I am not able to analyze electron binary with a size of ~158 MiB, being killed by OOM on my 32 GB RAM machine.

Doing something like

for chunk in data.chunks(1024 * 1024) {
    let insns = cs
        .disasm_all(chunk, 0)

reduces memory consumption down to ~500 MiB and finishes successfully.

A fairly dumb and straightforward solution.

The only possible downside I can think of: some instructions may fall on a boundary of the slice and be split in half. Which means they will be skipped as invalid.
It seems to be quite unlikely though, and considering that usually there are more than 1 instruction of a kind, I don't see any difference on practice.

@pkgw
Copy link
Owner Author

pkgw commented Oct 21, 2023

I'm still quite surprised about this memory consumption issue, but if a chunking approach helps, then great!

If you overlap the chunks by a little bit, would that avoid the pitfall of missing instructions that fall on a boundary? I suppose there is a possibility then that maybe an incomplete instruction will be mis-parsed?

Alternatively, maybe there's a way to figure out the byte offset and size of the final decoded instruction, and start the next chunk at a known boundary?

@HanabishiRecca
Copy link
Contributor

I'm still quite surprised about this memory consumption issue

Well, it contains a large amount of detailed information for every instruction. Disabling the detail mode reduces memory consumption significantly, but we lose ability to identify the group obviously.

If you overlap the chunks by a little bit, would that avoid the pitfall of missing instructions that fall on a boundary? I suppose there is a possibility then that maybe an incomplete instruction will be mis-parsed?

Maybe. Shrinking down to 4 KiB chunks I am able to see some discrepancy.

Alternatively, maybe there's a way to figure out the byte offset and size of the final decoded instruction, and start the next chunk at a known boundary?

I think it is technically possible.
But in this case I would better look towards cs.disasm_count. I don't know why capstone-rs API is so impractical, but maybe we can do something with it.

@HanabishiRecca
Copy link
Contributor

HanabishiRecca commented Oct 21, 2023

Actually yes, disasm_count seems to work very well. The solution I came up with is

let data = sect.data().expect("couldn't get section data");
let mut offset = 0;

loop {
    let rest = &data[offset..];
    if rest.is_empty() {
        break;
    }

    let insns = cs
        .disasm_count(rest, 0, 1)
        .expect("couldn't disassemble section");

    for insn in insns.iter() {
        offset += insn.bytes().len();

        let Ok(detail) = cs.insn_detail(insn) else {
            continue;
        };

        for group_code in detail.groups() {
            if seen_groups.insert(group_code.0) {
                if let Some(mnemonic) = insn.mnemonic() {
                    if let Some(desc) = describe_group(&group_code.0) {
                        println!("{} ({})", desc, mnemonic);
                    }
                }
            }
        }
    }
}

Reading instructions one-by-one works like a charm.
Memory consumption drops even lower than with the chunks approach (< 100 MiB) and also significantly faster (~9 secs for one-by-one vs ~28 secs for chunks).

In fact it is generally 3x faster than current 0.6.1 release, even for small binaries, on my machine. disasm_all seems to be ineffective for some reason.

@pkgw
Copy link
Owner Author

pkgw commented Oct 21, 2023

I am also very surprised that going one-by-one is faster too, but I'll take it! Maybe there are arrays that are realloc'ed with inefficient copying around, or something.

It might be a little more helpful to break when disasm_count() returns an error, rather than panicking (something that this program should be better at in general).

@HanabishiRecca
Copy link
Contributor

Maybe there are arrays that are realloc'ed with inefficient copying around, or something.

I guess so. Because performance only goes down with increased count of instructions requested via disasm_count(). Which is counterintuitive as batch operations are often cheaper.

It might be a little more helpful to break when disasm_count() returns an error

Maybe, but I don't really know in which situation it can fail. Invalid instructions are simply skipped as we know.
Also it is tricky because we don't know the amount of bytes to move offset in this case. Maybe simple offset += 1 will suffice though, idk really.

@HanabishiRecca
Copy link
Contributor

To summarize, I am unable to find an explanation of reasons disasm_count() can fail. I assume it is only about C library under the hood being unable to allocate memory or something. Which means we should exit the app in this case anyway.

So I think to push the PR as the current implementation seems good according to my testing.

@HanabishiRecca
Copy link
Contributor

Btw there is cs_disasm_iter exist in the original C library, which does effectively what we want.
But not implemented by capstone-rs. capstone-rust/capstone-rs#1

Maybe it is worth trying to call it directly via capstone-sys, can result in another significant performance uplift.

@HanabishiRecca
Copy link
Contributor

HanabishiRecca commented Oct 22, 2023

Well, my test shows that cs_disasm_iter is only slightly faster. Like 9 -> 8 secs improvement for a big binary.
I think it is not worth it right now for such marginal speedup, as it introduces a lot of new complexity and usafety. (Here you can see how horrendous it is.)

So for now I'd better go with the current PR.

@pkgw
Copy link
Owner Author

pkgw commented Oct 22, 2023

Thank you for investigating and documenting your findings!

@pkgw pkgw closed this as completed in #173 Oct 22, 2023
@HanabishiRecca
Copy link
Contributor

HanabishiRecca commented Oct 22, 2023

Btw, I noticed that simply updating dependencies to the latest versions improves the performance a bit.
So consider doing cargo update before the next release. 👍

@HanabishiRecca
Copy link
Contributor

I posted too late 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants