Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer encoding regression. #1024

Closed
ChALkeR opened this issue Mar 2, 2015 · 10 comments
Closed

Buffer encoding regression. #1024

ChALkeR opened this issue Mar 2, 2015 · 10 comments
Assignees
Labels
buffer Issues and PRs related to the buffer subsystem.

Comments

@ChALkeR
Copy link
Member

ChALkeR commented Mar 2, 2015

var s0 = 'тест ';
s0 += new Array(1031913 + 1 - s0.length).join('Z');
var s1 = new Buffer(s0, 'ucs2').toString('ucs2');
var s2 = new Buffer(s1, 'utf-8').toString('utf-8');
console.log(s0 === s2);
console.log(s0.substr(0, 200));
console.log(s2.substr(0, 200));

This works under node 0.10 (s0 equals to s2, the example logs 2 equal strings), but breaks in iojs 1.2.0 and 1.4.2 (s0 is not equal to s2, the second logged string is not correct).

When broken, s2.substr(0, 200) somewhy equals to new Buffer(s0, 'ucs2').toString('utf-8').substr(0, 200).

1031913 is the critical length, non-latin strings with length greater or equal to 1031913 trigger the bug, strings with length equal or less than 1031912 do not trigger the bug.

@Fishrock123 Fishrock123 added the buffer Issues and PRs related to the buffer subsystem. label Mar 2, 2015
@Fishrock123
Copy link
Contributor

cc @trevnorris?

@ChALkeR
Copy link
Member Author

ChALkeR commented Mar 2, 2015

Reproducible in nodejs 0.12 with different critical string length: 515957, which is exactly 1/2 of the iojs critical string length. Using nodejs 0.12.0-3 from Archlinux x86_64 packages.

I mean that the max safe length in iojs is 1031912, the max safe length is nodejs 0.12 is 515956, 515956 * 2 == 1031912.

@trevnorris
Copy link
Contributor

Must have to do with the use of externalized strings. I'll investigate.

@trevnorris trevnorris self-assigned this Mar 3, 2015
@bnoordhuis
Copy link
Member

The tipping point is exactly EXTERN_APEX bytes so yes, must be something to do with externalized (two-byte?) strings. I'll take a look as well, I already spotted at least one (possibly unrelated) bug.

@bnoordhuis
Copy link
Member

Okay, here is a simplified test case:

var assert = require('assert');

var a = Array(1031914).join('x');
var b = Buffer(a, 'ucs2').toString('ucs2');
var c = Buffer(b, 'utf8').toString('utf8');

assert.equal(a.length, b.length);
assert.equal(b.length, c.length);
assert.equal(a, b);
assert.equal(b, c);

And when you run it, the last assert fails like this, making it quite obvious what's going on.

$ out/Release/iojs tmp/issue1024-2.js

assert.js:87
  throw new assert.AssertionError({
        ^
AssertionError: 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx == 'x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x\u0000x
    at Object.<anonymous> (/home/bnoordhuis/src/v1.x/tmp/issue1024-2.js:10:8)
    at Module._compile (module.js:444:26)
    at Object.Module._extensions..js (module.js:462:10)
    at Module.load (module.js:339:32)
    at Function.Module._load (module.js:294:12)
    at Function.Module.runMain (module.js:485:10)
    at startup (node.js:112:16)
    at node.js:863:3

The call to Buffer(b, 'utf8') creates a buffer that looks like 78 00 78 00 78 00 .. but only when b is an externalized two-byte string.

@bnoordhuis
Copy link
Member

Found it: https://github.com/iojs/io.js/blob/6433ad1/src/string_bytes.cc#L305-312

StringBytes::Write() does a plain memcpy() when is_extern is true but that's wrong when the source is two-byte and the destination is one-byte or UTF-8 (and probably the other way around as well.)

@vkurchatkin
Copy link
Contributor

related nodejs/node-v0.x-archive#8683 ?

@bnoordhuis
Copy link
Member

@vkurchatkin Yes, looks to be the same issue. I'll try to come up with a patch later today.

bnoordhuis added a commit to bnoordhuis/io.js that referenced this issue Mar 5, 2015
StringBytes::Write() did a plain memcpy() when is_extern is true but
that's wrong when the source is a two-byte string and the destination
a one-byte or UTF-8 string.

The impact is limited to strings > 1,031,913 bytes because those are
normally the only strings that are externalized, although the use of
the 'externalize strings' extension (--expose_externalize_string) can
also trigger it.

This commit also cleans up the bytes versus characters confusion in
StringBytes::Write() because that was closely intertwined with the
UCS-2 encoding regression.  One wasn't fixable without the other.

Fixes: nodejs#1024
Fixes: nodejs/node-v0.x-archive#8683
PR-URL: nodejs#1042
Reviewed-By: Trevor Norris <trev.norris@gmail.com>
@bnoordhuis
Copy link
Member

Fixed in 1640ded.

@ChALkeR
Copy link
Member Author

ChALkeR commented Mar 5, 2015

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem.
Projects
None yet
Development

No branches or pull requests

5 participants