Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Integration test test_re_replace_all fails with a corner case #9731

Open
ttnghia opened this issue Nov 15, 2023 · 10 comments
Open

[BUG] Integration test test_re_replace_all fails with a corner case #9731

ttnghia opened this issue Nov 15, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 15, 2023

Due to random seed generation for testing, we uncovered a new corner case that test_re_replace_all fails:

SPARK_RAPIDS_TEST_DATAGEN_SEED=1700077791 ./integration_tests/run_pyspark_from_build.sh -k test_re_replace_all

...
--- CPU OUTPUT
+++ GPU OUTPUT
@@ -1055,7 +1055,7 @@
 Row(regexp_replace(a, .*$, PROD, 1)='zm\ngy\nPROD\nPROD')
 Row(regexp_replace(a, .*$, PROD, 1)='lv\nvo\nPROD\nPROD')
 Row(regexp_replace(a, .*$, PROD, 1)='tu\n\nPRODPROD')
-Row(regexp_replace(a, .*$, PROD, 1)='PRODPROD\x85PROD')
+Row(regexp_replace(a, .*$, PROD, 1)='PROD\x85PROD')
@ttnghia ttnghia added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 15, 2023
@abellina abellina removed the ? - Needs Triage Need team to review and classify label Nov 15, 2023
@andygrove andygrove self-assigned this Nov 15, 2023
@andygrove
Copy link
Contributor

I modified the test to show the input that causes this error:

-Row(a='oNÍ[\x87\x01áe>\x85', regexp_replace(a, .*$, PROD, 1)='PRODPROD\x85PROD')
+Row(a='oNÍ[\x87\x01áe>\x85', regexp_replace(a, .*$, PROD, 1)='PROD\x85PROD')

@andygrove
Copy link
Contributor

andygrove commented Nov 20, 2023

Simpler repro:

    .with_special_case('a\x85')
-Row(a='a\x85', regexp_replace(a, .*$, PROD, 1)='PRODPROD\x85PROD')
+Row(a='a\x85', regexp_replace(a, .*$, PROD, 1)='PROD\x85PROD')

Note that a\x84 works fine, so this appears to be specific to certain characters.

@andygrove andygrove added the ? - Needs Triage Need team to review and classify label Dec 19, 2023
@andygrove
Copy link
Contributor

Adding the triage label back now that there is a summary of the issue.

@andygrove andygrove removed their assignment Dec 19, 2023
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 26, 2023
@sameerz
Copy link
Collaborator

sameerz commented Dec 26, 2023

Need to investigate why we got \x85 as an input string, since it is not a valid UTF-8 string.

@revans2
Copy link
Collaborator

revans2 commented Dec 26, 2023

It looks like this might be related to how Spark/Python interprets the string 'a\x85'.

import pyspark.sql.types

df = spark.createDataFrame(SparkContext.getOrCreate().parallelize([("a\x85")]), pyspark.sql.types.StringType())
spark.conf.set("spark.rapids.sql.enabled", False)
df.selectExpr('CAST(value as BINARY)').show()
+----------+
|     value|
+----------+
|[61 C2 85]|
+----------+

[C2 85] is the UTF-8 encoded character that is the same as "\u0085", which is Next Line - NEL. The result, when it is pulled back into Spark converts the [C2 85] back to latin1 as an "\x85" in python.

So it looks like a kind of odd situation where .*$ is behaving differently for NEL compared to \n.

import pyspark.sql.types
df = spark.createDataFrame(SparkContext.getOrCreate().parallelize([("a\x85"), ("a\n")]), pyspark.sql.types.StringType())
spark.conf.set("spark.rapids.sql.enabled", True)
df.selectExpr("CAST(regexp_replace(value, '.*$', 'P', 1) AS BINARY) as p").show(truncate=False)
+-------------+
|p            |
+-------------+
|[50 C2 85 50]|
|[50 50 0A 50]|
+-------------+

spark.conf.set("spark.rapids.sql.enabled", False)
df.selectExpr("CAST(regexp_replace(value, '.*$', 'P', 1) AS BINARY) as p").show(truncate=False)
+----------------+
|p               |
+----------------+
|[50 50 C2 85 50]|
|[50 50 0A 50]   |
+----------------+

@NVnavkumar
Copy link
Collaborator

Note that \u0085 is recognized as a valid line terminator: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

A newline (line feed) character ('\n'),
A carriage-return character followed immediately by a newline character ("\r\n"),
A standalone carriage-return character ('\r'),
A next-line character ('\u0085'),
A line-separator character ('\u2028'), or
A paragraph-separator character ('\u2029).

So the replace in this case is not actually working properly on the GPU.

Also, Python 3 strings are by default UTF-8, so a\x85 and a\u0085 are equivalent as strings, Python just displays as 'a\x85', so it's not latin1.

@andygrove
Copy link
Contributor

Note that we transpile the pattern .*$ to [^\n\r\u0085\u2028\u2029]*(?:\r|\u0085|\u2028|\u2029|\r\n)?$ before passing to cuDF, so maybe we need to update this to handle NEL. I am investigating.

@andygrove
Copy link
Contributor

cuDF repro:

@Test
void testStringReplaceEdgeCase() {
    TableDebug debug = TableDebug.builder().build();
    RegexProgram target = new RegexProgram(
            "[^\n\r\u0085\uc285\u2028\u2029]*(?:\r|\u0085|\uc285|\u2028|\u2029|\r\n)?$");
    try (ColumnVector input = ColumnVector.fromStrings("a\n", "a\u0085");
         ColumnVector expected = ColumnVector.fromStrings("PRODPROD\nPROD", "PRODPROD\u0085PROD");
         Scalar replace = Scalar.fromString("PROD");
         ColumnVector output = input.replaceRegex(target, replace)) {
        debug.debug("input", input);
        debug.debug("output", output);
        assertColumnsAreEqual(expected, output);
    }
}

Output:

GPU COLUMN input - NC: 0 DATA: DeviceMemoryBufferView{address=0x7f7e86000000, length=5, id=-1} VAL: null
COLUMN input - STRING
0 "a
" 610a
1 "a<85>" 61c285
GPU COLUMN output - NC: 0 DATA: DeviceMemoryBufferView{address=0x7f7e86000a00, length=21, id=-1} VAL: null
COLUMN output - STRING
0 "PRODPROD
PROD" 50524f4450524f440a50524f44
1 "PRODPROD" 50524f4450524f44
HOST PADDING SIZE: 32

@andygrove
Copy link
Contributor

@NVnavkumar based on the test I posted here, I am not sure if this is really a bug in cuDF or not. What do you think?

@andygrove andygrove removed their assignment Apr 1, 2024
@andygrove andygrove added the ? - Needs Triage Need team to review and classify label Apr 1, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 2, 2024
@NVnavkumar
Copy link
Collaborator

Re-evaluate after implementation of #11554

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants