Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Dictionary columns do not read back correctly when passed timestamp #2879

Open
nguyenv opened this issue Aug 12, 2024 · 4 comments
Open
Assignees
Labels

Comments

@nguyenv
Copy link
Member

nguyenv commented Aug 12, 2024

Describe the bug
When writing a dictionary column at a given timestamp, we cannot readback at the same timestamp. This issue is in libtiledbsoma and affects both tiledbsoma-py and tiledbsoma-r.

Background
When we initially create our ArraySchema, the Enumeration values are empty. At the first write, we read what the dictionary values are from the passed-in Arrow column and use ArraySchemaEvolution to add the values by extending the enumeration. Currently we always evolve the schema at the current timestamp as the C++ API does not contain a setter for the timestamp. This means reading an array with enumerations at a given timestamp errors out 'unexpectedly' (see code below).

To Reproduce

import tiledbsoma as soma
import pyarrow as pa
import pandas as pd
import tempfile

with tempfile.TemporaryDirectory() as uri:
    asch = pa.schema([("foo", pa.dictionary(pa.int8(), pa.large_string()))])
    soma.DataFrame.create(uri, schema=asch, tiledb_timestamp=1).close()

    pydict = {}
    pydict["soma_joinid"] = [0, 1, 2]
    pydict["foo"] = pd.Series(['a', 'b', 'a'], dtype='category')
    rb = pa.Table.from_pydict(pydict)

    with soma.DataFrame.open(uri, "w", tiledb_timestamp=2) as sdf:
        sdf.write(rb)

    pydict = {}
    pydict["soma_joinid"] = [3, 4]
    pydict["foo"] = pd.Series(['b', 'b'], dtype='category')
    rb = pa.Table.from_pydict(pydict)
                              
    with soma.DataFrame.open(uri, "w", tiledb_timestamp=3) as sdf:
        sdf.write(rb)

    with soma.DataFrame.open(uri) as sdf:
        table = sdf.read().concat()
        print(table)

    with soma.DataFrame.open(uri, tiledb_timestamp=1) as sdf:
        table = sdf.read().concat()
        print(table)
        
    with soma.DataFrame.open(uri, tiledb_timestamp=2) as sdf:
        table = sdf.read().concat()
        print(table)

    with soma.DataFrame.open(uri, tiledb_timestamp=3) as sdf:
        table = sdf.read().concat()
        print(table)

Notice how the values do not get readback correctly at the timestamps when they were written but readback looks fine at current timestamp.

pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
["a","b"]  -- indices:
[0,1,0,0,0]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[]]
foo: [  -- dictionary:
[]  -- indices:
[]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2]]
foo: [  -- dictionary:
[]  -- indices:
[0,1,0]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
[]  -- indices:
[0,1,0,0,0]]

Versions (please complete the following information):

>>> tiledbsoma.show_package_versions()
tiledbsoma.__version__              1.12.0rc0.post72.dev619176400
TileDB core version (libtiledbsoma) 2.25.0
python version                      3.11.0.final.0
OS version                          Linux 4.19.128-microsoft-standard

Additional context
There is a timestamp setter for the ArraySchemaEvolution in the C API. Issues with this approach are the fact we would require mixing C++ and C code. There are also other nuances with setting timestamps that need further discussion.

@johnkerl
Copy link
Member

There is a timestamp setter for the ArraySchemaEvolution in the C API. Issues with this approach are the fact we would require mixing C++ and C code. There are also other nuances with setting timestamps that need further discussion.

This also exists in the C++ API:
https://github.com/TileDB-Inc/TileDB/blob/2.25.0/tiledb/sm/cpp_api/array_schema_evolution.h#L248-L256

See also #2897 which uses this.

@johnkerl johnkerl changed the title [Bug] Dictionary columns do not read back correctly when passed timestamp [c++] Dictionary columns do not read back correctly when passed timestamp Aug 14, 2024
@johnkerl
Copy link
Member

#2895 is merged.

Please feel free to re-open if, for some reason I've missed, I've closed this in error.

@johnkerl
Copy link
Member

[sc-52956]

@johnkerl johnkerl reopened this Aug 26, 2024
@johnkerl
Copy link
Member

Re-opening until #2920 is resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants