You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When writing a dictionary column at a given timestamp, we cannot readback at the same timestamp. This issue is in libtiledbsoma and affects both tiledbsoma-py and tiledbsoma-r.
Background
When we initially create our ArraySchema, the Enumeration values are empty. At the first write, we read what the dictionary values are from the passed-in Arrow column and use ArraySchemaEvolution to add the values by extending the enumeration. Currently we always evolve the schema at the current timestamp as the C++ API does not contain a setter for the timestamp. This means reading an array with enumerations at a given timestamp errors out 'unexpectedly' (see code below).
To Reproduce
import tiledbsoma as soma
import pyarrow as pa
import pandas as pd
import tempfile
with tempfile.TemporaryDirectory() as uri:
asch = pa.schema([("foo", pa.dictionary(pa.int8(), pa.large_string()))])
soma.DataFrame.create(uri, schema=asch, tiledb_timestamp=1).close()
pydict = {}
pydict["soma_joinid"] = [0, 1, 2]
pydict["foo"] = pd.Series(['a', 'b', 'a'], dtype='category')
rb = pa.Table.from_pydict(pydict)
with soma.DataFrame.open(uri, "w", tiledb_timestamp=2) as sdf:
sdf.write(rb)
pydict = {}
pydict["soma_joinid"] = [3, 4]
pydict["foo"] = pd.Series(['b', 'b'], dtype='category')
rb = pa.Table.from_pydict(pydict)
with soma.DataFrame.open(uri, "w", tiledb_timestamp=3) as sdf:
sdf.write(rb)
with soma.DataFrame.open(uri) as sdf:
table = sdf.read().concat()
print(table)
with soma.DataFrame.open(uri, tiledb_timestamp=1) as sdf:
table = sdf.read().concat()
print(table)
with soma.DataFrame.open(uri, tiledb_timestamp=2) as sdf:
table = sdf.read().concat()
print(table)
with soma.DataFrame.open(uri, tiledb_timestamp=3) as sdf:
table = sdf.read().concat()
print(table)
Notice how the values do not get readback correctly at the timestamps when they were written but readback looks fine at current timestamp.
Versions (please complete the following information):
>>> tiledbsoma.show_package_versions()
tiledbsoma.__version__ 1.12.0rc0.post72.dev619176400
TileDB core version (libtiledbsoma) 2.25.0
python version 3.11.0.final.0
OS version Linux 4.19.128-microsoft-standard
Additional context
There is a timestamp setter for the ArraySchemaEvolution in the C API. Issues with this approach are the fact we would require mixing C++ and C code. There are also other nuances with setting timestamps that need further discussion.
The text was updated successfully, but these errors were encountered:
There is a timestamp setter for the ArraySchemaEvolution in the C API. Issues with this approach are the fact we would require mixing C++ and C code. There are also other nuances with setting timestamps that need further discussion.
johnkerl
changed the title
[Bug] Dictionary columns do not read back correctly when passed timestamp
[c++] Dictionary columns do not read back correctly when passed timestampAug 14, 2024
Describe the bug
When writing a dictionary column at a given
timestamp
, we cannot readback at the sametimestamp
. This issue is in libtiledbsoma and affects both tiledbsoma-py and tiledbsoma-r.Background
When we initially create our
ArraySchema
, theEnumeration
values are empty. At the first write, we read what the dictionary values are from the passed-in Arrow column and useArraySchemaEvolution
to add the values by extending the enumeration. Currently we always evolve the schema at the current timestamp as the C++ API does not contain a setter for the timestamp. This means reading an array with enumerations at a given timestamp errors out 'unexpectedly' (see code below).To Reproduce
Notice how the values do not get readback correctly at the timestamps when they were written but readback looks fine at current timestamp.
Versions (please complete the following information):
Additional context
There is a timestamp setter for the
ArraySchemaEvolution
in the C API. Issues with this approach are the fact we would require mixing C++ and C code. There are also other nuances with setting timestamps that need further discussion.The text was updated successfully, but these errors were encountered: