Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: index_col in read_csv ignores dtype argument #59077

Closed
3 tasks done
seisman opened this issue Jun 24, 2024 · 7 comments · Fixed by #59316
Closed
3 tasks done

BUG: index_col in read_csv ignores dtype argument #59077

seisman opened this issue Jun 24, 2024 · 7 comments · Fixed by #59316
Assignees
Labels

Comments

@seisman
Copy link

seisman commented Jun 24, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import io
import numpy as np
import pandas as pd

data = io.StringIO("345.5 519.5 0\n519.5 726.5 1\n")


df = pd.read_csv(
    data,
    sep=r"\s+",
    header=None,
    names=["start", "stop", "bin_id"],
    dtype={"start": np.float32, "stop": np.float32, "bin_id": np.uint32},
    index_col="bin_id"
)
print(df.index.dtype)

Issue Description

df.index.dtype is int64 with pandas 3.0.0.dev0+1132.ga5e812d86d, although the dtype parameter already sets it to np.uint32. The issue is similar to an old issue #9435.

Expected Behavior

Pandas 2.x correctly returns uint32, which is the expected behavior.

Installed Versions

``` INSTALLED VERSIONS ------------------ commit : a5e812d python : 3.12.3.final.0 python-bits : 64 OS : Linux OS-release : 6.8.11-300.fc40.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Mon May 27 14:53:33 UTC 2024 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1132.ga5e812d86d
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 70.0.0
pip : 24.0
Cython : None
pytest : 8.2.1
hypothesis : None
sphinx : 7.3.7
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.25.0.dev
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : 2024.5.0
gcsfs : None
matplotlib : 3.10.0.dev202+gd901275d7c
numba : None
numexpr : None
odfpy : None
openpyxl : None
pyarrow : 16.1.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2024.5.1.dev6+g12123be8
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

</details>
@seisman seisman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 24, 2024
@xouyang1
Copy link

xouyang1 commented Jul 9, 2024

take

@xouyang1
Copy link

xouyang1 commented Jul 9, 2024

Seems to be a behavior change for the default c engine only.
python engine is also incorrect (returning int64) but was also doing that in v2.2.2

@MarcoGorelli
Copy link
Member

thanks for the report - we should probably first do a git bisect to see which commit introduced the bug

running it now: https://www.kaggle.com/code/marcogorelli/pandas-regression-example?scriptVersionId=187955574

@xouyang1
Copy link

@MarcoGorelli Hi Marco, I tried rerunning the regression example with just data = io.StringIO("345.5 519.5 0) and it always lands at the latest commit, which is incorrect. Can you help take a look?

@MarcoGorelli
Copy link
Member

thanks @xouyang1 - I was missing an escape character in the script, and so it wasn't running properly

having tried again, I'm getting #57943 as the commit that introduced it, but I don't know if that seems reasonable, haven't looked closer yet

@xouyang1
Copy link

thanks @xouyang1 - I was missing an escape character in the script, and so it wasn't running properly

having tried again, I'm getting #57943 as the commit that introduced it, but I don't know if that seems reasonable, haven't looked closer yet

It looks like the right commit, specifically this change to pandas/core/indexes/base.py ded256d#diff-c34a28314fc8cb12f0d2aa710f1c15f06cdfe3e48f03e658f01f99a43d4f5d09

@MarcoGorelli
Copy link
Member

It looks like the right commit, specifically this change to pandas/core/indexes/base.py ded256d#diff-c34a28314fc8cb12f0d2aa710f1c15f06cdfe3e48f03e658f01f99a43d4f5d09

cool, thanks for checking (cc @mroeschke just fyi, no blame 🤗 )

@MarcoGorelli MarcoGorelli removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants