Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame(::PyPandasDataFrame) converts date & datetime to bytes #293

Open
tomdstone opened this issue Apr 13, 2023 · 18 comments · May be fixed by #509
Open

DataFrame(::PyPandasDataFrame) converts date & datetime to bytes #293

tomdstone opened this issue Apr 13, 2023 · 18 comments · May be fixed by #509

Comments

@tomdstone
Copy link

When doing some work involving dataframes in python via PythonCall, it seems like DataFrame(PyTable(p)) where p is a pandas data table converts the date and datetime columns into byte vectors. Is this issue related to the issue #265 with milliseconds vs microseconds, or due to a missing part of the DataFrame(::PyPandasDataFrame) implementation?

Here are a few minimal examples, in a conda environment with pandas.

using PythonCall
using Dates
using DataFrames

a = DataFrame(x = [now()])  # julia dataframe
b = pytable(a)              # pandas dataframe
c = PyTable(b)              # PyPandasDataFrame
d = DataFrame(c)            # julia dataframe again

This results in:

julia> c
1×1 PyPandasDataFrame
                        x
0 2023-04-13 14:36:13.939

julia> d
1×1 DataFrame
 Row │ x
     │ PyArray…
─────┼───────────────────────────────────
   1 │ UInt8[0xc0, 0x62, 0x31, 0x8c, 0x…

The same thing happens when initially defining b as a pandas dataframe, so the microsecond issue in #265 seems to not be the problem?

julia> b = pd.DataFrame([[dt.datetime.now()]])
Python DataFrame:
                           0
0 2023-04-13 14:46:57.940077

julia> c = PyTable(b)
1×1 PyPandasDataFrame       
                           0
0 2023-04-13 14:46:57.940077

julia> d = DataFrame(c)
1×1 DataFrame
 Row │ 0
     │ PyArray…
─────┼───────────────────────────────────
   1 │ UInt8[0xc8, 0xf9, 0xa5, 0x7d, 0x…
@tomdstone tomdstone changed the title DataFrame(::PyPandasDataFrame) converts datetime to bytes DataFrame(::PyPandasDataFrame) converts datetime to bytes Apr 13, 2023
@tomdstone tomdstone changed the title DataFrame(::PyPandasDataFrame) converts datetime to bytes DataFrame(::PyPandasDataFrame) converts date & datetime to bytes Apr 13, 2023
@tomdstone
Copy link
Author

If you replace the now() with today() in the specific example I gave, it converts dates correctly, but in my use case where I had two date columns and two datetime columns, both the date and datetime columns are converted to byte vectors

@cjdoris
Copy link
Collaborator

cjdoris commented Apr 24, 2023

Just looking at this now. It all comes down to the fact that numpy arrays containing numpy.datetime64 cannot be wrapped as a Julia array yet. Therefore it's falling back on converting each element, and because numpy.datetime64 actually implements the buffer protocol with itemtype=1byte, that gets converted to a PyArray{UInt8} as you're seeing.

So there's two things that could be fixed here:

  • Support wrapping arrays containing datetime64
  • Support converting datetime64 to something more useful than PyArray{UInt8}.

@cjdoris
Copy link
Collaborator

cjdoris commented Apr 24, 2023

(PS thanks for the report and the MWE)

@hhaensel
Copy link

hhaensel commented Jun 29, 2023

Just stumbled across that problem and solved it this way:

function pytimestamp_to_datetime(t::PyArray)
    DateTime(1970) + Second(reinterpret(Int64, t)[1] / 1000000000)
end

function pytimestamp_to_datetime(v::AbstractVector{<:PyArray})
    pytimestamp_to_datetime.(v)
end

EDIT: This only holds for datetime64[ns], otherwise the factor is different. I just filed a tentative PR for a general solution

@hhaensel
Copy link

hhaensel commented Jun 30, 2023

@cjdoris
If I eliminate .values in line 61 of PyPandasDataFrame everything runs smoothly. Would that be an option or is there some performance issue or other issue that I am not aware of?

UPDATE: That doesn't work, columns will be vectors of Py objects

@hhaensel
Copy link

hhaensel commented Jul 1, 2023

If you replace the now() with today() in the specific example I gave, it converts dates correctly, but in my use case where I had two date columns and two datetime columns, both the date and datetime columns are converted to byte vectors

@tomdstone
Could you check if also your use case where you had difficulties with time is working with the PR in place?
Otherwise, please share an MWE.
I could try to cover your use case as well.

@hhaensel
Copy link

hhaensel commented Jul 2, 2023

Just added support for timedelta and timedelta64 conversion to Dates.CompoundPeriod.

@tomdstone
Copy link
Author

@hhaensel sorry, I won't have time to implement it due to how work is right now (we are in the middle of a move). I might follow up in a few months, but I definitely wouldn't hold your breath waiting for me. I appreciate the thought though!

@hhaensel
Copy link

hhaensel commented Jul 3, 2023

Thanks for the reply. No need to hurry. I just filed the PR as I was in need of a solution for my use case.

@hhaensel
Copy link

hhaensel commented Jul 4, 2023

@cjdoris
I added automatic conversion from Periods/CompoundPeriods to numpy:timedelta64 as this supports nanoseconds as well. When I use pytimedeltatype, I can only go down to microseconds. Is that a good decision? It seems that pandas can convert the timedelta64 to a timestamp.timedelta that handles nanoseconds correctly.
For conversion from timedelta to Julia I used Dates.CompoundPeriod in order to be type stable. Otherwise one could use Nanosecond, but that seemed a bit weired to me.
I know you're working for version 1 to come out soon. I hope my proposal fits in there somehow.

@hhaensel
Copy link

hhaensel commented Jul 4, 2023

Just want to share what's working now with the PR

julia> x = Py(Second(1))
Python: numpy.timedelta64(1,'s')

julia> pyconvert(Second, x)
1 second

julia> x = Py(Second(1) + Nanosecond(1))
Python: numpy.timedelta64(1000000001,'ns')

julia> y = pyconvert(Any, x)
1 second, 1 nanosecond

julia> typeof(y)
Dates.CompoundPeriod


julia> jdf = DataFrame(x = [now() + Second(rand(1:1000)) for _ in 1:100], y = [Second(n) for n in 1:100]);
julia> pdf = pytable(jdf)
Python:
                         x               y
0  2023-07-04 11:29:27.781 0 days 00:00:01
1  2023-07-04 11:30:02.781 0 days 00:00:02
2  2023-07-04 11:40:17.781 0 days 00:00:03
3  2023-07-04 11:31:11.781 0 days 00:00:04
           ... 5 more lines ...
98 2023-07-04 11:37:21.781 0 days 00:01:39
99 2023-07-04 11:35:53.781 0 days 00:01:40

[100 rows x 2 columns]

julia> jdf2 = DataFrame(PyTable(pdf))
100×2 DataFrame
 Row │ x                        y
     │ DateTime                 Compound   
─────┼──────────────────────────────────────
   12023-07-04T11:29:27.781  1 second
   22023-07-04T11:30:02.781  2 seconds
   32023-07-04T11:40:17.781  3 seconds
                    
  992023-07-04T11:37:21.781  99 seconds
 1002023-07-04T11:35:53.781  100 seconds
                             95 rows omitted

@github-actions
Copy link
Contributor

This issue has been marked as stale because it has been open for 30 days with no activity. If the issue is still relevant then please leave a comment, or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues about to be auto-closed label Aug 22, 2023
@hhaensel
Copy link

@cjdoris I think this is still relevant

@github-actions github-actions bot removed the stale Issues about to be auto-closed label Aug 23, 2023
@klwlevy
Copy link

klwlevy commented Aug 28, 2023

@cjdoris I think this is still relevant

I agree. Just stumbled on this when using pandas functionality to convert (legacy) xls data to xlsx. The dates became PyArray{UInt8}.

@hhaensel
Copy link

I'd be ready to adapt the my PR to the latest changes of PythonCall. Should I continue to work on it? Is the refactoring complete or should I still wait a bit?

@cjdoris
Copy link
Collaborator

cjdoris commented Jun 13, 2024

Yep the refactor is done. Can you clarify what the PR will change?

@hhaensel
Copy link

I think I summarised everything quite nicely above (#293 (comment))

The related PR is #334, where you commented that you are refactoring.

@hhaensel
Copy link

@cjdoris After refactoring I decided that it is easier to submit a new PR #509

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants