Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: concat of series of dtype category converting to object dtype (GH8641) #8714

Merged
merged 1 commit into from
Nov 9, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Nov 2, 2014

closes #8641

@jreback jreback added Bug Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas labels Nov 2, 2014
@jreback jreback added this to the 0.15.1 milestone Nov 2, 2014
@jreback
Copy link
Contributor Author

jreback commented Nov 2, 2014

cc @immerrr

maybe you can think of a better way to do this. 'feels' a bit hacky

@immerrr
Copy link
Contributor

immerrr commented Nov 4, 2014

It does indeed, let me have a closer look, I don't have my Emacs at hand right now.

@immerrr
Copy link
Contributor

immerrr commented Nov 5, 2014

There are two things that bother me.

One thing that bothers me is that concatenation code somehow crept back into Block class. Concatenating Categoricals (with ndarrays or with other categoricals) seems a valid scenario regardless of Blocks. It creates those really weird workflows when you need to create Blocks just to concatenate two values (I'm looking at you DatetimeIndex.hour). It seems simpler to just add proper special-case code to core.common._concat_compat. Or to put special-case concatenation close to their class definitions and do something like:

def concatenate_arrays(arrays):
    for fn in [pd.core.categorical.maybe_concatenate_categoricals,
               pd.core.sparse.array.maybe_concatenate_sparsearrays]:
        retval = fn(arrays)
        if retval is not None:
            return retval
    else:
        return np.concatenate([maybe_getvalues(a) for a in arrays])

This particular PR might probably get away with just fixing _concat_compat and leaving the rest of the code intact (if Series constructor supports Categorical data). And speaking of _concat_compat, it's quite possible that the special-casing of datetimes/timedeltas there is no longer necessary since numpy is now 1.7+.

The other thing is that Series being doubly special strikes back again: not only its first axis doesn't allow type variation (special case #1), but there is also an implicit axis=1 which allows Series to be stacked horizontally (special case #2). I wonder how much will it take to stop pretending and make Series internally a dataframe and if it will reduce the amount of special-casing. I also wonder if there's a place under the sun for type-varying (aka non-singleblock) Series, one use case that immediately comes to mind is that frame.iterrows can finally stop losing type information.

@jreback
Copy link
Contributor Author

jreback commented Nov 5, 2014

I think your first point is very valid - though (aside from the obvious impl issues) why do you think concatenation should be functional rather than exist in Block? is their a conceptual/fundamental reason of just (the exhibited code complexity)?

you second point though I am confused
so you want to allow multiple blocks on the same axis? seems very complex and unwieldy. Maybe you can expound on the perceived benefits

axis=1 is not a special case of Series rather a special case of merge - one that is an impl detail and can be done trivially rather than going thru the merge code
not a function of Series itself

@jreback
Copy link
Contributor Author

jreback commented Nov 5, 2014

can u show the DatetimeIndex.hour issue/example? from above

@jreback
Copy link
Contributor Author

jreback commented Nov 5, 2014

so you are proposing (sort of in between 1 and 2) to effectively remove SingleBlockManager and make it a pure BlockManager

I think you could (and prob not lose perf though you do lose opportunity to do some optimizations - not sure how much that would affect it)

want to create an issue for that?

@immerrr
Copy link
Contributor

immerrr commented Nov 5, 2014

The second point is rather a thought experiment and I probably should have not mentioned it here. Let's take it off the table right away and move it to another issue (or probably to the mailing-list).

@immerrr
Copy link
Contributor

immerrr commented Nov 5, 2014

re: datetimeindex:

Provided I have an array of dates and I want their respective day values, the easiest way is to do that is via a DatetimeIndex:

In [21]: arr = np.array([datetime(2014, 1, 1), datetime(2014, 1, 2), datetime(2014, 1, 3)])

In [22]: pd.DatetimeIndex(arr).day
Out[22]: array([1, 2, 3], dtype=int32)

I find it weird that I have to create an index to perform an operation on an array. I know, it's the case of Index fixing Array shortcomings and I haven't yet found time to fix this.

@immerrr
Copy link
Contributor

immerrr commented Nov 5, 2014

As for should the concatenation be functional or not, it's really a matter of preference, but as long as one cannot simply do arrs[0].concatenate(arrs[1:]), because arrs[0] may be an ndarray and those don't have a member function for that, I find it easier to write concatenate(arrs) than to add conditionals and loops to find the happy owner of concatenate method every time I need it.

@immerrr
Copy link
Contributor

immerrr commented Nov 5, 2014

And just to be clear, in terms of this particular PR I think it should work to implement special-casing for categorical concatenation in _concat_compat and revert most (maybe all) other changes.

@immerrr
Copy link
Contributor

immerrr commented Nov 5, 2014

And in general, I think that ideally Blocks should only maintain ref_locs and placement consistency with the data, concatenation rather belongs to array-related code.

@jreback
Copy link
Contributor Author

jreback commented Nov 5, 2014

@immerrr

Series(pd.date_range('20140101',periods=4)).dt.day is new in 0.15.0

I agree with all of your points:

  • going to fix _compat_concat instead (IIRC the datetime special casting in their was failing when I tried to take it out before, maybe let's have a look)
  • agreed that Blocks are involved with merges but not direct block concatenation.

@jreback
Copy link
Contributor Author

jreback commented Nov 5, 2014

updated and pushed. pls take a look.

def convert(x):
if is_categorical_dtype(x.dtype):
return x.get_values().ravel()
return x.ravel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it have to ravel everything here (and in the previous row)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the Categorical currently expects 1-d input (it doesn't need in the cateogorical line actually, but the object line does).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

elif is_bool_dtype(dtype):
typ = 'bool'
if typ is not None:
typs.add(typ)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this code could be merged with type detection in core.internals.get_empty_dtype_and_na, since the latter can be thought of a "forecast" of the dtype of array-like concatenation (special casing untyped null array-likes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it could - want to take it?
or maybe push to another issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it another issue then.

@immerrr
Copy link
Contributor

immerrr commented Nov 5, 2014

Looks solid. I mean, true, now there are noticeable logic blocks in concat_compat that can be moved to a sub-functions, but I'm not sure if it is that bad on readability. Maybe summon a fresh pair of eyes? :)

@immerrr
Copy link
Contributor

immerrr commented Nov 5, 2014

Series(pd.date_range('20140101',periods=4)).dt.day is new in 0.15.0

Yeah, that's much better. Seems a tad slower because of all the stuff Series ctor does, but definitely looks more intuitive to me.

@jreback
Copy link
Contributor Author

jreback commented Nov 7, 2014

@immerrr ok, finally fixed this.

We could dispatch to the sub-types to avoid having some of this code all in common

e.g. Categorical,SparseArray (maybe in tseries/common.py for timedelta/datetime)

?

@jreback
Copy link
Contributor Author

jreback commented Nov 7, 2014

@immerrr I fixed the impl to use my suggestion above, much cleaner IMHO.

@immerrr
Copy link
Contributor

immerrr commented Nov 8, 2014

I fixed the impl to use my suggestion above, much cleaner IMHO.

Much cleaner indeed. Although I'm still a bit worried that this PR does a bit too much corollary change (e.g. bool handling) without any tests for that change.

@jreback
Copy link
Contributor Author

jreback commented Nov 8, 2014

ok going to delay this
though all of the 'special cases' actually seem to be necessary just to get various things to pass

eg the bool and and int casting is a special case where np.concatenate does the wrong thing

and np.find_commom_type is basically useles

so prob need a systematic test of :
all dtypes in combination (by 2s,3s) with and without empties

@jreback jreback modified the milestones: 0.15.1, 0.15.2 Nov 8, 2014
@jreback jreback force-pushed the series_concat branch 3 times, most recently from 15b73e8 to 78f19fd Compare November 9, 2014 13:10
@jreback
Copy link
Contributor Author

jreback commented Nov 9, 2014

@immerrr I updated hopefully to catch all of these empty-empty combo cases. pls have a look.

@jreback
Copy link
Contributor Author

jreback commented Nov 9, 2014

ok, I think fixed up. the only weird one is float + bool which is an existing condition. But all works as currently (and I think testing correctly).

lmk

@immerrr
Copy link
Contributor

immerrr commented Nov 9, 2014

Added couple more nitpicks, but generally looks fine.

jreback added a commit that referenced this pull request Nov 9, 2014
BUG: concat of series of dtype category converting to object dtype (GH8641)
@jreback jreback merged commit 99a555b into pandas-dev:master Nov 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: concat of 2 Series categories does not preserve category dtype
2 participants