BUG: concat of series of dtype category converting to object dtype (GH8641) #8714

jreback · 2014-11-02T18:43:39Z

closes #8641

jreback · 2014-11-02T18:44:19Z

cc @immerrr

maybe you can think of a better way to do this. 'feels' a bit hacky

immerrr · 2014-11-04T15:15:02Z

It does indeed, let me have a closer look, I don't have my Emacs at hand right now.

immerrr · 2014-11-05T07:14:20Z

There are two things that bother me.

One thing that bothers me is that concatenation code somehow crept back into Block class. Concatenating Categoricals (with ndarrays or with other categoricals) seems a valid scenario regardless of Blocks. It creates those really weird workflows when you need to create Blocks just to concatenate two values (I'm looking at you DatetimeIndex.hour). It seems simpler to just add proper special-case code to core.common._concat_compat. Or to put special-case concatenation close to their class definitions and do something like:

def concatenate_arrays(arrays):
    for fn in [pd.core.categorical.maybe_concatenate_categoricals,
               pd.core.sparse.array.maybe_concatenate_sparsearrays]:
        retval = fn(arrays)
        if retval is not None:
            return retval
    else:
        return np.concatenate([maybe_getvalues(a) for a in arrays])

This particular PR might probably get away with just fixing _concat_compat and leaving the rest of the code intact (if Series constructor supports Categorical data). And speaking of _concat_compat, it's quite possible that the special-casing of datetimes/timedeltas there is no longer necessary since numpy is now 1.7+.

The other thing is that Series being doubly special strikes back again: not only its first axis doesn't allow type variation (special case #1), but there is also an implicit axis=1 which allows Series to be stacked horizontally (special case #2). I wonder how much will it take to stop pretending and make Series internally a dataframe and if it will reduce the amount of special-casing. I also wonder if there's a place under the sun for type-varying (aka non-singleblock) Series, one use case that immediately comes to mind is that frame.iterrows can finally stop losing type information.

jreback · 2014-11-05T07:25:22Z

I think your first point is very valid - though (aside from the obvious impl issues) why do you think concatenation should be functional rather than exist in Block? is their a conceptual/fundamental reason of just (the exhibited code complexity)?

you second point though I am confused
so you want to allow multiple blocks on the same axis? seems very complex and unwieldy. Maybe you can expound on the perceived benefits

axis=1 is not a special case of Series rather a special case of merge - one that is an impl detail and can be done trivially rather than going thru the merge code
not a function of Series itself

jreback · 2014-11-05T07:26:34Z

can u show the DatetimeIndex.hour issue/example? from above

jreback · 2014-11-05T07:29:53Z

so you are proposing (sort of in between 1 and 2) to effectively remove SingleBlockManager and make it a pure BlockManager

I think you could (and prob not lose perf though you do lose opportunity to do some optimizations - not sure how much that would affect it)

want to create an issue for that?

immerrr · 2014-11-05T08:07:50Z

The second point is rather a thought experiment and I probably should have not mentioned it here. Let's take it off the table right away and move it to another issue (or probably to the mailing-list).

immerrr · 2014-11-05T08:29:57Z

re: datetimeindex:

Provided I have an array of dates and I want their respective day values, the easiest way is to do that is via a DatetimeIndex:

In [21]: arr = np.array([datetime(2014, 1, 1), datetime(2014, 1, 2), datetime(2014, 1, 3)])

In [22]: pd.DatetimeIndex(arr).day
Out[22]: array([1, 2, 3], dtype=int32)

I find it weird that I have to create an index to perform an operation on an array. I know, it's the case of Index fixing Array shortcomings and I haven't yet found time to fix this.

immerrr · 2014-11-05T08:43:55Z

As for should the concatenation be functional or not, it's really a matter of preference, but as long as one cannot simply do arrs[0].concatenate(arrs[1:]), because arrs[0] may be an ndarray and those don't have a member function for that, I find it easier to write concatenate(arrs) than to add conditionals and loops to find the happy owner of concatenate method every time I need it.

immerrr · 2014-11-05T08:47:41Z

And just to be clear, in terms of this particular PR I think it should work to implement special-casing for categorical concatenation in _concat_compat and revert most (maybe all) other changes.

immerrr · 2014-11-05T08:51:10Z

And in general, I think that ideally Blocks should only maintain ref_locs and placement consistency with the data, concatenation rather belongs to array-related code.

jreback · 2014-11-05T12:12:49Z

@immerrr

Series(pd.date_range('20140101',periods=4)).dt.day is new in 0.15.0

I agree with all of your points:

going to fix _compat_concat instead (IIRC the datetime special casting in their was failing when I tried to take it out before, maybe let's have a look)
agreed that Blocks are involved with merges but not direct block concatenation.

jreback · 2014-11-05T12:13:13Z

updated and pushed. pls take a look.

immerrr · 2014-11-05T12:25:35Z

pandas/core/common.py

+            def convert(x):
+                if is_categorical_dtype(x.dtype):
+                    return x.get_values().ravel()
+                return x.ravel()


Why does it have to ravel everything here (and in the previous row)?

because the Categorical currently expects 1-d input (it doesn't need in the cateogorical line actually, but the object line does).

immerrr · 2014-11-05T12:30:45Z

pandas/core/common.py

+            elif is_bool_dtype(dtype):
+                typ = 'bool'
+            if typ is not None:
+                typs.add(typ)


I wonder if this code could be merged with type detection in core.internals.get_empty_dtype_and_na, since the latter can be thought of a "forecast" of the dtype of array-like concatenation (special casing untyped null array-likes).

I agree it could - want to take it?
or maybe push to another issue

Let's make it another issue then.

immerrr · 2014-11-05T12:36:00Z

Looks solid. I mean, true, now there are noticeable logic blocks in concat_compat that can be moved to a sub-functions, but I'm not sure if it is that bad on readability. Maybe summon a fresh pair of eyes? :)

immerrr · 2014-11-05T12:47:01Z

Series(pd.date_range('20140101',periods=4)).dt.day is new in 0.15.0

Yeah, that's much better. Seems a tad slower because of all the stuff Series ctor does, but definitely looks more intuitive to me.

jreback · 2014-11-07T17:37:29Z

@immerrr ok, finally fixed this.

We could dispatch to the sub-types to avoid having some of this code all in common

e.g. Categorical,SparseArray (maybe in tseries/common.py for timedelta/datetime)

?

jreback · 2014-11-07T22:44:12Z

@immerrr I fixed the impl to use my suggestion above, much cleaner IMHO.

immerrr · 2014-11-08T07:09:03Z

I fixed the impl to use my suggestion above, much cleaner IMHO.

Much cleaner indeed. Although I'm still a bit worried that this PR does a bit too much corollary change (e.g. bool handling) without any tests for that change.

jreback · 2014-11-08T12:03:32Z

ok going to delay this
though all of the 'special cases' actually seem to be necessary just to get various things to pass

eg the bool and and int casting is a special case where np.concatenate does the wrong thing

and np.find_commom_type is basically useles

so prob need a systematic test of :
all dtypes in combination (by 2s,3s) with and without empties

jreback · 2014-11-09T13:11:12Z

@immerrr I updated hopefully to catch all of these empty-empty combo cases. pls have a look.

jreback · 2014-11-09T15:39:34Z

ok, I think fixed up. the only weird one is float + bool which is an existing condition. But all works as currently (and I think testing correctly).

lmk

immerrr · 2014-11-09T19:25:36Z

Added couple more nitpicks, but generally looks fine.

…H8641)

BUG: concat of series of dtype category converting to object dtype (GH8641)

jreback added Bug Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas labels Nov 2, 2014

jreback added this to the 0.15.1 milestone Nov 2, 2014

jreback force-pushed the series_concat branch from 7302188 to 122070f Compare November 5, 2014 12:12

immerrr reviewed Nov 5, 2014
View reviewed changes

jreback force-pushed the series_concat branch from 122070f to a382c96 Compare November 5, 2014 12:29

immerrr reviewed Nov 5, 2014
View reviewed changes

jreback mentioned this pull request Nov 5, 2014

INT: clean/consolidate dtype upcasting and selection #8736

Closed

jreback force-pushed the series_concat branch from a382c96 to 377278a Compare November 7, 2014 16:38

jreback force-pushed the series_concat branch from 377278a to 7cee795 Compare November 7, 2014 22:43

jreback modified the milestones: 0.15.1, 0.15.2 Nov 8, 2014

jreback force-pushed the series_concat branch 3 times, most recently from 15b73e8 to 78f19fd Compare November 9, 2014 13:10

jreback force-pushed the series_concat branch from 78f19fd to 93dafe2 Compare November 9, 2014 15:38

BUG: concat of series of dtype category converting to object dtype (G…

cf56ff1

…H8641)

jreback force-pushed the series_concat branch from 93dafe2 to cf56ff1 Compare November 9, 2014 21:30

jreback added a commit that referenced this pull request Nov 9, 2014

Merge pull request #8714 from jreback/series_concat

99a555b

BUG: concat of series of dtype category converting to object dtype (GH8641)

jreback merged commit 99a555b into pandas-dev:master Nov 9, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: concat of series of dtype category converting to object dtype (GH8641) #8714

BUG: concat of series of dtype category converting to object dtype (GH8641) #8714

jreback commented Nov 2, 2014

jreback commented Nov 2, 2014

immerrr commented Nov 4, 2014

immerrr commented Nov 5, 2014

jreback commented Nov 5, 2014

jreback commented Nov 5, 2014

jreback commented Nov 5, 2014

immerrr commented Nov 5, 2014

immerrr commented Nov 5, 2014

immerrr commented Nov 5, 2014

immerrr commented Nov 5, 2014

immerrr commented Nov 5, 2014

jreback commented Nov 5, 2014

jreback commented Nov 5, 2014

immerrr Nov 5, 2014

jreback Nov 5, 2014

jreback Nov 5, 2014

immerrr Nov 5, 2014

jreback Nov 5, 2014

immerrr Nov 5, 2014

immerrr commented Nov 5, 2014

immerrr commented Nov 5, 2014

jreback commented Nov 7, 2014

jreback commented Nov 7, 2014

immerrr commented Nov 8, 2014

jreback commented Nov 8, 2014

jreback commented Nov 9, 2014

jreback commented Nov 9, 2014

immerrr commented Nov 9, 2014

BUG: concat of series of dtype category converting to object dtype (GH8641) #8714

BUG: concat of series of dtype category converting to object dtype (GH8641) #8714

Conversation

jreback commented Nov 2, 2014

jreback commented Nov 2, 2014

immerrr commented Nov 4, 2014

immerrr commented Nov 5, 2014

jreback commented Nov 5, 2014

jreback commented Nov 5, 2014

jreback commented Nov 5, 2014

immerrr commented Nov 5, 2014

immerrr commented Nov 5, 2014

immerrr commented Nov 5, 2014

immerrr commented Nov 5, 2014

immerrr commented Nov 5, 2014

jreback commented Nov 5, 2014

jreback commented Nov 5, 2014

immerrr Nov 5, 2014

Choose a reason for hiding this comment

jreback Nov 5, 2014

Choose a reason for hiding this comment

jreback Nov 5, 2014

Choose a reason for hiding this comment

immerrr Nov 5, 2014

Choose a reason for hiding this comment

jreback Nov 5, 2014

Choose a reason for hiding this comment

immerrr Nov 5, 2014

Choose a reason for hiding this comment

immerrr commented Nov 5, 2014

immerrr commented Nov 5, 2014

jreback commented Nov 7, 2014

jreback commented Nov 7, 2014

immerrr commented Nov 8, 2014

jreback commented Nov 8, 2014

jreback commented Nov 9, 2014

jreback commented Nov 9, 2014

immerrr commented Nov 9, 2014