BUG: groupby then resample on column gives incorrect results if the index is out of order #59350

knowecho · 2024-07-30T08:17:07Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(dict(
    datetime=[pd.to_datetime('2024-07-30T00:00Z'), pd.to_datetime('2024-07-30T00:01Z')],
    group=['A', 'A'],
    value=[100, 200],
), index=[1, 0])

df.groupby('group').resample('1min', on='datetime').aggregate(dict(value='sum'))

Issue Description

The example above gives the following incorrect output:

                                 value
group datetime                        
A     2024-07-30 00:00:00+00:00    200
      2024-07-30 00:01:00+00:00    100

Expected Behavior

The correct output is:

                                 value
group datetime                        
A     2024-07-30 00:00:00+00:00    100
      2024-07-30 00:01:00+00:00    200

The correct output can be got using either of the following - either reset the index, or use the datetime column as the index.

df.reset_index().groupby('group').resample('1min', on='datetime').aggregate(dict(value='sum'))
df.set_index('datetime').groupby('group').resample('1min').aggregate(dict(value='sum'))

It seems the out-of-order index ([1, 0] instead of [0, 1]) is affecting the resampling, even though the index should be ignored when using the on keyword argument to resample. This may be related to #35275 where it seems the index also affects the operation of resampling on a column (in that case giving an IndexError if the index value is not less than the length of the data frame.)

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.9.15.final.0
python-bits : 64
OS : Linux
OS-release : 5.14.0-284.11.1.el9_2.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Tue May 9 17:09:15 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : C.UTF-8
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 70.3.0
pip : 24.0
Cython : 3.0.10
pytest : 8.2.2
hypothesis : 6.108.2
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.1.1
html5lib : None
pymysql : 1.4.6
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.18.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.5.0
fsspec : 2024.6.1
gcsfs : None
matplotlib : 3.9.1
numba : 0.60.0
numexpr : 2.10.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.1
sqlalchemy : None
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : 2.0.1
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

aram-cinnamon · 2024-07-30T10:50:07Z

take

Amit-0905 · 2024-07-30T12:31:30Z

To solve this issue please Modify the DatetimeIndexResampler class in your resample.py located at "pandas/pandas/core /resample.py"

here is the corrected code for the DatetimeIndexResampler
`class DatetimeIndexResampler(Resampler):
ax: DatetimeIndex

@property
def _resampler_for_grouping(self) -> type[DatetimeIndexResamplerGroupby]:
    return DatetimeIndexResamplerGroupby

def _get_resampler_for_grouping(self, groupby, how, fill_method, limit, kind):
    """
    Return a resampler for groupby object.
    """
    self.groupby = groupby
    self._on = getattr(groupby, 'on', None)
    if self._on is not None:
        groupby._selected_obj = groupby._selected_obj.reset_index(drop=True).set_index(self._on)
    return self._get_resampler(how, fill_method, limit, kind)

def _get_resampler(self, how, fill_method, limit, kind):
    """
    Return a resampler for non-groupby object.
    """
    if self._on is not None:
        self._selected_obj = self._selected_obj.reset_index(drop=True).set_index(self._on)
    return super()._get_resampler(how, fill_method, limit, kind)

# Existing methods...

def _downsample(self, how, **kwargs):
    """
    Downsample the cython defined function.

    Parameters
    ----------
    how : string / cython mapped function
    **kwargs : kw args passed to how function
    """
    ax = self.ax

    # Excludes `on` column when provided
    obj = self._obj_with_exclusions

    if not len(ax):
        # reset to the new freq
        obj = obj.copy()
        obj.index = obj.index._with_freq(self.freq)
        assert obj.index.freq == self.freq, (obj.index.freq, self.freq)
        return obj

    # do we have a regular frequency

    # error: Item "None" of "Optional[Any]" has no attribute "binlabels"
    if (
        (ax.freq is not None or ax.inferred_freq is not None)
        and len(self._grouper.binlabels) > len(ax)
        and how is None
    ):
        # let's do an asfreq
        return self.asfreq()

    # we are downsampling
    # we want to call the actual grouper method here
    result = obj.groupby(self._grouper).aggregate(how, **kwargs)
    return self._wrap_result(result)

def _adjust_binner_for_upsample(self, binner):
    """
    Adjust our binner when upsampling.

    The range of a new index should not be outside specified range
    """
    if self.closed == "right":
        binner = binner[1:]
    else:
        binner = binner[:-1]
    return binner

def _upsample(self, method, limit: int | None = None, fill_value=None):
    """
    Parameters
    ----------
    method : string {'backfill', 'bfill', 'pad',
        'ffill', 'asfreq'} method for upsampling
    limit : int, default None
        Maximum size gap to fill when reindexing
    fill_value : scalar, default None
        Value to use for missing values
    """
    if self._from_selection:
        raise ValueError(
            "Upsampling from level= or on= selection "
            "is not supported, use .set_index(...) "
            "to explicitly set index to datetime-like"
        )

    ax = self.ax
    obj = self._selected_obj
    binner = self.binner
    res_index = self._adjust_binner_for_upsample(binner)

    # if we have the same frequency as our axis, then we are equal sampling
    if (
        limit is None
        and to_offset(ax.inferred_freq) == self.freq
        and len(obj) == len(res_index)
    ):
        result = obj.copy()
        result.index = res_index
    else:
        if method == "asfreq":
            method = None
        result = obj.reindex(
            res_index, method=method, limit=limit, fill_value=fill_value
        )

    return self._wrap_result(result)

def _wrap_result(self, result):
    result = super()._wrap_result(result)

    # we may have a different kind that we were asked originally
    # convert if needed
    if isinstance(self.ax, PeriodIndex) and not isinstance(
        result.index, PeriodIndex
    ):
        if isinstance(result.index, MultiIndex):
            # GH 24103 - e.g. groupby resample
            if not isinstance(result.index.levels[-1], PeriodIndex):
                new_level = result.index.levels[-1].to_period(self.freq)
                result.index = result.index.set_levels(new_level, level=-1)
        else:
            result.index = result.index.to_period(self.freq)
    return result

`

Update to solve :- BUG: groupby then resample on column gives incorrect results if the index is out of order pandas-dev#59350

knowecho added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 30, 2024

github-actions bot assigned aram-cinnamon Jul 30, 2024

Amit-0905 added a commit to Amit-0905/pandas that referenced this issue Jul 30, 2024

Update resample.py

01d3810

Update to solve :- BUG: groupby then resample on column gives incorrect results if the index is out of order pandas-dev#59350

Amit-0905 mentioned this issue Jul 30, 2024

Update resample.py #59353

Closed

5 tasks

rhshadrach added Resample resample method Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 31, 2024

aram-cinnamon linked a pull request Aug 4, 2024 that will close this issue

BUG: groupby then resample on column gives incorrect results if the index is out of order #59408

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby then resample on column gives incorrect results if the index is out of order #59350

BUG: groupby then resample on column gives incorrect results if the index is out of order #59350

knowecho commented Jul 30, 2024 •

edited

Loading

INSTALLED VERSIONS

aram-cinnamon commented Jul 30, 2024

Amit-0905 commented Jul 30, 2024

BUG: groupby then resample on column gives incorrect results if the index is out of order #59350

BUG: groupby then resample on column gives incorrect results if the index is out of order #59350

Comments

knowecho commented Jul 30, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

aram-cinnamon commented Jul 30, 2024

Amit-0905 commented Jul 30, 2024

knowecho commented Jul 30, 2024 •

edited

Loading