Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby then resample on column gives incorrect results if the index is out of order #59350

Open
2 of 3 tasks
knowecho opened this issue Jul 30, 2024 · 2 comments · May be fixed by #59408
Open
2 of 3 tasks

BUG: groupby then resample on column gives incorrect results if the index is out of order #59350

knowecho opened this issue Jul 30, 2024 · 2 comments · May be fixed by #59408
Assignees
Labels

Comments

@knowecho
Copy link

knowecho commented Jul 30, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(dict(
    datetime=[pd.to_datetime('2024-07-30T00:00Z'), pd.to_datetime('2024-07-30T00:01Z')],
    group=['A', 'A'],
    value=[100, 200],
), index=[1, 0])

df.groupby('group').resample('1min', on='datetime').aggregate(dict(value='sum'))

Issue Description

The example above gives the following incorrect output:

                                 value
group datetime                        
A     2024-07-30 00:00:00+00:00    200
      2024-07-30 00:01:00+00:00    100

Expected Behavior

The correct output is:

                                 value
group datetime                        
A     2024-07-30 00:00:00+00:00    100
      2024-07-30 00:01:00+00:00    200

The correct output can be got using either of the following - either reset the index, or use the datetime column as the index.

df.reset_index().groupby('group').resample('1min', on='datetime').aggregate(dict(value='sum'))
df.set_index('datetime').groupby('group').resample('1min').aggregate(dict(value='sum'))

It seems the out-of-order index ([1, 0] instead of [0, 1]) is affecting the resampling, even though the index should be ignored when using the on keyword argument to resample. This may be related to #35275 where it seems the index also affects the operation of resampling on a column (in that case giving an IndexError if the index value is not less than the length of the data frame.)

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.9.15.final.0
python-bits : 64
OS : Linux
OS-release : 5.14.0-284.11.1.el9_2.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Tue May 9 17:09:15 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : C.UTF-8
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 70.3.0
pip : 24.0
Cython : 3.0.10
pytest : 8.2.2
hypothesis : 6.108.2
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.1.1
html5lib : None
pymysql : 1.4.6
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.18.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.5.0
fsspec : 2024.6.1
gcsfs : None
matplotlib : 3.9.1
numba : 0.60.0
numexpr : 2.10.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.1
sqlalchemy : None
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : 2.0.1
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@knowecho knowecho added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 30, 2024
@aram-cinnamon
Copy link
Contributor

take

@Amit-0905
Copy link

To solve this issue please Modify the DatetimeIndexResampler class in your resample.py located at "pandas/pandas/core /resample.py"

here is the corrected code for the DatetimeIndexResampler
`class DatetimeIndexResampler(Resampler):
ax: DatetimeIndex

@property
def _resampler_for_grouping(self) -> type[DatetimeIndexResamplerGroupby]:
    return DatetimeIndexResamplerGroupby

def _get_resampler_for_grouping(self, groupby, how, fill_method, limit, kind):
    """
    Return a resampler for groupby object.
    """
    self.groupby = groupby
    self._on = getattr(groupby, 'on', None)
    if self._on is not None:
        groupby._selected_obj = groupby._selected_obj.reset_index(drop=True).set_index(self._on)
    return self._get_resampler(how, fill_method, limit, kind)

def _get_resampler(self, how, fill_method, limit, kind):
    """
    Return a resampler for non-groupby object.
    """
    if self._on is not None:
        self._selected_obj = self._selected_obj.reset_index(drop=True).set_index(self._on)
    return super()._get_resampler(how, fill_method, limit, kind)

# Existing methods...

def _downsample(self, how, **kwargs):
    """
    Downsample the cython defined function.

    Parameters
    ----------
    how : string / cython mapped function
    **kwargs : kw args passed to how function
    """
    ax = self.ax

    # Excludes `on` column when provided
    obj = self._obj_with_exclusions

    if not len(ax):
        # reset to the new freq
        obj = obj.copy()
        obj.index = obj.index._with_freq(self.freq)
        assert obj.index.freq == self.freq, (obj.index.freq, self.freq)
        return obj

    # do we have a regular frequency

    # error: Item "None" of "Optional[Any]" has no attribute "binlabels"
    if (
        (ax.freq is not None or ax.inferred_freq is not None)
        and len(self._grouper.binlabels) > len(ax)
        and how is None
    ):
        # let's do an asfreq
        return self.asfreq()

    # we are downsampling
    # we want to call the actual grouper method here
    result = obj.groupby(self._grouper).aggregate(how, **kwargs)
    return self._wrap_result(result)

def _adjust_binner_for_upsample(self, binner):
    """
    Adjust our binner when upsampling.

    The range of a new index should not be outside specified range
    """
    if self.closed == "right":
        binner = binner[1:]
    else:
        binner = binner[:-1]
    return binner

def _upsample(self, method, limit: int | None = None, fill_value=None):
    """
    Parameters
    ----------
    method : string {'backfill', 'bfill', 'pad',
        'ffill', 'asfreq'} method for upsampling
    limit : int, default None
        Maximum size gap to fill when reindexing
    fill_value : scalar, default None
        Value to use for missing values
    """
    if self._from_selection:
        raise ValueError(
            "Upsampling from level= or on= selection "
            "is not supported, use .set_index(...) "
            "to explicitly set index to datetime-like"
        )

    ax = self.ax
    obj = self._selected_obj
    binner = self.binner
    res_index = self._adjust_binner_for_upsample(binner)

    # if we have the same frequency as our axis, then we are equal sampling
    if (
        limit is None
        and to_offset(ax.inferred_freq) == self.freq
        and len(obj) == len(res_index)
    ):
        result = obj.copy()
        result.index = res_index
    else:
        if method == "asfreq":
            method = None
        result = obj.reindex(
            res_index, method=method, limit=limit, fill_value=fill_value
        )

    return self._wrap_result(result)

def _wrap_result(self, result):
    result = super()._wrap_result(result)

    # we may have a different kind that we were asked originally
    # convert if needed
    if isinstance(self.ax, PeriodIndex) and not isinstance(
        result.index, PeriodIndex
    ):
        if isinstance(result.index, MultiIndex):
            # GH 24103 - e.g. groupby resample
            if not isinstance(result.index.levels[-1], PeriodIndex):
                new_level = result.index.levels[-1].to_period(self.freq)
                result.index = result.index.set_levels(new_level, level=-1)
        else:
            result.index = result.index.to_period(self.freq)
    return result

`

Amit-0905 added a commit to Amit-0905/pandas that referenced this issue Jul 30, 2024
Update to solve :-

BUG: groupby then resample on column gives incorrect results if the index is out of order pandas-dev#59350
@Amit-0905 Amit-0905 mentioned this issue Jul 30, 2024
5 tasks
@rhshadrach rhshadrach added Resample resample method Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants