Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing to compute lambda fitting on series and stop dropping PM if it is in the REE pattern #100

Merged
merged 2 commits into from
Feb 18, 2024

Conversation

mmuesly
Copy link
Contributor

@mmuesly mmuesly commented Jan 2, 2024

Description

We love the implementation for characterizing tetrades in REE patterns on a dataframe. For performance considerations, we require to wrap the implementation into an apply function on a dataframe. In its current form, it is not possible to compute lambdas only on a single series. I suggest to change the implementation to also support the core computation of lambdas on series objects. As a consequence, it can be used inside an apply function and offers easy parallelization in combination with the pandarallel package.

Moreover, I would love to stop the pyrochem.REE to drop PM if it is already in the dataset. Preparing plots gets more complicated if the already added column is dropped by this column.

Related Issue

#99

Motivation and Context

Apart from solving a performance issue, it also allows to define in a dataset excluded elements for each sample and which samples should be fitted using the LTE effect or not. When integrating lambda fitting into larger pipelines, it is easier to work on a per sample base than splitting the dataset into multiple data frames upfront.

We have many pipelines that use pyrochem.REE for selecting the REE pattern before plotting. We love to maintain the PM column in the pattern, if it is already present in the data set.

Types of changes

  • [ x] New feature (non-breaking change which adds functionality)
  • [ x] Breaking change (fix or feature that would cause existing functionality to change)
    Keeping the PM column if present is a breaking change. Currently, it is always dropped using pyrochem.REE

Checklist:

  • [ x] My code follows the code style of this project.
  • My change requires a change to the documentation.
  • [] I have updated the documentation accordingly. (Eventually, there is some more to add to pyrochem.REE. Not sure.)
  • I have added tests to cover my changes.
  • [ x] All new and existing tests passed.

I am happy to make further changes to this PR.

@coveralls
Copy link

coveralls commented Jan 10, 2024

Pull Request Test Coverage Report for Build 7845529512

Details

  • -5 of 111 (95.5%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.06%) to 90.478%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pyrolite/util/lambdas/opt.py 53 54 98.15%
pyrolite/util/lambdas/params.py 2 3 66.67%
pyrolite/util/lambdas/helpers.py 13 16 81.25%
Totals Coverage Status
Change from base Build 6859306770: 0.06%
Covered Lines: 6186
Relevant Lines: 6837

💛 - Coveralls

Copy link
Owner

@morganjwilliams morganjwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mmuesly, thanks again for the PR! I've had a look through these change, and added some comments here around the dataframe/series accessors.

For the lambdas side of things, I had a look at a few changes to adapt existing functions to achieve the same result, without adding much code/extra functions:

  • The two helper functions can be condensed into one which has a switch for series/dataframes.
  • For each of the algorithms/methods, I've made a few small modifications which allow them to be used with series, together with the helper function to wrap everything together. This makes it a bit easier to maintain and in most cases gives fewer places in the docs to go chasing information.

If OK with you I could push them to your branch, or otherwise post some of it here and we can discuss/edit after merging?

After this is merged I'll also add a few tests to pass series to these functions, and down the line perhaps modify the pyrolite.util.lambdas.calc_lambdas() function to deal with series too.

@@ -44,8 +44,10 @@ def list_elements(self):
-------
The list will have the same ordering as the source DataFrame.
"""
fltr = self._obj.columns.isin(_common_elements)
return self._obj.columns[fltr].tolist()
if isinstance(self._obj, pd.Series):
Copy link
Owner

@morganjwilliams morganjwilliams Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the modifications in geochem.__init__.py I wonder if there's a simpler way to do this without having if/else branches under each function - perhaps by assigning a primary selection index under pyrochem.__init__() and using it subsequently. There would be more to this for some functions.. but might work well for simple cases. Not yet tested, but something along these lines:

class pyrochem(object):
    def __init__(self, obj):
        """Custom dataframe accessor for pyrolite geochemistry."""
        self._validate(obj)
        self._obj = obj
        if isinstance(self._obj, pd.DataFrame):
            self._selection_index = self._obj.columns
        else: # elif isinstance(self._obj, pd.Series):
            self._selection_index = self._obj.index
...

    def list_oxides(self):
        fltr = self._selection_index.isin(_common_oxides)
        return self._selection_index[fltr].tolist()

    @property
    def oxides(self):
        return self._obj[self.list_oxides]

    @oxides.setter
    def oxides(self, df):
        self._obj[self.list_oxides] = df

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I changed the PR.

@@ -76,7 +80,16 @@ def list_REE(self):
-------
The returned list will reorder REE based on atomic number.
"""
return [i for i in REE() if i in self._obj.columns]
if isinstance(self._obj, pd.Series):
if "Pm" in self._obj.index:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified slightly to return [i for i in REE(dropPm=("Pm" not in self._obj.index)) if i in self._obj.index] with the suggestion above around having a selection index, for both series and dataframes it would be return [i for i in REE(dropPm=("Pm" not in self._selection_index)) if i in self._selection_index].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, there is a limit to the amount of filter conditions in a list comprehension before readability gets harmed, but I have no strong opinion here. I have changed it in the PR.

@@ -22,3 +22,4 @@ comments, testing, bug reports, or feature requests.
* `Sarah Shi <https://github.com/sarahshi>`__
* `Ondrej Lexa <https://github.com/ondrolexa>`__
* `Bob Myhill <https://github.com/bobmyhill>`__
* `Malte Mues <https://github.com/mmuesly>`__
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding your name here!

Running the lambdas calculation on the series instead of the dataframe allows to embedd it easily into an apply function on the dataframe.
In combination with pandarallel, this allows easy parallelisation of the task on larger datasets.
@mmuesly
Copy link
Contributor Author

mmuesly commented Jan 17, 2024

Thank you for your feedback @morganjwilliams. Sorry for my late response. I got distracted by a paper deadline.

I am happy, to see a smaller change set doing the same result. We need a version that works on Series. It took me some time to understand how the DataFrame implementation might work. Feel free to push your changes.

I might help looking into adapting the test cases.

Comparing the results from the existing code for testing the DataFrame implementation with a version that computes lambdas on each series should be a small modification to add an additional test that covers the important parts of the change.

@mmuesly
Copy link
Contributor Author

mmuesly commented Feb 9, 2024

@morganjwilliams I pushed my proposal for unit tests.

The invocation with apply for the lambda calculation is what is needed to plugin pandarallel later for performance reason.
I am open for feedback and happy to change anything in this PR.

Sadly, I could not fix the docs generation error. I do not understand what the broken code is supposed to do.

@morganjwilliams
Copy link
Owner

@mmuesly thanks for adding the test suggestions also, I'll have another read through and look to get this in (apologies, I've been traveling and also had some recent big deadlines). Unless there's anything specific around the high-level API, any small tweaks to organization etc I might look to just update on develop subsequently. I'll also chase up the docs-generation related error, it looks like a numpy type conversion thing, so should be easily solvable.

@morganjwilliams morganjwilliams merged commit 7f18072 into morganjwilliams:develop Feb 18, 2024
14 of 18 checks passed
@mmuesly mmuesly deleted the smaller_improvements branch February 26, 2024 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants