Allowing to compute lambda fitting on series and stop dropping PM if it is in the REE pattern #100

mmuesly · 2024-01-02T07:55:24Z

Description

We love the implementation for characterizing tetrades in REE patterns on a dataframe. For performance considerations, we require to wrap the implementation into an apply function on a dataframe. In its current form, it is not possible to compute lambdas only on a single series. I suggest to change the implementation to also support the core computation of lambdas on series objects. As a consequence, it can be used inside an apply function and offers easy parallelization in combination with the pandarallel package.

Moreover, I would love to stop the pyrochem.REE to drop PM if it is already in the dataset. Preparing plots gets more complicated if the already added column is dropped by this column.

Related Issue

#99

Motivation and Context

Apart from solving a performance issue, it also allows to define in a dataset excluded elements for each sample and which samples should be fitted using the LTE effect or not. When integrating lambda fitting into larger pipelines, it is easier to work on a per sample base than splitting the dataset into multiple data frames upfront.

We have many pipelines that use pyrochem.REE for selecting the REE pattern before plotting. We love to maintain the PM column in the pattern, if it is already present in the data set.

Types of changes

[ x] New feature (non-breaking change which adds functionality)
[ x] Breaking change (fix or feature that would cause existing functionality to change)
Keeping the PM column if present is a breaking change. Currently, it is always dropped using pyrochem.REE

Checklist:

[ x] My code follows the code style of this project.
My change requires a change to the documentation.
[] I have updated the documentation accordingly. (Eventually, there is some more to add to pyrochem.REE. Not sure.)
I have added tests to cover my changes.
[ x] All new and existing tests passed.

I am happy to make further changes to this PR.

coveralls · 2024-01-10T00:16:51Z

Pull Request Test Coverage Report for Build 7845529512

Details

-5 of 111 (95.5%) changed or added relevant lines in 5 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.06%) to 90.478%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pyrolite/util/lambdas/opt.py	53	54	98.15%
pyrolite/util/lambdas/params.py	2	3	66.67%
pyrolite/util/lambdas/helpers.py	13	16	81.25%

Totals
Change from base Build 6859306770:	0.06%
Covered Lines:	6186
Relevant Lines:	6837

💛 - Coveralls

morganjwilliams

Hey @mmuesly, thanks again for the PR! I've had a look through these change, and added some comments here around the dataframe/series accessors.

For the lambdas side of things, I had a look at a few changes to adapt existing functions to achieve the same result, without adding much code/extra functions:

The two helper functions can be condensed into one which has a switch for series/dataframes.
For each of the algorithms/methods, I've made a few small modifications which allow them to be used with series, together with the helper function to wrap everything together. This makes it a bit easier to maintain and in most cases gives fewer places in the docs to go chasing information.

If OK with you I could push them to your branch, or otherwise post some of it here and we can discuss/edit after merging?

After this is merged I'll also add a few tests to pass series to these functions, and down the line perhaps modify the pyrolite.util.lambdas.calc_lambdas() function to deal with series too.

morganjwilliams · 2024-01-12T00:11:08Z

pyrolite/geochem/__init__.py

@@ -44,8 +44,10 @@ def list_elements(self):
        -------
        The list will have the same ordering as the source DataFrame.
        """
-        fltr = self._obj.columns.isin(_common_elements)
-        return self._obj.columns[fltr].tolist()
+        if isinstance(self._obj, pd.Series):


For the modifications in geochem.__init__.py I wonder if there's a simpler way to do this without having if/else branches under each function - perhaps by assigning a primary selection index under pyrochem.__init__() and using it subsequently. There would be more to this for some functions.. but might work well for simple cases. Not yet tested, but something along these lines:

class pyrochem(object): def __init__(self, obj): """Custom dataframe accessor for pyrolite geochemistry.""" self._validate(obj) self._obj = obj if isinstance(self._obj, pd.DataFrame): self._selection_index = self._obj.columns else: # elif isinstance(self._obj, pd.Series): self._selection_index = self._obj.index ... def list_oxides(self): fltr = self._selection_index.isin(_common_oxides) return self._selection_index[fltr].tolist() @property def oxides(self): return self._obj[self.list_oxides] @oxides.setter def oxides(self, df): self._obj[self.list_oxides] = df

Fair point, I changed the PR.

morganjwilliams · 2024-01-12T00:14:48Z

pyrolite/geochem/__init__.py

@@ -76,7 +80,16 @@ def list_REE(self):
        -------
        The returned list will reorder REE based on atomic number.
        """
-        return [i for i in REE() if i in self._obj.columns]
+        if isinstance(self._obj, pd.Series):
+            if "Pm" in self._obj.index:


This could be simplified slightly to return [i for i in REE(dropPm=("Pm" not in self._obj.index)) if i in self._obj.index] with the suggestion above around having a selection index, for both series and dataframes it would be return [i for i in REE(dropPm=("Pm" not in self._selection_index)) if i in self._selection_index].

IMHO, there is a limit to the amount of filter conditions in a list comprehension before readability gets harmed, but I have no strong opinion here. I have changed it in the PR.

morganjwilliams · 2024-01-12T01:32:22Z

docs/source/dev/contributors.rst

@@ -22,3 +22,4 @@ comments, testing, bug reports, or feature requests.
 * `Sarah Shi <https://github.com/sarahshi>`__
 * `Ondrej Lexa <https://github.com/ondrolexa>`__
 * `Bob Myhill <https://github.com/bobmyhill>`__
+* `Malte Mues <https://github.com/mmuesly>`__


Thanks for adding your name here!

Running the lambdas calculation on the series instead of the dataframe allows to embedd it easily into an apply function on the dataframe. In combination with pandarallel, this allows easy parallelisation of the task on larger datasets.

mmuesly · 2024-01-17T15:17:25Z

Thank you for your feedback @morganjwilliams. Sorry for my late response. I got distracted by a paper deadline.

I am happy, to see a smaller change set doing the same result. We need a version that works on Series. It took me some time to understand how the DataFrame implementation might work. Feel free to push your changes.

I might help looking into adapting the test cases.

Comparing the results from the existing code for testing the DataFrame implementation with a version that computes lambdas on each series should be a small modification to add an additional test that covers the important parts of the change.

mmuesly · 2024-02-09T14:45:05Z

@morganjwilliams I pushed my proposal for unit tests.

The invocation with apply for the lambda calculation is what is needed to plugin pandarallel later for performance reason.
I am open for feedback and happy to change anything in this PR.

Sadly, I could not fix the docs generation error. I do not understand what the broken code is supposed to do.

morganjwilliams · 2024-02-16T00:06:46Z

@mmuesly thanks for adding the test suggestions also, I'll have another read through and look to get this in (apologies, I've been traveling and also had some recent big deadlines). Unless there's anything specific around the high-level API, any small tweaks to organization etc I might look to just update on develop subsequently. I'll also chase up the docs-generation related error, it looks like a numpy type conversion thing, so should be easily solvable.

morganjwilliams self-requested a review January 10, 2024 00:12

morganjwilliams reviewed Jan 12, 2024

View reviewed changes

mmuesly force-pushed the smaller_improvements branch from 330dceb to b4a8a6b Compare January 17, 2024 15:08

Add proposal for test

602013a

morganjwilliams merged commit 7f18072 into morganjwilliams:develop Feb 18, 2024
14 of 18 checks passed

mmuesly deleted the smaller_improvements branch February 26, 2024 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing to compute lambda fitting on series and stop dropping PM if it is in the REE pattern #100

Allowing to compute lambda fitting on series and stop dropping PM if it is in the REE pattern #100

mmuesly commented Jan 2, 2024

coveralls commented Jan 10, 2024 •

edited

Loading

morganjwilliams left a comment

morganjwilliams Jan 12, 2024 •

edited

Loading

mmuesly Jan 17, 2024

morganjwilliams Jan 12, 2024

mmuesly Jan 17, 2024

morganjwilliams Jan 12, 2024

mmuesly commented Jan 17, 2024

mmuesly commented Feb 9, 2024

morganjwilliams commented Feb 16, 2024

Allowing to compute lambda fitting on series and stop dropping PM if it is in the REE pattern #100

Allowing to compute lambda fitting on series and stop dropping PM if it is in the REE pattern #100

Conversation

mmuesly commented Jan 2, 2024

Description

Related Issue

Motivation and Context

Types of changes

Checklist:

coveralls commented Jan 10, 2024 • edited Loading

Pull Request Test Coverage Report for Build 7845529512

Details

💛 - Coveralls

morganjwilliams left a comment

Choose a reason for hiding this comment

morganjwilliams Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

mmuesly Jan 17, 2024

Choose a reason for hiding this comment

morganjwilliams Jan 12, 2024

Choose a reason for hiding this comment

mmuesly Jan 17, 2024

Choose a reason for hiding this comment

morganjwilliams Jan 12, 2024

Choose a reason for hiding this comment

mmuesly commented Jan 17, 2024

mmuesly commented Feb 9, 2024

morganjwilliams commented Feb 16, 2024

coveralls commented Jan 10, 2024 •

edited

Loading

morganjwilliams Jan 12, 2024 •

edited

Loading