Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added an example within the documentation for custom readers supporting pandas DataFrames. #707

Merged
merged 25 commits into from
Mar 8, 2023

Conversation

BenjaminFraser
Copy link
Contributor

@BenjaminFraser BenjaminFraser commented Aug 31, 2022

Added a new example (Custom_Pandas_Dataloader.py) within the documentation in docs/examples for the definition of custom Readers that support pandas DataFrames.

This allows a wide range of data formats supported by pandas to be taken advantage of for Ground Truth Readers and Detection Readers, without the need manually define custom data ingestion processes for each type, e.g. JSON, XML, Parquet, HDF5, .txt, .zip.

Given its similarity to the requirements of the custom reader documentation example (#354), I've linked this pull request to that, which hopefully is not a problem.

These classes do have the disadvantage of requiring the entire dataset in memory. However, it seems that the ability to directly use pandas DataFrames is a feature several users of Stonesoup have shown interest in, which is understandable given the flexibility and processing functionalities this can provide.

The example in Custom_Pandas_Dataloader.py includes the definitions of DataFrameGroundTruthReader and DataFrameDetectionReader classes. Each of these inherit from the existing GroundTruthReader class, along with a custom defined _DataFrameReader class.

These classes operate similarly to the existing CSVGroundTruthReader and CSVDetectionReader classes, except they take as input a pandas DataFrame already read into memory, rather than a path to .csv file. They also have modified generator functions for producing the time and paths / detections.

These have been useful for some work I've done using Stonesoup for some UAV-based non-cooperative radar research, and so hopefully they are also of value to other members of the community!

Update on progression and fixes to aspects of this PR, as of 22 Oct 22:

  1. Added pandas to dev of setup.cfg.
  2. Updated references to Stone Soup to be consistent - two words throughout documentation.
  3. Added demonstration of ground truth reader after initialisation by outputting first iteration to docs.
  4. Added support for fields already in DateTime format.
  5. Added pandas_reader.py within reader directory with the three new classes: _DataFrameReader, DataFrameGroundTruthReader, and DataFrameDetectionReader. Tests are still to be developed for these (hence failing on the draft commits currently).
  6. Added tests in test_pandas_reader.py within stonesoup/reader/tests.

A point noted with the tests is that there is currently full coverage of all classes defined in pandas_reader.py, however Codecov flags the pandas import check (which raises an import error if pandas is not installed) as failed.

To-do / enhancements:

  1. Take advantage of pandas grouping to make code more efficient (as suggested by Steven below).
  2. Link documentation example to the classes defined within pandas_reader.py, using something such as inspect.getsource.

… definition of custom Readers that support pandas DataFrames. This has the benefit of being able to take advantage of the range of data formats that pandas supports for ground truth and detection data being read into stonesoup. The example includes the custom definition of a DataFrameGroundTruthReader and DataFrameDetectionReader class. Both of these inherit from the GroundTruthReader class, along with a custom defined _DataFrameReader class. Each of these classes supports reading of pandas dataframes that are already read into memory, in a similar way to the CSVGroundTruthReader and CSVDetectionReader [issue 354].
@BenjaminFraser BenjaminFraser requested a review from a team as a code owner August 31, 2022 11:15
@BenjaminFraser BenjaminFraser requested review from sdhiscocks and orosoman-dstl and removed request for a team August 31, 2022 11:15
@sdhiscocks
Copy link
Member

Thanks for the contribution @BenjaminFraser.

I see docs are failing to build due to pandas being missing dependency. If you could add pandas the dev dependencies in setup.py that should resolve it:

Stone-Soup/setup.py

Lines 31 to 35 in 435883a

extras_require={
'dev': [
'pytest-flake8', 'pytest-cov', 'pytest-remotedata', 'flake8<5',
'Sphinx', 'sphinx_rtd_theme', 'sphinx-gallery>=0.10.1', 'pillow', 'folium', 'plotly',
],

It'd be good to have the readers in the main code base (probably with an optional dependency on pandas) so users can easily access them. And also good to keep the example you've created as both a how to use them, but also, in reference to #354, to show how to create custom readers. (Minor issue of if they are modified, we'll have to be sure to update in both places, unless in the example could do something with inspect.getsource)

@sdhiscocks
Copy link
Member

(Minor issue of if they are modified, we'll have to be sure to update in both places, unless in the example could do something with inspect.getsource)

Or use of Sphinx literalinclude directive, which can add some syntax highlighting.

@BenjaminFraser
Copy link
Contributor Author

That's no problem at all, and including the Readers within the main code base sounds like a good idea! The only sticking point was including it with pandas as an optional dependency, but I'll look into that, which should hopefully be straightforward enough.

I'll take a look later when I have the chance and put together another PR for those points!

@sdhiscocks
Copy link
Member

The only sticking point was including it with pandas as an optional dependency, but I'll look into that, which should hopefully be straightforward enough.

We've done this before by simply raising an error on importing of dependencies.

try:
import requests
from requests.compat import urljoin
except ImportError as error:
raise ImportError(
"Usage of opensky requires the dependency 'requests' is installed. ") from error

@codecov
Copy link

codecov bot commented Aug 31, 2022

Codecov Report

Base: 94.81% // Head: 94.84% // Increases project coverage by +0.02% 🎉

Coverage data is based on head (af4c77b) compared to base (f27eaeb).
Patch coverage: 97.33% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #707      +/-   ##
==========================================
+ Coverage   94.81%   94.84%   +0.02%     
==========================================
  Files         169      170       +1     
  Lines        8221     8296      +75     
  Branches     1216     1230      +14     
==========================================
+ Hits         7795     7868      +73     
- Misses        316      318       +2     
  Partials      110      110              
Flag Coverage Δ
integration 68.50% <0.00%> (-0.63%) ⬇️
unittests 92.69% <97.33%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
stonesoup/reader/pandas_reader.py 97.33% <97.33%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

docs/examples/Custom_Pandas_Dataloader.py Outdated Show resolved Hide resolved
groundtruth_dict = {}
updated_paths = set()
previous_time = None
for row in self.dataframe.to_dict(orient="records"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, but wondering if you could take advantage of pandas to group by time field (and path_id field), such that you can simplify the logic below (i.e. no need for the previous_time code)?

Copy link
Contributor Author

@BenjaminFraser BenjaminFraser Oct 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure we could do this easily enough, although I've not included it in the current commit yet. Before doing this, I could do with confirming the precise functionality to avoid accidentally changing the current generator logic.

If we simply order by path_id and time, and then iteratively yield each time and updated_paths (detections for DataFrameDetectionReader), is that doing exactly the same functionality as the current groundtruth_paths_gen (and detections_gen)?

docs/examples/Custom_Pandas_Dataloader.py Show resolved Hide resolved
docs/examples/Custom_Pandas_Dataloader.py Outdated Show resolved Hide resolved
def detections_gen(self):
detections = set()
previous_time = None
for row in self.dataframe.to_dict(orient="records"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could possibly also group by time field here.

…ord, for consistency with the remainder of all Stone Soup documentation.
…DataFrameReader, as suggested by sdhiscocks.
…reader.py'. This includes a new _DataFrameReader class, which inherits from Reader and allows a Pandas DataFrame to be read. Two new classes are also developed for reading ground-truth data from pandas (DataFrameGroundTruthReader) and reading detections from pandas (DataFrameDetectionReader). Each of these new classes inherit from DetectionReader and _DataFrameReader, and yield outputs in the same way as the basic GroundTruthReader (yields previous time and updated_paths) and DetectionReader (yields previous time and detections) classes within Stone Soup. Tests are still to be generated for these classes.
…ns within both Custom_Pandas_Dataloader.py (documents example), and reader/pandas_reader.py.
…hin both Custom_Pandas_Dataloader.py (documents example), and reader/pandas_reader.py.
…ory. Tests now check both ground-truth reader and detection reader functionality.
…ectory. Now includes seperate tests for DataFrameGroundTruthReader and DataFrameDetectionReader.
… pandas_reader.py. Developed the same tests as defined within test_generic.py.
…st_pandas_reader.py to check for case where pandas column contains string formatted datetimes, rather than being Timestamp (already formatted in pandas before creating reader).
@BenjaminFraser BenjaminFraser marked this pull request as ready for review October 23, 2022 16:35
@PACarniglia PACarniglia self-requested a review March 7, 2023 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants