Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace KDTree library with alternative #383

Closed
kain88-de opened this issue Aug 5, 2015 · 33 comments · Fixed by #395
Closed

Replace KDTree library with alternative #383

kain88-de opened this issue Aug 5, 2015 · 33 comments · Fixed by #395

Comments

@kain88-de
Copy link
Member

Is there something against switching the KDTree library?

Currently you are using a pretty old C implementation. It would for example be possible
to use the scikit-learn KDTree.

@hainm
Copy link
Contributor

hainm commented Aug 5, 2015

what's wrong with 'old'?

@richardjgowers
Copy link
Member

There's also scipy.spatial.KDTree which wouldn't introduce a new dependency

So I guess things to prove would be

  • benchmarks of 250,000 3d coordinates, both creating the Tree and querying it
  • checking all methods have a counterpart (same functionality)

The other thing is all we currently use KDTree for is distance based selections I think... And the downside with it is it doesn't handle PBC. So a better solution might be to introduce a capped distance search method (using domain decomposition into cells). I was playing around with making one last month, https://github.com/richardjgowers/cellgrid

This would also be awesome for lots of common stuff in very large systems

@kain88-de
Copy link
Member Author

It's SWIG interface currently doesn't support Python3. scikit-learn would support py2 and py3. I also have good experience with it's performance. It would also mean one less thing which has to be maintained.

@kain88-de
Copy link
Member Author

@richardjgowers what would you suggest is a good test system?

@richardjgowers
Copy link
Member

Timeit in ipython is good enough, and we can assume that particles are randomly space, so something like...

import numpy as np
from scipy import spatial
from MDAnalysis.lib import KDTree

coords = np.random.random(750000).reshape(250000, 3).astype(np.float32)

t = KDTree.KDTree(3, 10)
%timeit t.set_coords(coords)

%timeit t2 = spatial.KDTree(coords)

I get this:

In [13]: %timeit t.set_coords(coords)
1 loops, best of 3: 833 ms per loop

In [16]: %timeit t2 = spatial.KDTree(coords)
1 loops, best of 3: 805 ms per loop

I think the setup cost of the KDTree will be the largest cost, but it's also worth checking that all methods are available in the replacement. We don't directly use the KDTree, but wrap it in a MDAnalysis.lib.KDTree.NeighbourSearch.

@kain88-de
Copy link
Member Author

http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4

It looks like the sklearn implementation is a lot faster then the one from scipy.

@richardjgowers
Copy link
Member

Hmm ok, and the sklearn KDTree has 96%+ test coverage and supports py3.

For fairness the mda kdtree search was 2x as fast, so after 700,000 searches it's quicker :D

Ok, if you can rewrite NeighborSearch to use the sklearn kdtree then this should be ok unless there's something I've mussed

@orbeckst
Copy link
Member

orbeckst commented Aug 6, 2015

One reason for bundling KDTree was that we didn't really like introducing scipy or sklearn as hard dependencies for core MDAnalysis because in the past this was awful to just pip install.

By default, KDTree is only used for the around selection; point uses distance_matrix if I remember correctly. However, building manual KDTree selections has been pretty useful.

Issue #137 was looking at an implementation of KDTree with PBC.

@richardjgowers
Copy link
Member

Re: #137, I don't think there's a way to add PBC at the C level. My best solution to that was to add too many coordinates to the KDTree (replicate the images), and this is done outside of the KDTree. So having our own KDTree doesn't affect the issue

@kain88-de
Copy link
Member Author

I'm just porting the selection object to use the sklearn KDTree. Some of the tests are failing right now. But just seeing that the tests in test_analysis:TestHydrogenBondingAnalysisChecking fail doesn't tell me much what the problem is due to the test design. It is not even that easy to figure out that the test
does exactly.

@mnmelo
Copy link
Member

mnmelo commented Aug 9, 2015

I hit a similar test failure in my half-way cleanup of the selection module (I left that on stand-by pending the development of the new AtomGroup model and use of pyparsing; see #371).

I had reflowed much of the point and around code, to reduce duplication, and was not surprised I broke something. But now you find a failure in the same tests. Maybe

  1. Those are the only tests that really use the KDTree feature, and both our fixes are indeed borked; or
  2. That module has a bug somewhere.

I'm not sure which I prefer to be true...

Here's my test error log (notice that something else also fails for test_atomgroup.py).
Let me know if you'd like me to upload my half-done branch.

@kain88-de
Copy link
Member Author

Well then we should first refactor the tests to be readable. I have serious problems understanding what the current tests are doing exactly

@kain88-de
Copy link
Member Author

Ok the tests pass now. the test_atomgroup.py tests are much better to finding out what happens and what is wrong.

@mnmelo
Copy link
Member

mnmelo commented Aug 9, 2015

So, was it something in your code, then?

@kain88-de
Copy link
Member Author

yup. didn't understand the old original code. now everything is working. you can check out my progress here. https://github.com/kain88-de/mdanalysis/tree/sklearn-kdtree

@kain88-de kain88-de mentioned this issue Aug 16, 2015
4 tasks
@dotsdl
Copy link
Member

dotsdl commented Aug 16, 2015

Did we decide we actually wanted to include scikit-learn as a dependency for a core component? I'm not sure we resolved the question yet.

@orbeckst
Copy link
Member

On 16 Aug, 2015, at 12:00, David Dotson wrote:

Did we decide we actually wanted to include scikit-learn as a dependency for a core component? I'm not sure we resolved the question yet.

I am against introducing such a heavy dependency unless there's a REALLY good reason.

If the code is well encapsulated, we could simply take the code and replace the KD-tree one.

Alternatively, trying a grid-search type algorithm (and writing it in cython ourselves) might be a good alternative.

Oliver Beckstein * orbeckst@gmx.net
skype: orbeckst * orbeckst@gmail.com

@kain88-de
Copy link
Member Author

I am against introducing such a heavy dependency unless there's a REALLY good reason.

It's faster then the BioPython version and very well maintained.

If the code is well encapsulated, we could simply take the code and replace the KD-tree one.

I don't like this approach. It just puts burden on MDAnalysis to maintain the forked code and merge upstream bug fixes. It is a better and more pythonic approach to use the other libraries directly.

I have switched the KDTree in my branch to use the BioPython KDTree directly. This will also work
for Python3

@mnmelo
Copy link
Member

mnmelo commented Aug 17, 2015

Then, if switching in place is that easy I'd say that we depend on either BioPython or scikit-learn. If none are present we install the lighter BioPython. If scikit-learn is available we use that. I'm not sure, though, how to teach these kind of dependencies to setuptools or pip.

The code then uses scikit-learn if available, otherwise it falls back to BioPython. We write a big notice on the install instructions, and maybe even issue a warning when BioPython is used saying speed is to be gained if scikit-learn is installed.

This is all assuming it's easy to just switch frameworks. @kain88-de might share more insight on that (I imagine it won't be as straightforward as having a try: except: on the imports...).

@kain88-de
Copy link
Member Author

@mnmelo it is not that easy since the BioPython KDTree has a different API then the scikit-learn KDTree. Besides I would prefer to use just one to keep the code simple.

@mnmelo
Copy link
Member

mnmelo commented Aug 17, 2015

Ah, that's a shame. But I understand that putting effort into making wrappers around equivalent foreign libraries shouldn't really be our priority here.

@orbeckst
Copy link
Member

Introducing sklearn as a dependency requires some serious discussion. I definitely don't want to rip out the old KD-Tree for 0.11.0.

For a start, one thing that vanilla Python (I mean pip/setuptools) is decidedly not good at is making sure that you can easily install packages—have you ever tried to get a package to work somewhere that does not have the full scientific python stack installed? In order to make this installation-hell sort-of-manageable you actually need 3rd party stuff like EPD or conda or at least a good understanding of your distribution's package management system. To be facetious: In some cases "pythonic" does not always equal "user friendly". Therefore, I think that inclusion of 3rd party libs into more complicated packages is actually the right thing to do in many instances because it shifts load from users to developers. And ultimately, I feel that firstly, MDAnalysis must make it easy for users to get work done.

I don't think that inclusion of sklearn as a hard dependency is justified at the moment because we are using only a tiny part of it to provide a tiny part of MDAnalysis functionality. If you look at the other hard dependencies, then they are either essential and provide a lot of functionality (numpy) or small-ish and/or install cleanly (networkx, GridDataFormats) or optional to provide very specific functionality (netCDF4 for Amber NCDF format). You could make a case that sklearn is similar to biopython, which also only contributes a fraction of functionality (one version of the PDBReader, sequence handling), but at least we never had any problems with biopython as a dependency as it always builds cleanly in my experience. (But I would like to get rid of that dependency, too, especially with a view towards Py3k!) Basically, I want to avoid one more stumbling block that prevents a user from easily installing the package (because these are the users that never come back and leave no feedback on what went wrong).

I am eager to listen to everyone's opinions.

@hainm
Copy link
Contributor

hainm commented Aug 19, 2015

I agree with @orbeckst about not to depend too much about sklearn. There is no point to replace the old lib if it works fine. Not all users is expert in Python and not all of them (currently) knows about conda.

@dotsdl
Copy link
Member

dotsdl commented Aug 20, 2015

I agree with @orbeckst as well. Although I very much respect scikit-learn and its high development standards, I don't consider it to be low-level enough in the python stack to make sense as a piece of the core of mdanalysis. It provides high-quality implementations of common machine learning algorithms with a consistent API; that is great, but I don't think that fits for this case.

I think a Cythonized KDTree that is easily able to handle periodic boundary conditions is something we would find desirable. That is something we will have to tackle, and I think that's the only reasonable choice.

@richardjgowers
Copy link
Member

We've already got scipy as a dependency, and we could use the KDTree there, but @kain88-de has already shown it's slightly slower (http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4)

I didn't see a problem with using sklearn, it looks like installing it has the same dependencies as numpy/scipy, which would fall into @orbeckst 's "install cleanly" clause. I've just tried making a virtual environment and installing sklearn and it looks painless enough?

@dotsdl the periodic KDTree will be a wrapper around a regular KDTree, so I'd rather use an external (and hopefully optimised) solution and wrap around that

@hainm
Copy link
Contributor

hainm commented Aug 20, 2015

I 've looked at benchmark for
'Test finding in radius of time'
mda has the lowest time.

On Aug 20, 2015, at 3:36 AM, Richard Gowers notifications@github.com wrote:

We've already got scipy as a dependency, and we could use the KDTree there, but @kain88-de has already shown it's slightly slower (http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4)

I didn't see a problem with using sklearn, it looks like installing it has the same dependencies as numpy/scipy, which would fall into @orbeckst 's "install cleanly" clause. I've just tried making a virtual environment and installing sklearn and it looks painless enough?

@dotsdl the periodic KDTree will be a wrapper around a regular KDTree, so I'd rather use an external (and hopefully optimised) solution and wrap around that


Reply to this email directly or view it on GitHub.

@kain88-de
Copy link
Member Author

The PR #395 is currently using the BioPython KDTree. That solution is slightly slower then the scikit-learn version but that shouldn't be noticed to much unless one is doing a lot of different selections
in a loop. (scikit-learn will be overall faster since it builds up the Tree incredibly fast, so a little slower search doesn't matter that much here). As far as Py3 is concerned BioPython supports python 2.6 - 3.4 so we are good there as well.

I personally have scikit-learn installed by default anyway. They offer a huge variety of dimension reduction algorithms and clusterings. But I also understand if you want to keep the dependencies
for MDAnalysis light.

With regard to installing things. In my experience pip works pretty great today if you are using it consistently. I install pip packages always only for my user with (--user) and have any system-wide
installation handled by my distribution package-manager. But I also have seen people who have a messed up environment when they used pip to install packages system-wide and then updated with the distribution package manager. But that's a problem people will have independent of MDAnalysis.

@richardjgowers richardjgowers added this to the 0.12 milestone Aug 20, 2015
@orbeckst
Copy link
Member

orbeckst commented Aug 20, 2015 via email

@orbeckst
Copy link
Member

I 've looked at benchmark for
'Test finding in radius of time'
mda has the lowest time.

Yes, but tree setup time really matters, as @richardjgowers pointed out in an earlier comment.

@orbeckst
Copy link
Member

@kain88-de , using the BioPython KD-tree implementation seems a good compromise. Can you clean up #395 (or submit a new clean PR); at the moment it does not pass the tests.

However, the NeighborSearch wrapper is being used in MDAnalysis.analysis.density and other tools dependent on MDAnalysis so I'd like to keep it. (I am mentioning this because in #395 you asked about it).

@kain88-de
Copy link
Member Author

Only the atomneighbor search is used there. This class still exists but in its own module now. @richardjgowers just has to tell me where that module should ideally live and then I'll fix the failing test.

@richardjgowers
Copy link
Member

Keep it in lib, analysis is more for whole tools rather than components.

@kain88-de
Copy link
Member Author

However, the NeighborSearch wrapper is being used in MDAnalysis.analysis.density and other tools dependent on MDAnalysis so I'd like to keep it. (I am mentioning this because in #395 you asked about it).

Oh I missed analysis.density because It doesn't have unit-tests. I change the API and I'm quite confident that it still works but I can't test that currently. I wanted to quickly run the example given
in the doc string but that doesn't work anymore.

orbeckst referenced this issue Feb 1, 2016
Further standardized XDR error message reporting and exceptions
(now mostly IOError).

Removed reference to the old SWIG numpy.i in the credits.
orbeckst added a commit that referenced this issue Sep 15, 2016
- changed "Maintained by Oliver Beckstein and Elizabeth Denning" to
  "Maintained by the MDAnalysis Core Devs Team"
- added Wouter Boomsma (for PR #976)
- removed explicit mentions for KDTree (Hamelryck) because since 0.11.0
  we do no longer include the actual code (see PR #395 and #383)
- removed explicit mentions for helanal (Hall) because Ben Hall is
  a contributing author
abiedermann pushed a commit to abiedermann/mdanalysis that referenced this issue Jan 5, 2017
 JmhX7IYP+2pnu4iv7mzp+xy8W5x1pzFOZhzfS4LslB2Ac+4SENM1eAZebiZQYSYl
 qyYkj9bLS2RqC40dQVfAA4dMrAuWpNfgZkTeCCU7pm29TodN0hmljNG/rLoGG/iH
 ieA+pSultKwyryY0UZXHAMmNRIViEfy1VsjE2Hswc/AIiY9W1yfLONXVjtELjFsU
 jv4u1ljR+lhuP6ZllEwucKeD5ojQDyXBU9WXPi2bUwFEFpNdn5zXVRTV/kntb86d
 sbOE1TQpy+WeSNTyqmEtdjQRuLH21Jm9kTnWFjkKjw2lL+odUOAoOEC4tLSin30i
 6KK2lVvSfXJi9e5XOL+nXWqvwr+jU5jqNV+PC60aEe9N1xhVapYroY84bIJwcuYQ
 2Axu0vJWNLOpnLGAH7kYYaeEFJdsZt283kDy1paagWmXqyy3p3hh1VD8NEWWcXQ4
 Da07VzZh61RvLHFWIzyXkizPjaCMCfCPIpwRACht5OJdRH0LwEhGRhJhSHEUdh7p
 WGNEQr5XVnd2Xb+dHfpBeq6a0vPmOv8oDXzRCnGpDeTjGE3kEQy8PGeDqm4dsD+B
 wOmeCYN/NmUnurpZrW2x151bKgCbcjwiTksCsIhZmN1kRHqAR+NkoHSUOiji6INw
 LvpyU91X7ji1yyYlhPTu
 =N0P/
 -----END PGP SIGNATURE-----

updated AUTHORS

- changed "Maintained by Oliver Beckstein and Elizabeth Denning" to
  "Maintained by the MDAnalysis Core Devs Team"
- added Wouter Boomsma (for PR MDAnalysis#976)
- removed explicit mentions for KDTree (Hamelryck) because since 0.11.0
  we do no longer include the actual code (see PR MDAnalysis#395 and MDAnalysis#383)
- removed explicit mentions for helanal (Hall) because Ben Hall is
  a contributing author
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants