-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigResultSet #573
Comments
👍 and perhaps with |
Hey @marcharper have you played around with HDF5 http://docs.h5py.org/en/latest/index.html Looks like this would involve storing the file in a different format (perhaps we could do that with a Just a thought. |
My next idea was to load the data using generators, computing wins, score differences, and such to reduce the memory footprint, possibly going over the file more than once. I'm open to other options of course. I previously tried a somewhat similar approach with |
I think the idea of generators sounds better to me. Saw this on a thread on reddit so thought it was worth mentioning :) Perhaps something to fall back on if generators doesn't do the trick :) |
Definitely something to consider when we build out an API that saves large numbers of results over time. |
FYI, I've had a little go at this, still very much a work in progress and mainly just playing around. The main input to the Here it is:
This means you can simply plug that in to the
Those last three checks ensure we have the behaviour that the You can then create a
(There I'm 'borrowing' the players list from This is SUPER SLOW at the moment and I haven't checked that memory is definitely not being used. If the heat of my laptop on my lap, writing to and from disk is anything to go by I think it's not...? Could be that we can add some optimisations in the Anyway, I'm going to keep playing around with this but just thought I'd let you know that I was giving it a try. :) 👍 (This could be the completely wrong way around this.) |
That's probably a good idea. Some thoughts:
One more thought -- we're computing matches in batches now, we could compute (all?) the desired quantities in memory before writing to disk. Maybe write all this to a second data file, keeping interactions in their current file. Shelves might be a good solution here (CSV is probably also fine, or a pandas dataframe if you can tolerate the dependency). Then we'd enhance You can probably see why I haven't written this yet. To do it right will probably take a significant amount of refactoring, and in any case, several hours of uninterrupted time to figure it all out. |
Just heading off for the night so I'll read through your comment in detail tomorrow (thank you for it, I am not at all dismissing it: in fact just the opposite!). FWIW it doesn't look like it's as slow as I thought, I had in fact just stalled because I hadn't implemented a I'll do some analysis of the memory and time consumption of what I've got. Perhaps if the memory consumption is indeed low and the time isn't ridiculously slower this could be worth optimising... Anyway, will read through your comment tomorrow and get back to you. 👍 |
I'm guessing you could cut the run time of this implemenation by a third or so by combing the functions in |
Yup :)
Yeah, this is brutally terrible. Have thought more about it over night and this is just stupid.
Is there a need to wrap the
Cool, batch size could be a tuneable parameter also perhaps?
I think this is ultimately what needs to happen. The fact that the current results set goes through interactions multiple times made sense from a readability point of view but also because they're dictionaries. Certainly doesn't make sense the way I've implemented it here (loop through for every single tuple lookup... lol this really is terrible). I think it shouldn't be too hard (famous last words) to get it to do this over two or three passes (possibly even just the one...). I'll have a go at this :)
Yeah this is to/from ssd. I think there are possibly some things that can be done in parallel. Could be helpful if I drew a map of all the results perhaps (http://axelrod.readthedocs.io/en/latest/tutorials/getting_started/visualising_results.html) to see what needs what... 💭
I'm hesitant about computing the results during the tournament run:
I think I'm just saying that I'd like to avoid this if possible but that's my initial 'slept on it' thinking... There might be a way to do it that doesn't mess with my worries about 1. and 2.
Yeah, my having a try isn't at all meant as a prod in your direction. I know you're busy :) I'm going to give this a go, just modifying the |
Don't answer that :) Figured it out, playing around with reading in in batches :) |
I think it's doable in two passes :) Should have something 'showable' (I'm basically just 'doodling' in a notebook right now) by the end of today :) |
Great, yeah I think working through the various |
So my attempt at this is here: branch 573: it's very raw so not well documented, (not at all) tested etc (sorry!)... There are currently two commits (and this PR: #671) on that branch ahead of master:
Assuming this is any good (I didn't figure out how to profile memory properly, but it actually looks like it might be faster???) I reckon the readability could be made not too bad with some nice refactoring. It could possibly be worth replacing I'm signing off for the weekend (or at least Saturday) so if anyone wanted to play with it please do :) |
If I'm reading below correctly this is quite good:
For example using the basic strategies, the normal read take 9.47 mb, the 'big_read' (using the EDIT: FYI https://github.com/drvinceknight/Axelrod-Python-Memory-Profiling |
So I've run the profiler on the following:
The output is below:
Assuming I am reading that correctly it looks like the memory consumption of the I'm going to rerun it but inverting the order in which each is done (so run the I've started refactoring and adding tests. I'll open a PR once that's done, I suggest that after that we could look at some timing experiments and potentially (I'm not convinced the |
Great! This is actually what I expect from using generators. Any sense about the speed? |
Yeah you did call this! :) I actually think it might be faster (basically just the one loop as opposed to 15 odd...) :) Going to refactor (will finish tomorrow I reckon) then I'll do some time experiments too :) |
Have re run in opposite order (just to be sure). Results are the same (can be seen here: https://github.com/drvinceknight/Axelrod-Python-Memory-Profiling/blob/master/memory_profiler.log). |
So this seems to be much faster:
There are some tweaks I still want to do and no doubt improvements that will come from the PR review. |
Just FYI. Have just pushed some more commits to the 573 branch. Amongst some refactoring and tests this also includes a much more precise progress bar but also a suggestion for how to 'plug' in the My plan is:
Should be able to do that by end of tomorrow (famous last words). |
For large tournaments, loading all the interactions into memory isn't feasible. @drvinceknight had the idea on #520 to have a special class for this case.
The text was updated successfully, but these errors were encountered: