-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re write of the ResultsSetFromFile #672
Conversation
Mainly useful for tests.
This is very very raw. 1. Needs to be profiled (is this any good?) 2. Needs to be tested (have done some basic tests: all results seem correct) 3. Needs to be refactored. The terribly long method that does everythig can be tidied substantially. 4. Progress bar and other niceties need to be added.
Now need to refactor: 1. Make modular methods; 2. Progress bar; 3. Pass players and interactions from tournament.
This checks that the tournament gives the same results.
Still need to do more and add tests.
Lacks tests.
This means that a tournament can pass these to the results set. Thus the BigResultSet only actually needs one read of the data.
If set to True the 'old' ResultSet will be used which does read in all the interactions to memory, making them available.
In particular docs that show look through interactions.
I think memory is still accumulating somewhere. Try the following code:
This creates a data file that is about 1.2Gb, and the analyzing step takes up 1-2 Gb of RAM. Any ideas? |
I'll investigate :) Could just be that with that many reps all the metrics Will see how the profiler compares that script with master... Will get to On Wed, 27 Jul 2016, 19:33 Marc Harper, notifications@github.com wrote:
|
I failed to wait till then... Here is the profiler with the current master for comparison (not to your script @marcharper but to the stuff I was running before).
Just setting the script you had there to go overnight (on master I think it'll take that long on the small machine I'm using). I expect it's mainly going to just be the size of some of the metrics (loads of them have 1 dimension corresponding to the repetitions) but it would be great if we could find another spot where there is memory accumulating... |
With the master branch with 100000 repetitions on an 8GB machine crashed Here are the empty metrics (that get updated as the data is read):
There are quite a few metrics that there would have had 100000 columns or rows (or a third dimension). I believe that's where the high memory is (and would be no matter how we calculate things). I'm rerunning the new branch with this many repetitions just to check and will fish around to see if I can see anything else... I'll give profiling the class itself a go perhaps... |
Maybe we don't need to do this though -- for score diffs we're really only using the mean and median, which we should be able to calculate sequentially, right? In any case this PR is a big improvement; if you are ready to merge we can dig around more later. |
"""A class to hold the results of a tournament. Reads in a CSV file produced | ||
by the tournament class. | ||
""" | ||
Read the result set directly from file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we're reading the interactions from the file (not e.g. a pickled result set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed: 285bf2c
We could improve the memory usage on some of these by using a dictionary perhaps. In other words I would guess that the score_diff list is mostly repeated values (and certain is for deterministic strategies). We can easily compute the mean, median, and variance from the counts of each difference rather than a list of all the values. Another idea: manually call the garbage collector. The memory builds up over time so maybe we can strategically call |
""" | ||
Build the result set (used by the play method) | ||
|
||
Returns | ||
------- | ||
axelrod.ResultSet | ||
axelrod.BigResultSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResultsSetFromFile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed: 285bf2c
For completeness: the profiler just ran for the large rep tournament with this branch (the new ResultSet) and it seemed to handle it ok:
|
I suggest that this could be a further issue and PR? Only because the various formats of the results set metrics feed in to the plot object. EDIT: Just a passing thought. A 100000 repetition tournament is very much an extreme case right? Not saying we should ignore it and it's a great test to see where else we can save space. Just mentioning it so as to suggest that we shouldn't perhaps overreact? Just a thought.
No harm in trying, my hunch is that this won't make a huge difference, I think it is just the size of the various metrics we (currently?) want to keep. Will play with it and see how the profiler looks... Will report back. :) |
Have added the garbage collector to the generator.
Just run the profiler with a garbage collection 0f52e2d (just collecting after each yield by the generator which ensures it's happening pretty regularly and after every calculation really). There is no real gain, here's the diff (for the test cases I had previous, not your large repetition example): drvinceknight/Axelrod-Python-Memory-Profiling@1c866f8 I would suggest reverting 0f52e2d. I think we're simply in the case that the result metrics take space... We could open another issue to discuss changing that... |
Currently re running with the high rep example. |
Sure let's leave this optimization (admittedly an extreme case) to a future PR. I'm happy that the 100k rep file actually loads now on my machine 😄 |
This reverts commit 0f52e2d.
👍 (I've just reverted the garbage collector.) |
These were commented out during dev, no longer make sense.
Closes #573
This changes the
ResultsSetFromFile
to read from disk (potentially just once) and do all the calculations as that happens.On #573 the suggestion was originally to have this class as a separate
BigResultClass
but it's not only faster it's also (much) quicker so I just suggest we make this the default.tournament.play
then the data will only be read once. If creating from a file and the number of interactions, list of players and number of repetitions is not known then you will read through the file 3 times (but it's still fast).keep_interactions
which can be passed to thetournament.play
method.One negative thing: running the tests spits out a lot of gunk. This seems to have something to do with the two hypothesis tests (that basically check that the results line up with the base results set) and the files not being closed properly. Not sure what can be done here. Not sure if this is a real problem (all errors/failures are reported after the gunk, it's just a bit more messy).
Here's the memory profile (just running the new class). Note that analysing these 5 tournaments took a total of 8 minutes on a small 2 core machine (this is very fast relative to before: I think this points out how bad the previous way, that I wrote, of doing things was):
TLDR: the largest tournament (all strategies, 50 tournaments, 20 repetitions) takes 0.003 GB of ram) <- That assumes I'm reading this correctly (?). I'm rerunning that now with the master branch just to compare one last time (this will take a couple of hours).