Re write of the ResultsSetFromFile #672

drvinceknight · 2016-07-27T16:24:43Z

Closes #573

This changes the ResultsSetFromFile to read from disk (potentially just once) and do all the calculations as that happens.

On #573 the suggestion was originally to have this class as a separate BigResultClass but it's not only faster it's also (much) quicker so I just suggest we make this the default.

If creating the results set from tournament.play then the data will only be read once. If creating from a file and the number of interactions, list of players and number of repetitions is not known then you will read through the file 3 times (but it's still fast).
There is an option to keep_interactions which can be passed to the tournament.play method.
The progress bar is now much more granular.

One negative thing: running the tests spits out a lot of gunk. This seems to have something to do with the two hypothesis tests (that basically check that the results line up with the base results set) and the files not being closed properly. Not sure what can be done here. Not sure if this is a real problem (all errors/failures are reported after the gunk, it's just a bit more messy).

Here's the memory profile (just running the new class). Note that analysing these 5 tournaments took a total of 8 minutes on a small 2 core machine (this is very fast relative to before: I think this points out how bad the previous way, that I wrote, of doing things was):


Line #    Mem usage    Increment   Line Contents
================================================

    19                                 # Demo strategies
    20                                 # Turns: 200
    21                                 # Repetitions: 100

    23  25.78906 MiB   0.57422 MiB       read(filename)

    27                                 # Basic strategies
    28                                 # Turns: 200
    29                                 # Repetitions: 100

    31  26.42188 MiB   0.63281 MiB       read(filename)                         

    35                                 # Ordinary strategies
    36                                 # Turns: 50
    37                                 # Repetitions: 5

    39  29.92188 MiB   3.50000 MiB       read(filename)

    43                                 # Ordinary strategies
    44                                 # Turns: 50
    45                                 # Repetitions: 10

    47  33.43359 MiB   3.51172 MiB       read(filename)                         

    51                                 # Ordinary strategies
    52                                 # Turns: 50
    53                                 # Repetitions: 20

    55  35.88281 MiB   2.44922 MiB       read(filename)

TLDR: the largest tournament (all strategies, 50 tournaments, 20 repetitions) takes 0.003 GB of ram) <- That assumes I'm reading this correctly (?). I'm rerunning that now with the master branch just to compare one last time (this will take a couple of hours).

Mainly useful for tests.

This is very very raw. 1. Needs to be profiled (is this any good?) 2. Needs to be tested (have done some basic tests: all results seem correct) 3. Needs to be refactored. The terribly long method that does everythig can be tidied substantially. 4. Progress bar and other niceties need to be added.

Now need to refactor: 1. Make modular methods; 2. Progress bar; 3. Pass players and interactions from tournament.

This checks that the tournament gives the same results.

Still need to do more and add tests.

Lacks tests.

This means that a tournament can pass these to the results set. Thus the BigResultSet only actually needs one read of the data.

If set to True the 'old' ResultSet will be used which does read in all the interactions to memory, making them available.

In particular docs that show look through interactions.

marcharper · 2016-07-27T18:33:43Z

I think memory is still accumulating somewhere. Try the following code:

import axelrod as axl

if __name__ == "__main__":
    players = [s() for s in axl.basic_strategies]
    tournament = axl.Tournament(players, turns=100, repetitions=100000)
    tournament.play(filename="test.out", build_results=False, processes=3)
    results = axl.ResultSetFromFile("test.out")

This creates a data file that is about 1.2Gb, and the analyzing step takes up 1-2 Gb of RAM.

Any ideas?

drvinceknight · 2016-07-27T19:49:13Z

I'll investigate :) Could just be that with that many reps all the metrics
have a large size...

Will see how the profiler compares that script with master... Will get to
this tomorrow/Friday.

On Wed, 27 Jul 2016, 19:33 Marc Harper, notifications@github.com wrote:

I think memory is still accumulating somewhere. Try the following code:

import axelrod as axl

if name == "main":
players = [s() for s in axl.basic_strategies]
tournament = axl.Tournament(players, turns=100, repetitions=100000)
tournament.play(filename="test.out", build_results=False, processes=3)
results = axl.ResultSetFromFile("test.out")

This creates a data file that is about 1.2Gb, and the analyzing step takes
up 1-2 Gb of RAM.

Any ideas?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#672 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACCGWjuARCJ0SM2Cx3slIBrdXepSS8HJks5qZ6SIgaJpZM4JWYAf
.

drvinceknight · 2016-07-27T20:01:19Z

Will get to this tomorrow/Friday.

I failed to wait till then...

Here is the profiler with the current master for comparison (not to your script @marcharper but to the stuff I was running before).

Filename: profile.py

Line #    Mem usage    Increment   Line Contents
================================================
    19                                 # Demo strategies
    20                                 # Turns: 200
    21                                 # Repetitions: 100

    23  27.07031 MiB   1.81641 MiB       read(filename)
    24                             

    27                                 # Basic strategies
    28                                 # Turns: 200
    29                                 # Repetitions: 100

    31  31.32422 MiB   4.25391 MiB       read(filename)
    32                             

    35                                 # Ordinary strategies
    36                                 # Turns: 50
    37                                 # Repetitions: 5

    39  40.11328 MiB   8.78906 MiB       read(filename)


    43                                 # Ordinary strategies
    44                                 # Turns: 50
    45                                 # Repetitions: 10

    47  52.21094 MiB  12.09766 MiB       read(filename)
    48                             
    51                                 # Ordinary strategies
    52                                 # Turns: 50
    53                                 # Repetitions: 20

    55  70.91406 MiB  18.70312 MiB       read(filename)

Just setting the script you had there to go overnight (on master I think it'll take that long on the small machine I'm using). I expect it's mainly going to just be the size of some of the metrics (loads of them have 1 dimension corresponding to the repetitions) but it would be great if we could find another spot where there is memory accumulating...

drvinceknight · 2016-07-28T04:26:44Z

With the master branch with 100000 repetitions on an 8GB machine crashed MemoryError, so the accumulation of memory there is I believe inevitable.

Here are the empty metrics (that get updated as the data is read):

        plist = range(self.nplayers)
        replist = range(self.nrepetitions)
        self.match_lengths = [[[0 for opponent in plist] for player in plist]
                              for _ in replist]
        self.wins = [[0 for _ in replist] for player in plist]
        self.scores = [[0 for _ in replist] for player in plist]
        self.normalised_scores = [[[] for _ in replist] for player in plist]
        self.payoffs = [[[] for opponent in plist] for player in plist]
        self.score_diffs = [[[0] * self.nrepetitions for opponent in plist]
                            for player in plist]
        self.cooperation = [[0 for opponent in plist] for player in plist]
        self.normalised_cooperation = [[[] for opponent in plist]
                                       for player in plist]
        self.good_partner_matrix = [[0 for opponent in plist]
                                    for player in plist]

        self.total_interactions = [0 for player in plist]
        self.good_partner_rating = [0 for player in plist]

There are quite a few metrics that there would have had 100000 columns or rows (or a third dimension). I believe that's where the high memory is (and would be no matter how we calculate things).

I'm rerunning the new branch with this many repetitions just to check and will fish around to see if I can see anything else... I'll give profiling the class itself a go perhaps...

marcharper · 2016-07-28T05:43:48Z

Maybe we don't need to do this though -- for score diffs we're really only using the mean and median, which we should be able to calculate sequentially, right?

In any case this PR is a big improvement; if you are ready to merge we can dig around more later.

marcharper · 2016-07-28T05:48:00Z

axelrod/result_set.py

-    """A class to hold the results of a tournament. Reads in a CSV file produced
-    by the tournament class.
+    """
+    Read the result set directly from file.


Technically we're reading the interactions from the file (not e.g. a pickled result set).

Fixed: 285bf2c

marcharper · 2016-07-28T05:59:17Z

We could improve the memory usage on some of these by using a dictionary perhaps. In other words I would guess that the score_diff list is mostly repeated values (and certain is for deterministic strategies). We can easily compute the mean, median, and variance from the counts of each difference rather than a list of all the values.

Another idea: manually call the garbage collector. The memory builds up over time so maybe we can strategically call gc.collect.

marcharper · 2016-07-28T06:02:20Z

axelrod/tournament.py

        """
        Build the result set (used by the play method)

        Returns
        -------
-        axelrod.ResultSet
+        axelrod.BigResultSet


ResultsSetFromFile?

Fixed: 285bf2c

drvinceknight · 2016-07-28T11:23:57Z

For completeness: the profiler just ran for the large rep tournament with this branch (the new ResultSet) and it seemed to handle it ok:

  1 Filename: profile_large_rep.py
  2
  3 Line #    Mem usage    Increment   Line Contents
  4 ================================================
  5      7  25.07031 MiB   0.00000 MiB   @profile(precision=precision, stream=fp    )
  6      8                             def main():
  7      9  25.08984 MiB   0.01953 MiB       players = [s() for s in axl.basic_s    trategies]
  8     10  25.08984 MiB   0.00000 MiB       tournament = axl.Tournament(players    , turns=100, repetitions=100000)
  9     11  25.72266 MiB   0.63281 MiB       tournament.play(filename="test.out"    , build_results=False, processes=0)
 10     12 872.59766 MiB 846.87500 MiB       results = axl.ResultSetFromFile("test.out")

drvinceknight · 2016-07-28T11:28:19Z

We could improve the memory usage on some of these by using a dictionary perhaps. In other words I would guess that the score_diff list is mostly repeated values (and certain is for deterministic strategies). We can easily compute the mean, median, and variance from the counts of each difference rather than a list of all the values.

I suggest that this could be a further issue and PR? Only because the various formats of the results set metrics feed in to the plot object.

EDIT: Just a passing thought. A 100000 repetition tournament is very much an extreme case right? Not saying we should ignore it and it's a great test to see where else we can save space. Just mentioning it so as to suggest that we shouldn't perhaps overreact? Just a thought.

Another idea: manually call the garbage collector. The memory builds up over time so maybe we can strategically call gc.collect.

No harm in trying, my hunch is that this won't make a huge difference, I think it is just the size of the various metrics we (currently?) want to keep. Will play with it and see how the profiler looks... Will report back. :)

Have added the garbage collector to the generator.

drvinceknight · 2016-07-28T13:09:57Z

No harm in trying, my hunch is that this won't make a huge difference, I think it is just the size of the various metrics we (currently?) want to keep. Will play with it and see how the profiler looks... Will report back. :)

Just run the profiler with a garbage collection 0f52e2d (just collecting after each yield by the generator which ensures it's happening pretty regularly and after every calculation really).

There is no real gain, here's the diff (for the test cases I had previous, not your large repetition example): drvinceknight/Axelrod-Python-Memory-Profiling@1c866f8

I would suggest reverting 0f52e2d.

I think we're simply in the case that the result metrics take space... We could open another issue to discuss changing that...

drvinceknight · 2016-07-28T13:33:19Z

(for the test cases I had previous, not your large repetition example)

Currently re running with the high rep example.

marcharper · 2016-07-28T15:11:42Z

Sure let's leave this optimization (admittedly an extreme case) to a future PR. I'm happy that the 100k rep file actually loads now on my machine 😄

This reverts commit 0f52e2d.

drvinceknight · 2016-07-28T15:14:06Z

👍 (I've just reverted the garbage collector.)

These were commented out during dev, no longer make sense.

drvinceknight added 17 commits July 26, 2016 20:37

Add equality of results set.

1eaa577

Mainly useful for tests.

Initial tests written.

32b46dc

Now need to refactor: 1. Make modular methods; 2. Progress bar; 3. Pass players and interactions from tournament.

Add a property based test.

9175c16

This checks that the tournament gives the same results.

Break up big method.

fb2825a

Still need to do more and add tests.

Refactor complete.

64c7c24

Implement a progress bar.

cb0ec35

Lacks tests.

Make BRS able to directly take players + nrep

4f57c69

This means that a tournament can pass these to the results set. Thus the BigResultSet only actually needs one read of the data.

Add interactions=False option to tournament.play

34181d5

If set to True the 'old' ResultSet will be used which does read in all the interactions to memory, making them available.

Add a utility function to read in interactions.

fa10d73

Add ability for BRS to keep interactions.

83daffd

Make BRS used by tournament.

9a28732

Align documentation.

6927934

In particular docs that show look through interactions.

Remove old big result set.

83dc4de

Include doc strings.

dd7909e

Close file when reading number of interactions.

568275e

Add more tests.

0bc5104

drvinceknight added the ready-for-review label Jul 27, 2016

Small refactor.

185f321

marcharper reviewed Jul 28, 2016
View reviewed changes

Fix minor errors picked up by Marc.

285bf2c

Add garbage collector.

0f52e2d

Have added the garbage collector to the generator.

marcharper added the 1-positive-review label Jul 28, 2016

Revert "Add garbage collector."

d572806

This reverts commit 0f52e2d.

Remove some tests.

ab3516c

These were commented out during dev, no longer make sense.

meatballs merged commit 0775645 into master Jul 31, 2016

meatballs deleted the 573 branch July 31, 2016 12:28

drvinceknight mentioned this pull request Jul 31, 2016

Revisit the results set. #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re write of the ResultsSetFromFile #672

Re write of the ResultsSetFromFile #672

drvinceknight commented Jul 27, 2016

marcharper commented Jul 27, 2016

drvinceknight commented Jul 27, 2016

drvinceknight commented Jul 27, 2016

drvinceknight commented Jul 28, 2016

marcharper commented Jul 28, 2016

marcharper Jul 28, 2016

drvinceknight Jul 28, 2016

marcharper commented Jul 28, 2016 •

edited

Loading

marcharper Jul 28, 2016

drvinceknight Jul 28, 2016

drvinceknight commented Jul 28, 2016

drvinceknight commented Jul 28, 2016 •

edited

Loading

drvinceknight commented Jul 28, 2016

drvinceknight commented Jul 28, 2016

marcharper commented Jul 28, 2016

drvinceknight commented Jul 28, 2016

Re write of the ResultsSetFromFile #672

Re write of the ResultsSetFromFile #672

Conversation

drvinceknight commented Jul 27, 2016

marcharper commented Jul 27, 2016

drvinceknight commented Jul 27, 2016

drvinceknight commented Jul 27, 2016

drvinceknight commented Jul 28, 2016

marcharper commented Jul 28, 2016

marcharper Jul 28, 2016

Choose a reason for hiding this comment

drvinceknight Jul 28, 2016

Choose a reason for hiding this comment

marcharper commented Jul 28, 2016 • edited Loading

marcharper Jul 28, 2016

Choose a reason for hiding this comment

drvinceknight Jul 28, 2016

Choose a reason for hiding this comment

drvinceknight commented Jul 28, 2016

drvinceknight commented Jul 28, 2016 • edited Loading

drvinceknight commented Jul 28, 2016

drvinceknight commented Jul 28, 2016

marcharper commented Jul 28, 2016

drvinceknight commented Jul 28, 2016

marcharper commented Jul 28, 2016 •

edited

Loading

drvinceknight commented Jul 28, 2016 •

edited

Loading