Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate ABC-SMC estimation for compartmental (SIR) models #42

Closed
ArtPoon opened this issue Feb 16, 2017 · 35 comments
Closed

Validate ABC-SMC estimation for compartmental (SIR) models #42

ArtPoon opened this issue Feb 16, 2017 · 35 comments
Assignees

Comments

@ArtPoon
Copy link
Contributor

ArtPoon commented Feb 16, 2017

Just like kernel-ABC validation

@ArtPoon ArtPoon changed the title Validate method on birth-death trees Validate method on birth-death or coalescent trees Feb 16, 2017
@ArtPoon ArtPoon changed the title Validate method on birth-death or coalescent trees Validate ABC-SMC estimation for compartmental (SIR) models Jun 5, 2017
gtng92 added a commit that referenced this issue Jun 8, 2017
--Tammy Ng
@gtng92
Copy link
Collaborator

gtng92 commented Jun 8, 2017

This is more of a minor annoyance, but I'm unsure about the format of variable sampleStates.
I created an example using of SIR non-dynamic compartmental model in pkg/examples/example-compartmental.R and this warning shows on the console:

Warning message:  
In if (!is.na(sampleStates)) 
    if (!is.matrix(sampleStates)) stop("sampleStates must be a matrix (not a data.frame)") :  
      the condition has length > 1 and only the first element will be used

Looking into compartmental-models.R, I've set sampleStates as:

# sample states
  sampleStates <- matrix(1, nrow=tips, ncol=length(demes))
  colnames(sampleStates) <- demes
  rownames(sampleStates) <- 1:tips

This code was derived from kamphir/simulate.SI.R. It creates a matrix with one column and many rows, all with the same value of 1.

In rcolgem.R (Erik Volz's package), I traced this warning back to line 119. It displays regardless, even though sampleStates is a matrix.

I figured when we run the compartmental.model function, we don't want this warning to pop up every single time it simulates a tree.

gtng92 added a commit that referenced this issue Jun 9, 2017
@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jun 12, 2017

This is a bug in rcolgem that I fixed a while back but didn't make it into this release for some reason. We need to replace:

if (!is.na(sampleStates))

with

if (all(!is.na(sampleStates))

This should result in a condition of length 1. We could make this change to our fork of rcolgem and direct people to that version.
The alternative is to to use the R function suppressWarnings so that we don't get this warning message in our wrapper function.
I think the best approach is to make our fork of rcolgem a submodule of Kaphi and edit this line, since there will be other issues that arise.

gtng92 added a commit that referenced this issue Jun 12, 2017
@gtng92
Copy link
Collaborator

gtng92 commented Jun 16, 2017

I've simulated a target tree, and was able to calculate kernel distances for a varying parameter.
I'm able to initialize workspace. When I run.smc I keep getting the following error:

Error in sample.int(m^2, size = 1, prob = as.vector(.lambdamat)) : 
  negative probability 

The traceback error:

11.sample.int(m^2, size = 1, prob = as.vector(.lambdamat)) at rcolgem.R#1473
10.simulate.binary.dated.tree.fgy(fgy[[1]], fgy[[2]], fgy[[3]], 
    fgy[[4]], sampleTimes, sampleStates, integrationMethod = integrationMethod) at compartmental-models.R#38
9..call.rcolgem(x0, t0, t.end, sampleTimes, sampleStates, births, 
    migrations, deaths, nonDemeDynamics, parms, fgyResolution, 
    integrationMethod) 
8.FUN(X[[i]], ...) 
7.lapply(X = X, FUN = FUN, ...) 
6.sapply(integer(n), eval.parent(substitute(function(...) expr)), 
    simplify = simplify) 
5.replicate(nsim, .call.rcolgem(x0, t0, t.end, sampleTimes, sampleStates, 
    births, migrations, deaths, nonDemeDynamics, parms, fgyResolution, 
    integrationMethod), simplify = FALSE) at compartmental-models.R#211
4.config$model(theta = theta, nsim = config$nsample, tips = workspace$tip.heights, 
    model = model, seed = seed, labels = workspace$tip.labels, 
    ...) 
3.simulate.trees(ws, ws$particles[i, ], model = model, ...) 
2.initialize.smc(ws, model) 
1.run.smc(ws, trace.file = "pkg/examples/example-compartmental.tsv", 
    model = "sir.dynamic") 

I noticed that when I calculated kernel distances for varying t.end, the same error is thrown if:

  1. I first set t.end = 30.*52 in theta for the target tree
  2. then as I vary t.end, the t.end value exceeds the value 30.*52

One possible reason for this may be how I've set my prior distribution for t.end. Currently it's a random normal distribution with mean=100 and sd=5.
N is set as lognormal, and the rest of the parameters are gamma distributions. I will see if this error shows up with these ones too.

@gtng92
Copy link
Collaborator

gtng92 commented Jun 19, 2017

rcolgem.R has a clause in function make.fgy that assigns the following:

demeNames <- rownames(births)
m <- nrow(births)
nonDemeNames <- names(nonDemeDynamics)
mm <- length(nonDemeNames)

and then checks

if (length(x0)!=m + mm) stop('initial conditons incorrect dimension', x0, m, mm) 

births = (ODE expr) therefore m = 1
nonDemeNames = ('S') therefore mm = 1
x0 = (S=S, I=I) therefore length(x0) = 2

This is okay for the sir.nondynamic model but not for the rest, since I'm including compartment R in the ODE expressions as well as compartment E for the seir model. This changes length(x0) to either 3 or 4.
Do I then put the recovered and exposed expressions together with the births expression?
I would keep the part of the expression that is strictly "deaths" in the deaths compartment.

Before:

births <- rbind(c('parms$beta * S * I / (S+I) - (parms$gamma + parms$mu) * I'))
migrations <- rbind(c('0'))
deaths <- rbind(c('parms$gamma * I - parms$mu * R'))
nonDemeDynamics <- rbind(c('parms$mu * (I+R) - parms$beta * S * I / (S+I)'))

to:

births <- cbind(c('parms$beta * S * I / (S+I) - (parms$gamma + parms$mu) * I ', '- parms$mu * R'))
migrations <- rbind(c('0'))
deaths <- rbind(c('parms$gamma * I '))
nonDemeDynamics <- rbind(c('parms$mu * (I+R) - parms$beta * S * I / (S+I)'))

I would also have to change births to cbind instead of rbind so rcolgem's nrow(births) registers the extra expression.
Or I could change rcolgem.R to other specifications. But I don't know how much I should mess around with it.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jun 20, 2017

We don't want to incorporate the recovered compartment in the rcolgem expressions, so change this:

deaths <- rbind(c('parms$gamma * I - parms$mu * R'))

to this:

deaths <- rbind(c('parms$gamma * I'))

and exclude R from the initial conditions vector x0.

@gtng92
Copy link
Collaborator

gtng92 commented Jun 21, 2017

Update on the following error:

Error in sample.int(m^2, size = 1, prob = as.vector(.lambdamat)) : 
  negative probability 

Parameter t.end = 30.*52
make.fgy returns

( 
  fgy[[1]],  # time axis of ODE solution
  fgy[[2]],  # births
  fgy[[3]],  # migrations
  fgy[[4]],  # deme sizes
  fgy[[5]]   # ODE solution
)
Plot for `x=fgy[[2]], y=fgy[[1]]`

rplot

I changed t.end to 1500 and it works ok for a single tree. Later when I run.smc, need to be more stringent up to about 1300 so you won't get negative probability.

@gtng92
Copy link
Collaborator

gtng92 commented Jun 28, 2017

Individual validation of each parameter:
Target tree parameters (true values):

theta <- c(t.end=50, N=5000, beta=0.1, gamma=1/520, mu=1/3640, alpha=0)

Configuration settings of priors:

> config$priors
$t.end
[1] "rnorm(n=1,mean=50,sd=5)"

$N
[1] "rnorm(n=1,mean=5000,sd=100)"

$beta
[1] "rgamma(n=1,shape=50,rate=1)"

$gamma
[1] "rgamma(n=1,shape=20,rate=1)"

$mu
[1] "rgamma(n=1,shape=20,rate=1)"

$alpha
[1] "rnorm(n=1,mean=1,sd=0.1)"

Kernel distances for varying N, using seq(100, 10100, 1000) # (from, to, step)
5000--100-10100-1000
Kernel distances for varying beta using seq(0.01, 0.36, 0.025)
0 1--0 01-0 36-0 025b
Kernel distances for varying gamma using seq(0.001, 0.08, 0.0025)
0 002--0 001-0 08-0 0025
Kernel distances for varying mu using seq(0.0001, 0.01, 0.0005)
0 00027--0 0001-0 01-0 0005

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jun 28, 2017 via email

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jun 28, 2017

Okay, now that I understand these are kernel distances I think we need to redo the analyses for mu and gamma to account for their low target values (0.00192 and 0.000274, respectively). Might also be worth applying a log-transform on the x-axis for these figures.

I also expect some of these parameters to be confounded, like beta and N, and mu and gamma. Eventually it will be informative to vary these pairs jointly and then look the distribution of kernel distances: see Rosemary's paper on contact networks and SMC-ABC for reference.

@gtng92
Copy link
Collaborator

gtng92 commented Jun 28, 2017

I've set gamma to 1/520. and mu to 1/3640. Should I switch the values?
norm.mode is set to 'NONE'

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jun 28, 2017 via email

@gtng92
Copy link
Collaborator

gtng92 commented Jun 29, 2017

Kernel distances for varying N, for nsim=100 and seq(1000, 10000, 1000)
In Rosemary's paper, it says that parameter N and m had little to no identifiability with ABC. Is this for a different kind of scenario?
n5000--1000-10000-1000
Kernel distances for beta for nsim=100 and seq(0.01, 0.28, 0.025)
beta0 1--0 01-0 28-0 025
Kernel distances for gamma for nsim=100 and seq(0.0005, 0.03, 0.0015)
gamma0 0019--0 0005-0 03-0 0015

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jun 29, 2017 via email

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jun 29, 2017

By the way, plots are much improved, thanks for refining the range on beta and gamma.
Until Monday, can you proceed with running ABC-SMC to assess performance and also to identify usability problems?

@gtng92
Copy link
Collaborator

gtng92 commented Jun 29, 2017

Yep, I'll look for a theta that reproduces the negative probability error.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 3, 2017

Try using the following approach to examine the distribution of kernel score over two model parameters:

plot(x, y, cex=sqrt(z))  # where x and y are two parameters and z is kernel score/distance

You can also use the rgb argument to plug in kernel distance as a red, green or blue argument (or two simultaneously).

@gtng92
Copy link
Collaborator

gtng92 commented Jul 5, 2017

Am I doing this correctly?
kdists4varyingbeta beta n
kdists4varyingn beta n

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 5, 2017 via email

@MathiasRenaud
Copy link
Collaborator

I have started a grid search script in pkg/examples/example-bd.R for birth-death model, which can be adapted to other models. Still working on the plot and testing.

@gtng92
Copy link
Collaborator

gtng92 commented Jul 11, 2017

This is the heatmap for N and beta with the different kernel distances. I can create a more specific one as well.
True value for beta is 0.1, and true value for N is 5000.
heatmap-beta-n

@gtng92
Copy link
Collaborator

gtng92 commented Jul 12, 2017

Heatmap for pairwise comparisons of gamma and mu.
True value for gamma is 0.00192, and true value for mu is 0.00027. It looks like the gamma values are consistently overestimated.
heatmap-gamma-mu

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 17, 2017

There are a few things to try here:

  1. Increase the resolution of numerical approximation for ODE (fgy.resol parameter) to reduce chance of numerical error causing negative probability.
  2. Ignore simulations where negative probability error arises and re-attempt simulation.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 17, 2017

There's a bug in compartmental-models.R, deaths term should not appear in births expression; for example:

    births <- rbind(c('parms$beta * S * I / (S+I) - parms$gamma * I'))

should be

    births <- rbind(c('parms$beta * S * I / (S+I)'))

The only place that parms$gamma * I should appear is in the subsequent deaths expression.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 24, 2017

Correcting the births matrix specification did not resolve the problem. Currently when the pseudorandom number generator is fixed with a given seed, the sixth particle being updated in SMC runs into a "negative probability" error (presumably associated with coalescent simulation).

  • Post the parameter values of particle 6 here.
  • Examine population trajectories associated with the parameter values of particle 6. If F, G and Y are negative, we need to determine why. Do the parameter values make sense?

@gtng92
Copy link
Collaborator

gtng92 commented Jul 24, 2017

The parameter values for some actually didn't make sense. In a previous comment, I've set the target tree with theta of:

theta <- c(t.end=50, N=5000, beta=0.1, gamma=1/520, mu=1/3640, alpha=0)

Configuration settings of priors beta, gamma, and mu in particular were previously:

> config$priors
$beta
[1] "rgamma(n=1,shape=50,rate=1)"
$gamma
[1] "rgamma(n=1,shape=20,rate=1)"
$mu
[1] "rgamma(n=1,shape=20,rate=1)"

And now that I've changed them to:

> config$priors
$beta
[1] "rgamma(n=1,shape=1,rate=5)"
$gamma
[1] "rgamma(n=1,shape=1,rate=10)"
$mu
[1] "rgamma(n=1,shape=0.2,rate=50)"

I can now initialize.smc for 100 particles (so far as I've tested).
Thanks!

@gtng92
Copy link
Collaborator

gtng92 commented Aug 17, 2017

Ran with parameters in commit 4d37ca4 executed with 500 particles finished in approximately 12.39 hours
Step 110 epsilon: 0.1023741 ESS: 187.2476 accept: 0.2204082 elapsed: 44595.8 s

Trajectories of the mean estimates of parameters in theta are able to be viewed in this file:
example-compartmental.pdf

In the plots it looks like there's still room to converge for parameters beta, gamma, and mu. But if I lower my accept rate, the runtime would take a while longer.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Aug 21, 2017

Check whether collecting samples with invalid values of t.end is skewing the other parameter estimates by fixing t.end to the actual value and then comparing to the current set of results (where invalid t.end can result in negative probabilities in the coalescent simulation, but only throws a warning and not an exception).

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Aug 28, 2017

@gtng92 observed that fixing to actual t.end is producing the negative probability error, so let's try using a uniform distribution with a lower bound at actual t.end and upper bound some small amount above this value, and varying these bounds until we no longer get this error.

@gtng92
Copy link
Collaborator

gtng92 commented Oct 30, 2017

From Issue #120, I decided to implement the idea of incorporating a dummy tree upon an error encountered in rcolgem. From there I was able to run a grid-search over a wide range of varying parameters for all params in theta (except alpha, which isn't used in an sir.dynamic model --> still working on stripping that part out when it's irrelevant to the model)

True Params:

theta <- c(t.end=200, N=10000, beta=0.1, gamma=0.2, mu=0.01, alpha=5)

Grid-search Params:

t.end <- seq(50, 400, 50)
N <- seq(7000, 14000, 1000)
beta <- seq(0.01, 0.36, 0.05)
gamma <- seq(0.01, 0.36, 0.05)
mu <- seq(0.001, 0.071, 0.01)
alpha <- 5

Average of top 10 thetas with lowest kernel distances:

avg.theta <- c(t.end=215, N=10900, beta=0.12, gamma=0.235, mu=0.03451, alpha=5)

Average of top 20 thetas:

avg.theta <- c(t.end=217.5, N=10277.78, beta=0.115, gamma=0.23, mu=0.0365, alpha=5)

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 30, 2017

Ok, careful about this though -- if there was no real difference among trees with respect to kernel score, one could arbitrarily pick 10 or 20 trees and the mean parameter estimates would still be around these values based on the way you've designed the grid. It will be more convincing when we see the actual distribution of distances over the grid. Thanks!

@gtng92
Copy link
Collaborator

gtng92 commented Oct 30, 2017

From viewing the distributions, varying t.end and N seem to have little to no effect on the kernel distance. beta and gamma in particular seemed the most estimable (agrees with what was estimable before).
beta-gamma-distance

It's a little hard to see the distribution in this plot, but I can show the distributions at the dev meeting with a 3D scatterplot that is rotatable.

@gtng92
Copy link
Collaborator

gtng92 commented Nov 6, 2017

Trajectory of Mean t.end seems to be improving, although admittedly my priors were relatively narrow. It does look better than the t.end trajectory in the PDF file linked in the comment from Aug 17, 2017. Parameter beta still seems to have a strange spike up to 40 iterations.
trajectory of mean t end

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 6, 2017

I think we're in danger of fishing for results here. Let's see what distances are like using other metrics. Thanks!

@gtng92
Copy link
Collaborator

gtng92 commented Nov 6, 2017

To check, I widened the prior to a normal distribution with a mean of 300 and a standard deviation of 100. True value was 200.
t end-mean300-sd100

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 14, 2017

Deprecated by #123

@ArtPoon ArtPoon closed this as completed Nov 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants