Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Accuracy much lower than Validation Accuracy #9

Closed
Tiiiger opened this issue Jan 15, 2019 · 15 comments
Closed

Training Accuracy much lower than Validation Accuracy #9

Tiiiger opened this issue Jan 15, 2019 · 15 comments

Comments

@Tiiiger
Copy link

Tiiiger commented Jan 15, 2019

In running the sample code, I found that the training accuracy is much lower than the validation accuracy, which is different from training GraphSAGE on reddit in their repo. Is this normal.

For example, the logging I got:
Epoch: 0042 train_loss= 1.72849 train_acc= 0.66406 val_loss= 3.17200 val_acc= 0.90848 time per batch= 0.01104
Epoch: 0043 train_loss= 1.84603 train_acc= 0.59375 val_loss= 3.18259 val_acc= 0.90506 time per batch= 0.01108
Epoch: 0044 train_loss= 1.86952 train_acc= 0.60156 val_loss= 3.17415 val_acc= 0.90324 time per batch= 0.01116

Also, the paper reports the F1 measure. How to get F1 score using the codebase?

@Tiiiger
Copy link
Author

Tiiiger commented Jan 15, 2019

Hi,

I try to compute the F1 score but using the default (released) hyperparameter, the best F1 score I got with the final model is only about 92. However, when I lower the learning rate to 0.0001, the best F1 score is 93.2.

Can you kindly tell me what are the correct hyperparameters to replicate the result reported in the paper?

Thank you!

@matenure
Copy link
Owner

matenure commented Jan 15, 2019

Trying these files instead can probably help get the correct results:
reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J

As to the hyperparameter, the most important one is the learning rate. For different dataset the lr might be different, and we select the best one from [0.01, 0.001, 0.0001]. 0.001 Is the default one and will always work (it means at least achieving decent results). On Reddit, 0.0001 might be the best one.

@Tiiiger
Copy link
Author

Tiiiger commented Jan 15, 2019

Thank you for a quick response. I tried the filed attached but that did not solve the problem with training accuracy. May I ask which version of tensorflow did you use to run these scripts?

@matenure
Copy link
Owner

Sorry I need to take back my words. I checked our log, and the training acc is indeed around 0.65 same as yours. So it is not a problem caused by tensorflow version. Your result is correct.
Here is the possible explanation:

It is possible that the training accuracy is lower than the validation one, because in the current implementation the former is computed by using sampling (for efficiency purpose) but the latter not. The sampling results in an approximation that is consistent but not unbiased, and hence the approximation may be far from the truth. In the paper we propose to use sampling for only learning the model parameters but not for computing predictions, i.e. in the test/validation phase, we did not do sampling.

@Tiiiger
Copy link
Author

Tiiiger commented Jan 17, 2019

Thank you! With learning rate 0.001 I successfully replicate the paper reported F1. Great job!

However, I try to benchmark FastGCN and GraphSAGE on GPU (a 1080 Ti) and got some unexpected results. Using the default early stopping criteria:
FastGCN takes ~300s
GraphSAGE-mean (small) ~80s
GraphSAGE-LSTM ~300s.

Is this the correct behavior? I see in the paper that you only report the wall clock time on CPU. Did you also measure the GPU performance?

Thank you again!

@matenure

@matenure
Copy link
Owner

matenure commented Jan 25, 2019

@Tiiiger Sorry for my late reply due to a conference deadline. We did not compare the performance on GPU. All the early experiments (including the hyperparameter tuning) were done on a machine without a gpu, so we also did the remaining experiments using cpu later. The codes and hyperparameters were not optimized for gpu.
Moreover, the total training time is largely impacted by the optimization algorithm, learning rate, batch size and stopping criteria, that is why we also care about the running time per epoch (even more sometimes). For example, if you turn up the learning rate, or set the step of stopping criteria lower, or change the optimization algorithm (e.g. adam->rmsprop), maybe the accuracy is a little bit lower, but the epochs used to complete the experiments will be much lower, with the time per epoch almost unchanged.
BTW, I noticed that the sample size in the released codes is "None", which has to be changed...

@Tiiiger
Copy link
Author

Tiiiger commented Jan 30, 2019

Thank you!

@Tiiiger Tiiiger closed this as completed Jan 30, 2019
@Zian-Zhou
Copy link

Zian-Zhou commented Jul 10, 2019

Here is the possible explanation:

It is possible that the training accuracy is lower than the validation one, because in the current implementation the former is computed by using sampling (for efficiency purpose) but the latter not. The sampling results in an approximation that is consistent but not unbiased, and hence the approximation may be far from the truth. In the paper we propose to use sampling for only learning the model parameters but not for computing predictions, i.e. in the test/validation phase, we did not do sampling.

@matenure
Hello! I don't quite undeserstand your explanation, but I think this problem really arises during the sampling process. When I study the logic about sampling code in pubmed_inductive_appr2layers.py, I find the following problem. Let's have a look.(please execute the code after load pubmed dataset~)

rank0,rank1=100,100
for batch in iterate_minibatches_listinputs([normADJ_train, y_train, train_mask],batchsize=256, shuffle=True):
    [normADJ_batch, y_train_batch, train_mask_batch] = batch
p1 = column_prop(normADJ_batch)
q1 = np.random.choice(np.arange(numNode_train), rank1, p=p1) 
support1 = sparse_to_tuple(normADJ_batch[:, q1].dot(sp.diags(1.0 / (p1[q1] * rank1))))
p2 = column_prop(normADJ_train[q1, :])
q0 = np.random.choice(np.arange(numNode_train), rank0, p=p2)
support0 = sparse_to_tuple(normADJ_train[q1, :][:, q0])
features_inputs = sp.diags(1.0 / (p2[q0] * rank0)).dot(train_features[q0, :]) 

A = normADJ_train[q1, :][:, q0].toarray()
X = features_inputs
W = np.random.rand(500,3)

A.dot(X.dot(w))################

np.count_nonzero(res[:,0])/res.shape[0]##############

The ressult of 'A.dot(X.dot(w))' usually have lots of zero rows, which means that there will be some zero rows in the sampling gcn result of batch nodes, and these rows will not give any imformation. As a result, in each batch, there are some nodes have not uesed to train, and the training accuracy will have a upper bound.

Adjusting the sampling numbers: rank0,rank1 will be help, but cannot deal with this problem. You can try it and compute the ratio of nonzero rows using 'np.count_nonzero(res[:,0])/res.shape[0]'. When rank0>>rank1>>batch_size, the ratio is close to 1, and then the upper bound of trainingAcc is close to 1 to!

Hope it useful for you to explain the problems what we mentioned~

@matenure
Copy link
Owner

@Zian-Zhou Thank you for the information. That is indeed a problem. Actually in our final model "train_batch_multiRank_inductive_reddit_Mixlayers_sampleA" you can see that we already solved that problem and only sampled nonzero rows.
I just re-ran the codes, and the training accuracy seems correct.

@Zian-Zhou
Copy link

Zian-Zhou commented Jul 11, 2019

@matenure
I run the codes using Pubmed data, but the training accuracy is still low and gets a upper bound. So I check the outputs during the process, and there are also lots of zero-rows.
I think that, the sampling code makes sure that you will choose rank1 nodes which must be connected to the batch nodes (in other words, there will not be any node which is unconnected with all of the batch nodes), however, the sampling result show that there are some sampling nodes of the rank1 nodes unconnected with some of the batch nodes. So there are still lots of zero-rows.

A = normADJ_train[q1, :][:, q0].toarray()
X = features_inputs
w = np.random.rand(500,3)
A.dot(X.dot(w))################
np.count_nonzero(res[:,0])/res.shape[0]##############

"A.dot(X.dot(w))" simulate a result of GCN with sampling. Hope you check it by yourself with Pubmed dataset and Reddit~

@matenure
Copy link
Owner

@Zian-Zhou Is this result from the "appr2layers" model? Then you are right, we have not changed the code there. Actually, it is even more consistent with our theory, but in practice, as you said, avoiding zero rows will be better. You can refer to the codes in "**sampleA" where we sampled only non-zero rows to avoid this problem.
I may also check the codes again and probably change the "appr2layers" later.

@Zian-Zhou
Copy link

@matenure
No, I try the new sampling code using Pubmed data.

rank1=100

for batch in iterate_minibatches_listinputs([normADJ_train, y_train], batchsize=256, shuffle=True):
            [normADJ_batch, y_train_batch] = batch
p0 = column_prop(normADJ_train)
distr = np.nonzero(np.sum(normADJ_batch, axis=0))[1]        
q1 = np.random.choice(distr, rank1, replace=False, p=p0[distr]/sum(p0[distr]))    
support1 = sparse_to_tuple(normADJ_batch[:, q1].dot(sp.diags(1.0 / (p0[q1] * rank1))))

features_inputs = train_features[q1, :]  # selected nodes for approximation
        
A = normADJ_batch[:, q1].toarray()   
X = features_inputs
w = np.random.rand(500,3)

A.dot(X.dot(w))
res = A.dot(X.dot(w))
np.count_nonzero(res[:,0])/res.shape[0]

The result of "res = A.dot(X.dot(w))" and "np.count_nonzero(res[:,0])/res.shape[0]" will show something.

@Zian-Zhou
Copy link

Zian-Zhou commented Jul 11, 2019

@matenure I build a dataset. Please have a look~

import numpy as np

rank1 = 2
batch_size = 8
normADJ_train = np.array([[1,1,0,0,0,0,0,0],
                          [0,1,1,0,0,0,0,0],
                          [0,0,1,1,0,0,0,1],
                          [0,0,0,1,0,0,0,0],
                          [0,0,0,0,1,1,0,0],
                          [0,0,0,0,0,1,1,0],
                          [0,0,0,0,0,0,1,1],
                          [0,0,0,0,0,0,0,1]
                          ])
train_features = np.random.rand(8,10)

def column_prop(adj):
    column_norm = np.sum(adj,axis=0)
    norm_sum = np.sum(column_norm)
    return column_norm/norm_sum

normADJ_train+=normADJ_train.T

normADJ_batch = normADJ_train[0:batch_size,:]

p0 = column_prop(normADJ_batch)

if rank1 is None:
    #support1 = sparse_to_tuple(normADJ_batch)
    #features_inputs = train_features
else:
    distr = np.nonzero(np.sum(normADJ_batch, axis=0))[0]
    if rank1 > len(distr):
        q1 = distr
    else:
        q1 = np.random.choice(distr, rank1, replace=False, p=p0[distr]/sum(p0[distr]))

A = normADJ_batch[:, q1]
X = train_features[q1,:]
w = np.random.rand(10,3)


A.dot(X.dot(w))
res = A.dot(X.dot(w))
np.count_nonzero(res[:,0])/res.shape[0]

Change the number of rank1 or batch_size, and run the code. I think it will show something.

@matenure
Copy link
Owner

@Zian-Zhou I got it. Yes, you are right, there will be zero-rows using our sampling method. And I think it is not easy to avoid this problem if we use batch training (except for batch size 1). But I do not think it should impact the training acc...
Then I checked my local codes and compare it with the GitHub, and I found they are different in calculating the training acc... In my local codes, it has been changed into
"train_cost, train_acc, _ = evaluate(train_features, sparse_to_tuple(normADJ_train), y_train, placeholders)"
i.e. we use the "evaluation" method to get training accuracy. While in the commented codes in Github, the training acc is the acc for sampled batches; of course, the zero-rows will impact a lot in this case. But one batch accuracy does not mean anything. Because we changed the sampled nodes every batch and every epoch, eventually we will utilize all nodes. That is why our sampling method could finally lead to optimization result consistent to original GCN.

Anyway, thank you for the discussion. And I suggest using "evaluate" to get the "real" training accuracy.

@Zian-Zhou
Copy link

@matenure I completely agree!

I had intended to mask the batchnodes correspond to the zero-rows. It is unnecessary. I will try to use evalue() to check the training accuracy, and I believe it is right. But if I want to do experiment on a larger graph data, it maybe take some time.

Thanks for your kindly help! What's a great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants