Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaNs when training with the provided hyperparameters #8

Open
benjaminwilson opened this issue Mar 26, 2018 · 1 comment
Open

NaNs when training with the provided hyperparameters #8

benjaminwilson opened this issue Mar 26, 2018 · 1 comment

Comments

@benjaminwilson
Copy link

Hi,

I am having trouble training the WordNet noun's model with the provided hyperparameters (in train-nouns.sh). Here is the state of the repo:

ubuntu@hyperbolic-words-2:~/poincare-embeddings$ git diff
diff --git a/train-nouns.sh b/train-nouns.sh
index c106f4a..df4118f 100755
--- a/train-nouns.sh
+++ b/train-nouns.sh
@@ -2,7 +2,7 @@

 # Get number of threads from environment or set to default
 if [ -z "$NTHREADS" ]; then
-   NTHREADS=5
+   NTHREADS=64
 fi

 echo "Using $NTHREADS threads"

And I am running the script using nohup ./train-nouns.sh &. As can be seen from the output below, there are NaNs from epoch 84 onwards.

ubuntu@hyperbolic-words-2:~/poincare-embeddings$ cat nohup.out
Using 64 threads
slurp: objects=82115, edges=743086
Indexing data
json_conf: {"distfn": "poincare", "dim": 10, "lr": 1, "batchsize": 50, "negs": 50}
Burnin: lr=0.01
json_log: {"epoch": 0, "loss": 1.7510217754905981, "elapsed": 116.03712247902877}
Burnin: lr=0.01
json_log: {"epoch": 1, "loss": 1.654937060126994, "elapsed": 118.86135302501498}
Burnin: lr=0.01
json_log: {"epoch": 2, "loss": 1.5647581015296042, "elapsed": 118.4134448269906}
Burnin: lr=0.01
json_log: {"epoch": 3, "loss": 1.4822902187269897, "elapsed": 117.75066023998079}
Burnin: lr=0.01
json_log: {"epoch": 4, "loss": 1.4054055172298818, "elapsed": 119.41065871602041}
Burnin: lr=0.01
json_log: {"epoch": 5, "loss": 1.3346420868443154, "elapsed": 119.11423735501012}
Burnin: lr=0.01
json_log: {"epoch": 6, "loss": 1.2689210829544222, "elapsed": 118.58688614101266}
Burnin: lr=0.01
json_log: {"epoch": 7, "loss": 1.2099971524190973, "elapsed": 118.22312185697956}
Burnin: lr=0.01
json_log: {"epoch": 8, "loss": 1.1548116808904323, "elapsed": 118.3734846469888}
Burnin: lr=0.01
json_log: {"epoch": 9, "loss": 1.1028053120125545, "elapsed": 119.84133693398326}
Burnin: lr=0.01
json_log: {"epoch": 10, "loss": 1.0538397419021133, "elapsed": 118.92894380999496}
Burnin: lr=0.01
json_log: {"epoch": 11, "loss": 1.0086800653686467, "elapsed": 118.51857428599033}
Burnin: lr=0.01
json_log: {"epoch": 12, "loss": 0.9661437821506956, "elapsed": 118.79905583200161}
Burnin: lr=0.01
json_log: {"epoch": 13, "loss": 0.9271851564789986, "elapsed": 119.18357229899266}
Burnin: lr=0.01
json_log: {"epoch": 14, "loss": 0.8898532564651916, "elapsed": 118.5679669189849}
Burnin: lr=0.01
json_log: {"epoch": 15, "loss": 0.8561349531560194, "elapsed": 118.50844964102725}
Burnin: lr=0.01
json_log: {"epoch": 16, "loss": 0.8238338795120215, "elapsed": 119.0086042000039}
Burnin: lr=0.01
json_log: {"epoch": 17, "loss": 0.7933640109146769, "elapsed": 118.66447035799501}
Burnin: lr=0.01
json_log: {"epoch": 18, "loss": 0.7658562077606146, "elapsed": 118.73805539900786}
Burnin: lr=0.01
json_log: {"epoch": 19, "loss": 0.7403765493110794, "elapsed": 118.37916503899032}
json_log: {"epoch": 20, "loss": 2.370282818574938, "elapsed": 377.01101255300455}
json_log: {"epoch": 21, "loss": 1.332919087618401, "elapsed": 373.0545672760054}
json_log: {"epoch": 22, "loss": 1.1209379775260888, "elapsed": 373.76439789202414}
json_log: {"epoch": 23, "loss": 0.9983822026230719, "elapsed": 372.82547582898405}
json_log: {"epoch": 24, "loss": 0.9136303973428944, "elapsed": 374.4872254100046}
json_log: {"epoch": 25, "loss": 0.853509014305552, "elapsed": 372.2331004269945}
json_log: {"epoch": 26, "loss": 0.8096088365475078, "elapsed": 373.190142884996}
json_log: {"epoch": 27, "loss": 0.7833678124650717, "elapsed": 364.4548651619989}
json_log: {"epoch": 28, "loss": 0.7738845424189945, "elapsed": 362.44732139699045}
json_log: {"epoch": 29, "loss": 0.7732465087031701, "elapsed": 363.7966738270188}
json_log: {"epoch": 30, "loss": 0.7745720795530572, "elapsed": 365.75540012700367}
json_log: {"epoch": 31, "loss": 0.7754874658808333, "elapsed": 364.4030886440014}
json_log: {"epoch": 32, "loss": 0.7791964033661933, "elapsed": 365.59433861100115}
json_log: {"epoch": 33, "loss": 0.7783978579808019, "elapsed": 363.1505782940076}
json_log: {"epoch": 34, "loss": 0.7788809554177886, "elapsed": 366.6854563309753}
json_log: {"epoch": 35, "loss": 0.7807947874874831, "elapsed": 362.40433710900834}
json_log: {"epoch": 36, "loss": 0.7802273132815123, "elapsed": 363.4012530000182}
json_log: {"epoch": 37, "loss": 0.7821110229706861, "elapsed": 363.1637753259856}
json_log: {"epoch": 38, "loss": 0.7834545301763879, "elapsed": 362.778968936007}
json_log: {"epoch": 39, "loss": 0.7821672492099232, "elapsed": 365.47163956501754}
json_log: {"epoch": 40, "loss": 0.7812606035833877, "elapsed": 372.4652839000046}
json_log: {"epoch": 41, "loss": 0.7831733638117153, "elapsed": 371.5480394010083}
json_log: {"epoch": 42, "loss": 0.7805352591396997, "elapsed": 365.6498189200065}
json_log: {"epoch": 43, "loss": 0.7841648086514452, "elapsed": 366.00326509500155}
json_log: {"epoch": 44, "loss": 0.7817570621436887, "elapsed": 364.9565300550021}
json_log: {"epoch": 45, "loss": 0.7792914231417969, "elapsed": 364.8312062040204}
json_log: {"epoch": 46, "loss": 0.7813605962990672, "elapsed": 364.26285549401655}
json_log: {"epoch": 47, "loss": 0.7832389085209808, "elapsed": 364.54356124799233}
json_log: {"epoch": 48, "loss": 0.7828953792766546, "elapsed": 362.8009843980253}
json_log: {"epoch": 49, "loss": 0.7830643432524169, "elapsed": 364.27069370500976}
json_log: {"epoch": 50, "loss": 0.7832194386006741, "elapsed": 362.9516918490117}
json_log: {"epoch": 51, "loss": 0.7803435126326296, "elapsed": 365.32360184399295}
json_log: {"epoch": 52, "loss": 0.7814182463081865, "elapsed": 364.3582421159954}
json_log: {"epoch": 53, "loss": 0.7808202591523791, "elapsed": 364.49702233899734}
json_log: {"epoch": 54, "loss": 0.7814219246435845, "elapsed": 366.30092610701104}
json_log: {"epoch": 55, "loss": 0.7812639302362918, "elapsed": 364.321928276011}
json_log: {"epoch": 56, "loss": 0.7805546298960221, "elapsed": 365.4223571420007}
json_log: {"epoch": 57, "loss": 0.7827168783831445, "elapsed": 369.57942718299455}
json_log: {"epoch": 58, "loss": 0.7826227238666216, "elapsed": 370.0409598569968}
json_log: {"epoch": 59, "loss": 0.7840471950221404, "elapsed": 369.9050895939872}
json_log: {"epoch": 60, "loss": 0.7821681343282576, "elapsed": 372.4341457049886}
json_log: {"epoch": 61, "loss": 0.7791859689612924, "elapsed": 370.6871205380012}
json_log: {"epoch": 62, "loss": 0.7812333104954267, "elapsed": 367.2430337770202}
json_log: {"epoch": 63, "loss": 0.7819590478411862, "elapsed": 371.23542637602077}
json_log: {"epoch": 64, "loss": 0.7797660082275362, "elapsed": 369.08145095501095}
json_log: {"epoch": 65, "loss": 0.7808701695771313, "elapsed": 371.47186528297607}
json_log: {"epoch": 66, "loss": 0.7825124721070259, "elapsed": 370.3029446750006}
json_log: {"epoch": 67, "loss": 0.7822453611017892, "elapsed": 370.0708697150112}
json_log: {"epoch": 68, "loss": 0.781655567370507, "elapsed": 372.9709478849836}
json_log: {"epoch": 69, "loss": 0.7807679479705355, "elapsed": 369.12473262500134}
json_log: {"epoch": 70, "loss": 0.7785301718368571, "elapsed": 367.71985385299195}
json_log: {"epoch": 71, "loss": 0.7821944001183523, "elapsed": 368.5803866839851}
json_log: {"epoch": 72, "loss": 0.7792014827696869, "elapsed": 369.225650538021}
json_log: {"epoch": 73, "loss": 0.7821118975729462, "elapsed": 370.6435987050063}
json_log: {"epoch": 74, "loss": 0.7816132332723827, "elapsed": 371.09823266702006}
json_log: {"epoch": 75, "loss": 0.780557655646636, "elapsed": 370.9333890530106}
json_log: {"epoch": 76, "loss": 0.7819239331816404, "elapsed": 368.4207367969793}
json_log: {"epoch": 77, "loss": 0.7809382741582769, "elapsed": 369.501590519998}
json_log: {"epoch": 78, "loss": 0.7819559281778887, "elapsed": 369.5081132330233}
json_log: {"epoch": 79, "loss": 0.7785417012889864, "elapsed": 369.03904611899634}
json_log: {"epoch": 80, "loss": 0.7799045569593641, "elapsed": 370.648248100013}
json_log: {"epoch": 81, "loss": 0.7813012339836487, "elapsed": 368.1522722489899}
json_log: {"epoch": 82, "loss": 0.7809676399123354, "elapsed": 368.78767832298763}
json_log: {"epoch": 83, "loss": 0.7801517064626365, "elapsed": 372.1423245120095}
json_log: {"epoch": 84, "loss": nan, "elapsed": 365.5702549460111}
json_log: {"epoch": 85, "loss": nan, "elapsed": 363.09633759901044}
json_log: {"epoch": 86, "loss": nan, "elapsed": 363.81353292398853}
json_log: {"epoch": 87, "loss": nan, "elapsed": 362.3099420409999}
json_log: {"epoch": 88, "loss": nan, "elapsed": 364.5262840819778}
json_log: {"epoch": 89, "loss": nan, "elapsed": 361.63322074498865}
json_log: {"epoch": 90, "loss": nan, "elapsed": 364.6625888540002}
json_log: {"epoch": 91, "loss": nan, "elapsed": 364.0965977109736}
json_log: {"epoch": 92, "loss": nan, "elapsed": 364.00835764498333}
json_log: {"epoch": 93, "loss": nan, "elapsed": 362.5855255270144}
json_log: {"epoch": 94, "loss": nan, "elapsed": 363.579538193997}
json_log: {"epoch": 95, "loss": nan, "elapsed": 362.10383523098426}
json_log: {"epoch": 96, "loss": nan, "elapsed": 365.4729586149915}
json_log: {"epoch": 97, "loss": nan, "elapsed": 364.0538249380188}
json_log: {"epoch": 98, "loss": nan, "elapsed": 362.8323765830137}
Process Process-66:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "embed.py", line 68, in control
    mrank, mAP = ranking(types, model, distfn)
  File "embed.py", line 38, in ranking
    ap_scores.append(average_precision_score(_labels, -_dists))
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 199, in average_precision_score
    sample_weight=sample_weight)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/metrics/base.py", line 75, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 191, in _binary_uninterpolated_average_precision
    y_true, y_score, sample_weight=sample_weight)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 441, in precision_recall_curve
    sample_weight=sample_weight)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 324, in _binary_clf_curve
    assert_all_finite(y_score)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 54, in assert_all_finite
    _assert_all_finite(X.data if sp.issparse(X) else X)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
...
@bheinzerling
Copy link

For certain combinations of embedding dimension and number of threads, I'm getting NaNs, too.

For example, NaNs occur with 2-d embeddings and 24 threads, but not with 10-d embeddings and 24 threads and not with 2-d embeddings and 8 threads.

So a workaround seems to be to reduce the number of threads or to increase embedding dimensionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants