vector_norm and similarity value incorrect #522

xuanyiguang · 2016-10-11T19:09:31Z

Somehow vector_norm is incorrectly calculated.

import spacy
import numpy as np
nlp = spacy.load("en")
# using u"apples" just as an example
apples = nlp.vocab[u"apples"]
print apples.vector_norm
# prints 1.4142135381698608, or sqrt(2)
print np.sqrt(np.dot(apples.vector, apples.vector))
# prints 1.0

Then vector_norm is used in similarity, which always returns a value that is always half of the correct value.

def similarity(self, other):
    if self.vector_norm == 0 or other.vector_norm == 0:
        return 0.0
    return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)

It is OK if the use case is to rank similarity scores for synonyms. But the cosine similarity score itself is incorrect.

The text was updated successfully, but these errors were encountered:

honnibal · 2016-10-11T19:11:44Z

Thanks! Will figure this out.

…e calculations into a helper function.

honnibal · 2016-10-23T13:03:07Z

I think this is fixed in 1.0, but this bug makes me uneasy because I don't feel like I really understand what was wrong. I haven't had time to test 0.101.0 yet, but: you say the cosine was always half? I can't figure out why that should be...

What I've come up with is that this calculation looks unreliable:

        for orth, lex_addr in self._by_orth.items():
            lex = <LexemeC*>lex_addr
            if lex.lower < vectors.size():
                lex.vector = vectors[lex.lower]
                for i in range(vec_len):
                    lex.l2_norm += (lex.vector[i] * lex.vector[i])
                lex.l2_norm = math.sqrt(lex.l2_norm)
            else:
                lex.vector = EMPTY_VEC

The lex.l2_norm value is possibly uninitialised, and so there may be a problem there. Passing a 32 bit float to the Python function math.sqrt is also suspicious. But if this was the problem, the results should have been "unreliable, always wrong". Always half?? Unsettling!

honnibal · 2016-10-23T13:28:49Z

Got it now.

The previous default vectors were already normalized. This led to a value of lex.l2_norm = 1 being stored in the lexemes.bin file. This was then read back out into the LexemeC struct when the vocabulary was deserialised.

Later, I added the capability to load custom word vectors, which meant the L2 norm had to be calculated. However, I didn't initialised the value of lex.l2_norm to 0 before computing the new norm. Since the default vectors were normalised, the initial value was always 1, and the eventual norm was sqrt(1+1). This explains why the similarity was consistently half.

No tests checked the exact value returned by the similarity function. They only sanity-checked relative values. This has since been addressed.

lock · 2018-05-09T07:39:15Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Oct 11, 2016

honnibal added a commit that referenced this issue Oct 23, 2016

Fix calculation of vector norm, re Issue #522. Need to consolidate th…

2c3a67b

…e calculations into a helper function.

honnibal closed this as completed Oct 23, 2016

mraduldubey mentioned this issue May 30, 2017

token.vector is calculated as all-0s creating problems with token.similarity() #1092

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vector_norm and similarity value incorrect #522

vector_norm and similarity value incorrect #522

xuanyiguang commented Oct 11, 2016

honnibal commented Oct 11, 2016

honnibal commented Oct 23, 2016

honnibal commented Oct 23, 2016

lock bot commented May 9, 2018

vector_norm and similarity value incorrect #522

vector_norm and similarity value incorrect #522

Comments

xuanyiguang commented Oct 11, 2016

honnibal commented Oct 11, 2016

honnibal commented Oct 23, 2016

honnibal commented Oct 23, 2016

lock bot commented May 9, 2018