Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vector_norm and similarity value incorrect #522

Closed
xuanyiguang opened this issue Oct 11, 2016 · 4 comments
Closed

vector_norm and similarity value incorrect #522

xuanyiguang opened this issue Oct 11, 2016 · 4 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@xuanyiguang
Copy link

Somehow vector_norm is incorrectly calculated.

import spacy
import numpy as np
nlp = spacy.load("en")
# using u"apples" just as an example
apples = nlp.vocab[u"apples"]
print apples.vector_norm
# prints 1.4142135381698608, or sqrt(2)
print np.sqrt(np.dot(apples.vector, apples.vector))
# prints 1.0

Then vector_norm is used in similarity, which always returns a value that is always half of the correct value.

def similarity(self, other):
    if self.vector_norm == 0 or other.vector_norm == 0:
        return 0.0
    return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)

It is OK if the use case is to rank similarity scores for synonyms. But the cosine similarity score itself is incorrect.

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Oct 11, 2016
@honnibal
Copy link
Member

Thanks! Will figure this out.

honnibal added a commit that referenced this issue Oct 23, 2016
@honnibal
Copy link
Member

I think this is fixed in 1.0, but this bug makes me uneasy because I don't feel like I really understand what was wrong. I haven't had time to test 0.101.0 yet, but: you say the cosine was always half? I can't figure out why that should be...

What I've come up with is that this calculation looks unreliable:

        for orth, lex_addr in self._by_orth.items():
            lex = <LexemeC*>lex_addr
            if lex.lower < vectors.size():
                lex.vector = vectors[lex.lower]
                for i in range(vec_len):
                    lex.l2_norm += (lex.vector[i] * lex.vector[i])
                lex.l2_norm = math.sqrt(lex.l2_norm)
            else:
                lex.vector = EMPTY_VEC

The lex.l2_norm value is possibly uninitialised, and so there may be a problem there. Passing a 32 bit float to the Python function math.sqrt is also suspicious. But if this was the problem, the results should have been "unreliable, always wrong". Always half?? Unsettling!

@honnibal
Copy link
Member

Got it now.

The previous default vectors were already normalized. This led to a value of lex.l2_norm = 1 being stored in the lexemes.bin file. This was then read back out into the LexemeC struct when the vocabulary was deserialised.

Later, I added the capability to load custom word vectors, which meant the L2 norm had to be calculated. However, I didn't initialised the value of lex.l2_norm to 0 before computing the new norm. Since the default vectors were normalised, the initial value was always 1, and the eventual norm was sqrt(1+1). This explains why the similarity was consistently half.

No tests checked the exact value returned by the similarity function. They only sanity-checked relative values. This has since been addressed.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants