Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confused with vocab in process_data.py, need heeeeeeelp #46

Open
Larry955 opened this issue Sep 13, 2018 · 2 comments
Open

Confused with vocab in process_data.py, need heeeeeeelp #46

Larry955 opened this issue Sep 13, 2018 · 2 comments

Comments

@Larry955
Copy link

Larry955 commented Sep 13, 2018

I'm a new bee in Sentiment Analysis and recently I'm trying to use CNN to apply to Sentiment Analysis. Yoon's paper helps me a lot and I really appreicate that.

I want to understand every piece of code in this repo, but I get some trouble when I read process_data.py. Variable vocab is a type of dictionary and it should store the frequency of each word occurred in MR datas, which is {word, word_frequency}, but in the function build_data_cv, Yoon used set to store words in each line, which means the duplicate words will be removed, in this case how can we calculate the occurred times of each word ?

    vocab = defaultdict(float)   # dict to store words with its frequences
    with open(pos_file, "rb") as f:
        for line in f:       
            rev = []
            rev.append(line.strip())
            if clean_string:
                orig_rev = clean_str(" ".join(rev))
            else:
                orig_rev = " ".join(rev).lower()
            words = set(orig_rev.split()) # use set to store words, which means duplicate words will be removed in current line

IS THERE ANYBODY CAN HELP ME? THANKS A LOT!!!

@talevy23
Copy link

talevy23 commented Sep 15, 2018

It seems we get frequency per review.
It is more likely word W is an indicator for bad reviews if it appeared in many bad reviews rather than appeared many times in a single review.

This is later used when adding 'unknown words'.
If you scroll down the code you'll find

def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
    """
    For words that occur in at least min_df documents, create a separate word vector.    
    0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
    """
    for word in vocab:
        if word not in word_vecs and vocab[word] >= min_df:
            word_vecs[word] = np.random.uniform(-0.25,0.25,k)  

Here we don't consider words that appear in a single review.
I think it would have been clearer for a higher threshold.
For example: filter out words that appear in less than 10 reviews.

@Larry955
Copy link
Author

@talevy23
Thanks a lot!! your opinion really inspair me and solve my confusion. It's a good explanation for filtering out words that appears in less than 10(or any other number) reviews. From that we can conclude that the code only cares how many times a word appears in the reviews but doesn't care about its frequency in a single review, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants