You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm a new bee in Sentiment Analysis and recently I'm trying to use CNN to apply to Sentiment Analysis. Yoon's paper helps me a lot and I really appreicate that.
I want to understand every piece of code in this repo, but I get some trouble when I read process_data.py. Variable vocab is a type of dictionary and it should store the frequency of each word occurred in MR datas, which is {word, word_frequency}, but in the function build_data_cv, Yoon used set to store words in each line, which means the duplicate words will be removed, in this case how can we calculate the occurred times of each word ?
vocab = defaultdict(float) # dict to store words with its frequences
with open(pos_file, "rb") as f:
for line in f:
rev = []
rev.append(line.strip())
if clean_string:
orig_rev = clean_str(" ".join(rev))
else:
orig_rev = " ".join(rev).lower()
words = set(orig_rev.split()) # use set to store words, which means duplicate words will be removed in current line
IS THERE ANYBODY CAN HELP ME? THANKS A LOT!!!
The text was updated successfully, but these errors were encountered:
It seems we get frequency per review.
It is more likely word W is an indicator for bad reviews if it appeared in many bad reviews rather than appeared many times in a single review.
This is later used when adding 'unknown words'.
If you scroll down the code you'll find
def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
"""
For words that occur in at least min_df documents, create a separate word vector.
0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
"""
for word in vocab:
if word not in word_vecs and vocab[word] >= min_df:
word_vecs[word] = np.random.uniform(-0.25,0.25,k)
Here we don't consider words that appear in a single review.
I think it would have been clearer for a higher threshold.
For example: filter out words that appear in less than 10 reviews.
@talevy23
Thanks a lot!! your opinion really inspair me and solve my confusion. It's a good explanation for filtering out words that appears in less than 10(or any other number) reviews. From that we can conclude that the code only cares how many times a word appears in the reviews but doesn't care about its frequency in a single review, right?
I'm a new bee in Sentiment Analysis and recently I'm trying to use CNN to apply to Sentiment Analysis. Yoon's paper helps me a lot and I really appreicate that.
I want to understand every piece of code in this repo, but I get some trouble when I read process_data.py. Variable vocab is a type of dictionary and it should store the frequency of each word occurred in MR datas, which is {word, word_frequency}, but in the function build_data_cv, Yoon used set to store words in each line, which means the duplicate words will be removed, in this case how can we calculate the occurred times of each word ?
IS THERE ANYBODY CAN HELP ME? THANKS A LOT!!!
The text was updated successfully, but these errors were encountered: