-
Notifications
You must be signed in to change notification settings - Fork 0
/
sec_setup.tex
139 lines (123 loc) · 8.18 KB
/
sec_setup.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
\subsection{Data}
\label{subsec:Data}
We exploit user-generated language from an online forum of football fans,
namely, the r/LiverpoolFC subreddit, one of the many communities hosted
by the Reddit platform.\footnote{\url{https://www.reddit.com}. We downloaded Reddit data using the Python package Praw: \url{https://pypi.python.org/pypi/praw/}.}
\newcite{del2018road} showed that this subreddit presents many characteristics that favour the creation and spread of linguistic innovations, such as a topic that reflects a strong external interest and high density of the connections among its users. This makes it a good candidate for our investigation.
We focus on a short period of eight years, between 2011 and 2017.
In order to enable a clearer observation of short-term meaning shift, we define two
non-consecutive time bins: the first one ($t_1$) contains data from
2011--2013 and the second one ($t_2$) from 2017.\footnote{These choices
ensure that the samples in these two time bins are approximately of the same size -- see Table~\ref{tab:data}. The
r/LiverpoolFC subreddit exists since 2009, but very little content
was produced in 2009--2010.}
We also use a large sample of community-independent language for the
initialization of the word vectors, namely, a random crawl from Reddit
in 2013.
Table~\ref{tab:data} shows the size of each sample.
\subsection{Model}
\label{subsec:Model}
We adopt the model proposed by \newcite{kim2014temporal}, a representative method for computing diachronic meaning shift.\footnote{The model was implemented using the Python package Gensim: \url{https://pypi.python.org/pypi/gensim/}.} While other methods might be equally suitable (see Section \ref{sect:Related_Work}), we expect our results not to be method-specific, because they concern general properties of short-term shift, as we show in Sections~\ref{sec:types} and~\ref{sec:results}.
In the model by \newcite{kim2014temporal}, word
embeddings for the first time bin $t_1$ are initialized randomly; then,
given a sequence of time-related samples, embeddings for $t_i$
are initialized using the embeddings of $t_{i-1}$ and further
updated.
If at $t_i$ the word is used in the same contexts as in $t_{i-1}$, its embedding will only be marginally updated, whereas a major change in the context of use will lead to a stronger update of the embedding. The model makes embeddings across time bins directly comparable.
We implement the following steps:
First, we create randomly initialized word embeddings with the large sample \redd\ to obtain
meaning representations that are community-independent.
We then use these embeddings
to initialize those in LiverpoolFC$_{13}$, update the vectors on this
sample, and thus obtain embeddings for time $t_1$. This step
adapts the general embeddings %from \redd\ to the
to the LiverpoolFC community. Finally, we
initialize the word embeddings for LiverpoolFC$_{17}$ with those of
$t_1$, train on this sample, and get embeddings for $t_2$.
%\footnote{We implement the model
%using the \texttt{gensim} library.}
%To obtain word embeddings for time $t_2$, we
%initialize the word embeddings for LiverpoolFC$_{17}$ with those of $t_1$,
%and train skip-gram further on the
%LiverpoolFC$_{17}$ data.
%We define the vocabulary to be represented as the intersection of the vocabularies of the three samples (\redd, LiverpoolFC$_{13}$, LiverpoolFC$_{17}$) which results in a vocabulary of 157k words.
The vocabulary is defined as the intersection of the
vocabularies of the three samples (\redd, LiverpoolFC$_{13}$,
LiverpoolFC$_{17}$), and includes 157k words.
%\todo{What is the vocabulary size?}
%\redd~embeddings are initialized randomly.
For \redd, we include only words that occur at least 20 times in the
sample, so as to ensure meaningful representations for each word,
while for the other two samples we do not use any frequency
threshold: Since the embeddings used for the initialization of
LiverpoolFC$_{13}$ encode community-independent meanings, if a word doesn't occur in
LiverpoolFC$_{13}$ its representation will simply be as in \redd,
which reflects the idea that if a word is not used in a community, then its meaning is not altered within
that community.
%Instead, if a word's meaning changes in the
%community, then we expect the word embedding to change accordingly
%after training on the community-specific data.
%Analogous reasoning applies to LiverpoolFC$_{17}$.
We train with standard skip-gram parameters \cite{levy2015improving}: window 5, learning rate 0.01, embedding dimension 200, hierarchical softmax.
%----- TABLE ----------------------------------------------------
\begin{table}[t!]
\centering
\begin{tabular}{lccc}
\bf sample & \bf time bin & \bf million tokens \\
\hline
\redd & 2013 & {\raise.17ex\hbox{$\scriptstyle\sim$}}900 \\
LiverpoolFC$_{13}$ & 2011--13 & ~ 8.5\\
LiverpoolFC$_{17}$ & 2017 & 11.9\\ \hline
\end{tabular}
\caption{Time bin and size of the datasets.}
\label{tab:data}
%\vspace*{-12pt}
\end{table}
%----- END TABLE -----------------------------------------
%===== EVALUATION DATASET ===============
%\paragraph{Evaluation dataset.}
\subsection{Evaluation dataset}
\label{subsec:Evaluation dataset}
Our dataset consists of 97 words from the r/LiverpoolFC subreddit with
annotations by members of the subreddit ---that is, community members
with domain knowledge (needed for this task) but no linguistic
background.
To ensure that we would get enough cases of semantic shift to enable a
meaningful analysis, we started out from content words that increase
their relative frequency between $t_1$ and $t_2$.\footnote{Frequency
increase has been shown to positively correlate with meaning change
\cite{wijaya2011understanding,kulkarni2015statistically}; although
it is not a necessary condition, it is a reasonable starting point,
as a random selection of words would contain very few positive
examples. Our dataset is thus biased towards precision over recall.}
A threshold of 2 standard deviations above the mean yielded
$\sim$200 words.~The first author manually identified 34 semantic
shift candidates among these words by analyzing their contexts of use
in the r/LiverpoolFC data. Semantic shift is defined here as a change
in the ontological type that a word denotes, which takes place when the word starts to be used to denote an entity which is different from the one originally denoted and the new use spreads among the members of a community (see examples in Sec.~\ref{sec:types}).
We added two types of confounders: 33 words
with a significant frequency increase but not marked as meaning
shift candidates, and 33 words with constant frequency between $t_1$
and $t_2$, included as a sanity check. All words have
absolute frequency in range [50--500].
The participants were shown the 100 words (in randomized order)
together with randomly chosen contexts of usage from each time period
($\mu$=4.7 contexts per word) and, for simplicity, were asked to make
a binary decision about whether there was a change in meaning. In order to have the redditors familiarize with the concept of meaning change, we first provide them with an intuitive, non-technical definition, and then a set of cases that exemplify it. The instructions to participants can be found in the project's GitHub repository (see footnote \ref{note1}).
Semantic shift is arguably a graded notion. In line with a suggestion by \newcite{KutuzovEtal-coling2018} to account for this fact, we aggregate the annotations into a graded \emph{semantic shift index},
ranging from 0 (no shift) to 1 (shift) depending on how many subjects
spotted semantic change . The shift index is exclusively based
on the judgments of the redditors, and does not consider the
preliminary candidate selection done by us. Overall, 26 members
of r/LiverpoolFC participated in the survey, and each word received on
average 8.8 judgements.
%Three words were discarded after analysis of
%the redditor data.\footnote{The words are: `discord' and `owls' due to
% the homonymy with proper names not detected during survey's
% implementation; `tracking' because the chosen examples clearly
% mislead the judgements of the redditors.}
Further details about the dataset are in Appendix \ref{sec:further_details_data_model}.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: