-
Notifications
You must be signed in to change notification settings - Fork 0
/
outline.txt
278 lines (194 loc) · 7.38 KB
/
outline.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
abstract
(1) Introduction [background]
* evaluation [esp. automatic evaluation]
* popular current measures BLEU, TER
* challenges - motivate search for better measures
* other extensions:
- METEOR - stemming and synonym tables; tuned but not generally tuneable
- TERp
- development-set tuning by user
- stemming and synonym tables
- paraphrase tables
* prior work with (partial) syntactic locality:
- Liu & Gildea
- Roark et al. Sparseval
- Owczarzak et al. 2007
* Evaluation of evaluation measures
- correlation vs. human fluency/adequacy
- correlation vs. human-in-the-loop measures (HTER)
* one challenge of metric-evaluation: underlying doc difficulty
- describe mean-subtraction
(2) Experimental paradigm
Section explains space of measures to explore
* F-measure over various bags of components of parse tree
[ Figure describing Sparseval-ish variant ]
* Listing possible component valences
- dependent + labeled link + head
- outbound only (dependent + label)
- inbound only (label + head)
- word-word (skipping label)
- 1-grams
- 2-grams
- ...
Note: inbound links & full pairs "overweight" head words; this is
a good thing.
[ figure demonstrating F-measure over d+l+h decomposition of toy
trees ]
* Tree features may be extracted from a 1-best tree, or from an
n-best tree list
- counts are an expectation over the forest
- n parameter: how many trees in the forest
- gamma parameter for flattening overconfident parse hyps
note intentional similarities to Owczarzak paper
* Components in the basic DPM measures combined in various ways:
- combination of precision, recall scores for different
components (on analogy with BLEU combination), naive
assumption: all weighted equally
- simple F over items of all valences (naive assumption: all
weighted equally)
* Tuneable weights:
- recall that TERp has tuneable system over small number of parameters
- insertion weight
- deletion weight
- substitution weight
- synonym weight (motivated with WordNet)
- learn weights for separate precision, recall for various
components
* Implementation
- Charniak parser
- n=1..50
- head-finding, with some modifications
- label-construction
(3) Correlation with human judgments of fluency & adequacy
Corpus: LDC Multiple Translation Chinese corpus 3&4
treat each segment and each judgment as independent data point
[EXPT]
using n=1 parse for DPM variants
metric r
-------- -----
F[dl;lh] 0.226
BLEU4 0.218 # +1 smoothing
F[dlh] 0.185
TER -0.173
point: using partial-labeling (dl,lh) much better than full
link. Also, using Charniak parser as labeler seems to work, LFG not
necessary
[EXPT]
using other variants
F[1g,2g,dl,lh] 0.237
F[1g,dl,lh] 0.234
F[1g,2g] 0.227
F[dl,lh] 0.226
F[1g,dl,dlh] 0.227 # increasing length: idea from Liu &
# Gildea
point: using (short) ngrams also is a good thing
[EXPT]
combining precision, recall naively vs. combining items
F(1g,2g,dl,lh) 0.237
prmeans(1g,2g,dl,lh) 0.217
F(dl,lh) 0.226
prmeans(dl,lh) 0.208
point: combining individual items naively (before computing
precision and recall) seems better than naively combining precisions
and recalls (too many chances for zeroes?)
[EXPT]
expanding n, gamma=0
F[1g,2g,dl,lh],n=50 0.239
... n=1 0.237
F[dl,lh],n=50 0.234
... n=1 0.226
point: larger n -> better r
(holds for other comparisons too, but these are good examples)
[EXPT]
setting gamma
Tuning experiments finds that gamma of 0.25 is slightly better
(especially when using F[dl,lh]) than gammas of other values in
range 0-1 (these values probably not significant)
(4) Correlations with HTER
Corpus
Gale translation corpus 2.5; 3 sites, 2 source languages, 4 genres each
prediction:
Mean-subtracted HTER per-document
base measure EDPM: F[1g,2g,dl,lh],n=50,gamma=0.25
[EXPT]
Ar Zh All
TER 0.51 0.32 0.44
BLEU_4 -0.42 -0.33 -0.37
EDPM -0.60* -0.39 -0.50*
point: base EDPM does better on all docs, and on Arabic. difference
is not quite significant at the p=0.05 on the Chinese documents
Note 1: also tried pairwise deltas per-document, with similar
results, as presented in the MetricsMATR competition submission.
Note 2: when using per-segment mean-subtracted, r values are smaller
and *Arabic* was the language where EDPM did not perform best.
Ar Zh All
TER 0.38* 0.04 0.19
BLEU_4 -0.10 -0.16 -0.14
EDPM -0.31 -0.19* -0.24*
(Go into details of per-genre differences here? or leave out, for
space reasons?)
(5) Tuning weights
TERp optimizer [Matt, a brief description here]
TERp comes with a variety of features (described earlier)
include optionally new features
Corpus: GALE 2.5 data, again
documents randomly split into two groups, evenly split across
language and genre
tuning target: weighted correlation with mean-removed segment HTER
[we use weighted correlation to avoid emphasis on short segments
(which hurts document-level and system-level utility) ]
Select features from set:
E: features from EDPM, specifically, P,R for inbound, outbound,
full-links, and unlabeled links (8 features)
T: features from basic TERp
(inserts, deletes, substitutions, shifts) (7 features)
P: features from TERp paraphraser
(4 features)
N: precision and recall for 1-grams, 2-grams
B: brevity/prolixity (2 features: one for longer-than-ref, one for
shorter-than-ref)
tune weights on one set, test on the other (and vice versa).
reporting average r between the two.
== Tuned For Weighted Mean Removed Pearson
- Examining WGT_MEAN_RM_SEG_PEAR ==
Average
Cond Test
---------------
ETPB 0.4096
ETP 0.4038
TPB 0.4010
TP 0.3946
TPNB 0.3940
TPN 0.3891
ETB 0.3871
ET 0.3827
TB 0.3815
ETNB 0.3814
ETN 0.3804
T 0.3758
TNB 0.3708
TN 0.3680
E 0.3287
EB 0.3267
ENB 0.3149
NB 0.3058
EN 0.2973
N 0.2636
B 0.1638
(maybe don't report all these numbers!)
point: E+T > T, E+TB > TB, E+TP > TP; information is available in
syntax that is not captured in the other measures.
point: syntax is *not* just an expensive way to get at n-grams;
N-grams are not the same -- TP > TP+N, but TP < TP+E
(6) Conclusion
expected dependency pair matching gets at something new
troubling area: speed of analysis -- current implementation requires
running a complete n-best parser
possible use: as a late-pass evaluation, to identify how systems
perform overall
future work: explore ways to get at syntactic information without
the expense of a full parser:
- this work moves from an LFG parser (Owczarzak) to the
substantially simpler and more robust Charniak system - keep
going and look at simpler partial-parsing approaches (forests?)
- conversely: how important is it that the parses be of high-quality?