outline.txt

abstract

(1) Introduction [background]

  * evaluation [esp. automatic evaluation]
  * popular current measures BLEU, TER

  * challenges - motivate search for better measures

  * other extensions:
    - METEOR - stemming and synonym tables; tuned but not generally tuneable
    - TERp
      - development-set tuning by user
      - stemming and synonym tables
      - paraphrase tables

  * prior work with (partial) syntactic locality:
    - Liu & Gildea
    - Roark et al. Sparseval
    - Owczarzak et al. 2007

  * Evaluation of evaluation measures
    - correlation vs. human fluency/adequacy
    - correlation vs. human-in-the-loop measures (HTER)

  * one challenge of metric-evaluation: underlying doc difficulty
    - describe mean-subtraction

(2) Experimental paradigm

  Section explains space of measures to explore

  * F-measure over various bags of components of parse tree

     [ Figure describing Sparseval-ish variant ]

  * Listing possible component valences
    - dependent + labeled link + head
    - outbound only (dependent + label)
    - inbound only (label + head)
    - word-word (skipping label)

    - 1-grams
    - 2-grams
    - ...

    Note: inbound links & full pairs "overweight" head words; this is
    a good thing.

    [ figure demonstrating F-measure over d+l+h decomposition of toy
    trees ]

  * Tree features may be extracted from a 1-best tree, or from an
    n-best tree list 

    - counts are an expectation over the forest

    - n parameter: how many trees in the forest

    - gamma parameter for flattening overconfident parse hyps

    note intentional similarities to Owczarzak paper


  * Components in the basic DPM measures combined in various ways:

      - combination of precision, recall scores for different
        components (on analogy with BLEU combination), naive
        assumption: all weighted equally 

      - simple F over items of all valences (naive assumption: all
        weighted equally)

  * Tuneable weights:

      - recall that TERp has tuneable system over small number of parameters
         - insertion weight
         - deletion weight
         - substitution weight
         - synonym weight (motivated with WordNet)

      - learn weights for separate precision, recall for various
        components


  * Implementation

    - Charniak parser
    - n=1..50
    - head-finding, with some modifications
    - label-construction

(3) Correlation with human judgments of fluency & adequacy

  Corpus: LDC Multiple Translation Chinese corpus 3&4
    treat each segment and each judgment as independent data point


  [EXPT]
  using n=1 parse for DPM variants

   metric      r
  --------   -----
  F[dl;lh]   0.226
  BLEU4      0.218  # +1 smoothing
  F[dlh]     0.185
  TER       -0.173

  point: using partial-labeling (dl,lh) much better than full
  link. Also, using Charniak parser as labeler seems to work, LFG not
  necessary

  [EXPT]
  using other variants

  F[1g,2g,dl,lh]  0.237
  F[1g,dl,lh]     0.234

  F[1g,2g]        0.227
  F[dl,lh]        0.226

  F[1g,dl,dlh]    0.227   # increasing length: idea from Liu &
                            # Gildea

  point: using (short) ngrams also is a good thing

  [EXPT]
  combining precision, recall naively vs. combining items

  F(1g,2g,dl,lh)        0.237
  prmeans(1g,2g,dl,lh)  0.217

  F(dl,lh)              0.226
  prmeans(dl,lh)        0.208

  point: combining individual items naively (before computing
  precision and recall) seems better than naively combining precisions
  and recalls (too many chances for zeroes?)

  [EXPT]
  expanding n, gamma=0

  F[1g,2g,dl,lh],n=50   0.239
    ...          n=1    0.237
  F[dl,lh],n=50         0.234
    ...    n=1          0.226

  point: larger n -> better r
  (holds for other comparisons too, but these are good examples)

  [EXPT]
  setting gamma

  Tuning experiments finds that gamma of 0.25 is slightly better
  (especially when using F[dl,lh]) than gammas of other values in
  range 0-1 (these values probably not significant)

(4) Correlations with HTER

  Corpus
    Gale translation corpus 2.5; 3 sites, 2 source languages, 4 genres each

  prediction:
    Mean-subtracted HTER per-document

  base measure EDPM: F[1g,2g,dl,lh],n=50,gamma=0.25

  [EXPT]
            Ar     Zh    All
   TER     0.51   0.32  0.44
   BLEU_4 -0.42  -0.33 -0.37
   EDPM   -0.60* -0.39 -0.50*

  point: base EDPM does better on all docs, and on Arabic. difference
  is not quite significant at the p=0.05 on the Chinese documents

  Note 1: also tried pairwise deltas per-document, with similar
  results, as presented in the MetricsMATR competition submission.

  Note 2: when using per-segment mean-subtracted, r values are smaller
  and *Arabic* was the language where EDPM did not perform best.

               Ar     Zh     All
      TER     0.38*  0.04   0.19
      BLEU_4 -0.10  -0.16  -0.14
      EDPM   -0.31  -0.19* -0.24*

  (Go into details of per-genre differences here? or leave out, for
  space reasons?)

(5) Tuning weights

  TERp optimizer [Matt, a brief description here]

  TERp comes with a variety of features (described earlier)

  include optionally new features

  Corpus: GALE 2.5 data, again
    documents randomly split into two groups, evenly split across
    language and genre

  tuning target: weighted correlation with mean-removed segment HTER

     [we use weighted correlation to avoid emphasis on short segments
      (which hurts document-level and system-level utility) ]

  Select features from set:

   E: features from EDPM, specifically, P,R for inbound, outbound,
      full-links, and unlabeled links  (8 features)
   T: features from basic TERp
        (inserts, deletes, substitutions, shifts)  (7 features)
   P: features from TERp paraphraser
        (4 features)
   N: precision and recall for 1-grams, 2-grams

   B: brevity/prolixity (2 features: one for longer-than-ref, one for
        shorter-than-ref)

  tune weights on one set, test on the other (and vice versa).
  reporting average r between the two.

   == Tuned For Weighted Mean Removed Pearson
     - Examining WGT_MEAN_RM_SEG_PEAR ==
            Average
      Cond   Test  
     ---------------
      ETPB   0.4096
      ETP    0.4038
      TPB    0.4010
      TP     0.3946
      TPNB   0.3940
      TPN    0.3891
      ETB    0.3871
      ET     0.3827
      TB     0.3815
      ETNB   0.3814
      ETN    0.3804
      T      0.3758
      TNB    0.3708
      TN     0.3680
      E      0.3287
      EB     0.3267
      ENB    0.3149
      NB     0.3058
      EN     0.2973
      N      0.2636
      B      0.1638

  (maybe don't report all these numbers!)


  point: E+T > T, E+TB > TB, E+TP > TP; information is available in
  syntax that is not captured in the other measures.

  point: syntax is *not* just an expensive way to get at n-grams;
  N-grams are not the same -- TP > TP+N, but TP < TP+E


(6) Conclusion

  expected dependency pair matching gets at something new

  troubling area: speed of analysis -- current implementation requires
  running a complete n-best parser

  possible use: as a late-pass evaluation, to identify how systems
  perform overall

  future work:  explore ways to get at syntactic information without
  the expense of a full parser:
    - this work moves from an LFG parser (Owczarzak) to the
      substantially simpler and more robust Charniak system - keep
      going and look at simpler partial-parsing approaches (forests?)

    - conversely: how important is it that the parses be of high-quality?