Proposal: Add textcat to spacy train CLI #4038

adrianeboyd · 2019-07-29T12:01:21Z

It would be useful for spacy train to support textcat components. Some of the main questions are:

How to extend the current JSON training format
How to provide model settings to spacy train
How to score textcat results
How to provide textcat results to users (especially for multilabel tasks)

I think it would make sense to adopt slightly different terminology for mutually exclusive vs. non-mutually exclusive classes (as in https://spacy.io/api/textcategorizer#init) and instead call them:

mutually exclusive -> multiclass
non-mutually exclusive -> multilabel

JSON Training Format

The format currently looks like this:

[{
    "id": int,                      # ID of the document within the corpus
    "paragraphs": [{                # list of paragraphs in the corpus
        "raw": string,              # raw text of the paragraph
        "sentences": [{             # list of sentences in the paragraph
            "tokens": [{            # list of tokens in the sentence
                "id": int,          # index of the token in the document
                "dep": string,      # dependency label
                "head": int,        # offset of token head relative to token index
                "tag": string,      # part-of-speech tag
                "orth": string,     # verbatim text of the token
                "ner": string       # BILUO label, e.g. "O" or "B-ORG"
            }],
            "brackets": [{          # phrase structure (NOT USED by current models)
                "first": int,       # index of first token
                "last": int,        # index of last token
                "label": string     # phrase label
            }]
        }]
    }]
}]

cats could be added at the document level like this:

[{
    "id": int,                      # ID of the document within the corpus
    "paragraphs": [{                # list of paragraphs in the corpus
        "raw": string,              # raw text of the paragraph
        "sentences": [{             # list of sentences in the paragraph
            "tokens": [{            # list of tokens in the sentence
                "id": int,          # index of the token in the document
                "dep": string,      # dependency label
                "head": int,        # offset of token head relative to token index
                "tag": string,      # part-of-speech tag
                "orth": string,     # verbatim text of the token
                "ner": string       # BILUO label, e.g. "O" or "B-ORG"
            }],
            "brackets": [{          # phrase structure (NOT USED by current models)
                "first": int,       # index of first token
                "last": int,        # index of last token
                "label": string     # phrase label
            }]
        }]
    }],
    "cats": [{
        "label": string,
        "value": number
    }]
}]

With the data spread across paragraphs, I don't think it makes sense to try to support subdocument textcats with character offsets (the (start, end, label) keys supported in GoldParse, see https://spacy.io/api/goldparse#attributes), but if you wanted to, it could easily be extended like this:

"cats": [{
    "label": string,
    "value": number,
    "start": int,
    "end": int
}]

If you were extremely sure that you did not want to support subdocument textcats, then a slightly simpler version would be with the labels as keys:

"cats": {
    string: number,
    string: number,
    ...
}

This simple version is what is proposed in #2928, but I think it would be better for it to be a list so that it is more like the other types of annotation and so that it is more extensible.

Joining Paragraphs

With the current paragraph-based format, you would need to decide how to join paragraphs for training purposes. Something like \n\n? Should this be a language-specific setting?

Model/Task Information

The following information is needed in order to initialize the model. It could be included directly as command-line options to spacy train or in a separate JSON file (e.g., --meta meta.json):

{
    "labels": [string, ... ]    # list of all labels
    "type": string,             # multiclass vs. multilabel (default: multiclass)
    "sparse": boolean,          # true: missing labels are 0.0, false: missing labels are None
                                #    (default: false)
}

I think for most typical use cases with spacy train, this information could be automatically detected. This isn't true for general-purpose textcat training where you might not able to make an initial pass through all your data , but I think it might be okay to simplify this for spacy train and have spacy train autodetect these settings. Users should be able to override the autodetected settings with command-line options if needed.

Autodetection of Model Settings

If training a new model, autodetect would examine the training data:

labels: all labels present in the training data
type: multiclass if each text has exactly one positive label
sparse: True if all labels are not present on all texts in the data

If extending an existing model:

labels: union of all labels in the model and training data
type: multiclass vs. multilabel would be detected from the existing model
sparse: True if all labels are not present on all texts in the training data (?)

Binary Tasks

binary can be represented as either one-class multilabel or two-class multiclass. With one-class multilabel the positive label would be the one provided label, but with two-class multiclass you'd need to know the positive label to provide a better evaluation. This could potentially be added to the info in meta.json or as a command-line option.

`GoldParse` / `GoldCorpus`

GoldParse supports the textcat annotations as .cats.

gold.json_to_tuple() would need to be updated to read in the cats information for GoldParse/GoldCorpus.

(Are the jsonl and msg file input options in spacy.gold just a sketch at this point?)

Scorer

There are multiple options for scoring. I think some kind of precision/recall/f-scores are probably okay for most use cases, but feedback/suggestions are welcome.

The main question with f-scores is how to average across labels/instances, especially for multilabel tasks. If I had to pick one option, I might pick the weighted macro average as described here as weighted: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html, but weighted scores get tricky and possibly a macro average would be more straightforward to interpret for a typical user, especially if the per-label scores are also easy to inspect.

micro and macro averaging would be supported by PRFScore and the only additional functionality would be the weighted average.

Scorer could be extended with the properties cats_p/r/f and cats_per_cat similar to NER.

Alternative metrics:

AUC ROC (also decide how to average across labels)
a misclassification cost matrix
accuracy for multiclass tasks
???

`doc.cats` and Multilabel Thresholds

In order to optimize results for a multilabel classification task given a particular evaluation metric (e.g., f0.5-score), you might want to find/store probability thresholds on a per-label basis. I'm not sure I know enough about where/how to do this sensibly, but my initial suggestion would be to use a supplied evaluation metric along with the dev set in spacy train to find thresholds and store them in the model as a default. (As something like cfg['default_thresholds']?)

I think an alternative to doc.cats that just provides a set of positive labels could be useful. In the multiclass case, argmax provides the positive label. In the multilabel case, the stored thresholds or provided thresholds could be applied to doc.cats to provide a set of positive labels. I'm not sure whether this should be stored in Doc or provided as a separate utility function like util.get_positive_cats(nlp, doc, thresholds=thresholds), where nlp provides the thresholds unless alternate thresholds are specified. (Preferably with a better name than get_positive_cats, but I can't think of anything better right now.)

Tasks

Add cats to JSON training format
Export cats in JSON training format for Doc (see discussion in: Fix gold.docs_to_json() to match JSON train format #4013)
Import cats from JSON training format into GoldParse/GoldCorpus
Add threshold logic for multilabel tasks (in TextCategorizer? in Scorer?)
Add textcat scoring to Scorer
Add positive_cats-type set output?
Add textcat to the pipeline options in spacy train including autodetection of model settings

Related/Future Tasks

Add data debugging / warnings as described in 💫 Proposal: New JSON(L) format for training and improved training commands #2928

The text was updated successfully, but these errors were encountered:

honnibal · 2019-08-12T20:17:34Z

Thanks for doing this so thoroughly, and sorry it's taken me a while to get to it.

mutually exclusive --> multilabel / multiclass

I think it's fine to use that terminology in documentation and descriptions, but I'm nervous about having it in an API or in config settings. To me the most natural interpretation of multiclass is "not single class". The contrast with multilabel vs multiclass is known within machine learning, but if you haven't heard of this before, it could be quite surprising.

format of cats

I think the list-based format you've proposed is very nice:

        "label": string,
        "value": number
    }]

I like that it's more extensible, so we can support ranged categories if necessary. Optionally, we could also allow a cats object at the paragraph and sentence level?

Joining Paragraphs

As we discussed, this is one of the more surprising details of the training format currently. We currently emit a (doc, gold) pair for a paragraph. So you're right that we'd need to join up the paragraphs somehow.

Autodetect

I think autodetecting is a fine solution, so long as we're sure to print the results to the command line so it's clear what's going on. Definitely we should support providing the answers explicitly as well.

Multilabel thresholds

I'm not sure I'm keen on this. It feels like the ratio of library complexity added to user convenience delivered is low. Ultimately the user needs [cat for cat, score in doc.cats.items() if score >= threshold]. We can't tell them what threshold they should use, and we can't tell them whether the same threshold should apply to every document. So I would probably prefer to leave this to the user.

adrianeboyd · 2019-08-14T08:18:55Z

Thanks for doing this so thoroughly, and sorry it's taken me a while to get to it.

mutually exclusive --> multilabel / multiclass

I think it's fine to use that terminology in documentation and descriptions, but I'm nervous about having it in an API or in config settings. To me the most natural interpretation of multiclass is "not single class". The contrast with multilabel vs multiclass is known within machine learning, but if you haven't heard of this before, it could be quite surprising.

Would the terminology "mutually exclusive" vs. "multilabel" be okay? I don't think the API needs to change, I mainly just don't want to write "non-mutually exclusive".

Optionally, we could also allow a cats object at the paragraph and sentence level?

For the train CLI I think that subdocument cats could be straightforward enough for training but very hard for evaluation. What dev subdocuments do you evaluate with? Some random selection of document substrings from the dev data? If a subdocument doesn't exactly correspond with an existing labeled subdocument the category should be false? Or if it overlaps a bit the category is true? Or maybe if it overlaps more than 50% (measured how?) it's true? I think it would be very difficult to do well.

To keep the evaluation simple, I'd prefer to the keep the cats at the document level. I think using the sentence level is not a great idea because spacy has doc.cats and not span.cats.

However, since the current format basically treats paragraphs as docs I could see using paragraphs instead of documents with the current format as an interim solution until the new format is supported. This would simplify the problems with how to handle cats in gold.docs_to_json() until there's a new format.

(The subdocument evaluation problems are the same for the new format, though.)

Autodetect

I think autodetecting is a fine solution, so long as we're sure to print the results to the command line so it's clear what's going on. Definitely we should support providing the answers explicitly as well.

That sounds good.

Multilabel thresholds

I'm not sure I'm keen on this. It feels like the ratio of library complexity added to user convenience delivered is low. Ultimately the user needs [cat for cat, score in doc.cats.items() if score >= threshold]. We can't tell them what threshold they should use, and we can't tell them whether the same threshold should apply to every document. So I would probably prefer to leave this to the user.

I understand why that makes sense (and it's fine for doc.cats in general), but I think we need to do some threshold selection during training, use those thresholds during the training-internal evaluation, and present them as suggestions to the user. I'm worried that the user will see such bad evaluations during training with 0.5 thresholds that they won't realize that the model could be useful with other thresholds and they will think their model is terrible and give up when it might be useful.

ines · 2019-09-17T12:13:32Z

See #4226 🎉

lock · 2019-10-17T12:43:18Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

adrianeboyd changed the title ~~Proposal: Add textcat to spacy train~~ Proposal: Add textcat to spacy train CLI Jul 29, 2019

ines added enhancement Feature requests and improvements feat / textcat Feature: Text Classifier training Training and updating models proposal Proposal specs for new features labels Jul 29, 2019

ines closed this as completed Sep 17, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add textcat to spacy train CLI #4038

Proposal: Add textcat to spacy train CLI #4038

adrianeboyd commented Jul 29, 2019

honnibal commented Aug 12, 2019 •

edited

Loading

adrianeboyd commented Aug 14, 2019 •

edited

Loading

ines commented Sep 17, 2019

lock bot commented Oct 17, 2019

Proposal: Add textcat to spacy train CLI #4038

Proposal: Add textcat to spacy train CLI #4038

Comments

adrianeboyd commented Jul 29, 2019

JSON Training Format

Joining Paragraphs

Model/Task Information

Autodetection of Model Settings

Binary Tasks

GoldParse / GoldCorpus

Scorer

doc.cats and Multilabel Thresholds

Tasks

Related/Future Tasks

honnibal commented Aug 12, 2019 • edited Loading

adrianeboyd commented Aug 14, 2019 • edited Loading

ines commented Sep 17, 2019

lock bot commented Oct 17, 2019

`GoldParse` / `GoldCorpus`

`doc.cats` and Multilabel Thresholds

honnibal commented Aug 12, 2019 •

edited

Loading

adrianeboyd commented Aug 14, 2019 •

edited

Loading