Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add textcat to spacy train CLI #4038

Closed
7 tasks
adrianeboyd opened this issue Jul 29, 2019 · 4 comments
Closed
7 tasks

Proposal: Add textcat to spacy train CLI #4038

adrianeboyd opened this issue Jul 29, 2019 · 4 comments
Labels
enhancement Feature requests and improvements feat / textcat Feature: Text Classifier proposal Proposal specs for new features training Training and updating models

Comments

@adrianeboyd
Copy link
Contributor

It would be useful for spacy train to support textcat components. Some of the main questions are:

  1. How to extend the current JSON training format
  2. How to provide model settings to spacy train
  3. How to score textcat results
  4. How to provide textcat results to users (especially for multilabel tasks)

I think it would make sense to adopt slightly different terminology for mutually exclusive vs. non-mutually exclusive classes (as in https://spacy.io/api/textcategorizer#init) and instead call them:

  • mutually exclusive -> multiclass
  • non-mutually exclusive -> multilabel

JSON Training Format

The format currently looks like this:

[{
    "id": int,                      # ID of the document within the corpus
    "paragraphs": [{                # list of paragraphs in the corpus
        "raw": string,              # raw text of the paragraph
        "sentences": [{             # list of sentences in the paragraph
            "tokens": [{            # list of tokens in the sentence
                "id": int,          # index of the token in the document
                "dep": string,      # dependency label
                "head": int,        # offset of token head relative to token index
                "tag": string,      # part-of-speech tag
                "orth": string,     # verbatim text of the token
                "ner": string       # BILUO label, e.g. "O" or "B-ORG"
            }],
            "brackets": [{          # phrase structure (NOT USED by current models)
                "first": int,       # index of first token
                "last": int,        # index of last token
                "label": string     # phrase label
            }]
        }]
    }]
}]

cats could be added at the document level like this:

[{
    "id": int,                      # ID of the document within the corpus
    "paragraphs": [{                # list of paragraphs in the corpus
        "raw": string,              # raw text of the paragraph
        "sentences": [{             # list of sentences in the paragraph
            "tokens": [{            # list of tokens in the sentence
                "id": int,          # index of the token in the document
                "dep": string,      # dependency label
                "head": int,        # offset of token head relative to token index
                "tag": string,      # part-of-speech tag
                "orth": string,     # verbatim text of the token
                "ner": string       # BILUO label, e.g. "O" or "B-ORG"
            }],
            "brackets": [{          # phrase structure (NOT USED by current models)
                "first": int,       # index of first token
                "last": int,        # index of last token
                "label": string     # phrase label
            }]
        }]
    }],
    "cats": [{
        "label": string,
        "value": number
    }]
}]

With the data spread across paragraphs, I don't think it makes sense to try to support subdocument textcats with character offsets (the (start, end, label) keys supported in GoldParse, see https://spacy.io/api/goldparse#attributes), but if you wanted to, it could easily be extended like this:

"cats": [{
    "label": string,
    "value": number,
    "start": int,
    "end": int
}]

If you were extremely sure that you did not want to support subdocument textcats, then a slightly simpler version would be with the labels as keys:

"cats": {
    string: number,
    string: number,
    ...
}

This simple version is what is proposed in #2928, but I think it would be better for it to be a list so that it is more like the other types of annotation and so that it is more extensible.

Joining Paragraphs

With the current paragraph-based format, you would need to decide how to join paragraphs for training purposes. Something like \n\n? Should this be a language-specific setting?

Model/Task Information

The following information is needed in order to initialize the model. It could be included directly as command-line options to spacy train or in a separate JSON file (e.g., --meta meta.json):

{
    "labels": [string, ... ]    # list of all labels
    "type": string,             # multiclass vs. multilabel (default: multiclass)
    "sparse": boolean,          # true: missing labels are 0.0, false: missing labels are None
                                #    (default: false)
}

I think for most typical use cases with spacy train, this information could be automatically detected. This isn't true for general-purpose textcat training where you might not able to make an initial pass through all your data , but I think it might be okay to simplify this for spacy train and have spacy train autodetect these settings. Users should be able to override the autodetected settings with command-line options if needed.

Autodetection of Model Settings

If training a new model, autodetect would examine the training data:

  • labels: all labels present in the training data
  • type: multiclass if each text has exactly one positive label
  • sparse: True if all labels are not present on all texts in the data

If extending an existing model:

  • labels: union of all labels in the model and training data
  • type: multiclass vs. multilabel would be detected from the existing model
  • sparse: True if all labels are not present on all texts in the training data (?)

Binary Tasks

binary can be represented as either one-class multilabel or two-class multiclass. With one-class multilabel the positive label would be the one provided label, but with two-class multiclass you'd need to know the positive label to provide a better evaluation. This could potentially be added to the info in meta.json or as a command-line option.

GoldParse / GoldCorpus

GoldParse supports the textcat annotations as .cats.

gold.json_to_tuple() would need to be updated to read in the cats information for GoldParse/GoldCorpus.

(Are the jsonl and msg file input options in spacy.gold just a sketch at this point?)

Scorer

There are multiple options for scoring. I think some kind of precision/recall/f-scores are probably okay for most use cases, but feedback/suggestions are welcome.

The main question with f-scores is how to average across labels/instances, especially for multilabel tasks. If I had to pick one option, I might pick the weighted macro average as described here as weighted: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html, but weighted scores get tricky and possibly a macro average would be more straightforward to interpret for a typical user, especially if the per-label scores are also easy to inspect.

micro and macro averaging would be supported by PRFScore and the only additional functionality would be the weighted average.

Scorer could be extended with the properties cats_p/r/f and cats_per_cat similar to NER.

Alternative metrics:

  • AUC ROC (also decide how to average across labels)
  • a misclassification cost matrix
  • accuracy for multiclass tasks
  • ???

doc.cats and Multilabel Thresholds

In order to optimize results for a multilabel classification task given a particular evaluation metric (e.g., f0.5-score), you might want to find/store probability thresholds on a per-label basis. I'm not sure I know enough about where/how to do this sensibly, but my initial suggestion would be to use a supplied evaluation metric along with the dev set in spacy train to find thresholds and store them in the model as a default. (As something like cfg['default_thresholds']?)

I think an alternative to doc.cats that just provides a set of positive labels could be useful. In the multiclass case, argmax provides the positive label. In the multilabel case, the stored thresholds or provided thresholds could be applied to doc.cats to provide a set of positive labels. I'm not sure whether this should be stored in Doc or provided as a separate utility function like util.get_positive_cats(nlp, doc, thresholds=thresholds), where nlp provides the thresholds unless alternate thresholds are specified. (Preferably with a better name than get_positive_cats, but I can't think of anything better right now.)

Tasks

  • Add cats to JSON training format
  • Export cats in JSON training format for Doc (see discussion in: Fix gold.docs_to_json() to match JSON train format #4013)
  • Import cats from JSON training format into GoldParse/GoldCorpus
  • Add threshold logic for multilabel tasks (in TextCategorizer? in Scorer?)
  • Add textcat scoring to Scorer
  • Add positive_cats-type set output?
  • Add textcat to the pipeline options in spacy train including autodetection of model settings

Related/Future Tasks

@adrianeboyd adrianeboyd changed the title Proposal: Add textcat to spacy train Proposal: Add textcat to spacy train CLI Jul 29, 2019
@ines ines added enhancement Feature requests and improvements feat / textcat Feature: Text Classifier training Training and updating models proposal Proposal specs for new features labels Jul 29, 2019
@honnibal
Copy link
Member

honnibal commented Aug 12, 2019

Thanks for doing this so thoroughly, and sorry it's taken me a while to get to it.

mutually exclusive --> multilabel / multiclass

I think it's fine to use that terminology in documentation and descriptions, but I'm nervous about having it in an API or in config settings. To me the most natural interpretation of multiclass is "not single class". The contrast with multilabel vs multiclass is known within machine learning, but if you haven't heard of this before, it could be quite surprising.

format of cats

I think the list-based format you've proposed is very nice:

        "label": string,
        "value": number
    }]

I like that it's more extensible, so we can support ranged categories if necessary. Optionally, we could also allow a cats object at the paragraph and sentence level?

Joining Paragraphs

As we discussed, this is one of the more surprising details of the training format currently. We currently emit a (doc, gold) pair for a paragraph. So you're right that we'd need to join up the paragraphs somehow.

Autodetect

I think autodetecting is a fine solution, so long as we're sure to print the results to the command line so it's clear what's going on. Definitely we should support providing the answers explicitly as well.

Multilabel thresholds

I'm not sure I'm keen on this. It feels like the ratio of library complexity added to user convenience delivered is low. Ultimately the user needs [cat for cat, score in doc.cats.items() if score >= threshold]. We can't tell them what threshold they should use, and we can't tell them whether the same threshold should apply to every document. So I would probably prefer to leave this to the user.

@adrianeboyd
Copy link
Contributor Author

adrianeboyd commented Aug 14, 2019

Thanks for doing this so thoroughly, and sorry it's taken me a while to get to it.

mutually exclusive --> multilabel / multiclass

I think it's fine to use that terminology in documentation and descriptions, but I'm nervous about having it in an API or in config settings. To me the most natural interpretation of multiclass is "not single class". The contrast with multilabel vs multiclass is known within machine learning, but if you haven't heard of this before, it could be quite surprising.

Would the terminology "mutually exclusive" vs. "multilabel" be okay? I don't think the API needs to change, I mainly just don't want to write "non-mutually exclusive".

Optionally, we could also allow a cats object at the paragraph and sentence level?

For the train CLI I think that subdocument cats could be straightforward enough for training but very hard for evaluation. What dev subdocuments do you evaluate with? Some random selection of document substrings from the dev data? If a subdocument doesn't exactly correspond with an existing labeled subdocument the category should be false? Or if it overlaps a bit the category is true? Or maybe if it overlaps more than 50% (measured how?) it's true? I think it would be very difficult to do well.

To keep the evaluation simple, I'd prefer to the keep the cats at the document level. I think using the sentence level is not a great idea because spacy has doc.cats and not span.cats.

However, since the current format basically treats paragraphs as docs I could see using paragraphs instead of documents with the current format as an interim solution until the new format is supported. This would simplify the problems with how to handle cats in gold.docs_to_json() until there's a new format.

(The subdocument evaluation problems are the same for the new format, though.)

Autodetect

I think autodetecting is a fine solution, so long as we're sure to print the results to the command line so it's clear what's going on. Definitely we should support providing the answers explicitly as well.

That sounds good.

Multilabel thresholds

I'm not sure I'm keen on this. It feels like the ratio of library complexity added to user convenience delivered is low. Ultimately the user needs [cat for cat, score in doc.cats.items() if score >= threshold]. We can't tell them what threshold they should use, and we can't tell them whether the same threshold should apply to every document. So I would probably prefer to leave this to the user.

I understand why that makes sense (and it's fine for doc.cats in general), but I think we need to do some threshold selection during training, use those thresholds during the training-internal evaluation, and present them as suggestions to the user. I'm worried that the user will see such bad evaluations during training with 0.5 thresholds that they won't realize that the model could be useful with other thresholds and they will think their model is terrible and give up when it might be useful.

@ines
Copy link
Member

ines commented Sep 17, 2019

See #4226 🎉

@ines ines closed this as completed Sep 17, 2019
@lock
Copy link

lock bot commented Oct 17, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 17, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / textcat Feature: Text Classifier proposal Proposal specs for new features training Training and updating models
Projects
None yet
Development

No branches or pull requests

3 participants