-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Add textcat to spacy train CLI #4038
Comments
textcat
to spacy train
Thanks for doing this so thoroughly, and sorry it's taken me a while to get to it.
I think it's fine to use that terminology in documentation and descriptions, but I'm nervous about having it in an API or in config settings. To me the most natural interpretation of
I think the list-based format you've proposed is very nice:
I like that it's more extensible, so we can support ranged categories if necessary. Optionally, we could also allow a
As we discussed, this is one of the more surprising details of the training format currently. We currently emit a
I think autodetecting is a fine solution, so long as we're sure to print the results to the command line so it's clear what's going on. Definitely we should support providing the answers explicitly as well.
I'm not sure I'm keen on this. It feels like the ratio of library complexity added to user convenience delivered is low. Ultimately the user needs |
Would the terminology "mutually exclusive" vs. "multilabel" be okay? I don't think the API needs to change, I mainly just don't want to write "non-mutually exclusive".
For the train CLI I think that subdocument cats could be straightforward enough for training but very hard for evaluation. What dev subdocuments do you evaluate with? Some random selection of document substrings from the dev data? If a subdocument doesn't exactly correspond with an existing labeled subdocument the category should be false? Or if it overlaps a bit the category is true? Or maybe if it overlaps more than 50% (measured how?) it's true? I think it would be very difficult to do well. To keep the evaluation simple, I'd prefer to the keep the However, since the current format basically treats paragraphs as docs I could see using paragraphs instead of documents with the current format as an interim solution until the new format is supported. This would simplify the problems with how to handle (The subdocument evaluation problems are the same for the new format, though.)
That sounds good.
I understand why that makes sense (and it's fine for |
See #4226 🎉 |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
It would be useful for
spacy train
to supporttextcat
components. Some of the main questions are:spacy train
textcat
resultstextcat
results to users (especially for multilabel tasks)I think it would make sense to adopt slightly different terminology for mutually exclusive vs. non-mutually exclusive classes (as in https://spacy.io/api/textcategorizer#init) and instead call them:
mutually exclusive
->multiclass
non-mutually exclusive
->multilabel
JSON Training Format
The format currently looks like this:
cats
could be added at the document level like this:With the data spread across paragraphs, I don't think it makes sense to try to support subdocument textcats with character offsets (the
(start, end, label)
keys supported inGoldParse
, see https://spacy.io/api/goldparse#attributes), but if you wanted to, it could easily be extended like this:If you were extremely sure that you did not want to support subdocument textcats, then a slightly simpler version would be with the labels as keys:
This simple version is what is proposed in #2928, but I think it would be better for it to be a list so that it is more like the other types of annotation and so that it is more extensible.
Joining Paragraphs
With the current paragraph-based format, you would need to decide how to join paragraphs for training purposes. Something like
\n\n
? Should this be a language-specific setting?Model/Task Information
The following information is needed in order to initialize the model. It could be included directly as command-line options to
spacy train
or in a separate JSON file (e.g.,--meta meta.json
):I think for most typical use cases with
spacy train
, this information could be automatically detected. This isn't true for general-purposetextcat
training where you might not able to make an initial pass through all your data , but I think it might be okay to simplify this forspacy train
and havespacy train
autodetect these settings. Users should be able to override the autodetected settings with command-line options if needed.Autodetection of Model Settings
If training a new model, autodetect would examine the training data:
multiclass
if each text has exactly one positive labelTrue
if all labels are not present on all texts in the dataIf extending an existing model:
multiclass
vs.multilabel
would be detected from the existing modelTrue
if all labels are not present on all texts in the training data (?)Binary Tasks
binary
can be represented as either one-classmultilabel
or two-classmulticlass
. With one-classmultilabel
the positive label would be the one provided label, but with two-classmulticlass
you'd need to know the positive label to provide a better evaluation. This could potentially be added to the info inmeta.json
or as a command-line option.GoldParse
/GoldCorpus
GoldParse
supports the textcat annotations as.cats
.gold.json_to_tuple()
would need to be updated to read in thecats
information forGoldParse
/GoldCorpus
.(Are the
jsonl
andmsg
file input options inspacy.gold
just a sketch at this point?)Scorer
There are multiple options for scoring. I think some kind of precision/recall/f-scores are probably okay for most use cases, but feedback/suggestions are welcome.
The main question with f-scores is how to average across labels/instances, especially for multilabel tasks. If I had to pick one option, I might pick the weighted macro average as described here as
weighted
: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html, but weighted scores get tricky and possibly amacro
average would be more straightforward to interpret for a typical user, especially if the per-label scores are also easy to inspect.micro
andmacro
averaging would be supported byPRFScore
and the only additional functionality would be the weighted average.Scorer
could be extended with the propertiescats_p/r/f
andcats_per_cat
similar to NER.Alternative metrics:
doc.cats
and Multilabel ThresholdsIn order to optimize results for a multilabel classification task given a particular evaluation metric (e.g., f0.5-score), you might want to find/store probability thresholds on a per-label basis. I'm not sure I know enough about where/how to do this sensibly, but my initial suggestion would be to use a supplied evaluation metric along with the dev set in
spacy train
to find thresholds and store them in the model as a default. (As something likecfg['default_thresholds']
?)I think an alternative to
doc.cats
that just provides a set of positive labels could be useful. In the multiclass case,argmax
provides the positive label. In the multilabel case, the stored thresholds or provided thresholds could be applied todoc.cats
to provide a set of positive labels. I'm not sure whether this should be stored inDoc
or provided as a separate utility function likeutil.get_positive_cats(nlp, doc, thresholds=thresholds)
, wherenlp
provides the thresholds unless alternate thresholds are specified. (Preferably with a better name thanget_positive_cats
, but I can't think of anything better right now.)Tasks
cats
to JSON training formatcats
in JSON training format forDoc
(see discussion in: Fix gold.docs_to_json() to match JSON train format #4013)cats
from JSON training format intoGoldParse
/GoldCorpus
TextCategorizer
? inScorer
?)textcat
scoring toScorer
positive_cats
-type set output?textcat
to the pipeline options inspacy train
including autodetection of model settingsRelated/Future Tasks
The text was updated successfully, but these errors were encountered: