Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotating BILOU tags from another system #461

Closed
viksit opened this issue Jul 26, 2016 · 5 comments
Closed

Annotating BILOU tags from another system #461

viksit opened this issue Jul 26, 2016 · 5 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@viksit
Copy link
Contributor

viksit commented Jul 26, 2016

I have a domain specific NER system that generates BILOU tags for a given sentence. What would be the best way to integrate this information into spacy?

In #187, there's an example of how to train the system on new data. But I'm not entirely sure if there's way to do something like,

doc = nlp(u"this is a lion")
custom_ents = get_custom_ents(doc)
# >>  ['0', '0', '0', 'U-ANIMAL']
# function called annotate to combine this information into spacy's tokens/spans
annotate(doc, custom_ents) # how do we write this?
print([(i.text, i.label_) for i in doc.ents])
# >> [(lion, 'ANIMAL')]
@syllog1sm
Copy link
Contributor

You should be able to do:

doc.ents = [(label, start, end) for (label, start, end) in ents]

Example --- label "best buy" as a retailer:

nlp.entity.add_label(u'RETAILER')
retailer = nlp.strings[u'RETAILER')
doc = nlp(u'best buy is a pretty bad store')
doc.ents = [(retailer, 0, 2)]
span = doc[0:2]
best_buy = list(doc.ents)[0]
assert span.start == best_buy.start == 0
assert span.end == best_buy.end == 2

The API here isn't so polished. I'm surprised that the doc.ents = [] doesn't clear entities. We only add entities here. This should really be changed.

Here's some more detailed usage description:

  • Label should be an integer encoding of the label. You should register it with the NER as well.
  • Start is an integer indicating the start of the slice.index of the first token within the document. Watch out for changed indices from .merge() operations.
  • End is an integer indicating the end of the range

Finally, here's the relevant code:

https://github.com/spacy-io/spaCy/blob/master/spacy/tokens/doc.pyx#L178

@viksit
Copy link
Contributor Author

viksit commented Jul 26, 2016

@syllog1sm awesome, thanks for the information.

@viksit
Copy link
Contributor Author

viksit commented Aug 3, 2016

@syllog1sm couple of follow ups.


animal = nlp.vocab.strings[u"ANIMAL"]
doc1 = nlp(u"this is a lion and that is a royal bengal tiger that Michael Collins loved on the Apollo 11")

print()
print(list(doc1.ents))
print([(i.text, i.label_) for i in doc1.ents])
print([i.ent_iob_ for i in doc1])


>> [Michael Collins, Apollo]
>> [(u'Michael Collins', u'PERSON'), (u'Apollo', u'ORG')]
>> ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'B', 'O']

old = [(i.label, i.start, i.end) for i in doc1.ents]

# derived from external ner
new = [(animal, 3, 4), (animal, 8, 11)]

doc1.ents = old + new
print()
print([(i.text, i.label_) for i in doc1.ents])
print([i.ent_iob_ for i in doc1])
print("entities: ", list(doc1.ents))

>> (u'lion', u'ANIMAL'), (u'royal bengal tiger', u'ANIMAL'), (u'Michael Collins', u'PERSON'), (u'Apollo', u'ORG')]
>> ['', '', '', 'B', '', '', '', '', 'B', 'I', 'I', '', 'B', 'I', '', '', '', 'B', '']
>> entities:  [lion, royal bengal tiger, Michael Collins, Apollo]

lion = doc1[3:4]
rbt = doc1[8:11]
lion_ent, rbt_ent, mc, apollo = list(doc1.ents)

assert lion_ent.start == lion.start
assert rbt_ent.start == rbt_ent.start

Questions,

  • It looks like spacy loses the 'O' tag after adding new entities. Is this on purpose?
  • I don't see L or U tags anywhere - why's that?

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Oct 21, 2016
@honnibal
Copy link
Member

Thanks for the report — fixed.

I don't see L or U tags anywhere - why's that?

Currently the ent_iob field stores the IOB markers, even though the model is trained with BILUO tags. Maybe this should change — if you want to advocate for that, it's best if we start a new thread.

honnibal added a commit that referenced this issue Oct 23, 2016
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

3 participants