[NER] More fine-grained set definition regarding locations #59

proycon · 2018-08-21T17:20:13Z

Currently the NER module in Frog distinguishes persons, locations, events, products(?) and miscellaneous.

Since the module has been enhanced with gazetteers, I think we can do better than this coarse division. Various named entities are perfectly enumerable; countries, cities, street names, postal codes, rivers, forests, mountains... and gazetteers serve well here; it would be a waste to lose this information by subsuming it all under "location". We already have a FoLiA set definition (https://github.com/proycon/folia/blob/master/setdefinitions/namedentities.foliaset.ttl) from a prior project that allows for a more fine-grained taxonomy regarding locations, which is compatible (i.e. a superset) with our current set.

Databases such as Geonames also contain this information, and we currently don't make use of it. I propose we migrate to a more fine-grained set (and include a few more gazetteers where possible). What do you think @kosloot @antalvdb @Irishx ?

Context: this is relevant for our 112-project (@HenkvdHeuvel), here we need to know whether a location is a street, city, etc.. I think we can include a lot of these gazetteer-based improvements in the Frog data itself, i.e. the generic dutch model (as it's not sensitive data)

(technicality: this is more of more of a frogdata issue than a Frog issue as such, but I guess it's more visible here)

kosloot · 2018-08-22T07:42:16Z

As far as i can see. The software itself doesn't impose restrictions. So this is indeed a data question.
I did use a small part of Geonames to test, and it is usable. But there are a lot of ugly details to consider.
The data can be polluted and (very) ambiguous.
So using this data might need some investigation, and probably preprocessing.

proycon · 2018-08-22T09:35:27Z

Another good (secondary) source for location data is OpenStreetMap, I experimented with that yesterday. It's fairly easy to extract all streets and cities/towns.

…med some files to make clear it's about organisations rather than locations, added a provinces list

…frog#59)

…reldsteden. Using the new tagset (LanguageMachines/frog#59)

proycon · 2018-11-15T11:09:04Z

This is also relevant for @Irishx (frog evaluation) and @HenkvdHeuvel (112 project), and perhaps @antalvdb:

Okay, things are a bit more complex. We have some ordering problems. The current situation:

[Kobus] als ambigue, dan wint de laatste gazet, denk ik
denk dat het zo gaat:
ALS geen timbl tag toegekend
en WEL een gazet info bekend
dan neem die

[Kobus] alles wordt in een grote hash gepropt
laatste telt

I have a test sentence:

De Maas en de Waal stromen niet door Amsterdam, maar monden wel uit in de Noordzee

This results in four loc detections (from the context-based module), which is correct but doesn't make use of the gazetteers so we don't get any of the fine-grained categories, which was kind of the whole point of this exercise.

If I use a Frog trained on the much more limited model from the 112 project, the gazetteers do kick in and I now get:

De Maas - per (error from the context-module I presume)
Waal - loc.street
Amsterdam -- loc.city
de Noordzee -- loc.city

There's a street named Waal, not surprising as there are streets named after pretty much everything so this should be get a lower priority. There's also a village called "Noordzee" apparently which happens to take precedence over loc.water.sea.

I'm trying to find the 'optimal' ordering for ners.known, which is tricky enough as there is always ambiguity and you can never get it really right, but I can't override the NER context-based module here which poses a bigger problem. It would help (feature request) if we had a parameter to set the context-based module to have the highest priority, lowest priority, or completely disable it (the latter case might be interesting if you want to rely on gazetteers only, for speed for instance which may be an important factor in the 112 project)

Opinions?

Do we want to merge the new gazetteers into frogdata master despite the problems (the new lists technically are superior, i.e. more complete, and categories more fine-grained). Or do we keep the old status quo for now?

Irishx · 2018-11-15T11:24:17Z

ik wil de NER graag testen met en zonder deze gazetteers om te zien wat het effect is.

proycon · 2018-11-15T16:00:08Z

Dat lijkt me een goed idee ja, je kan de gazetteers in ieder geval uitschakelen door zelf in de frog configuratie file, en ners.known te editen.

kosloot · 2023-03-07T08:20:24Z

@proycon and @Irishx Can we close this as "solved" for now? Or?

Irishx · 2023-03-07T08:37:24Z

hoi, Ja dit kunnen we wel afsluiten. Groetjes iris Iris Hendrickx ***@***.***

…

On 7 Mar 2023, at 09:20, Ko van der Sloot ***@***.***> wrote: @proycon <https://github.com/proycon> and @Irishx <https://github.com/Irishx> Can we close this as "solved" for now? Or? — Reply to this email directly, view it on GitHub <#59 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPWXGWAUUO2SK2AAZQOMX3W23VVHANCNFSM4FQY544A>. You are receiving this because you were mentioned.

proycon added enhancement NER labels Aug 21, 2018

proycon self-assigned this Aug 21, 2018

proycon added a commit to LanguageMachines/frogdata that referenced this issue Aug 22, 2018

introducing more fine-grained labels (LanguageMachines/frog#59), rena…

e5e8278

…med some files to make clear it's about organisations rather than locations, added a provinces list

proycon added a commit to LanguageMachines/frogdata that referenced this issue Aug 23, 2018

adapted FoLiA set definition (LanguageMachines/frog#59)

11a181a

proycon added a commit to proycon/folia that referenced this issue Aug 23, 2018

expanding named entity set definition (LanguageMachines/frog#59)

d3717e8

proycon added a commit to proycon/folia that referenced this issue Aug 23, 2018

named entity set definition fixes and enhancements (LanguageMachines/…

fa59079

…frog#59)

proycon added a commit to LanguageMachines/frogdata that referenced this issue Aug 30, 2018

split geonames into various subparts (removed the old ones), added we…

87f7297

…reldsteden. Using the new tagset (LanguageMachines/frog#59)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NER] More fine-grained set definition regarding locations #59

[NER] More fine-grained set definition regarding locations #59

proycon commented Aug 21, 2018

kosloot commented Aug 22, 2018

proycon commented Aug 22, 2018

proycon commented Nov 15, 2018 •

edited

Loading

Irishx commented Nov 15, 2018

proycon commented Nov 15, 2018

kosloot commented Mar 7, 2023

Irishx commented Mar 7, 2023 via email

[NER] More fine-grained set definition regarding locations #59

[NER] More fine-grained set definition regarding locations #59

Comments

proycon commented Aug 21, 2018

kosloot commented Aug 22, 2018

proycon commented Aug 22, 2018

proycon commented Nov 15, 2018 • edited Loading

Irishx commented Nov 15, 2018

proycon commented Nov 15, 2018

kosloot commented Mar 7, 2023

Irishx commented Mar 7, 2023 via email

proycon commented Nov 15, 2018 •

edited

Loading