Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing comments and abstracts for english for multiple articles #714

Open
pkleef opened this issue Sep 13, 2021 · 15 comments
Open

Missing comments and abstracts for english for multiple articles #714

pkleef opened this issue Sep 13, 2021 · 15 comments
Labels
priority issues to be discussed by the dev-team status: accepted status: fix-required PR related to issue is needed status: test-method-required type: data

Comments

@pkleef
Copy link

pkleef commented Sep 13, 2021

Issue validity

Some explanation: DBpedia Snapshot is produced every three months, see Release Frequency & Schedule, which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. Berlin or Joe_Biden here: http://dief.tools.dbpedia.org/server/extraction/en/
If the issue persists, please post the link from your browser here:

https://dbpedia.org/resource/Eating_your_own_dog_food?lang=*
https://dbpedia.org/resource/Paul_Erd%C5%91s?lang=*

NOTE: http://dief.tools.dbpedia.org/server/extraction/en/Eating_your_own_dog_food returns an error at this time

Error Description

Please state the nature of your technical emergency:

I received several reports of articles with missing english (and possible other language) triples for dbo:abstract and dbo:comment.

Pinpointing the source of the error

Where did you find the data issue? Non-exhaustive options are:

Error occurs on the current 2021-06 snapshot of the databus dump that is loaded on http://dbpedia.org/sparql

@jlareck
Copy link
Collaborator

jlareck commented Sep 14, 2021

It is very interesting that this error occurs also with http://dief.tools.dbpedia.org/server/extraction/en/Eating_your_own_dog_food , because when I run the server locally on my machine, it extracted abstracts for English:
Screenshot 2021-09-14 at 13 28 15

Then I guess the error on the server can be related to the configuration that is used by dief.tools.dbpedia.org/server/ (maybe it uses old version of extraction framework).

@JJ-Author
Copy link
Contributor

@jlareck can you patch the html in order that it shows the commit (optionally also the branch) it is using (that has a hyperlink to github). I think sometimes the cronjob fails or the redeploy script is not mature yet. so yes the service was out of date. but it is hard to recognize. so displaying this simple information here http://dief.tools.dbpedia.org/server/extraction/en/ could really help

@jlareck
Copy link
Collaborator

jlareck commented Sep 15, 2021

@JJ-Author not completely clear what I need to do. So I need to find the commit where this error occurs, am I right?

@JJ-Author
Copy link
Contributor

no just write a commit which prints out the current commit hash of the build on the DIEF extractor webpage. you could use sth. like this https://github.com/git-commit-id/git-commit-id-maven-plugin.

@JJ-Author
Copy link
Contributor

by the way i updated the webservice now manually. but we dont know for sure because we dont see which commit it is using.

@jlareck
Copy link
Collaborator

jlareck commented Sep 15, 2021

Oh, well I also noticed another thing. So, maybe the problem was in not correct usage of api server call, because this url works fine: http://dief.tools.dbpedia.org/server/extraction/en/extract?title=Eating+your+own+dog+food&revid=&format=trix&extractors=custom . @pkleef Could you please check it and say if it is expected result? Or maybe I don't understand what result must be on the server

@jlareck
Copy link
Collaborator

jlareck commented Sep 16, 2021

@jlareck can you patch the html in order that it shows the commit (optionally also the branch) it is using (that has a hyperlink to github).

no just write a commit that prints out the current commit hash of the build on the DIEF extractor webpage.

@JJ-Author, as I understood, I need to add current commit information on the DIEF server page and link to the commit on the github page. And I can add it somewhere at the top of the page (for example near the ontology) or at the footer?
Screenshot 2021-09-16 at 11 58 12

@pkleef
Copy link
Author

pkleef commented Sep 16, 2021

@jlareck I can confirm your dief tools link to the article does show the triples i do not see when loading the 2021-06 Databus snapshot on the http://dbpedia.org/sparql endpoint.

See this result:

https://dbpedia.org/sparql?query=select+lang%28%3Fcomment%29++%3Fcomment+where+%7B%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FEating_your_own_dog_food%3E+rdfs%3Acomment+%3Fcomment%7D

My main concern is that the Databus dump apparently was reported as successful, yet on a number of articles did not dump the english abstracts and comments.

The DBpedia team needs to figure out why these comments were not dumped as this could be an indication that extraction errors are not properly caught and reported.


As a side note for the DIEF tool, i see i used the wrong URL form:

http://dief.tools.dbpedia.org/server/extraction/en/Eating_your_own_dog_food

but i was not expecting a Java Exception com.sun.jersey.api.NotFoundException: null for uri.

Would it be possible to add some argument checking and produce a slightly more informative error page?

@JJ-Author
Copy link
Contributor

@jlareck can you patch the html in order that it shows the commit (optionally also the branch) it is using (that has a hyperlink to github).

no just write a commit that prints out the current commit hash of the build on the DIEF extractor webpage.

@JJ-Author, as I understood, I need to add current commit information on the DIEF server page and link to the commit on the github page. And I can add it somewhere at the top of the page (for example near the ontology) or at the footer?
Screenshot 2021-09-16 at 11 58 12

Yes I think this would make sense, right? so that we always know whether we are using the latest code?

@JJ-Author
Copy link
Contributor

@jlareck I can confirm your dief tools link to the article does show the triples i do not see when loading the 2021-06 Databus snapshot on the http://dbpedia.org/sparql endpoint.

See this result:

https://dbpedia.org/sparql?query=select+lang%28%3Fcomment%29++%3Fcomment+where+%7B%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FEating_your_own_dog_food%3E+rdfs%3Acomment+%3Fcomment%7D

My main concern is that the Databus dump apparently was reported as successful, yet on a number of articles did not dump the english abstracts and comments.

@Vehnem @kurzum Maybe it makes sense to have some metrics here? like number of abstracts compared in total and compared to total number of entities. So that we can track whether abstracts are getting more or less from release to release? and maybe track this also for other artifacts? Maybe using the void mods?

The DBpedia team needs to figure out why these comments were not dumped as this could be an indication that extraction errors are not properly caught and reported.

@jlareck do you know whether exception statistics / summary for the extraction are written in general? I know there is a logging of exceptions.

@pkleef my best guess is that the commit used for the 2021-06 extraction did not have the fix yet. @Vehnem @jlareck is there a way to determin the commit hash for a marvin extraction now?
But in general I assume that there is a gap in terminology. Successful so far means nothing crashed or aborted, (so ideally no missing files). But indeed missing triples or the number of exceptions per extractor could be used as quality indicators to judge a "successful" release in the future.

@jlareck
Copy link
Collaborator

jlareck commented Sep 20, 2021

@JJ-Author I think exception statistics and summary is written for each language wikidump separately. So, as I understand, we can see how many pages were successfully extracted and how many failed for example after extraction of the English wikidump.

@jlareck
Copy link
Collaborator

jlareck commented Sep 20, 2021

Well, I found the reason why Eating_your_own_dog_food was not extracted. So, during English extraction, there were too many requests using wikimedia API and that's why extraction of this page failed. Here is error for this page:

Exception; en; Main Extraction at 46:38.508s for 4 datasets; Main Extraction failed for instance http://dbpedia.org/resource/Eating_your_own_dog_food: Server returned HTTP response code: 429 for URL: https://en.wikipedia.org/w/api.php 

This error occurred very often during June extraction: for the English dump, there were 910000 exceptions Server returned HTTP response code: 429 during the extraction.

@kurzum kurzum added the priority issues to be discussed by the dev-team label Sep 21, 2021
@jlareck
Copy link
Collaborator

jlareck commented Sep 21, 2021

I and Marvin checked logs one more time today and this exception occurred not 910000 but 455000 times during June extraction (I didn't calculate it correctly the first time). But anyway this is still a huge number. And the other interesting moment is that during August extraction this error occurred only 723 times for English (anything related to requests in the Extraction Framework was not changed during this period). Also, we compared the number of triples in each dataset (June and August) and in June the number of triples was 5460872 and in August there were 5952058 extracted abstracts. Still very confusing and we are trying to inspect it more

@JJ-Author
Copy link
Contributor

The Wikipedia API is heavily used, so maybe there needs to be some kind of request control firing not too many request per second. I can imagine that they have a load balancer, so in case not many load is on the system they are gracious. It is maybe also worth having a look here https://www.mediawiki.org/wiki/API:Etiquette

@JJ-Author
Copy link
Contributor

as I said the number of triples is more expressive when compared to the number of articles extracted. But for me it seems reasonable? less failed requests -> more triples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority issues to be discussed by the dev-team status: accepted status: fix-required PR related to issue is needed status: test-method-required type: data
Projects
None yet
Development

No branches or pull requests

5 participants