Missing comments and abstracts for english for multiple articles #714

pkleef · 2021-09-13T15:26:50Z

Issue validity

Some explanation: DBpedia Snapshot is produced every three months, see Release Frequency & Schedule, which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. Berlin or Joe_Biden here: http://dief.tools.dbpedia.org/server/extraction/en/
If the issue persists, please post the link from your browser here:

https://dbpedia.org/resource/Eating_your_own_dog_food?lang=*
https://dbpedia.org/resource/Paul_Erd%C5%91s?lang=*

NOTE: http://dief.tools.dbpedia.org/server/extraction/en/Eating_your_own_dog_food returns an error at this time

Error Description

Please state the nature of your technical emergency:

I received several reports of articles with missing english (and possible other language) triples for dbo:abstract and dbo:comment.

Pinpointing the source of the error

Where did you find the data issue? Non-exhaustive options are:

Web/SPARQL, e.g. http://dbpedia.org/sparql or http://dbpedia.org/resource/Berlin, please provide query or link
Dumps: dumps are managed by the Databus. Please provide artifact & version or download link
DIEF: you ran the software and the error occured then, please include all necessary information such as the extractor or log. If you had problems running the software use another issue template

Error occurs on the current 2021-06 snapshot of the databus dump that is loaded on http://dbpedia.org/sparql

The text was updated successfully, but these errors were encountered:

jlareck · 2021-09-14T12:51:37Z

It is very interesting that this error occurs also with http://dief.tools.dbpedia.org/server/extraction/en/Eating_your_own_dog_food , because when I run the server locally on my machine, it extracted abstracts for English:

Then I guess the error on the server can be related to the configuration that is used by dief.tools.dbpedia.org/server/ (maybe it uses old version of extraction framework).

JJ-Author · 2021-09-15T08:26:38Z

@jlareck can you patch the html in order that it shows the commit (optionally also the branch) it is using (that has a hyperlink to github). I think sometimes the cronjob fails or the redeploy script is not mature yet. so yes the service was out of date. but it is hard to recognize. so displaying this simple information here http://dief.tools.dbpedia.org/server/extraction/en/ could really help

jlareck · 2021-09-15T08:43:32Z

@JJ-Author not completely clear what I need to do. So I need to find the commit where this error occurs, am I right?

JJ-Author · 2021-09-15T09:18:23Z

no just write a commit which prints out the current commit hash of the build on the DIEF extractor webpage. you could use sth. like this https://github.com/git-commit-id/git-commit-id-maven-plugin.

JJ-Author · 2021-09-15T09:19:14Z

by the way i updated the webservice now manually. but we dont know for sure because we dont see which commit it is using.

jlareck · 2021-09-15T17:22:06Z

Oh, well I also noticed another thing. So, maybe the problem was in not correct usage of api server call, because this url works fine: http://dief.tools.dbpedia.org/server/extraction/en/extract?title=Eating+your+own+dog+food&revid=&format=trix&extractors=custom . @pkleef Could you please check it and say if it is expected result? Or maybe I don't understand what result must be on the server

jlareck · 2021-09-16T09:22:28Z

@jlareck can you patch the html in order that it shows the commit (optionally also the branch) it is using (that has a hyperlink to github).

no just write a commit that prints out the current commit hash of the build on the DIEF extractor webpage.

@JJ-Author, as I understood, I need to add current commit information on the DIEF server page and link to the commit on the github page. And I can add it somewhere at the top of the page (for example near the ontology) or at the footer?

pkleef · 2021-09-16T09:42:21Z

@jlareck I can confirm your dief tools link to the article does show the triples i do not see when loading the 2021-06 Databus snapshot on the http://dbpedia.org/sparql endpoint.

See this result:

https://dbpedia.org/sparql?query=select+lang%28%3Fcomment%29++%3Fcomment+where+%7B%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FEating_your_own_dog_food%3E+rdfs%3Acomment+%3Fcomment%7D

My main concern is that the Databus dump apparently was reported as successful, yet on a number of articles did not dump the english abstracts and comments.

The DBpedia team needs to figure out why these comments were not dumped as this could be an indication that extraction errors are not properly caught and reported.

As a side note for the DIEF tool, i see i used the wrong URL form:

http://dief.tools.dbpedia.org/server/extraction/en/Eating_your_own_dog_food

but i was not expecting a Java Exception com.sun.jersey.api.NotFoundException: null for uri.

Would it be possible to add some argument checking and produce a slightly more informative error page?

JJ-Author · 2021-09-16T13:58:26Z

@jlareck can you patch the html in order that it shows the commit (optionally also the branch) it is using (that has a hyperlink to github).

no just write a commit that prints out the current commit hash of the build on the DIEF extractor webpage.

@JJ-Author, as I understood, I need to add current commit information on the DIEF server page and link to the commit on the github page. And I can add it somewhere at the top of the page (for example near the ontology) or at the footer?

Yes I think this would make sense, right? so that we always know whether we are using the latest code?

JJ-Author · 2021-09-16T14:17:21Z

@jlareck I can confirm your dief tools link to the article does show the triples i do not see when loading the 2021-06 Databus snapshot on the http://dbpedia.org/sparql endpoint.

See this result:

https://dbpedia.org/sparql?query=select+lang%28%3Fcomment%29++%3Fcomment+where+%7B%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FEating_your_own_dog_food%3E+rdfs%3Acomment+%3Fcomment%7D

My main concern is that the Databus dump apparently was reported as successful, yet on a number of articles did not dump the english abstracts and comments.

@Vehnem @kurzum Maybe it makes sense to have some metrics here? like number of abstracts compared in total and compared to total number of entities. So that we can track whether abstracts are getting more or less from release to release? and maybe track this also for other artifacts? Maybe using the void mods?

The DBpedia team needs to figure out why these comments were not dumped as this could be an indication that extraction errors are not properly caught and reported.

@jlareck do you know whether exception statistics / summary for the extraction are written in general? I know there is a logging of exceptions.

@pkleef my best guess is that the commit used for the 2021-06 extraction did not have the fix yet. @Vehnem @jlareck is there a way to determin the commit hash for a marvin extraction now?
But in general I assume that there is a gap in terminology. Successful so far means nothing crashed or aborted, (so ideally no missing files). But indeed missing triples or the number of exceptions per extractor could be used as quality indicators to judge a "successful" release in the future.

jlareck · 2021-09-20T07:24:30Z

@JJ-Author I think exception statistics and summary is written for each language wikidump separately. So, as I understand, we can see how many pages were successfully extracted and how many failed for example after extraction of the English wikidump.

jlareck · 2021-09-20T08:34:16Z

Well, I found the reason why Eating_your_own_dog_food was not extracted. So, during English extraction, there were too many requests using wikimedia API and that's why extraction of this page failed. Here is error for this page:

Exception; en; Main Extraction at 46:38.508s for 4 datasets; Main Extraction failed for instance http://dbpedia.org/resource/Eating_your_own_dog_food: Server returned HTTP response code: 429 for URL: https://en.wikipedia.org/w/api.php

This error occurred very often during June extraction: for the English dump, there were 910000 exceptions Server returned HTTP response code: 429 during the extraction.

jlareck · 2021-09-21T19:05:00Z

I and Marvin checked logs one more time today and this exception occurred not 910000 but 455000 times during June extraction (I didn't calculate it correctly the first time). But anyway this is still a huge number. And the other interesting moment is that during August extraction this error occurred only 723 times for English (anything related to requests in the Extraction Framework was not changed during this period). Also, we compared the number of triples in each dataset (June and August) and in June the number of triples was 5460872 and in August there were 5952058 extracted abstracts. Still very confusing and we are trying to inspect it more

JJ-Author · 2021-09-22T07:21:46Z

The Wikipedia API is heavily used, so maybe there needs to be some kind of request control firing not too many request per second. I can imagine that they have a load balancer, so in case not many load is on the system they are gracious. It is maybe also worth having a look here https://www.mediawiki.org/wiki/API:Etiquette

JJ-Author · 2021-09-22T07:23:49Z

as I said the number of triples is more expressive when compared to the number of articles extracted. But for me it seems reasonable? less failed requests -> more triples

pkleef added the type: data label Sep 13, 2021

kurzum added the priority issues to be discussed by the dev-team label Sep 21, 2021

jlareck added status: fix-required PR related to issue is needed status: minidump-test-required labels Sep 26, 2021

kurzum mentioned this issue Sep 29, 2021

Testing methodology for abstracts has major gaps #716

Open

Vehnem added status: test-method-required status: accepted and removed status: minidump-test-required labels Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing comments and abstracts for english for multiple articles #714

Missing comments and abstracts for english for multiple articles #714

pkleef commented Sep 13, 2021

jlareck commented Sep 14, 2021 •

edited

Loading

JJ-Author commented Sep 15, 2021

jlareck commented Sep 15, 2021 •

edited

Loading

JJ-Author commented Sep 15, 2021

JJ-Author commented Sep 15, 2021

jlareck commented Sep 15, 2021 •

edited

Loading

jlareck commented Sep 16, 2021

pkleef commented Sep 16, 2021

JJ-Author commented Sep 16, 2021

JJ-Author commented Sep 16, 2021

jlareck commented Sep 20, 2021

jlareck commented Sep 20, 2021

jlareck commented Sep 21, 2021 •

edited

Loading

JJ-Author commented Sep 22, 2021

JJ-Author commented Sep 22, 2021

Missing comments and abstracts for english for multiple articles #714

Missing comments and abstracts for english for multiple articles #714

Comments

pkleef commented Sep 13, 2021

Issue validity

Error Description

Pinpointing the source of the error

jlareck commented Sep 14, 2021 • edited Loading

JJ-Author commented Sep 15, 2021

jlareck commented Sep 15, 2021 • edited Loading

JJ-Author commented Sep 15, 2021

JJ-Author commented Sep 15, 2021

jlareck commented Sep 15, 2021 • edited Loading

jlareck commented Sep 16, 2021

pkleef commented Sep 16, 2021

JJ-Author commented Sep 16, 2021

JJ-Author commented Sep 16, 2021

jlareck commented Sep 20, 2021

jlareck commented Sep 20, 2021

jlareck commented Sep 21, 2021 • edited Loading

JJ-Author commented Sep 22, 2021

JJ-Author commented Sep 22, 2021

jlareck commented Sep 14, 2021 •

edited

Loading

jlareck commented Sep 15, 2021 •

edited

Loading

jlareck commented Sep 15, 2021 •

edited

Loading

jlareck commented Sep 21, 2021 •

edited

Loading