Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

429 Too many requests #733

Open
uleodolter opened this issue Jun 8, 2022 · 3 comments
Open

429 Too many requests #733

uleodolter opened this issue Jun 8, 2022 · 3 comments
Labels
status: fix-required PR related to issue is needed

Comments

@uleodolter
Copy link

Hi,

I have configured https://github.com/dbpedia/marvin-config to extract german wikipedia. A first run worked for the 20220401 dump.

Today i run again to extract the 20220601 dump, but it only worked partly the extraction framework and after some time only HTTP 429 was returned from https://de.wikipedia.org/w/api.php.

Exception; de; Main Extraction at 00:00.957s for 62 datasets; Main Extraction failed for instance http://de.dbpedia.org/resource/Liste_von_Autoren/J: Server returned HTTP response code: 429 for URL: https://de.wikipedia.org/w/api.php java.io.IOException: Server returned HTTP response code: 429 for URL: https://de.wikipedia.org/w/api.php at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1902) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268) at org.dbpedia.extraction.util.MediaWikiConnector$$anonfun$retrievePage$1.apply$mcVI$sp(MediaWikiConnector.scala:97) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166) ...

I used the following settings in extractionConfiguration/extraction.de.properties

mwc-apiUrl=https://{{LANG}}.wikipedia.org/w/api.php
mwc-maxRetries=5
mwc-connectMs=4000
mwc-readMs=30000
mwc-sleepFactor=2000

It seems the extraction-framework does not handle this HTTP error properly. I would be great if the Retry-After HTTP header is used to handle such errors. Any suggestions which properties to adjust for this problem?

@jlareck jlareck added the status: fix-required PR related to issue is needed label Jun 21, 2022
@jlareck
Copy link
Collaborator

jlareck commented Jun 22, 2022

Hi, we are currently reworking the abstract extraction

@uleodolter
Copy link
Author

Any updates on this or workaround for this ? the extraction of german wikipedia worked only once in April 2022.

@jlareck
Copy link
Collaborator

jlareck commented Oct 29, 2022

Hi, yes, we have some updates around text extraction. So, this summer, we had a Google Summer of Code project during which one student upgraded text extraction and it became better (at least we reduced number of 429 errors but still sometimes text extraction process becomes frozen at some point of time). So in this branch there is all related work https://github.com/dbpedia/extraction-framework/tree/celian-gsoc .

During this gsoc project there were implemented two new MediawikiConnectors based on previous one:

https://github.com/dbpedia/extraction-framework/blob/celian-gsoc/core/src/main/scala/org/dbpedia/extraction/util/MediawikiConnectorConfigured.scala - this MediawikiConnector uses current Mediawiki API that we always have used before, but there was added some new configurations so as result number of 429 HTTP errors were reduced. But sometimes extraction doesn't completes and when maybe 70-95% (I am not completly sure in these numbers but when we tested it and compared with datasets that we had in previous releases, the number of extracted pages looks like were almost the same) of pages from dump were extracted then the extraction process just becomes frozen. I recommend you to run extraction only for one language per process (in extraction.text.properties file just write one language).

https://github.com/dbpedia/extraction-framework/blob/celian-gsoc/core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnectorRest.scala - here is used new REST Mediawiki API. And for this one we still have same problem with frozen process during extraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: fix-required PR related to issue is needed
Projects
None yet
Development

No branches or pull requests

2 participants