Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMT Download broken #132

Open
kailukowiak opened this issue Mar 2, 2023 · 6 comments
Open

WMT Download broken #132

kailukowiak opened this issue Mar 2, 2023 · 6 comments

Comments

@kailukowiak
Copy link

I'm trying to run the tutorial in Transformers.jl/example/AttentionIsAllYouNeed/wmt14/train.jl and when I run

julia> using Transformers.Datasets

julia> using Transformers.Datasets: WMT

julia> wmt14 = WMT.GoogleWMT()
Transformers.Datasets.WMT.GoogleWMT()

julia> word_counts = get_vocab(wmt14)
This program has requested access to the data dependency Google-WMT en-de.
which is not currently installed. It can be installed automatically, and you will not see this message again.

"""shows in wmt14 of torchtext
The WMT 2014 English-German dataset, as preprocessed by Google Brain.

Though this download contains test sets from 2015 and 2016, the train set
differs slightly from WMT 2015 and 2016 and significantly from WMT 2017.
"""

contain bpe training set and news testset from 2009~2016 (include origin text,
tokenized, and bpe versions), and also a bpe.32000 and vocab.32000 (merged vocab)



Do you want to download the dataset from https://docs.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8 to "/home/<name>/.julia/datadeps/Google-WMT en-de"?
[y/n]
y
ERROR: HTTP.Exceptions.StatusError(404, "HEAD", "/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8&confirm=pbef", HTTP.Messages.Response:
"""
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=utf-8
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Thu, 02 Mar 2023 06:53:49 GMT
Content-Length: 1642
Strict-Transport-Security: max-age=31536000
Cross-Origin-Opener-Policy: same-origin; report-to="DriveUntrustedContentHttp"
Content-Security-Policy: script-src 'report-sample' 'nonce-WeEppd-iCOhoEnLwLNRXaA' 'unsafe-inline';object-src 'none';base-uri 'self';report-uri /_/DriveUntrustedContentHttp/cspreport;worker-src 'self', require-trusted-types-for 'script';report-uri /_/DriveUntrustedContentHttp/cspreport
Accept-CH: Sec-CH-UA-Arch, Sec-CH-UA-Bitness, Sec-CH-UA-Full-Version, Sec-CH-UA-Full-Version-List, Sec-CH-UA-Model, Sec-CH-UA-WoW64, Sec-CH-UA-Platform, Sec-CH-UA-Platform-Version
Permissions-Policy: ch-ua-arch=*, ch-ua-bitness=*, ch-ua-full-version=*, ch-ua-full-version-list=*, ch-ua-model=*, ch-ua-wow64=*, ch-ua-platform=*, ch-ua-platform-version=*
Report-To: {"group":"DriveUntrustedContentHttp","max_age":2592000,"endpoints":[{"url":"https://csp.withgoogle.com/csp/report-to/DriveUntrustedContentHttp/external"}]}
Server: ESF
X-XSS-Protection: 0
X-Content-Type-Options: nosniff
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000

""")
Stacktrace:
  [1] download_gdrive(url::String, localdir::String)
    @ Fetch ~/.julia/packages/Fetch/6DlaY/src/gdrive.jl:63
  [2] gdownload(url::String, localdir::String)
    @ Fetch ~/.julia/packages/Fetch/6DlaY/src/gdrive.jl:113
  [3] run_fetch
    @ ~/.julia/packages/DataDeps/ae6dT/src/resolution_automatic.jl:99 [inlined]
  [4] download(datadep::DataDeps.DataDep{String, String, typeof(Fetch.gdownload), typeof(DataDeps.unpack)}, localdir::String; remotepath::String, i_accept_the_terms_of_use::Nothing, skip_checksum::Bool)
    @ DataDeps ~/.julia/packages/DataDeps/ae6dT/src/resolution_automatic.jl:78
  [5] download
    @ ~/.julia/packages/DataDeps/ae6dT/src/resolution_automatic.jl:63 [inlined]
  [6] handle_missing
    @ ~/.julia/packages/DataDeps/ae6dT/src/resolution_automatic.jl:10 [inlined]
  [7] _resolve
    @ ~/.julia/packages/DataDeps/ae6dT/src/resolution.jl:83 [inlined]
  [8] resolve(datadep::DataDeps.DataDep{String, String, typeof(Fetch.gdownload), typeof(DataDeps.unpack)}, inner_filepath::String, calling_filepath::String)
    @ DataDeps ~/.julia/packages/DataDeps/ae6dT/src/resolution.jl:29
  [9] resolve(datadep_name::String, inner_filepath::String, calling_filepath::String)
    @ DataDeps ~/.julia/packages/DataDeps/ae6dT/src/resolution.jl:54
 [10] resolve
    @ ~/.julia/packages/DataDeps/ae6dT/src/resolution.jl:73 [inlined]
 [11] #get_vocab#2
    @ ~/.julia/packages/Transformers/nIgPX/src/datasets/translate/google_wmt.jl:49 [inlined]
 [12] get_vocab(::Transformers.Datasets.WMT.GoogleWMT)
    @ Transformers.Datasets.WMT ~/.julia/packages/Transformers/nIgPX/src/datasets/translate/google_wmt.jl:47
 [13] top-level scope
    @ REPL[9]:1

I presume this is due to the download location moving but I'm not sure.

@chengchingwen
Copy link
Owner

Yeah, the link seems to be down. We'll need to find a different source for the dataset.

@kailukowiak
Copy link
Author

Would this link work? https://www.statmt.org/europarl/v7/de-en.tgz
I got it from https://www.statmt.org/europarl/

If that would work I'd be happy to throw in a trivial PR if you'd like.

@chengchingwen
Copy link
Owner

It seems to be a different corpus? It would be better to find a new official source for the WMT dataset. Personally I don't have a strong intent to add new dataset, but you could add a new example code that also handle the download and use the europarl corpus if you want.

@tobefreeman
Copy link

I am unable to pre-compile Transformers.jl as of this week. Does anyone know why?

@chengchingwen
Copy link
Owner

@tobefreeman Please open a new issue and provide the error message.

@tobefreeman
Copy link

@tobefreeman Please open a new issue and provide the error message.

My issue is not repeatable. Sorry, @chengchingwen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants