Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Packaging] S3FileSystem curl error when using localstack-created S3 bucket or custom ca-certificate #37001

Open
thvasilo opened this issue Aug 2, 2023 · 10 comments

Comments

@thvasilo
Copy link

thvasilo commented Aug 2, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I'm trying to use a localstack-created S3 bucket as way to test my application without interacting with S3.

To do that I launch an S3 endpoint using localstack start -d and create my PyArrow S3FS using:

s3_fs = fs.S3FileSystem(endpoint_override="localhost:4566")

When I try interacting with files on the simulated bucket however I get the following:

In [223]: nrows = pq.read_metadata(f"{file_bucket}/{file_key}", filesystem=s3_fs).num_rows
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-223-a51dff0bbcaa> in <module>
----> 1 nrows = pq.read_metadata(f"{file_bucket}/{file_key}", filesystem=s3_fs).num_rows

/[...]/lib/python3.7/site-packages/pyarrow/parquet/core.py in read_metadata(where, memory_map, decryption_properties, filesystem)
   3479     file_ctx = nullcontext()
   3480     if filesystem is not None:
-> 3481         file_ctx = where = filesystem.open_input_file(where)
   3482
   3483     with file_ctx:

[...]/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.open_input_file()
[...]/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
[...]/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

OSError: When reading information for key 'redacted/path/to/file' in bucket 'example-bucket': AWS Error NETWO
RK_CONNECTION during HeadObject operation: curlCode: 60, SSL peer certificate or SSH remote key was not OK

Another user seems to have the same problem when using on-prem S3, and had to use s3fs along with PyFileSystem, FSSpecHandler to resolve it: https://discuss.ray.io/t/ssl-peer-certificate-or-ssh-remote-key-was-not-ok/11091/2

Fully reproducible example:

pip install localstack awscli-local pyarrow
localstack start -d
awslocal s3 mb example-bucket
python <<HEREDOC
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                   'two': ['foo', 'bar', 'baz'],
                   'three': [True, False, True]},
                   index=list('abc'))
table = pa.Table.from_pandas(df)
pq.write_table(table, 'example.parquet')
HEREDOC
awslocal s3 cp example.parquet s3://example-bucket/
python <<HEREDOC
from pyarrow import fs
import pyarrow.parquet as pq
s3_fs = fs.S3FileSystem(endpoint_override="localhost:4566")
pq.read_metadata("example-bucket/example.parquet", filesystem=s3_fs)
HEREDOC

Would result in:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "[.../]lib/python3.9/site-packages/pyarrow/parquet/core.py", line 3481, in read_metadata
    file_ctx = where = filesystem.open_input_file(where)
  File "pyarrow/_fs.pyx", line 770, in pyarrow._fs.FileSystem.open_input_file
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When reading information for key 'example.parquet' in bucket 'example-bucket': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 60, SSL peer certificate or SSH remote key was not OK

Possibly related: https://issues.apache.org/jira/browse/ARROW-9261

Tested on Pyarrow 10.0.1, 12.0.1.

Component(s)

Python

@pitrou
Copy link
Member

pitrou commented Aug 22, 2023

@jorisvandenbossche @danepitkin

@sheldonrong
Copy link

any progress on this one please?

@pitrou
Copy link
Member

pitrou commented Oct 30, 2023

@sheldonrong Can you suggest a solution for this?

@sheldonrong
Copy link

encountered this issue today, I haven't dive into any pyarrow code, but I guess the best way would be to allow user to pass in location of the CA certificates via the API?

For example, in the constructor of the S3FileSystem, allow additional paramters like
filesystem = fs.S3FileSystem(endpoint_override="localhost:4566", ca_certificate="/etc/ssl/certs/ca.crt")

@pitrou
Copy link
Member

pitrou commented Oct 30, 2023

For example, in the constructor of the S3FileSystem, allow additional paramters like filesystem = fs.S3FileSystem(endpoint_override="localhost:4566", ca_certificate="/etc/ssl/certs/ca.crt")

That could be a good idea indeed. We don't have a way of doing that currently, but we should probably add one.

For now, you can perhaps workaround this by using the SSL_CERT_DIR or SSL_CERT_FILE environment variables as described in https://www.openssl.org/docs/man3.0/man7/openssl-env.html , but this will affect the entire Python process.

cc @danepitkin

@shohamyamin
Copy link

is there any updates on this? i'm also stuck at that point.

@alonahmias
Copy link

I'm also searching for a workaround with this error
Currently im inserting the write_table to a different variable and then writing it to the s3 using boto3.
That way it is ugly and takes a lot of extra memory when it is a big table

@pitrou
Copy link
Member

pitrou commented Jan 8, 2024

Well, have you tried the workaround I suggested?

For now, you can perhaps workaround this by using the SSL_CERT_DIR or SSL_CERT_FILE environment variables as described in https://www.openssl.org/docs/man3.0/man7/openssl-env.html , but this will affect the entire Python process.

@pitrou
Copy link
Member

pitrou commented Mar 27, 2024

I've tried it locally and it works using SSL_CERT_FILE and Minio:

>>> import os
>>> from pyarrow.fs import S3FileSystem, FileSelector
>>> os.environ['SSL_CERT_FILE']
'/home/antoine/t/miniocert/public.crt'
>>> fs = S3FileSystem(endpoint_override="localhost:9000", scheme="https", access_key="minioadmin", secret_key="minioadmin")
>>> fs.get_file_info(FileSelector('', recursive=True))
[]

You have to make sure that your endpoint_override matches the certificate's subject name (i.e. the host name it is allowed to authentify). For example, if the certificate's subject name is "localhost", you should use "localhost" in your endpoint_override (not "127.0.0.1" or anything else).

Unfortunately, the error message returned by the AWS SDK is not terribly informative if you're not giving the right hostname:

>>> fs = S3FileSystem(endpoint_override="127.0.0.1:9000", scheme="https", access_key="minioadmin", secret_key="minioadmin")
>>> fs.get_file_info(FileSelector('', recursive=True))
  [...]
OSError: When listing buckets: AWS Error NETWORK_CONNECTION during ListBuckets operation: curlCode: 60, SSL peer certificate or SSH remote key was not OK

You can try using the curl command line to get a more meaningful error message, for example here:

$ curl --cacert ./t/miniocert/public.crt  https://127.0.0.1:9000
curl: (60) SSL: no alternative certificate subject name matches target host name '127.0.0.1'
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

@pitrou
Copy link
Member

pitrou commented Mar 27, 2024

So, I opened aws/aws-sdk-cpp#2908 so as to get a more informative error message from the AWS SDK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants