In Airflow, you can create connection to S3 in order to, for instance, store logs in S3 bucket.
To do so, you have to go to airflow interface, go to "Admin" menu, "Connections" submenu, and then click on the blue + sign.
However, once you’ve created your connection, there is no easy way to check that it is working. Here is a small process to test a newly created S3 connection, if you have ssh access to the server where airflow is deployed:
-
Connect to machine where you have deployed airflow:
ssh your_login@your_airflow_server
-
Create a
test.pyfile with the following content, replaceyour_connection_idwith the connection id you’ve just created andyour_s3_bucketwith the name of the bucket you want to connect to:
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
remote_conn_id = 'your_connection_id'
remote_location = 'your_s3_bucket'
hook = S3Hook(remote_conn_id, transfer_config_args={'use_threads': False})
print(hook.list_keys(remote_location)[0:10])
-
Execute this
test.pyscript:
python3 test.py
If your connection is working, you should get the list of first 10 files in your S3 bucket:
[2021-10-11 15:33:12,934] {base_aws.py:368} INFO - Airflow Connection: aws_conn_id=your_connection_id
[2021-10-11 15:33:12,972] {base_aws.py:179} INFO - No credentials retrieved from Connection
[2021-10-11 15:33:12,973] {base_aws.py:82} INFO - Retrieving region_name from Connection.extra_config['region_name']
[2021-10-11 15:33:12,973] {base_aws.py:84} INFO - Creating session with aws_access_key_id=None region_name=eu-central-1
[2021-10-11 15:33:12,980] {base_aws.py:157} INFO - role_arn is None
['directory1/', 'directory1/file1.txt', 'directory1/file2.txt', 'directory1/file3.txt', 'directory1/file4.txt',
'directory1/file5.txt', 'directory1/file6.txt, 'directory1/file7.txt', 'directory1/file8.txt', 'directory1/file9.txt'']
If you can’t connect to your S3 bucket, you will have a python stacktrace. For instance if the bucket you try to connect does not exist:
[2021-10-11 15:39:45,558] {base_aws.py:368} INFO - Airflow Connection: aws_conn_id=your_connection_id
[2021-10-11 15:39:45,588] {base_aws.py:179} INFO - No credentials retrieved from Connection
[2021-10-11 15:39:45,588] {base_aws.py:82} INFO - Retrieving region_name from Connection.extra_config['region_name']
[2021-10-11 15:39:45,588] {base_aws.py:84} INFO - Creating session with aws_access_key_id=None region_name=eu-central-1
[2021-10-11 15:39:45,596] {base_aws.py:157} INFO - role_arn is None
Traceback (most recent call last):
File "test.py", line 7, in <module>
print(hook.list_keys(remote_location)[0:10])
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 62, in wrapper
return func(*bound_args.args, **bound_args.kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 302, in list_keys
for page in response:
File "/home/ubuntu/.local/lib/python3.8/site-packages/botocore∕paginate.py", line 255, in __iter__
response = self._make_request(current_kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/botocore∕paginate.py", line 332, in _make_request_
return self._method(**current_kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/botocore∕client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/botocore∕client.py", line 676, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist