Query Amazon S3 Data with Amazon Athena and AWS Glue

Nipulpatel
6 min readApr 23, 2024

--

harnessing the power of AWS for data processing and visualization will have a solid understanding of how to upload a CSV file into Amazon S3, utilize AWS Glue to crawl and create a database, leverage Amazon Athena for flexible querying.

  • Simple Storage Service (S3) is an online storage service where you can store and retrieve any type of data on the web regardless of the time and place.
  • AWS Glue is a powerful service that allows you to discover, catalog, and transform data from various sources, making it easier to prepare data for analytics.
  • Athena can connect to data stored in Amazon S3 using the AWS Glue Data to store metadata such as table and column names. And after the connection is made, databases and tables appear in Athena’s query editor
  • It is an interactive query service that makes it simple to analyze data directly in Amazon S3 using standard SQL.

let’s see really quick how we can set this pipeline up.

  • Uploading Data to Amazon S3:
  • Amazon Simple Storage Service (S3) is a highly scalable and durable object storage service provided by AWS. It offers a secure and reliable solution for storing and retrieving data from anywhere on the web. Let’s explore how you can upload a CSV file to S3 and ensure efficient data organization.
  • Uploading a CSV file to S3: You can upload a CSV file to S3 using various methods but for this project we are going to be working with AWS Management Console.
  • We begin with Logging in to the AWS Management Console and navigate to the buckets section under s3. lets proceed to uploading our CSV file, but before that lets create a new bucket or we can use an existing bucket.
  • Click on CREATE BUCKET, in this articule we are using the bucket name called boon123
  • navigate to the created bucket and Click on the “Upload” button and select your CSV file from your local machine.
  • Follow the prompts to configure options such as access permissions, encryption, and metadata.
  • Click on the “Upload” button to initiate the upload process.

Head to Amazon Athena console and select Data sources under Administration in the left side panel.

Select S3 — AWS Glue Data Catalog and choose Next.

Choose Create in AWS Glue.

This will automatically open the Add crawler page under AWS Glue > Crawlers.

Enter a name for your crawler and choose Next.

Since we are creating a brand new one from S3, choose Not yet for Is your data alreday mapped, and choose Add a data source.

Choose S3 for Data source, leave the Network connection empty and enter the S3 path for your data. Note that All folders and files contained in the S3 path are crawled. In my example below, all files within data/ folder will be crawled.

Choose Next and you should see the S3 data source you just configured showing up under Data sources section.

Choose Next.

Choose Create new IAM role. and enter a name for it.

This will automatically create a Glue Service Role with the following policies attached.

Within the Custom managed s3Policy, here is what you have. You will have Read and Write permission to the S3 path you specified as data source.

Choose Next.

Let’s create a new Data Catalog database for our S3 data. Choose Add database.

This will take you to the Add database page under AWS Glue > Databases like following. Enter a name for your database and choose Create database.

Head back to Crawler creation page and choose the database we have just create above. If you are not seeing it showing up, click on the refresh button next to the name entry.

For Crawler schedule, choose on Demand. You can also choose other frequencies like Daily or Weekly, or enter your own cron expressions.

Choose Next. Review the information you have entered and choose Create crawler.

To start crawling process, choose the crawler under AWS Glue > Crawlers and choose Run crawler.

On success, you should see the following.

By doing the above process, table(s) will automatically be created for you in the database. You DO NOT need to create Table and enter your column information manually like what this AWS official guide suggests.

You can confirm that the table is indeed created successfully by going to AWS Glue > Tables.

Now, let’s check to see if we can indeed query our S3 data correctly using Amazon Athena. Head back to Athena console and choose query editor.

Select the AwsDataCatalog as Data source and choose the Database we have just created.

Simply enter and run the following query

select * from boon123

and you should be able to confirm the result under Query results.

We can now connect to our S3 from QuickSight using Athena easily!

Conclusion

The seamless integration of AWS Athena and AWS Glue enables organizations to leverage schema and crawled data for efficient and optimized database queries. By combining the power of metadata-driven analysis and query execution, businesses can uncover valuable insights, make informed decisions, and drive success in the data-driven era.

Thank you for reading!

--

--

No responses yet