What is Athena?
- Interactive query service for analysis of data stored in S3
- Serverless avoiding setup of infrastructure
- Provides automatic scaling of data volume in queries
- Leverages column-based table creation for parallel processing
- Cloud based in-memory query system
Business Role for Athena
- User friendly query system for S3 data storage
- Central metadata store architecture like Hive
- Focuses on unstructured and semi-structured data stored in S3
- Common examples of queried data include JSON, CSV,
- Apache Parquet, and Apache ORC large data files
- Emphasis is on large capture data files like weblogs, IOT, and other external data
Creating Tables in Athena
- Athena creates tables using the Apache Hive Data Definition Language
- Hive is an open-source Big Data toolset for analytics
- Uses SQL compliant statements for table creation
Schema on Read
- Verifies data organization when a query is issued
- Provides much faster loading as structure is not validated
Multiple schemas serving different needs for the same data
Better option when the schema is not known at loading time
Parallel Processing of Queries
- Parallel operations within an SQL Query
- Concurrent users can access columns at the same time
- Horizontal and vertical parallelization of a single query operation using multiple nodes
Governed Tables
- Governed Tables are tables formed within a data lake created by AWS Lake Formation
- Similar to Managed Tables in Hive
- When a governed table is dropped the table definition in the metastore and the data file is deleted
Iceberg Table
- An Iceberg Table is an Apache open format table designed to capture a large analytics dataset
- Manages a large collection of files as a table
- Iceberg tables must be associated with an AWS Glue catalog
- Must be created using the Parquet format in AWS
- Drop table deletes the meta store and data file
Summary
- Athena is a serverless cloud based in-memory query service
- Athena federated Query service
- Uses a common metadata store architecture for table
- definitions
- Uses common data stored in JSON, ORC, and Parquet formats
- Uses standard SQL query language
- Supports external, governed, and iceberg tables
Top comments (0)