DEV Community

Yen Trinh
Yen Trinh

Posted on

Glue – Athena custom output fixed number Of Files

Situation:

When I only use partition clause, there are so many files in S3 bucket which is <1MB, this affect to the query speed and I want to make those become a bigger file.

Solution:

Solution 1: Use Athena "bucketing" method to custom the number of output file.

You can see this AWS blog for more information:
How can I set the number or size of files when I run a CTAS query in Athena?

However, there is one drawback if you use bucketing: Bucketed table do not support INSERT INTO query. Here comes the solution 2.

Solution 2: Use Glue repartition

The context is the same but now I want to use INSERT INTO query.

You can refer to this AWS blog for the procedure:
Build a Data Lake Foundation with AWS Glue and Amazon S3

Note that in the step numbered "13. View the job", we add the following code into the job:

datasource_df = dropnullfields3.repartition(<number of output file you want here>)
Enter fullscreen mode Exit fullscreen mode

right after the line:

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
Enter fullscreen mode Exit fullscreen mode

and edit the code:

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "<your_s3_path>"}, format = "parquet", transformation_ctx = "datasink4")
Enter fullscreen mode Exit fullscreen mode

to:

datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource_df, connection_type = "s3", connection_options = {"path": "<your_s3_path>"}, format = "parquet", transformation_ctx = "datasink4")
Enter fullscreen mode Exit fullscreen mode

If you want to know more about Glue repartition:

Try querying with Athena
Create table:

CREATE EXTERNAL TABLE IF NOT EXISTS demo_query (
  dispatching_base_num string,
  pickup_date string,
  locationid bigint)
STORED AS PARQUET
LOCATION 's3://athena-examples/parquet/'
tblproperties ("parquet.compress"="SNAPPY");
Enter fullscreen mode Exit fullscreen mode

Try to insert:

insert into demo_query ("dispatching_base_num", "pickup_date", "locationid") values ('aa23dtgt', '2020-12-03', 1234);
Enter fullscreen mode Exit fullscreen mode

The insert query now should work. Success!

Top comments (1)

Collapse
 
sasuke002a profile image
Sasuke Shelby

Industrial underfill epoxy adhesive glue
, circuit board-level adhesives, and adhesives for electronic goods have all been created by DeepMaterial.

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more