DEV Community

Cover image for AWS Athena for GitHub - Eliminating the default Security blind spots
Siddhant Khare
Siddhant Khare

Posted on

AWS Athena for GitHub - Eliminating the default Security blind spots

Did you know GitHub’s audit logs are retained for only six months, but Git events are retained for just seven days ([1], [2])? This mismatch can leave significant gaps in your security visibility. For example, if an attacker manages to infiltrate your system, they could hide their tracks if you do not have a system in place to retain these logs longer.

Many organizations stream these logs to external storage like Amazon S3. While storing these logs is essential, being able to query them is equally important. Failing to do so can result in missing information during security incidents, lack of compliance with regulatory requirements, and an overall inability to perform thorough forensic analysis during an investigation.

In this post, I will walk you through how to set up and query Amazon Athena for audit logs stored in S3.

Why Query Audit Logs with Amazon Athena?

Audit logs provide a list of events that affect your enterprise. Having these logs queryable in Amazon Athena offers several benefits:

  1. Immediate Access: query logs for specific events without the need to manually sift through raw data.
  2. Detailed Insights: drill down into specific details, such as who performed a particular action and when. For instance, you can see who deleted a repository, what time they did it, and their IP address. Here's a mock example of what the detailed results might look like:
{
  "action": "repo.destroy",
  "actor": "johndoe",
  "actor_id": 12345,
  "actor_ip": "192.168.1.1",
  "user_agent": "Mozilla/5.0",
  "visibility": "private",
  "repo": "example-repo",
  "repo_id": 54321,
  "public_repo": false,
  "org": "example-org",
  "org_id": 67890,
  "_document_id": "abcdef123456",
  "@timestamp": "2024-01-15T06:30:00Z",
  "created_at": "2024-01-15T06:29:50Z",
  "operation_type": "delete",
  "business": "example-business",
  "business_id": 111213,
  "actor_location": {"country_code": "US"}
}
Enter fullscreen mode Exit fullscreen mode
  1. Cost Efficiency: By leveraging partitioning and other cost-saving features in Athena, you can optimize your query costs.

Setting Up Athena to Query GitHub Audit Logs

To get started, you'll need to create a table and a view in Athena that reference your GitHub audit logs stored in S3.

Architecture diagram - Athena for GitHub

Step 1: Create an Athena Table

First, let's define a table in Athena that can read the JSON-formatted audit logs stored in S3.

CREATE EXTERNAL TABLE IF NOT EXISTS `<TABLE NAME>` (
  `json_objects` string
)
PARTITIONED BY (
  `date` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
LOCATION 's3://<YOUR S3 PATH>/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.date.type' = 'date',
  'projection.date.format' = 'yyyy/MM/dd/HH',
  'projection.date.range' = '2023/01/01/01,NOW',
  'projection.date.interval' = '1',
  'projection.date.interval.unit' = 'HOURS',
  'storage.location.template' = 's3://<YOUR_S3_PATH>/${date}/'
);
Enter fullscreen mode Exit fullscreen mode

This table setup ensures that Athena can read and partition your audit logs based on the date.

Step 2: Create an Athena View

Next, create a view to make it easier to query specific fields from the JSON data.

CREATE VIEW `<VIEW_NAME>` AS
SELECT
  date,
  json_extract_scalar(json_objects, '$.action') AS action,
  json_extract_scalar(json_objects, '$["@timestamp"]') AS "timestamp",
  json_objects
FROM `<TABLE NAME>`;
Enter fullscreen mode Exit fullscreen mode

With this view, you can easily extract specific fields from the JSON logs.

Example Use Case: Querying for Repository Deletion Events

To demonstrate the setup, let's assume you want to find out who deleted a specific repository and when.

Step 1: Identify the JSON Schema

Run the following query to reveal the JSON schema of the event log for repository deletions.

SELECT * FROM <VIEW_NAME>
WHERE
  action = 'repo.destroy'
  AND date BETWEEN '2023/01/01' AND '2023/02/01';
Enter fullscreen mode Exit fullscreen mode

This query helps you identify the schema of the logs related to the repo.destroy action. Here's an example of what the JSON schema might look like:

{
    "action": "repo.destroy",
    "actor": "<actor name>",
    "actor_id": <actor id>,
    "actor_ip": "<actor ip>",
    "user_agent": "<user agent>",
    "visibility": "<repo visibility>",
    "repo": "<repo name>",
    "repo_id": <repo id>,
    "public_repo": <is public>,
    "org": "<org name>",
    "org_id": <org id>,
    "_document_id": "<document id>",
    "@timestamp": <timestamp>,
    "created_at": <timestamp>,
    "operation_type": "<operation_type>",
    "business": "<business name>",
    "business_id": <business id>,
    "actor_location": {"country_code": "JP"}
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Create a Table for the Specific Event

Using the identified JSON schema, create a table tailored to the repo.destroy event.

CREATE EXTERNAL TABLE IF NOT EXISTS `repo_destroy_events` (
  `action` string,
  `actor` string,
  `repo` string,
  `@timestamp` string
)
PARTITIONED BY (
  `date` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'ignore.malformed.json' = 'FALSE',
  'dots.in.keys' = 'FALSE',
  'case.insensitive' = 'TRUE'
)
LOCATION 's3://<YOUR_S3_PATH>/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.date.type' = 'date',
  'projection.date.format' = 'yyyy/MM/dd/HH',
  'projection.date.range' = '2023/01/01/01,NOW',
  'projection.date.interval' = '1',
  'projection.date.interval.unit' = 'HOURS',
  'storage.location.template' = 's3://<YOUR_S3_PATH>/${date}/'
);
Enter fullscreen mode Exit fullscreen mode

Step 3: Query the Specific Table

Now, you can run a query on the newly created table to find out when and who deleted the repository.

SELECT * FROM repo_destroy_events
WHERE
  action = 'repo.destroy'
  AND repo = 'hoge'
  AND date BETWEEN '2024/01/01' AND '2024/02/01';
Enter fullscreen mode Exit fullscreen mode

This query will return the specific logs indicating when the repository was deleted and by whom.

Why we need this and optimization tips

Why follow these steps?

In a nutshell, these steps are essential for achieving a general-purpose log search.

Before you can use Amazon Athena to query data in an S3 object, you must define a table. For each dataset that you query, Athena requires an underlying table that it uses to retrieve and return query results. Therefore, you must register the table with Athena before you can query the data. More about Athena tables.

However, defining the table in advance is challenging due to two main reasons:

  1. Different JSON Schemas for Different Events: GitHub logs have varied JSON schemas for different events. For example, the JSON schema for repo.destroy is different from public_key.create. Here are examples:

repo.destroy JSON Schema:

   {
       "action": "repo.destroy",
       "actor": "<actor name>",
       "actor_id": <actor id>,
       "actor_ip": "<actor ip>",
       "user_agent": "<user agent>",
       "visibility": "<repo visibility>",
       "repo": "<repo name>",
       "repo_id": <repo id>,
       "public_repo": <is public>,
       "org": "<org name>",
       "org_id": <org id>,
       "_document_id": "<document id>",
       "@timestamp": <timestamp>,
       "created_at": <timestamp>,
       "operation_type": "<operation_type>",
       "business": "<business name>",
       "business_id": <business id>,
       "actor_location": {"country_code": "JP"}
   }
Enter fullscreen mode Exit fullscreen mode

public_key.create JSON Schema:

   {
       "action": "public_key.create",
       "actor": "<actor>",
       "actor_id": <actor_id>,
       "actor_ip": "<actor_ip>",
       "user_agent": "<user_agent>",
       "external_identity_nameid": "<external_identity_nameid>",
       "external_identity_username": "<external_identity_username>",
       "title": "<title>",
       "key": "<key>",
       "fingerprint": "<fingerprint>",
       "read_only": "<true>",
       "org": "<org>",
       "org_id": <org_id>,
       "repo": "<repo>",
       "repo_id": <repo_id>,
       "public_repo": <public_repo>,
       "_document_id": "<_document_id>",
       "@timestamp": <@timestamp>,
       "created_at": <timestamp>,
       "operation_type": "<operation_type>",
       "business": "<business>",
       "business_id": <business_id>,
       "actor_location": {"country_code": "JP"}
   }
Enter fullscreen mode Exit fullscreen mode
  1. Undefined Search Use Case: The specific use case for log search is often not clear in advance. This uncertainty makes it difficult to create a one-size-fits-all Athena table.

To address these challenges, follow these steps:

  1. Examine the JSON Schema: Identify the structure of the JSON logs for each event type.
  2. Create an Athena Table: Define a table in Athena that matches the identified schema.
  3. Query the Data: Use SQL queries to retrieve and analyze the logs.

By following these steps, you can create a flexible and generic log search solution adaptable to various use cases, ensuring you don't miss critical details or fail to detect suspicious activities.

The Necessity of Tables and Views

Why do tables and views exist in this setup? The answer lies in the need to examine the JSON schema of the event.

To search logs effectively, you first need to know the JSON schema of the event. Unfortunately, the GitHub documentation does not provide a page that defines the JSON schema for events. However, you can determine the event name from this GitHub document, and it is included in the JSON schema of any event log stored in S3.

For instance, consider the following example:

{
  "action": "repo.destroy",
  "actor": "<actor name>",
  // ...other fields
}
Enter fullscreen mode Exit fullscreen mode

To achieve this, we decided to create both a table and a view:

  1. Create an Athena table with the event log JSON as-is:
   CREATE EXTERNAL TABLE IF NOT EXISTS `<TABLE NAME>` (
     `json_objects` string
   )
   PARTITIONED BY (
     `date` string
   )
   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
   LOCATION 's3://<YOUR S3 PATH>/'
   TBLPROPERTIES (
     'projection.enabled' = 'true',
     'projection.date.type' = 'date',
     'projection.date.format' = 'yyyy/MM/dd/HH',
     'projection.date.range' = '2023/01/01/01,NOW',
     'projection.date.interval' = '1',
     'projection.date.interval.unit' = 'HOURS',
     'storage.location.template' = 's3://<YOUR_S3_PATH>/${date}/'
   );
Enter fullscreen mode Exit fullscreen mode
  1. Create an Athena view to extract fields:
   CREATE VIEW `<VIEW_NAME>` AS
   SELECT
     date,
     json_extract_scalar(json_objects, '$.action') AS action,
     json_extract_scalar(json_objects, '$["@timestamp"]') AS "timestamp",
     json_objects
   FROM `<TABLE NAME>`;
Enter fullscreen mode Exit fullscreen mode

This is why both tables and views are necessary. The table stores the raw JSON logs, and the view helps extract and query specific fields from these logs.

How to Use ROW FORMAT SERDE

In this article, we used two types of ROW FORMAT SERDE:

  1. LazySimpleSerDe:
   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
Enter fullscreen mode Exit fullscreen mode
  1. JsonSerDe:
   ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
Enter fullscreen mode Exit fullscreen mode

ROW FORMAT SERDE specifies how Athena deserializes data in S3.

  • LazySimpleSerDe: Reads the data in S3 one line at a time as a string. This is useful for examining the entire JSON schema.
  • JsonSerDe: Reads the data in S3 as JSON, which allows you to handle individual keys in the JSON. This is more flexible and precise for detailed queries.

Example of a GitHub audit log file:

{"key":"value", ...}
{"key":"value", ...}
Enter fullscreen mode Exit fullscreen mode

For the table where we want to examine the entire JSON schema, we use LazySimpleSerDe. For the table where we want to handle individual keys in JSON, we use JsonSerDe.

Please refer to the official documentation for more details: JSON SerDe.

Lower Costs with Partitions

When you run a query in Athena, it performs a full scan against S3 by default, which can be costly in terms of both time and money. To mitigate this, we use partitions to narrow the search target. Athena partitions can be configured for S3 object paths.

GitHub Enterprise Cloud's audit logs are output in the format yyyy/MM/dd/HH/mm/<uuid>.json.gz, so it makes sense to partition the data by date. This allows us to target specific time frames rather than scanning all logs.

However, Athena's partitioning feature only works on paths that exist at the time of the load. This means that if you load the data at a specific point in time and then specify a date range in the WHERE clause that wasn't loaded at that time, Athena will perform a full scan. To avoid this, it is necessary to load the partition information regularly.

Before you can query partitioned tables, you must update the AWS Glue Data Catalog with partition information.

Instead of regular loading, we use a feature called partition projection. Partition projection is a mechanism by which Athena determines the partition at query execution based on predetermined information (such as the position, type, and range of the S3 path).

For instance, we know that the audit log is output in the format s3://<S3_BUCKET_NAME>/yyyy/MM/dd/HH/mm/<uuid>.json.gz, so we can set Athena to recognize the year to hour part as a partition. Here is a part of the DDL (Data Definition Language) statement to set up partition projection:

TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.date.type' = 'date',
  'projection.date.format' = 'yyyy/MM/dd/HH',
  'projection.date.range' = '2023/01/01/01,NOW',
  'projection.date.interval' = '1',
  'projection.date.interval.unit' = 'HOURS',
  'storage.location.template' = 's3://<YOUR_S3_PATH>/${date}/'
  # ${date} should be used as is without replacement
)
Enter fullscreen mode Exit fullscreen mode

Insure Your Query Costs

While partition projection can narrow the scope of a query, it only works effectively if you write a query that correctly determines the partitions. For example, specifying the date in the WHERE clause is crucial to avoid a full scan of the data.

To prevent unexpected costs, it's a good idea to set query limits. Amazon Athena can limit the amount of data that is scanned, ensuring your query costs remain manageable. Learn how to set these limits.

By following these practices, you can effectively manage your query costs while ensuring efficient and accurate data retrieval.

Conclusion

GitHub’s audit logs are retained for only six months, and Git events for just seven days. This mismatch can create significant security gaps, leaving your organization vulnerable to undetected malicious activities. Streaming these logs to Amazon S3 for long-term retention is essential, but it’s equally important to be able to query them effectively. Without proper querying capabilities, you risk missing critical information during security incidents, failing to meet compliance requirements, and being unable to perform thorough forensic analysis.

Using Amazon Athena to query GitHub audit logs stored in S3 addresses these challenges by offering immediate access to detailed insights and cost-efficient querying capabilities. By setting up tables and views in Athena, you can transform raw JSON logs into actionable data, helping you monitor changes, investigate incidents, and ensure compliance.

Key benefits include:

  1. Immediate Access: Quickly find specific events without sifting through raw data.
  2. Detailed Insights: Get precise information on who performed actions, when they occurred, and from where.
  3. Cost Efficiency: Use partitioning and partition projection to optimize query costs and avoid unnecessary expenses.

To set up Athena for querying GitHub audit logs, follow these steps:

  1. Create an Athena Table: Define a table that reads the JSON-formatted logs from S3.
  2. Create an Athena View: Make querying specific fields easier by creating a view that extracts key data from the JSON logs.
  3. Query Specific Events: Use SQL queries to retrieve and analyze logs, focusing on relevant details such as repository deletions.

Additionally, managing costs with partitioning and partition projection ensures efficient and affordable queries, while setting query limits helps prevent unexpected expenses.

By implementing this setup, you can eliminate the default security blind spots in GitHub Enterprise, ensuring your logs are always accessible, insightful, and cost-effective. For more tips and insights on security and log analysis, follow me on Twitter @Siddhant_K_code and stay updated with the latest & detailed tech content like this.

Top comments (0)