Hi, This article would help you understand baby steps for maintaining mid to large sale production logs that are necessary for debugging and auditing purposes and some recommended off-beat ETL services to process and query them.
For any application, when you log something, it needs to be easily searchable and rich with relevant information. Hence it's important to:
- get rid of irrelevant data inside the logs
- maintain a logging id for every unique request made, that is consistent throughout your web app from UI to Backend
- Have a system that is able to search logs to track requests.
The below diagram shows a basic logging system where a trace_id ( UUID) generated from the UI for every request made by a user, which can be added and sent into a header to the backed services. This id can be later used by the debugger to track the entire request from Frontend to Backend to debug any kind of issues faced later.
Logging Service
The best place to log requests is the Website gateway. Here you will get the entire request sent from the UI in its raw state. Since nginx is one of the most popular and easy to configure gateway|reverse-proxy servers, here's an example to configure nginx logging format to output JSON:
log_format mycombined escape=json '{"status": "$status", "upstream_addr": "$upstream_addr", "time_local": "$time_local", "method": "$request_method", "request_uri": "$request_uri" , "http_retry": "$http_retry", "http_authorization": "$http_authorization", "http_x_forwarded_for": "$http_x_forwarded_for", "body_bytes_sent": "$body_bytes_sent", "request_time": "$request_time", "upstream_response_time": "$upstream_response_time"}';
The escape
parameter tells us which characters to dismiss while generating the log. Here all non-JSON supported characters will be escaped.
Now that Nginx logs are being produced as JSON, we can process them in our log shipper.
Below is a sample vector configuration for the same.
Log Shipping ETL service
Vector is a low memory consumption, stand-alone ETL-service daemon process which can tail your logs, process, and send them to your data locations using a high-level pipeline. Vector may not even consume more than 1GB RAM for mid-level services having around 2000RPM, hence it is ideal for services that don't require standalone devices for ETL as opposed to Logstash/ELK.
Please refer to Vector's MAN page for understanding all possibilities in vector and understanding the below configuration if it piques your interest.
Our Sample Vector configuration takes:
- nginx access logs (source block),
- identifies log message as JSON and parses JSON data as variables ( transforms block)
- parses nested JSON fields and extracts variables from them (transforms block)
- adds new field variables
- filters unwanted fields
- has two sinks, one outputting to standard output in the terminal for debugging and other one sending the processed logs to AWS-S3.
# THIS IS THE VECTOR CONF PIPELINE
# source file to pick up logs
[sources.nginx_access_log]
type = 'file'
include = [ '/var/log/nginx/access.log' ]
start_at_beginning = false
# parse the log 'message' as json
[transforms.nginx_as_json]
type = "json_parser" # required
inputs = ["nginx_access_log"] # required
drop_invalid = true # required
# you can extract sub json fields too and add it as normal fields.
# Here we have a custom nginx header var called jwt, might be different for you
[transforms.jwt_extract]
type = "json_parser" # required
inputs = ["nginx_as_json"]
field = "jwt"
drop_invalid = true
# extract nested fields in jwt, into parent fields( original message)
[transforms.jwt_flatten]
type = "add_fields"
inputs = ["jwt_extract"]
fields.user_id ="{{user.userid}}"
fields.app = "{{user.app}}"
# discard unnecessary fields
[transforms.filter_fields]
type = "remove_fields" # required
inputs = ["jwt_flatten"] # required
fields = ["user","timestamp", "source_type", "http_retry", "request_length", "file", "body_bytes_sent"] # required
# this console output is just for debug, might delete if not required
[sinks.console]
# General
type = "console" # required
inputs = ["filter_fields"] # required
target = "stdout" # optional, defaultg
# Encoding
encoding.codec = "json" # required
# example to upload each processed log into amazon s3 using IAM role
# demo folder : date/subfolder/TheLogFile.log
[sinks.aws_s3]
# General
type = "aws_s3" # required
inputs = ["filter_fields"] # required
bucket = "gateway-logs" # required
compression = "none" # you can keep it as gzip too for almost 3x compresion
healthcheck = true # optional, default
region = "us-east-1" # required, required when endpoint = ""
# Batch
batch.max_events = 100 # optional, no default, events
batch.timeout_secs = 60 # optional, seconds
# Buffer
# max events inside buffer after which its flushed to s3
buffer.max_events = 100 # relevant when type = "memory"
# where the buffer is kept in
buffer.type = "memory" # optional
# Encoding [ here JSON ]
encoding.codec = "ndjson"
# Naming [The full prefix to your s3 bucket object ,here date/SOME_APP_NAME/*.log]
key_prefix = "%F/{{app}}/" # optional, default
Log Storage Service with Query Support
Logs generated from a gateway can pile up as a huge amount of data (even 1.5 GB per day). Since we plan to save this data to be queried later, we will need to have a BigData Stack like HDFS with Hive/HBase to be able to store and query it. Maintaining Hadoop clusters can be tricky and will cost you a lot of $$$. But accessing logs is an infrequent task and having a BigData Cluster up 24x7 for that is overkill. Good thing AWS keeps us covered with its serverless management system where we can :
- Use AWS-S3 similar to a HDFS storage system
- Query s3 logs (even JSON) using amazon Athena, without further processing.
We can even create multiple External Tables in Amazon Athena on the same S3 Data, each table catering to specific query tyes and hence optimally partitioned to those specific queries, at no extra storage cost!
An External Table is just a Schema definition for your data that will help index and query your data. Note that an external table is NOT your data.
Learn more about external tables here
Thanks for reading. This is a WIP, I will add more linked-notes regarding partitioning and optimizing Big Data stores soon ! :)
References:
Learn NGINX ReverseProxy |LB| Gateway
Getting Started with Vector
Vector vs ELK
AWS S3 as Big Data Storage
Athena
Discussion (0)