DEV Community

Cover image for API-Log Management | ETL -Vector | Searching and Debugging in Production :
Arka Bhowmik
Arka Bhowmik

Posted on

API-Log Management | ETL -Vector | Searching and Debugging in Production :

Hi, This article would help you understand baby steps for maintaining mid to large sale production logs that are necessary for debugging and auditing purposes and some recommended off-beat ETL services to process and query them.

For any application, when you log something, it needs to be easily searchable and rich with relevant information. Hence it's important to:

  • get rid of irrelevant data inside the logs
  • maintain a logging id for every unique request made, that is consistent throughout your web app from UI to Backend
  • Have a system that is able to search logs to track requests.

The below diagram shows a basic logging system where a trace_id ( UUID) generated from the UI for every request made by a user, which can be added and sent into a header to the backed services. This id can be later used by the debugger to track the entire request from Frontend to Backend to debug any kind of issues faced later.

Alt Text

Logging Service

The best place to log requests is the Website gateway. Here you will get the entire request sent from the UI in its raw state. Since nginx is one of the most popular and easy to configure gateway|reverse-proxy servers, here's an example to configure nginx logging format to output JSON:

    log_format mycombined  escape=json '{"status": "$status", "upstream_addr": "$upstream_addr", "time_local": "$time_local", "method": "$request_method", "request_uri": "$request_uri" ,   "http_retry": "$http_retry", "http_authorization": "$http_authorization", "http_x_forwarded_for": "$http_x_forwarded_for", "body_bytes_sent": "$body_bytes_sent", "request_time": "$request_time", "upstream_response_time":  "$upstream_response_time"}';
Enter fullscreen mode Exit fullscreen mode

The escape parameter tells us which characters to dismiss while generating the log. Here all non-JSON supported characters will be escaped.

Now that Nginx logs are being produced as JSON, we can process them in our log shipper.
Alt Text
Below is a sample vector configuration for the same.

Log Shipping ETL service

Vector is a low memory consumption, stand-alone ETL-service daemon process which can tail your logs, process, and send them to your data locations using a high-level pipeline. Vector may not even consume more than 1GB RAM for mid-level services having around 2000RPM, hence it is ideal for services that don't require standalone devices for ETL as opposed to Logstash/ELK.

Please refer to Vector's MAN page for understanding all possibilities in vector and understanding the below configuration if it piques your interest.

Our Sample Vector configuration takes:

  • nginx access logs (source block),
  • identifies log message as JSON and parses JSON data as variables ( transforms block)
  • parses nested JSON fields and extracts variables from them (transforms block)
  • adds new field variables
  • filters unwanted fields
  • has two sinks, one outputting to standard output in the terminal for debugging and other one sending the processed logs to AWS-S3.

 # source file to pick up logs
 type = 'file'
 include = [ '/var/log/nginx/access.log' ]
 start_at_beginning = false

 # parse the log 'message' as json 
  type = "json_parser" # required
  inputs = ["nginx_access_log"] # required
  drop_invalid = true # required

 # you can extract sub json fields too and add it as normal fields. 
# Here we have a custom nginx header var called jwt,  might be different for you
  type = "json_parser" # required
  inputs = ["nginx_as_json"]
  field =  "jwt"
  drop_invalid = true

 # extract nested fields in jwt, into parent fields( original message)
  type = "add_fields"
  inputs = ["jwt_extract"]
  fields.user_id ="{{user.userid}}" = "{{}}"

 # discard unnecessary fields
  type = "remove_fields" # required
  inputs = ["jwt_flatten"] # required
  fields = ["user","timestamp", "source_type", "http_retry", "request_length", "file", "body_bytes_sent"] # required

 # this console output is just for debug, might delete if not required
 # General
  type = "console" # required
  inputs = ["filter_fields"] # required
  target = "stdout" # optional, defaultg

  # Encoding
  encoding.codec = "json" # required

 # example to upload each processed log into amazon s3 using IAM role
 # demo folder : date/subfolder/TheLogFile.log
  # General
  type = "aws_s3" # required
  inputs = ["filter_fields"] # required
  bucket = "gateway-logs" # required
  compression = "none" # you can keep it as gzip too for almost 3x compresion
  healthcheck = true # optional, default
  region = "us-east-1" # required, required when endpoint = ""

  # Batch
  batch.max_events = 100 # optional, no default, events
  batch.timeout_secs = 60 # optional, seconds

  # Buffer
  # max events inside buffer after which its flushed to s3
  buffer.max_events = 100 # relevant when type = "memory"
  # where the buffer is kept in
  buffer.type = "memory" # optional

  # Encoding [ here JSON ]
  encoding.codec = "ndjson"

  # Naming [The full prefix to your s3 bucket object ,here date/SOME_APP_NAME/*.log]
  key_prefix = "%F/{{app}}/" # optional, default
Enter fullscreen mode Exit fullscreen mode

Log Storage Service with Query Support

Logs generated from a gateway can pile up as a huge amount of data (even 1.5 GB per day). Since we plan to save this data to be queried later, we will need to have a BigData Stack like HDFS with Hive/HBase to be able to store and query it. Maintaining Hadoop clusters can be tricky and will cost you a lot of $$$. But accessing logs is an infrequent task and having a BigData Cluster up 24x7 for that is overkill. Good thing AWS keeps us covered with its serverless management system where we can :

  • Use AWS-S3 similar to a HDFS storage system
  • Query s3 logs (even JSON) using amazon Athena, without further processing. We can even create multiple External Tables in Amazon Athena on the same S3 Data, each table catering to specific query tyes and hence optimally partitioned to those specific queries, at no extra storage cost! An External Table is just a Schema definition for your data that will help index and query your data. Note that an external table is NOT your data. Learn more about external tables here

Thanks for reading. This is a WIP, I will add more linked-notes regarding partitioning and optimizing Big Data stores soon ! :)


Learn NGINX ReverseProxy |LB| Gateway
Getting Started with Vector
Vector vs ELK
AWS S3 as Big Data Storage

Discussion (0)