ObservabilityGuy

Posted on Nov 10

Right Approach to JSON Log Analysis: A Hands-on Guide to Efficient Practices with Alibaba Cloud SLS

#analytics #cloud #tutorial #dataengineering

This article describes the best practices for processing and analyzing JSON logs in Alibaba Cloud Simple Log Service (SLS).
JSON is one of the most common log data formats, due to its strong flexibility, extensibility, and high readability. However, massive volumes of JSON logs also pose challenges to efficient analysis. This topic describes the best practices for processing and analyzing JSON logs in Alibaba Cloud Simple Log Service (SLS). This helps you accurately and efficiently extract the core value from vast volumes of seemingly unordered data.

1. Data Preprocessing: Laying the Foundation for Efficient Analysis from the Source
For JSON logs with a relatively fixed structure, the optimal strategy is to "Flatten" them before they enter storage. This means expanding nested JSON fields into independent and flattened fields.

The benefits are obvious:

● Better query performance: By eliminating the need for real-time parsing during each query and enabling direct analysis of individual fields, this approach results in faster speed and higher efficiency.

● Lower storage costs: Flattening the logs eliminates redundancy caused by the JSON format itself, such as curly braces, quotation marks, and commas. This effectively reduces storage overhead.

SLS provides multiple ways to preprocess data before it is written in:

Method 1: Processing during collection (recommended for Logtail users)
If you use Logtail for log collection, you can directly leverage its built-in JSON plug-in. This plug-in can parse and expand JSON objects into multiple independent fields during the collection phase to implement data normalization at the source. If only some fields in the log are JSON strings, you can process specific fields in conjunction with SPL statements in the plug-in.

Method 2: Processing during writing (Ingestion Processor)
When logs come from diverse sources (for example, APIs and SDKs in addition to Logtail), or when you cannot control the configuration of the collection device, you can configure an ingestion processor on the Logstore of SLS. All data written to this Logstore is uniformly handled by the processor before being persisted, and JSON expansion can also be achieved in this way. This provides a centralized and unified data governance portal.

Method 3: Processing after writing (Data Processing)
There's no need to worry if the JSON logs have already been stored in the Logstore. You can use a data processing task to read data from the source Logstore, process it with SPL, and then write the processed structured data to a new Logstore. This is useful for the cleaning and archiving of historical data.

No matter which method you choose, the powerful SPL is the core tool for your JSON data processing. It can easily complete various operations such as expansion, extraction, and conversion, paving the way for subsequent efficient analysis.

2. Configuring Indexes: Striking the Optimal Balance Between Structure Preservation and Query Performance
Although "flattened" storage is ideal, sometimes we want to retain the original JSON structure of the field to better reflect the hierarchical relationships in the logs. At the same time, we still need to be able to perform flexible query and analysis.

You can create a JSON-type index for the JSON field itself, and separately create sub-node indexes for the most commonly used leaf node paths (such as Payload.Status). In this way, the complete JSON structure of the Payload field is retained, and high-speed query and analysis of high-frequency sub-nodes can be directly performed like querying ordinary fields.

What if there are too many fields to create? You can select "Automatically index all text fields in JSON", so that all text-type sub-nodes in the JSON will be automatically indexed and can be directly queried. For example, we do not add an index to the Method sub-node in the index configuration, but we can directly query the Payload.Method keyword.

3. JSON Functions: The Versatile Tool for Delving Into JSON
As mentioned earlier, you can add indexes to JSON sub-nodes for direct analysis. However, there are situations where adding sub-node indexes is not possible or convenient:

(1) The fields contained in the JSON object are uncertain and cannot be enumerated in advance.

(2) For JSON arrays or intermediate nodes of JSON objects, separate indexes cannot be created for them.

Powerful JSON analysis functions in SLS come in handy. They allow you to perform real-time, flexible, and in-depth analysis of JSON data in SQL queries.

json_extract or json_extract_scalar: Choose the right tool
Both functions serve as the foundation for JSON extraction, but differ in their purposes:

● json_extract (json, json_path): Return a JSON object or JSON array. You can use this function when you need to perform operations on the JSON structure itself (such as calculating the length of an array).

● json_extract_scalar (json, json_path): Return a scalar value (string, number, and Boolean), but its return type is always varchar (string). This is the most commonly used function to extract field values for analysis.

Note: When using json_extract_scalar to extract numeric values for calculation, you need to convert them to a numeric type using the CAST function first.

* | selectjson_extract_scalar(Payload, '$.Method') as method,avg(cast(json_extract_scalar(Payload, '$.Latency') as bigint)) as latencygroup bymethod

In what scenarios do I use the json_extract function?

It should be used when we need to perform some analytical operations on the JSON object structure itself. For example, in the following log example, Payload is a JSON log field that contains a sub-array named Params, and within this array are JSON objects.

Suppose we now want to analyze the average length of the Params array in each log. This is precisely analyzing a JSON array, so we need to first extract the array using the json_extract function, then calculate its length with json_array_length, and finally compute the average.

* | selectavg(json_array_length(json_extract(Payload, '$.Params')))
json_extract_long/double/bool: Say goodbye to cumbersome type conversions
The return value of json_extract_scalar is always of the varchar type, which means that before performing numeric calculations, you must use the CAST function to convert it. This not only increases the complexity of SQL, but also incurs additional performance overhead.

To simplify queries and improve performance, SLS provides preset extraction functions. They can directly extract values of the specified type, eliminating the need for CAST operations.

● json_extract_long(json, json_path): Extract as 64-bit Integer

● json_extract_double(json, json_path): Extract as a double-precision floating-point type.

● json_extract_bool(json, json_path): Extract as a Boolean type

As seen in the previous example, the expression

cast(json_extract_long(Payload, '$.Latency') as bigint) as latency

can be simplified into

json_extract_long(Payload, '$.Latency') as latency

json_path: Master JSON paths for precise positioning
When using functions like json_extract or json_extract_scalar to extract data from JSON, you need to specify the json_path field to indicate which part of the JSON you want to extract.

The basic format of json_path is such as "$.a.b", where the $ symbol represents the root node of the current JSON object, and then the node to be extracted is referenced through the "." symbol.

● What if the key value of JSON itself is in the form of a.b?

For example, if Payload has a sub-node named "user.agent", in such cases, you can replace the dot (.) with square brackets ([]), and the node name inside the brackets must be enclosed in double quotes.

* | select json_extract_scalar(Payload, '$.["user.agent"]')

● What if you need to extract a specific element from a JSON array?

In such cases, square brackets [] can also be used, with numbers inside the brackets representing numerical indices, starting from 0.

* | select json_extract_long(Payload, '$.Params[0].value')

JSON array analysis: unnest function helps "simplify complexity"
When a single log entry contains a list of items (that is, a JSON array), a common analysis requirement is to expand the array and perform aggregate analysis on each of its elements. The unnest function is designed precisely for this purpose. It can extract each element from an array as a separate row.

Log example

* | selectjson_extract_scalar(kv, '$.key') as key,avg(json_extract_long(kv, '$.value')) as valueFROM  log,unnest(cast(json_extract(Payload, '$.Params') as array(json))) as t(kv)group bykey

Execution result

Execution process analysis:

● json_extract extracts the Params array.

● CAST(... AS ARRAY (JSON)) converts it to a JSON array type that is recognized by SLS.

● UNNEST(...) AS t(kv) expands the array, and the kv column in each row is an element (a JSON object) of the original array.

● Finally, we can apply the json_extract function to the kv and perform grouping aggregation.

4. SQL Copilot: Intelligent Generation of SQL Statements
When dealing with complex analysis requirements, writing SQL manually is not only time-consuming but also prone to errors. The built-in SQL Copilot feature of Alibaba Cloud SLS has completely changed this situation. You only need to describe your analysis target in natural language (for example, "Expand the Params array in the Payload and calculate the average value of 'value' by 'key' group"), and Copilot can automatically generate accurate SQL query statements for you.

This means you can focus more on "what to analyze" rather than "how to query" **.

Practice suggestion: At the beginning of the analysis, you may first use SQL Copilot to generate basic queries, and then fine-tune and optimize them according to specific requirements. This approach will help you work more efficiently and get better results with less effort.

Summary and Outlook
Efficient analysis of JSON logs is a systematic project. For this purpose, SLS provides a full-link solution from data ingestion to query and analysis:

● Prioritize data normalization: If conditions permit, expanding JSON before writing through collection plug-ins, write processors, or data processing is the optimal approach to achieve high-performance and low-cost analysis.

● Use indexes well: For logs stored in JSON format, creating sub-indexes for frequently accessed paths or enabling automatic full-text indexing is key to accelerating queries.

● Master core functions: When you need real-time and flexible analysis, you can skillfully use JSON analysis functions such as json_extract series functions and unnest. They are powerful tools for you to delve into the data.

● Embrace AI as an enabler: With SQL Copilot, let natural language serve as a bridge for your conversations with data, greatly simplifying the analysis process.

Mastering these methods and techniques will enable you to confidently handle various complex JSON log analysis scenarios, truly transforming massive log data into valuable assets that drive business decisions.

DEV Community

Right Approach to JSON Log Analysis: A Hands-on Guide to Efficient Practices with Alibaba Cloud SLS

Top comments (0)