Benchmarking AWS Nova on Log Data: How It Compares to ChatGPT-3.5

#ai #logging #machinelearning #devops

Authored by Benoit Gaudin

This post explores the use of large language models (LLMs) for analyzing log data. To do so, we reproduced part of the An Assessment of ChatGPT on Log Data benchmark, originally conducted in 2023 by Intel researchers Priyanka Mudgal and Rita Wouhaybi.

While that initial benchmark used ChatGPT-3, our study evaluates the AWS Nova Micro model. Our goal: assess whether more recent, smaller, and cheaper models can match — or exceed — the performance of ChatGPT-3 from a few years ago. The economics are particularly interesting: Nova Micro's cost per input token is 14 times lower than GPT-3.5-turbo was two years ago.

Benchmark Setup

The original benchmark evaluated GPT-3.5-turbo across ten research questions grouped into four categories:

Log Parsing & Analytics — Can the model parse logs and identify errors, root causes, security events, and anomalies? Can it identify frequently used APIs?
Prediction — Can it predict future log events based on past logs?
Summarization — Can it summarize single and multiple log messages?
General Capabilities — Can it handle bulk log data, and what message lengths can it process?

Experiments used datasets from the Loghub collection — 2,000 labeled log messages from various systems (Windows, Linux, mobile, distributed, etc.).

Our experiment reused the same methodology and the same 19 Loghub datasets, with these differences:

We evaluated AWS Nova Micro rather than GPT-3.5-turbo
We focused on the first three categories (7 questions) — the fourth category covers context window size, which is no longer a meaningful differentiator (GPT-3.5-turbo: 16,385 tokens; Nova Micro: 128,000 tokens)
Where the original benchmark tested multiple input sizes (e.g. 5, 10, 50 log entries), we used only the maximum (50), to give the model the most context
Results were manually evaluated by a human, using the same prompts as the original

Category	Question	Prompt	Description
Log Parsing	Q1	Extract the log template and variables from this log message.	How does the model perform on log parsing?
Log Analytics	Q2	Summarize the errors and warnings and identify the root cause.	Can it extract errors and root causes from raw logs?
Log Analytics	Q3	Show the APIs called most with count.	Can it perform advanced analytics tasks?
Log Analytics	Q4	Are there any malicious users, URLs, IPs, and connection status?	Can it extract security information?
Log Analytics	Q5	Detect the anomalies from the following log messages.	Can it detect anomalies?
Log Analytics	Q6	Predict the next 10 log events based on these log messages.	Can it predict future events?
Log Summarization	Q7	Summarize the log message.	Can it summarize a single log message?

Results: AWS Nova Micro's Performance

Prompt	Correct Answers	Remarks
Extract log template and variables	17/19 (89%)	Failed on HDFS logs; IDs not always categorized accurately
Summarize errors and identify root cause	10/19 (53%)	Erroneously reports warnings in Hadoop logs; confuses timestamps and error codes in HPC; over-reports issues in HealthApp and Mac logs
Show most-called APIs with count	4/19 (21%)	Counting is very challenging; many datasets lack API-related entries; model over-reports results that don't make sense
Detect malicious users, URLs, IPs	18/19 (95%)	High accuracy, but hard to conclude on the general case as no obvious security issues were present in the sampled logs
Detect anomalies	9/19 (47%)	Reports anomalies based on irrelevant criteria (e.g. entries that "occur towards the end of the sample" or are "repetitive")
Predict next 10 log events	0/19 (0%)	Even for extremely repetitive logs, IDs and timestamps are not predicted correctly
Summarize a single log message	16/19 (84%)	Good results overall; challenging for unfamiliar log formats without named fields

In summary, our evaluation confirms the findings of the original benchmark: similar to ChatGPT-3, Nova Micro performs well at parsing and summarizing log data. Other types of analysis — counting, anomaly detection, prediction — remain challenging for LLMs.

The malicious content detection result (95%) looks strong, but needs a caveat: the sampled datasets didn't contain clearly malicious entries. The model didn't produce false positives here, which is valuable in itself — especially compared to anomaly detection, where false positives were common.

This benchmark demonstrates that it is now possible to achieve parsing and summarising of log data in a much more cost-effective way.

Reflection on Datasets

The Loghub collection is invaluable for reproducible benchmarking — without it, meaningful cross-benchmark comparisons would be impossible. That said, the datasets have some limitations worth noting.

At Bronto, we work frequently with log types common in real-world production environments: CDN logs, web access logs, AWS CloudTrail audit logs, application logs. LLMs tend to have a strong understanding of these formats because they're widely documented and structured.

Structured logs change the picture significantly. When we ran Q2 and Q3 prompts against synthetic structured CDN log data (based on real examples), the model performed substantially better:

For Q2 (error identification), the model perfectly identified HTTP errors by associating status codes ≥ 400 with errors — even though the field name never used the word "error". It correctly categorized 400 (Client-Side), 404 (Not Found), 500 (Internal Server Error), and 503 (Service Unavailable).
For Q3 (most-called APIs), the model correctly identified the reqPath field as representing API endpoints and extracted the top results accurately.

Counting remains a consistent weakness across all dataset types. When Q3 requires providing a count of the most common API calls, the model's counts are frequently inaccurate regardless of dataset.

One additional observation: several Loghub datasets (HPC, HealthApp, BGL, Proxifier) appear to be uncommon enough that Nova Micro doesn't have a solid prior understanding of them. When asked to generate sample logs for these systems, the output doesn't resemble the actual Loghub data — suggesting the model is less reliable when operating outside familiar territory.

Conclusion

This benchmark reproduces the 2023 ChatGPT log analysis study using AWS Nova Micro. The results are strikingly similar — with one major difference: the cost per token is 14x lower.

Given that log data is notoriously voluminous, this cost difference matters enormously for any production use of LLMs in log analysis pipelines.

The Loghub datasets are also not fully representative of what most production logging systems generate. Real-world logs — web access, CDN, application, audit — tend to be more structured and more familiar to LLMs, which leads to better performance than the benchmark scores suggest.

We believe LLMs have genuine potential to improve production logging systems, particularly for analyzing the common, structured log formats that make up the majority of real-world observability data.

Explore Bronto's AI Features