DEV Community

Cover image for Utilize another telemetry data for faster improvement with AI agent
Eliana Lam for AWS Community On Air

Posted on

Utilize another telemetry data for faster improvement with AI agent

Speaker: Yoshi Yamaguchi @ AWS Community Day Hong Kong 2025

Summary by Amazon Nova

https://www.youtube.com/watch?v=4jZ5A5lJHHQ



Introduction to Profiling

  • Definition of Profiler: A profiler is a type of telemetry that provides information about how well a system is performing and how much it is consuming system resources.

  • Purpose of Profiling: Profiling helps investigate resource consumption in a program and provides statistical information on specific resource usage over a period.

  • Difference from Tracing: Unlike tracing, which tracks resource consumption in a time series manner, profiling offers statistical information throughout a specific period.

  • Famous Profiling Tools:

  • Java: Java Flight Recorder (JFR)

  • Python: cProfile (standard library) and line_profiler (third-party tool)

  • Linux: perf command for arbitrary executable files

Importance of Profiling in Performance Investigation

  • Exercise Scenario: Imagine a latency issue in a web application where client requests to the web server experience slow response times.

  • Goal: Identify the cause of the performance issue and fix it.

  • Observability: While observability is often associated with logs, metrics, and traces, profiling provides additional insights that may be necessary to resolve performance problems.



Limitations of Logs, Metrics, and Distributed Traces

  • [ 1 ] Metrics:

  • Definition: Metrics are the most common type of telemetry, providing information like CPU usage and memory usage.

  • Limitation: Metrics track the entire resource consumption of an instance or service (e.g., a container running a web server). They do not provide information about which specific function or implementation within the application is consuming the resources. For example, metrics can show that a web server is consuming a lot of CPU, but they cannot identify which function is causing the high CPU usage.

  • [ 2 ] Distributed Traces:

  • Definition: Distributed traces provide important information about latency issues within a system.

  • Limitation: Common solutions for distributed tracing require instrumenting the system or program. If a function is not instrumented, no latency information is available for that function. Instrumenting every function is time-consuming and can introduce performance issues.

  • [ 3 ] Logs:

  • Definition: Logs provide information about specific functions performing specific actions at specific times.

  • Limitation: Logs only show successful process completion or error messages. If a function handles a request properly, logs will show successful messages without indicating any latency issues. Logs do not provide information about the cause of latency problems unless an error occurs.

Conclusion

  • Problem: Logs, metrics, and distributed traces, while useful, do not provide complete information about the cause of latency issues. Profiling is necessary to gain deeper insights into resource consumption and identify performance bottlenecks.


Summary of the Current Status and Ideal Status

  • [ 1 ] Objective: 

  • The goal of using observability is to identify the cause of performance issues and fix them.

  • [ 2 ] Current Status:

  • Gap in Telemetry: Despite having latency problems, surges in CPU or memory usage, changes in latency throughout distributed traces, and logs showing functions handling requests well, the exact cause of the latency problem remains unidentified.

  • [ 3 ] Ideal Status:

  • Exact Identification: The ideal status is to be able to identify the exact line of code causing the performance issue, allowing for precise fixes.

Definition of Observability

  • Observability in Control Theory: Observability is defined as a measure of how well the internal state of a system can be inferred from its external outputs (telemetry).

  • Ideal Observability: The ideal state of observability is to identify the exact cause of issues solely through telemetry.

Role of Profiling

  • Filling the Gap: Profiling is necessary to bridge the gap that logs, metrics, and traces cannot fill.

  • Last Mile Information: Profiling provides the final piece of information needed to go from the data provided by logs, metrics, and traces to the exact cause of the performance problem.

Traditional Use of Profiling

  • Instrumentation: To use profiling, you need to instrument your program to collect telemetry data.

  • Analysis: Analyze the collected telemetry data to identify and fix performance issues.



Instrumenting Your Program for Profiling

Go:

  • Standard Package: Use the pprof package, which is included with Go.

  • Steps:

  • Import the Pprof package.

  • Create an instance of the profiler.

  • Start the profiler.

  • Define where to stop the profiler measurement.

Python:

  • Line Profiler: Use the Line_profiler package.

  • Steps:

  • Import the Line_profiler package.

  • Enable the profiler.

  • Write the specific part of the process to be profiled.

  • Disable the profiler.

Continuous Profiling for Always-On Systems

  • [ 1 ] Difference:

  • One-Off Profiler: Measures the time of a single run (start and stop the timer).

  • Continuous Profiler: Takes periodical profile data at set intervals (e.g., every 10 seconds for 5 minutes) and stores it externally (e.g., Amazon S3).

  • [ 2 ] Use Case: 

  • Ideal for always-on systems like daemons or web servers, especially during spikes in traffic.

  • [ 3 ] Solution:

  • AWS: Use CodeGuru or manage Grafana with Pyroscope.

  • Open Source: Pyroscope, now part of Grafana solutions.

Analysis of Profile Data

  • Screenshot Example: Amazon Managed Grafana with Pyroscope plugin showing statistical resource usage per stack.

  • Next Step: After collecting data, proceed to the analysis phase to identify and fix performance issues.



Visualizing Profile Data with Flame Graphs

  • Flame Graph: A famous way to visualize profile data.

  • Challenge: Understanding how to read a flame graph requires knowledge of the source code, visualization meaning, and fundamental computer science concepts.

Reading a Flame Graph

  • [ 1 ] Example Graph: 

  • Visualizes CPU usage of the entire program stack.

  • [ 2 ] Interpretation:

  • Long Bars: Indicate high resource consumption.

  • sys.call.read: A system call for reading, consuming a lot of CPU.

  • os.file.read: A standard Go library function for reading physical files, likely well-implemented and performant.

  • [ 3 ] Conclusion: The issue may lie in how the custom function (Cut function imitating the GNU Cut command) is calling Os.file.read inefficiently.

Example Function Implementation

  • cut Function: A function that imitates the behavior of the GNU Cut command.

  • Next Step: Review the implementation of the  Cut function to identify potential inefficiencies in how it calls os.file.read



Demo:

Inefficiency in File Reading

  • Issue Identified: The program was reading the file one byte at a time, which is inefficient because each byte read triggers a system call.

  • Knowledge Requirement: Understanding why this is inefficient requires knowledge of user land and kernel land, as well as computer science principles.

Role of AI Assistants in Profiling

  • AI Assistance: AI assistants, particularly Large Language Models (LLMs), can help interpret profile data.

  • Fit for Operations: LLMs are more suited for operations than development because operations involve smaller, more deterministic outputs (like issue identification) and have rich context (system specifications, telemetry data, logs).

  • Development vs. Operations: Development requires generating large amounts of source code with indeterminate outputs, making it less suited for LLMs.

Utilizing AI Assistants with Profile Data

  • Demo: An AWS employee uses an AI assistant to analyze profile data and source code to identify performance issues.

  • Process:

  • [ 1 ] Ask the AI assistant to analyze the profile data and source code.

  • [ 2 ] The assistant identifies the need for tools to visualize the profile data.

  • [ 3 ] Upon approval, the assistant visualizes CPU usage by function and identifies the exact function causing the issue.

  • [ 4 ] The assistant provides line-by-line CPU usage information and suggests improvements.

Summary

  • Profile Data: Key for linking performance issues to actual code improvements.

  • Traditional Analysis: Requires deep system and computer science knowledge.

  • AI Assistance: Leverages profile data to educate users on its utilization and accelerates development.



Team:

AWS FSI Customer Acceleration Hong Kong

AWS Amarathon Fan Club

AWS Community Builder Hong Kong

Top comments (0)