Beyond the Program: The Three Layers of Computer Abstraction and the Underlying Logic of Performance Optimization

#architecture #devops #productivity

Have you ever wondered how a computer understands the code you write? How does it transform high-level languages into instructions that the hardware can execute? Most people are only involved in calling functions but don't truly understand the underlying principles. This article will reveal what lies "beyond the program" and how we define computer performance.

Beyond the Program

The Three Layers of Abstraction

Have you ever considered what your computer actually is? Essentially, it's a collection of hardware components, a system that processes 0s and 1s. We can abstract your laptop into three layers: application software, system software, and hardware.

Application software is like a complex instruction manual written by programmers, but it's in a language that the computer can't directly understand.
System software acts as a bridge between the application software and the hardware, translating the complex manual into simple, easy-to-understand strings of 0s and 1s.
Hardware is responsible for executing tasks such as displaying content on the screen, playing music, and performing calculations. It's like a strong worker who doesn't make mistakes and is incredibly efficient, but you need to communicate in its language to tell it what to do and how to use its resources.

Which of these three layers is the most important? The answer varies depending on your perspective, but from a technological development standpoint, the outermost layer is the most likely to be impacted by AI. Interestingly, one of the reasons is that it's the easiest for us to get started with.

Application software is the most closely related to our daily lives, including games, social media applications, and search engines. System software is a bit less familiar, and many people are often confused when using it. For example, I still remember my first time using Git to upload files to GitHub. The language based on the operating system made it difficult for me to adapt at first.

Hardware, on the other hand, is like a "familiar stranger." We all use it (like mice, keyboards, and screens) and have heard of components like CPUs and GPUs, but do we understand their working principles? The answer is likely no. This is why people working in hardware are in high demand. People tend to underestimate hardware (like mice and keyboards) or consider applications simple (like CPUs and GPUs), leading them to focus solely on software development.

What about system software? The field of application software, with its established framework, has reached a point of saturation. Why? Let's delve deeper.

System software encompasses many components, with the two most critical being the operating system and the compiler. Their primary tasks include handling basic input and output operations, allocating external and internal memory, and providing shared computing resources for multiple applications. In other words, it serves as the interface between pure software and pure hardware. The operating systems we currently use, such as Linux, iOS, and Windows, are unlikely to be replaced by other operating systems, considering the immense amount of engineering required and the high level of difficulty. Apart from Huawei, no other company dares, wants, or has the capability to attempt such an endeavor.

The other key component is the compiler, which translates programs written in high-level languages (such as C, C++, Java, and Python) into instructions that the hardware can execute. This process is quite complex, and we'll provide a brief overview here, with a more detailed explanation in a future article.

From High-Level Language to Hardware Language

Computers can only understand binary data and, based on the bit segments of the instructions, extract the corresponding data and perform the corresponding operations (calculations, shifts, storage, and retrieval). For example, 1001010100101110 would instruct the computer to add two numbers. Using numbers to represent both instructions and data is the foundation of computing. I'll be writing a detailed article about instructions later on, so if you're interested, please like and follow me to stay updated.

Obviously, a string of numbers is tedious and lacks readability, which led to the development of assemblers. These are software programs that translate mnemonic instructions into corresponding binary code, instructing the computer to add two numbers A and B.

add A, B --> 1001010100101110

However, this still wasn't intuitive enough, so people created high-level programming languages like C++, C, and Java, which use portable languages composed of words and algebraic symbols. Compilers translate these into assembly language.

A + B --> add A, B

At this point, you might have a general idea of what happens. When we run code, our compiler checks the syntax and then translates the high-level language we've written into assembly language. With the help of the assembler, the assembly language is then translated into instructions that the computer can understand. These instructions are stored in memory, and during each clock cycle, the CPU fetches these instructions and performs the corresponding operations. The process is illustrated below:

Performance

When it comes to computers, everyone loves to talk about performance. So, how do we define performance? One way is to compare the execution time of the same program, similar to a race. This is the speed perspective, which is commonly used for CPUs. Another perspective is throughput, such as how many instructions can be executed in a clock cycle, which is the core idea behind GPUs. Still having trouble understanding? Imagine transporting 450 tourists from Shanghai to Beijing. There are several ways to do it. Using a plane or high-speed train might be sufficient for a single trip, while using small cars would require multiple trips back and forth.

Let's consider performance from the speed perspective:

Performance = 1 / Execution Time

Let's look at an example:

If computer A takes 10 seconds to run a program and computer B takes 15 seconds, how much faster is computer A than computer B?

Performance_A / Performance_B = Execution_Time_B / Execution_Time_A = 15 / 10 = 1.5

To be more precise, the execution time we consider is the CPU execution time, and we use clock cycles as the unit of measurement. So we have:

CPU Execution Time of a Program = CPU Clock Cycles of a Program * Clock Cycle Length

or CPU Execution Time of a Program = CPU Clock Cycles of a Program / Clock Frequency

Now, let's consider instructions:

Number of CPU Clock Cycles = Number of Program Instructions * Average Clock Cycles per Instruction (CPI)

CPI stands for "clock cycle per instruction."

Let's look at another example:

Suppose we have two different implementations of the same instruction set. Computer A has a clock cycle length of 250 ps and a CPI of 2.0 for a certain program. Computer B has a clock cycle length of 500 ps and a CPI of 1.2 for the same program. Which computer is faster for this program?

Number of clock cycles:

CPU Clock Cycles_A = I * 2.0

CPU Clock Cycles_B = I * 1.2

CPU time:

CPU Time_A = CPU Clock Cycles_A * Clock Cycle Length = I * 2.0 * 250 ps = 500 * I ps

CPU Time_B = CPU Clock Cycles_B * Clock Cycle Length = I * 1.2 * 500 ps = 600 * I ps

Performance:

CPU Performance_A / CPU Performance_B = CPU_B / CPU_A = 600 / 500 = 1.2

There's another way to express CPU time:

CPU Time = (Number of Instructions * CPI) / Clock Frequency

This method allows us to analyze the performance impact of each type of instruction. However, we won't delve into that here.

So, how do we improve computer performance? From the formula, we can see that the shorter the CPU time, the better the performance. Therefore, a faster clock frequency and fewer cycles required for program execution are what we desire. A faster clock frequency is related to hardware, while the number of clock cycles required is related to algorithms. In situations where we can't change the hardware, optimizing the code based on the computer architecture to improve performance is what sets apart computer experts from ordinary programmers. Here's an intuitive insight: accelerate frequently occurring events. A real-world example is high-speed elevators in office buildings.

Finally, I want to share an important law: Amdahl's Law. It's quite simple: you can't expect a proportional improvement in overall performance from a local improvement in one aspect of the computer. Let's take a simple example:

Suppose a program takes 100 seconds to run on a computer, with 80 seconds spent on multiplication operations. If we improve the speed of program execution by 5 times, how much should the speed of multiplication operations be improved?

Time after improvement = (Execution time affected by the improvement) / (Improvement factor) + (Execution time not affected)

Time after improvement = 80 / n + (100 - 80)

Since we want to improve the execution time by 5 times, the new execution time should be 20 seconds. Therefore:

20 = 80 / n + 20

0 = 80 / n

This means that the multiplication operations need to be infinitely faster than before, which is obviously impossible.

You must have heard of the concept of multi-core. Similarly, simply stacking processors won't lead to a proportional improvement in performance, because the communication protocols between devices will limit their speed. In other words, using multi-core to improve computer performance follows a pattern of diminishing returns. Stacking cores eventually leads to a slowdown in performance improvement.

I believe you're also familiar with the famous scaling law. In simple terms, it means that as long as you blindly add more hardware, performance will skyrocket. However, according to Amdahl's Law, the end of the scaling law is a "toothpaste-like" improvement in performance.

The recent DeepSeek AI is perhaps the most vivid example. They don't need a lot of Nvidia H100 GPUs to achieve performance comparable to OpenAI's O1 model in certain aspects. This is because they are not blindly stacking hardware; they are optimizing algorithms based on the computer architecture.