Siddharth Bakshi

Posted on Apr 14, 2020

Embedded Systems Optimization Techniques

Introduction

“Things should be made as simple as possible, but not any simpler.”- Albert Einstein. Most embedded systems courses in universities start by teaching students about the basics of hardware (resistor, diode, bits and bytes, etc). How to represent and interact with those components in C/C++. How to then develop embedded applications-done in Labs, which can often be a “hit or miss” learning experience based on your Professor; an implication I am making based on reliable feedback from my peers studying at UWaterloo, Western and Queens engineering programmes. I myself am an engineering student at the Lassonde School of Engineering at York University, in Toronto. I was taught about Embedded Systems by Professor.James Smith. For the unfortunate among us, and to ensure all readers are on equal footing, the embedded software development process, specifically the application build process comprises three sub-processes: Compiling, Linking and Locating (as seen in figure).

Figure 1 - Image from (Barr & Massa, 2009). These processes run on a host computer. For example, inside of an IDE which has built-in tools such as a compiler, linker, assembler, and debugger etc). The purpose of using these tools is to construct an executable binary image that will execute only on a target embedded system.

A whole article could be written about each of the above sub-processes; one just requires enough reference material(s) and most importantly, in my opinion, interest. And though most embedded systems courses cover memory management, and hardware and software debugging techniques-- putting it all together with peripheral interfacing methodologies and handling interrupts. What most embedded systems courses in Universities across Ontario don't cover, and what this article is about: strategies for increasing code efficiency, reducing memory usage, and lowering power requirements.

A note regarding C++:

Expensive features such templates, exceptions and runtime type identification can negatively impact execution time and code size. Thus it is advised to experiment, before deploying the product/application for production).

Motivation

The need for low-cost versions of a product drives hardware designers to provide just barely enough memory and power to get the job done (Barr & Massa, 2009). Thus, implementing any one of these optimization strategies during the software development phase can be a valuable skill for any engineer or programmer alike. Furthermore, it can decrease the operational expenses of your product and in turn, your organizations by tens or even thousands of dollars. Keep in mind however, most of these optimization techniques on code involve a trade-off. This trade-off is between code size and execution speed. An engineer/programmer may make their application faster or smaller, but not both (Barr & Massa, 2009). As a matter of fact, an improvement in one area can be detrimental to the other. Therefore, allocating adequate time for requirement analysis and system specification is highly encouraged. It is up to the engineer/programmer to decide which of these improvements is most important to him/her.

The First Step...

The first step to solving an optimization issue with respect to embedded applications is to determine what issue one wants to solve. Size issue? Speed issue? Or both? For a single type of issue the engineer/programmer may utilize the compiler to focus on that particular area. The compiler is in a much better position relative to the engineer/programmer to make changes across all software modules (Barr & Massa, 2009). However, if the issue is both speed and size of the application, then it is strongly recommended that the engineer/programmer make code size optimizations using the compiler’s help. Then manually refactor their code to identify bottlenecks and time-critical sections, to make those parts of the code that have short-deadlines and most frequent execution more efficient. (In devices that are battery-powered, reduced runtime can be achieved for every unnecessary processor cycle result. Thus, what engineer/programmer must do is to optimize for speed across the entire application).

Increasing Code Efficiency

When identifying execution time i.e. speed as the single primary issue over code size, the engineer/programmer can apply the following techniques:

Inline functions,
Table look-ups,
Hand-coded Assembly,
Register variables,
Global variables,
Polling,
Fixed-point arithmetic, and
Loop unrolling.

Decreasing Code size

As stated before, for making optimization to code size it highly recommended that the engineer/programmer make use of the compiler. After these automatic optimizations, the engineer/programmer could:

Avoid standard library and routines,
Use to goto statements, and
Same of the techniques mentioned above- table look-ups, register variables, global variables, and hand-coded Assembly (this yields the largest decrease in code size)(Barr & Massa, 2009).

Reducing Memory Usage

In those cases where RAM rather than the ROM is the limiting factor, the engineer/programmer is strongly encouraged to reduce dependence on global data, the stack and the heap. These are manual optimizations that are better made by the engineer/programmer as opposed to the compiler.

Power-Saving Techniques

To conserve power in an embedded system, the engineer/programmer can control the clock frequency and use power-sensitive processors. These are power-saving strategies under software control. In-contrast to using low-voltage ICs and circuit shutdown techniques, etc- used by hardware designers. Also, instead of solely focusing on speed or code size, the engineer/programmer is highly encouraged to focus on analyzing code to determine how to reduce external bus transactions.

Specifically...

Inline Functions

The keyword inline could be added to any function declaration (starting C99). It asks the compiler to replace all calls to the indicated function with copies of the code that is inside (Barr & Massa, 2009). Thus, the runtime overhead associated with the function call is eliminated. Furthermore, is most effective when the function is used frequently but contains only a few lines of code.

Table lookups

Given the engineer/programmer understands the value of test and jump, which consumes valuable processor time. A switch statement could used to decrease execution time by putting the individual cases in order by their relative frequency of occurrence. This will reduce the average execution time, though it will not improve at all upon the worst-case time (Barr & Massa, 2009). Thus, If there is a lot of work to be done within each case, it might be more efficient to replace the entire switch statement with a table of pointers to functions:



enum NodeType {NODE_A, NODE_B, NODE_C};
switch (getNodeType( ))
{
  case NODE_A:
   .
   .
  case NODE_B:
   .
   .
  case NODE_C:
   .
   .
}

Instead, replace this switch statement with:



int processNodeA(void);
int processNodeB(void);
int processNodeC(void);
// The first part of this is the setup: the creation of an array of function pointers.
int (* nodeFunctions[])( ) = {processNodeA, processNodeB, processNodeC};
.
.
//The second part is a one-line replacement for the switch statement that executes more efficiently.
status = nodeFunctions[getNodeType()]( );

Hand-coded assembly

Most C compilers produce better machine code than the average engineer/programmer (Barr & Massa, 2009). However, a skilled and experienced assembly engineer/programmer may do better work than the compiler for a given function.

Register variables

The register keyword can be used when declaring local variables. Doing so asks the compiler to place variable into a general-purpose register rather than on the stack. Used correctly, this technique provides hints to the compiler about the most frequently accessed variables and enhances the performance of the function. The more frequently the function is called, the more likely it is that such a change will improve the code's performance.

Global variables

Efficiency can be obtained by using global variables instead of passing parameters to functions. Doing so eliminates the need to push the parameter onto the stack before the function call and pop it back-off once the function is completed. In fact, the most efficient implementation of any subroutine would have no parameters at all (Barr & Massa, 2009).

Polling

Interrupt Service Routines(ISRs) though improve the responsiveness of a program, in rare cases they can introduce overheads that can cause inefficiency. These rare cases occur when the average time between interrupts is greater than or equal to the interrupt latency.

Fixed-point arithmetic

If the application contains one or two Floating-point instructions. Then they can implemented using fixed-point arithmetic. For example, two fractional bits representing a value of 0.00, 0.25, 0.50, or 0.75 are easily stored in any integer by multiplying the real value by 4 (e.g., << 2). Doing so reduces the very large penalty a processor pays when manipulating a float data type, in contrast to the integer counterparts.

Variable size

Using the processor's native register width for variable size (8, 16 or 32-bits), allows the compiler to produce code that leverages the fast registers built into the processor's machine opcodes (Barr & Massa, 2009). It essentially limits the number of external memory accesses if the variable size is tailored to the processors registers.

Loop unrolling

Repetitive loop code can be optimized by performing loop unrolling. Do so eliminates the loop overhead at the start and end of a loop. For example, consider the following for-loop:



for (i = 0; i < 5; i++)
{
  value[i] = incomingData[i];
}

This is what it would look like when unrolled:



value[0] = incomingData[0];
value[1] = incomingData[1];
value[2] = incomingData[2];
value[3] = incomingData[3];
value[4] = incomingData[4];

An engineer/programmer may find it helpful to check the assembly output from the compiler to see whether efficiency has actually been improved. In this case, the execution is decreased by eliminating the beginning and ending of the for-loop. However, notice here the trade-off between code size and speed.

Avoid standard library routines

Through avoiding the use of large standard library routines, the cost in-terms of size may be greatly reduced. This is because the largest routines try to handle all possible cases, which can incur costs in terms of size on application. For example, the strupr function might be small, but a call to it might drag other functions such as strlower , strcmp , strcpy, and others into your program whether they are used or not. A recommended approach is to implement a subset of the functionality into a header file, with significantly less code.

Using goto statements

In order to remove complicated control structures or share a block of (often) repeated code, goto statements can be used by a engineer/programmer. For example, a goto statement can be used to jump out a routine in case of error. Doing so reduces the size of code that has to processed before the execution of a routine is adequately completed.



int functionWork(void)
{
//Do some work here.
...

//If there was an error doing the work, exit. 
goto CLEANUP;

//Do some more work here.
...

//If there was an error doing the work, exit.
goto CLEANUP;
...

//Otherwise, everything succeeded. 
return SUCCESS;

CLEANUP:
//Clean up code here.
return FAILURE;
}

Concluding Remarks

"Dead code elimination” is a notorious feature of using the automatic optimization capabilities of the compiler. The compiler eliminates code that it believes to be redundant or irrelevant. Dictated by Murphy's law, using this feature on a well-working program can result in unexpected failures. Often this failure occurs when the compiler removes those instructions that the compiler does know about but are implemented by the engineer/programmer for a particular purpose.

For example, given the following block of code:



//Most optimizing compilers would remove the first statement
//because the value of *pCtrl is not used before it is overwritten (on the third line)
1: *pCtrl = DISABLE;
2: *pData = 'x';
3: *pCtrl = ENABLE;
// But what if pCtrl and pData are actually pointers to memory-mapped device registers?

In that case, the peripheral device would not receive the DISABLE command before the byte of data was written. Thus, all future interactions between the processor and this peripheral could potentially cause failures to occur(Barr & Massa, 2009)).

A Reminder...

Never make the mistake of assuming that the optimized program will behave the same way as the un-optimized one (Barr & Massa, 2009). It is advised that engineer/programmer perform regression testing at each new optimization level to be sure the system/program's behavior hasn't changed.

In essence...

This article listed and described strategies for increasing code efficiency, reducing memory usage, and lowering power requirements.

References

1) Barr, Micheal and Massa, Anthony.(2009, February). PDF format. Programming Embedded Systems, Second Edition with C and GNU Development Tools. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=2ahUKEwjG7N3j8OfoAhUJmeAKHaBJBlQQFjACegQIAhAB&url=http%3A%2F%2Fstepsmail.com%2Fdownload%2FCareer-In-Embedded-System.PDF&usg=AOvVaw1fzI61R_P5ssSEjVQW63II
2) Smith, James. (2020, January). PDF. Class Slides. https://moodle.yorku.ca

Top comments (3)

Ravi Mali • May 12 '20

I have read in many places about not using goto statement ( to improve code readability).