DEV Community

AttractivePenguin
AttractivePenguin

Posted on

How a Shell Script Can Compile C Code: Inside C89cc

You read that right. There exists a complete C89/ELF64 compiler written entirely in portable shell script. No C bootstrap. No Python helper. No cheating. Just pure, unadulterated /bin/sh.

It's called c89cc, and when it hit 117 points on Hacker News, the developer community's collective jaw hit the floor. Because if you can write a C compiler in shell script, what can't you do with shell?

In this article, we'll explore how c89cc works, why it matters, and what you can learn from it — whether you're a shell scripting enthusiast, a compiler nerd, or just someone who appreciates a beautiful engineering feat.


Why This Matters

Shell scripting is the duct tape of the Unix world. We use it for CI pipelines, deployment scripts, system administration, and quick automations. But a compiler? That's supposed to be the domain of lex, yacc, LLVM, and carefully crafted C/C++ codebases spanning hundreds of thousands of lines.

c89cc challenges that assumption. It proves that shell scripting is far more capable than most developers give it credit for. More importantly, it's an incredible educational tool: by reading a compiler written in shell, you can understand every layer of the compilation pipeline without getting lost in macro magic or template metaprogramming.

Three reasons you should care:

  1. It demystifies compilers. Shell is readable. When the lexer, parser, and code generator are all written in shell, you can trace exactly how source code becomes a binary.
  2. It pushes shell's limits. Understanding how the author handled memory, string processing, and binary output in pure shell teaches you techniques you'll use in everyday scripting.
  3. It's just cool. Sometimes engineering should be about the joy of "because I can."

How c89cc Works: The Compilation Pipeline

Like any compiler, c89cc follows the standard pipeline: preprocessing, lexical analysis, parsing, semantic analysis, and code generation. The difference is that every stage is shell functions operating on text files and shell variables.

Step 1: Preprocessing

The preprocessor handles #include, #define, and conditional compilation. In c89cc, this is done with sed, awk, and shell functions that expand macros line by line.

# Simplified view of how c89cc might handle #define expansion
preprocess() {
    while IFS= read -r line; do
        case "$line" in
            '#define'*)
                name=$(echo "$line" | awk '{print $2}')
                value=$(echo "$line" | awk '{print $3}')
                eval "define_${name}=\"${value}\""
                ;;
            *)
                # Expand known macros
                for macro in $(set | grep '^define_' | cut -d= -f1); do
                    mname="${macro#define_}"
                    line="${line//${mname}/${!macro}}"
                done
                echo "$line"
                ;;
        esac
    done < "$1"
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Lexical Analysis (Tokenization)

The lexer breaks C source into tokens — keywords, identifiers, operators, literals, and punctuation. In c89cc, this is implemented as a state machine using shell case statements and pattern matching.

# Simplified token categorization
tokenize() {
    local char="$1"
    case "$char" in
        [a-zA-Z_])  echo "IDENTIFIER" ;;
        [0-9])       echo "NUMBER" ;;
        '+')         echo "PLUS" ;;
        '-')         echo "MINUS" ;;
        '*')         echo "STAR" ;;
        '/')         echo "SLASH" ;;
        ';')         echo "SEMICOLON" ;;
        '{')         echo "LBRACE" ;;
        '}')         echo "RBRACE" ;;
        *)           echo "UNKNOWN:$char" ;;
    esac
}
Enter fullscreen mode Exit fullscreen mode

The real lexer is far more sophisticated — handling multi-character operators (==, !=, &&), string literals with escape sequences, and comments. But the core idea remains: a shell case statement driving a state machine.

Step 3: Parsing (Building the AST)

The parser consumes tokens and builds an Abstract Syntax Tree. c89cc uses a recursive descent parser — each grammar rule is a shell function that calls other shell functions.

# Recursive descent: parse an expression
parse_expression() {
    local left=$(parse_term)
    while [ "$current_token" = "PLUS" ] || [ "$current_token" = "MINUS" ]; do
        local op="$current_token"
        advance_token
        local right=$(parse_term)
        left="{$op, $left, $right}"
    done
    echo "$left"
}

parse_term() {
    local left=$(parse_factor)
    while [ "$current_token" = "STAR" ] || [ "$current_token" = "SLASH" ]; do
        local op="$current_token"
        advance_token
        local right=$(parse_factor)
        left="{$op, $left, $right}"
    done
    echo "$left"
}

parse_factor() {
    case "$current_token" in
        "NUMBER")  local val="$current_value"; advance_token; echo "$val" ;;
        "IDENTIFIER") local name="$current_value"; advance_token; echo "$name" ;;
        "LPAREN")  advance_token; local expr=$(parse_expression); advance_token; echo "$expr" ;;
    esac
}
Enter fullscreen mode Exit fullscreen mode

Each function returns its sub-tree as a string (shell's only real data structure), and the parent function combines them. It's functional programming in shell — and it works.

Step 4: Code Generation (ELF64 Output)

This is where it gets wild. c89cc generates a valid ELF64 binary — headers, section tables, program headers, and x86-64 machine code — all by writing raw bytes using shell's printf with octal escapes.

# Writing the ELF header (simplified)
write_elf_header() {
    # Magic number: 0x7f 'E' 'L' 'F'
    printf '\177ELF' > "$output"
    # ELFCLASS64
    printf '\002' >> "$output"
    # Little endian
    printf '\001' >> "$output"
    # ELF version
    printf '\001' >> "$output"
    # OS/ABI, padding
    printf '\000' >> "$output"
    printf '%0.s\000' {1..8} >> "$output"
    # ET_EXEC (executable file)
    printf '\002\000' >> "$output"
    # EM_X86_64
    printf '\062\000' >> "$output"
    # ...
}
Enter fullscreen mode Exit fullscreen mode

The code generator traverses the AST and emits x86-64 instructions using the same printf technique. Each CPU opcode is encoded as its byte sequence.


Real-World Lessons for Shell Scripters

You're probably not going to write your next compiler in shell. But the techniques c89cc uses are directly applicable to everyday scripting:

1. Shell Can Handle Binary Data

Most people think shell can only process text. printf with octal escapes (\177) and the $'\xNN' syntax let you write arbitrary bytes. This is useful for generating binary files, protocol messages, or even small images.

# Write a minimal PNG header
printf '\x89PNG\r\n\x1a\n' > image.png
Enter fullscreen mode Exit fullscreen mode

2. Functional Patterns in Shell

c89cc uses functions that echo their results, and callers capture them with $(). This is essentially functional programming — pure functions composing via pipe-like semantics.

# Compose functions instead of mutating globals
result=$(validate $(sanitize $(read_input)))
Enter fullscreen mode Exit fullscreen mode

3. State Machines with case

The lexer's state machine pattern is useful anywhere you need to process input character by character — log parsing, protocol handling, simple parsers.

4. Don't Underestimate Shell Performance

Yes, c89cc is slow. Compiling a simple program can take minutes instead of milliseconds. But for one-off build scripts, prototyping, and CI checks, shell is fast enough. The bottleneck is usually I/O, not the shell itself.


FAQ

Q: Can c89cc compile itself (bootstrap)?
A: Not yet. The C89 feature set it supports is impressive but incomplete. Self-compilation is the holy grail of compiler projects, and shell adds its own constraints. But the subset it handles — functions, pointers, structs, control flow — is substantial.

Q: Is this production-ready?
A: Absolutely not, and the author doesn't claim it is. This is an educational and artistic project. Use GCC, Clang, or TCC for real work.

Q: Why shell specifically? Why not Python or Go?
A: The constraint is the point. Shell is available everywhere — every Unix system, every Docker container, every CI runner. Writing a compiler in the most portable language possible is a statement about accessibility and minimalism.

Q: How does it handle memory?
A: Shell doesn't have pointers or manual memory management. c89cc uses temporary files and shell variables to simulate memory. Symbol tables are associative arrays (in bash/ksh) or files keyed by name.

Q: What C89 features are supported?
A: The core set: integer types, functions, pointers, arrays, structs, control flow (if/else, while, for, switch), and basic preprocessor directives. Notably missing: floating point, complex macro expansion, and full standard library support.


Running c89cc Yourself

Want to see it in action? Here's how:

# Clone the repo
git clone https://github.com/yiyus/c89cc.git
cd c89cc

# Try compiling a simple C program
cat > hello.c << 'CEOF'
int main() {
    write(1, "Hello from shell-compiled C!\n", 29);
    return 0;
}
CEOF

# Compile with c89cc (this will take a while!)
./c89cc.sh hello.c -o hello

# Run the result
./hello
# Output: Hello from shell-compiled C!
Enter fullscreen mode Exit fullscreen mode

A word of warning: Compilation is slow. A program that GCC compiles in milliseconds might take c89cc several minutes. This is expected — shell is interpreting each line, doing string processing for every token, every parse step, every byte of output. Grab a coffee.


Conclusion

c89cc isn't about replacing your toolchain. It's about expanding what you believe is possible.

When someone says "shell scripting can't do X," c89cc is the counterargument. A C compiler — with a lexer, recursive descent parser, and ELF64 code generator — all in portable shell. It works. It produces real executables. And you can read every line of its source code in an afternoon.

The next time you're writing a shell script and thinking "this is getting too complex, I should switch to Python," remember c89cc. Maybe shell can handle it. And maybe the constraint of staying in shell will lead you to a simpler, more elegant solution.

The repo is at github.com/yiyus/c89cc. Go read the source. It's a masterclass in what shell can do.


Have you built something surprising in shell? I'd love to hear about it in the comments.

Top comments (0)