lostghost

Posted on Jun 29

Linux from the developer's perspective. Part 1 - C language introduction

#programming #linux

This blog is part of a series.

Unix is the programmer's OS. Let's see that for ourselves. A further exploration of the material shown here can be found in this book.

Linux and C share a lot of their DNA - the syscall API is a C API. Understanding Linux helps you understand C, and vice versa. To get an overview of how the whole mechanism functions, let's compile and run a simple C program, and carefully examine the process.

We will develop in the terminal, on a minimal Linux installation. How to get such an installation - refer to an earlier blog.

The minimal program will look like this:

[lostghost1@archlinux c]$ cat main.c 
#include <stdio.h>
int main(int argc, char** argv){
    if (argc<2) return 1;
    printf("%s\n",argv[1]);
    return 0;
}

I suggest using the nano editor. Let's go over it line-by-line.

On line 1 there is a preprocessor directive - those start with #. A preprocessor is a special kind of language, programs in which run at compile time. It's used for metaprogramming, modifying the way the program behaves. Modern languages try to avoid creating a second language for metaprogramming, but that's not the case with C.

In C, the compilation unit is a file - the compiler considers one file at a time. So if the file needs function signatures, to be able to call into libraries - the signatures need to be included into every file. This is what the #include directive does - it takes the file that you pass to it, and copies the entire contents to the place where the #include was written. We can see that for ourselves, by running the C PreProcessor, cpp (that requires installing the package gcc).

[lostghost1@archlinux c]$ cpp main.c
# 0 "main.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "main.c"
...
extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__)) __attribute__ ((__nonnull__ (1)));
# 949 "/usr/include/stdio.h" 3 4
extern int __uflow (FILE *);
extern int __overflow (FILE *, int);
# 973 "/usr/include/stdio.h" 3 4

# 2 "main.c" 2

# 2 "main.c"
int main(int argc, char **argv){
    if (argc<2) return 1;
    printf("%s\n",argv[1]);
    return 0;
}

We can see a lot of text, and in the end - our program. The mass of text that precedes the program, is all that was included from our #include directive - plus metadata, such as files from which the included text came from. Of course, #include is a pretty crude way of being able to call libraries - modern languages use module systems, instead of header files.

What does a header file look like? You could find the stdio.h header file - it should be at /usr/include/stdio.h. Or we can write our own header file, for our program:

[lostghost1@archlinux c]$ cat main.h
#ifndef _MAIN_H
int main(int argc, char **argv);
#define _MAIN_H
#endif

And include it into our program:

#include "main.h"
#include <stdio.h>
int main(int argc, char **argv){
    if (argc<2) return 1;
    printf("%s\n",argv[1]);
    return 0;
}

Include with <these> braces means including from the system directory (/usr/include/), while with "these" means from the custom one - including the current directory.

In the header file, there are weird preprocessor directives, above and below the actual signature. That is called an include guard - it prevents the same header from being included twice. Which, is another argument in favor of a proper module system - you don't need hacks like this one.

Next comes the function header - it's our main function, from which the program starts running (more on that later). It returns an int, and takes two arguments - argc, and an array of strings argv.

Let's first discuss the return value. It's only a loose convention, of which values mean what. If the function performs an operation - typically the return value of zero means success, and a positive number - means an error code. If it returns an address - then the value of 0, which is NULL, means failure, and a positive number means a valid address. Sometimes numbers from 0 and up are all valid - like with file descriptors, returned by the open syscall, in which case the value -1 means an error. When it comes to program exit status, which is what the main function returns - despite the data type being a signed int, only values 0-255 are allowed - where 0 means success, and any other number is an error code. Error codes aren't standard across programs either - you need to look at the documentation for a particular program, to find out what it means. Of course, a better system would be a standard set of exception attributes - about if the exception is fatal, if it's transient, which module does it relate to, what is the cause. Some exceptions would be more specific, others - less. But for now, it's just numbers.

Next let's take a look at datatypes. They in large part correspond to the machine data types, that the processor operates on - C is not far removed from assembly. There's signed and unsigned numbers, a character, which is actually a single byte, bool, float, struct, union and pointer. Pointer is special - it is actually not a "real" datatype - it is just an unsigned number, wide enough to store an address for the current architecture. You can easily convert a pointer to a number and back. A pointer to 0, a NULL pointer, is a special kind of pointer, which is technically valid - you can dereference it - but typically you don't want to dereference it, because you probably didn't intend to look at the value at address 0, and it is a programming mistake if you do.

Next is another datatype, that doesn't really exist - an array. An array is just a pointer, and to index into the array you use an offset. So these are equivalent:

*(argv+1)
argv[1]
1[argv]

How then do you determine where the array ends? You use a separate variable. Or, you use a decades-old dirty trick.

Another datatype, that doesn't really exist - a character string. A character string is just an array, which is just a pointer. But there is no second variable to determine the length of the string. Instead, there is an implicit contract - the last character of a string will be a zero byte, a NULL. So, if you print out a string - the printing function will start at the start address, and continue reading byte-by-byte - until it finds the NULL character. If for some reason it's not there - the function will continue reading either up to an unmapped page boundary, in which case there will be a SegFault, or a random other NULL byte - and if between the start and the NULL byte, there was stored, for example, a password - it will be printed out as well. As you can imagine, historically, this has been a major source of vulnerabilities.

When writing a program, you are doing computer maths. And you would prefer to operate with actual mathematical concepts, and not concrete computer ones - you shouldn't care about how many bits are in a number, or if an array is a pointer and an offset. That's not the role C plays - it's close to hardware. For a higher-level system, look into how integers are handled in Python - when small, they use the native datatype. But if they get too large - they implicitly convert to "BigInteger" implementation, and don't just loop around.

So then, what is char**? It's an array of NULL-terminated character strings, the length of which is passed in the argc variable. The array contains command-line arguments, with which the program was launched - the first argument always contains the path to the executable itself.

So then, what does the program do? It prints out it's first actual command-line argument. Compile it and run it:

[lostghost1@archlinux c]$ gcc main.c
[lostghost1@archlinux c]$ ./a.out hello
hello

Success! But wait, what did this gcc program actually do? And how exactly was the resulting executable, well, executed? That's for a later blog. See ya then!

Top comments (2)

Roi Kadmon • Jul 5

Great work!
Since you're mentioning libraries and explaining how C works, I think it'd help to mention that printf() in the program you showed is part of the standard C library. I think an easy-to-digest way of explaining it is that ir provides many functionalities that are likely to be useful for some programs, including:

Input-output functions: The printf() you mentioned, along with other variations that write to a specific file (not necessarily the standard output), including functions that deal with files -opening, closing files, reading from / writing to files [although it may be helpful to mention that the C notion of a file can be different from Linux' file-descriptor mechanism];
Numeric functions: Including basic functions like absolute value, minimum, maximum, etc., and also including exponential, power, trigonometric, hyperbolic and format-conversion functions;
String functions: Converting strings to numeric values, concatenating strings, searching for substrings, copying strings, etc. I'd point the reader to the C page in cppreference so they can get an idea of what functions are available, if they're interested. Moreover, you have different revisions of the C language that incrementally add more core functionality and library functions. (Much of what I've said could be out-of-scope, at least to expand on, but I think it could be helpful to mention.)

Bee Buzzin • Jul 5

Just finished reading your article. Pretty easy to follow along, but I'm definitely missing some background info on C and programming concepts. Will be reading your other articles to try to catch up!