This blog is part of a series.
Unix is the programmer's OS. Let's see that for ourselves. A further exploration of the material shown here can be found in this book.
Linux and C share a lot of their DNA - the syscall API is a C API. Understanding Linux helps you understand C, and vice versa. To get an overview of how the whole mechanism functions, let's compile and run a simple C program, and carefully examine the process.
We will develop in the terminal, on a minimal Linux installation. How to get such an installation - refer to an earlier blog.
The minimal program will look like this:
[lostghost1@archlinux c]$ cat main.c
#include <stdio.h>
int main(int argc, char** argv){
if (argc<2) return 1;
printf("%s\n",argv[1]);
return 0;
}
I suggest using the nano
editor. Let's go over it line-by-line.
On line 1 there is a preprocessor directive - those start with #
. A preprocessor is a special kind of language, programs in which run at compile time. It's used for metaprogramming, modifying the way the program behaves. Modern languages try to avoid creating a second language for metaprogramming, but that's not the case with C.
In C, the compilation unit is a file - the compiler considers one file at a time. So if the file needs function signatures, to be able to call into libraries - the signatures need to be included into every file. This is what the #include
directive does - it takes the file that you pass to it, and copies the entire contents to the place where the #include
was written. We can see that for ourselves, by running the C PreProcessor, cpp
(that requires installing the package gcc
).
[lostghost1@archlinux c]$ cpp main.c
# 0 "main.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "main.c"
...
extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__)) __attribute__ ((__nonnull__ (1)));
# 949 "/usr/include/stdio.h" 3 4
extern int __uflow (FILE *);
extern int __overflow (FILE *, int);
# 973 "/usr/include/stdio.h" 3 4
# 2 "main.c" 2
# 2 "main.c"
int main(int argc, char **argv){
if (argc<2) return 1;
printf("%s\n",argv[1]);
return 0;
}
We can see a lot of text, and in the end - our program. The mass of text that precedes the program, is all that was included from our #include
directive - plus metadata, such as files from which the included text came from. Of course, #include
is a pretty crude way of being able to call libraries - modern languages use module systems, instead of header files.
What does a header file look like? You could find the stdio.h
header file - it should be at /usr/include/stdio.h
. Or we can write our own header file, for our program:
[lostghost1@archlinux c]$ cat main.h
#ifndef _MAIN_H
int main(int argc, char **argv);
#define _MAIN_H
#endif
And include it into our program:
#include "main.h"
#include <stdio.h>
int main(int argc, char **argv){
if (argc<2) return 1;
printf("%s\n",argv[1]);
return 0;
}
Include with <these>
braces means including from the system directory (/usr/include/
), while with "these"
means from the custom one - including the current directory.
In the header file, there are weird preprocessor directives, above and below the actual signature. That is called an include guard - it prevents the same header from being included twice. Which, is another argument in favor of a proper module system - you don't need hacks like this one.
Next comes the function header - it's our main function, from which the program starts running (more on that later). It returns an int
, and takes two arguments - argc
, and an array of strings argv
.
Let's first discuss the return value. It's only a loose convention, of which values mean what. If the function performs an operation - typically the return value of zero means success, and a positive number - means an error code. If it returns an address - then the value of 0, which is NULL
, means failure, and a positive number means a valid address. Sometimes numbers from 0 and up are all valid - like with file descriptors, returned by the open
syscall, in which case the value -1
means an error. When it comes to program exit status, which is what the main
function returns - despite the data type being a signed int
, only values 0-255 are allowed - where 0 means success, and any other number is an error code. Error codes aren't standard across programs either - you need to look at the documentation for a particular program, to find out what it means. Of course, a better system would be a standard set of exception attributes - about if the exception is fatal, if it's transient, which module does it relate to, what is the cause. Some exceptions would be more specific, others - less. But for now, it's just numbers.
Next let's take a look at datatypes. They in large part correspond to the machine data types, that the processor operates on - C is not far removed from assembly. There's signed and unsigned numbers, a character, which is actually a single byte, bool, float, struct, union and pointer. Pointer is special - it is actually not a "real" datatype - it is just an unsigned number, wide enough to store an address for the current architecture. You can easily convert a pointer to a number and back. A pointer to 0, a NULL pointer, is a special kind of pointer, which is technically valid - you can dereference it - but typically you don't want to dereference it, because you probably didn't intend to look at the value at address 0, and it is a programming mistake if you do.
Next is another datatype, that doesn't really exist - an array. An array is just a pointer, and to index into the array you use an offset. So these are equivalent:
*(argv+1)
argv[1]
1[argv]
How then do you determine where the array ends? You use a separate variable. Or, you use a decades-old dirty trick.
Another datatype, that doesn't really exist - a character string. A character string is just an array, which is just a pointer. But there is no second variable to determine the length of the string. Instead, there is an implicit contract - the last character of a string will be a zero byte, a NULL. So, if you print out a string - the printing function will start at the start address, and continue reading byte-by-byte - until it finds the NULL character. If for some reason it's not there - the function will continue reading either up to an unmapped page boundary, in which case there will be a SegFault, or a random other NULL byte - and if between the start and the NULL byte, there was stored, for example, a password - it will be printed out as well. As you can imagine, historically, this has been a major source of vulnerabilities.
When writing a program, you are doing computer maths. And you would prefer to operate with actual mathematical concepts, and not concrete computer ones - you shouldn't care about how many bits are in a number, or if an array is a pointer and an offset. That's not the role C plays - it's close to hardware. For a higher-level system, look into how integers are handled in Python - when small, they use the native datatype. But if they get too large - they implicitly convert to "BigInteger" implementation, and don't just loop around.
So then, what is char**
? It's an array of NULL-terminated character strings, the length of which is passed in the argc
variable. The array contains command-line arguments, with which the program was launched - the first argument always contains the path to the executable itself.
So then, what does the program do? It prints out it's first actual command-line argument. Compile it and run it:
[lostghost1@archlinux c]$ gcc main.c
[lostghost1@archlinux c]$ ./a.out hello
hello
Success! But wait, what did this gcc
program actually do? And how exactly was the resulting executable, well, executed? That's for a later blog. See ya then!
Top comments (2)
Great work!
Since you're mentioning libraries and explaining how C works, I think it'd help to mention that printf() in the program you showed is part of the standard C library. I think an easy-to-digest way of explaining it is that ir provides many functionalities that are likely to be useful for some programs, including:
Just finished reading your article. Pretty easy to follow along, but I'm definitely missing some background info on C and programming concepts. Will be reading your other articles to try to catch up!