DEV Community

lostghost
lostghost

Posted on

Linux from the developer's perspective. Part 1 - C language introduction

This blog is part of a series.

Unix is the programmer's OS. Let's see that for ourselves. A further exploration of the material shown here can be found in this book.

Linux and C share a lot of their DNA - the syscall API is a C API. Understanding Linux helps you understand C, and vice versa. To get an overview of how the whole mechanism functions, let's compile and run a simple C program, and carefully examine the process.

We will develop in the terminal, on a minimal Linux installation. How to get such an installation - refer to an earlier blog.

The minimal program will look like this:

[lostghost1@archlinux c]$ cat main.c 
#include <stdio.h>
int main(int argc, char** argv){
    if (argc<2) return 1;
    printf("%s\n",argv[1]);
    return 0;
}
Enter fullscreen mode Exit fullscreen mode

I suggest using the nano editor. Let's go over it line-by-line.

On line 1 there is a preprocessor directive - those start with #. A preprocessor is a special kind of language, programs in which run at compile time. It's used for metaprogramming, modifying the way the program behaves. Modern languages try to avoid creating a second language for metaprogramming, but that's not the case with C.

In C, the compilation unit is a file - the compiler considers one file at a time. So if the file needs function signatures, to be able to call into libraries - the signatures need to be included into every file. This is what the #include directive does - it takes the file that you pass to it, and copies the entire contents to the place where the #include was written. We can see that for ourselves, by running the C PreProcessor, cpp (that requires installing the package gcc).

[lostghost1@archlinux c]$ cpp main.c
# 0 "main.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "main.c"
...
extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__)) __attribute__ ((__nonnull__ (1)));
# 949 "/usr/include/stdio.h" 3 4
extern int __uflow (FILE *);
extern int __overflow (FILE *, int);
# 973 "/usr/include/stdio.h" 3 4

# 2 "main.c" 2

# 2 "main.c"
int main(int argc, char **argv){
    if (argc<2) return 1;
    printf("%s\n",argv[1]);
    return 0;
}
Enter fullscreen mode Exit fullscreen mode

We can see a lot of text, and in the end - our program. The mass of text that precedes the program, is all that was included from our #include directive - plus metadata, such as files from which the included text came from. Of course, #include is a pretty crude way of being able to call libraries - modern languages use module systems, instead of header files.

What does a header file look like? You could find the stdio.h header file - it should be at /usr/include/stdio.h. Or we can write our own header file, for our program:

[lostghost1@archlinux c]$ cat main.h
#ifndef _MAIN_H
int main(int argc, char **argv);
#define _MAIN_H
#endif
Enter fullscreen mode Exit fullscreen mode

And include it into our program:

#include "main.h"
#include <stdio.h>
int main(int argc, char **argv){
    if (argc<2) return 1;
    printf("%s\n",argv[1]);
    return 0;
}
Enter fullscreen mode Exit fullscreen mode

Include with <these> braces means including from the system directory (/usr/include/), while with "these" means from the custom one - including the current directory.

In the header file, there are weird preprocessor directives, above and below the actual signature. That is called an include guard - it prevents the same header from being included twice. Which, is another argument in favor of a proper module system - you don't need hacks like this one.

Next comes the function header - it's our main function, from which the program starts running (more on that later). It returns an int, and takes two arguments - argc, and an array of strings argv.

Let's first discuss the return value. It's only a loose convention, of which values mean what. If the function performs an operation - typically the return value of zero means success, and a positive number - means an error code. If it returns an address - then the value of 0, which is NULL, means failure, and a positive number means a valid address. Sometimes numbers from 0 and up are all valid - like with file descriptors, returned by the open syscall, in which case the value -1 means an error. When it comes to program exit status, which is what the main function returns - despite the data type being a signed int, only values 0-255 are allowed - where 0 means success, and any other number is an error code. Error codes aren't standard across programs either - you need to look at the documentation for a particular program, to find out what it means. Of course, a better system would be a standard set of exception attributes - about if the exception is fatal, if it's transient, which module does it relate to, what is the cause. Some exceptions would be more specific, others - less. But for now, it's just numbers.

Next let's take a look at datatypes. They in large part correspond to the machine data types, that the processor operates on - C is not far removed from assembly. There's signed and unsigned numbers, a character, which is actually a single byte, bool, float, struct, union and pointer. Pointer is special - it is actually not a "real" datatype - it is just an unsigned number, wide enough to store an address for the current architecture. You can easily convert a pointer to a number and back. A pointer to 0, a NULL pointer, is a special kind of pointer, which is technically valid - you can dereference it - but typically you don't want to dereference it, because you probably didn't intend to look at the value at address 0, and it is a programming mistake if you do.

Next is another datatype, that doesn't really exist - an array. An array is just a pointer, and to index into the array you use an offset. So these are equivalent:

*(argv+1)
argv[1]
1[argv]
Enter fullscreen mode Exit fullscreen mode

How then do you determine where the array ends? You use a separate variable. Or, you use a decades-old dirty trick.

Another datatype, that doesn't really exist - a character string. A character string is just an array, which is just a pointer. But there is no second variable to determine the length of the string. Instead, there is an implicit contract - the last character of a string will be a zero byte, a NULL. So, if you print out a string - the printing function will start at the start address, and continue reading byte-by-byte - until it finds the NULL character. If for some reason it's not there - the function will continue reading either up to an unmapped page boundary, in which case there will be a SegFault, or a random other NULL byte - and if between the start and the NULL byte, there was stored, for example, a password - it will be printed out as well. As you can imagine, historically, this has been a major source of vulnerabilities.

When writing a program, you are doing computer maths. And you would prefer to operate with actual mathematical concepts, and not concrete computer ones - you shouldn't care about how many bits are in a number, or if an array is a pointer and an offset. That's not the role C plays - it's close to hardware. For a higher-level system, look into how integers are handled in Python - when small, they use the native datatype. But if they get too large - they implicitly convert to "BigInteger" implementation, and don't just loop around.

So then, what is char**? It's an array of NULL-terminated character strings, the length of which is passed in the argc variable. The array contains command-line arguments, with which the program was launched - the first argument always contains the path to the executable itself.

So then, what does the program do? It prints out it's first actual command-line argument. Compile it and run it:

[lostghost1@archlinux c]$ gcc main.c
[lostghost1@archlinux c]$ ./a.out hello
hello
Enter fullscreen mode Exit fullscreen mode

Success! But wait, what did this gcc program actually do? And how exactly was the resulting executable, well, executed? That's for a later blog. See ya then!

Top comments (2)

Collapse
 
roi_kadmon_1a37a430309077 profile image
Roi Kadmon

Great work!
Since you're mentioning libraries and explaining how C works, I think it'd help to mention that printf() in the program you showed is part of the standard C library. I think an easy-to-digest way of explaining it is that ir provides many functionalities that are likely to be useful for some programs, including:

  • Input-output functions: The printf() you mentioned, along with other variations that write to a specific file (not necessarily the standard output), including functions that deal with files -opening, closing files, reading from / writing to files [although it may be helpful to mention that the C notion of a file can be different from Linux' file-descriptor mechanism];
  • Numeric functions: Including basic functions like absolute value, minimum, maximum, etc., and also including exponential, power, trigonometric, hyperbolic and format-conversion functions;
  • String functions: Converting strings to numeric values, concatenating strings, searching for substrings, copying strings, etc. I'd point the reader to the C page in cppreference so they can get an idea of what functions are available, if they're interested. Moreover, you have different revisions of the C language that incrementally add more core functionality and library functions. (Much of what I've said could be out-of-scope, at least to expand on, but I think it could be helpful to mention.)
Collapse
 
beebuzzin profile image
Bee Buzzin

Just finished reading your article. Pretty easy to follow along, but I'm definitely missing some background info on C and programming concepts. Will be reading your other articles to try to catch up!