Why does the indexing of array start with ZERO in C?

#c #programming #linux #cpp

Most programming languages today, including C, use a zero-based index for arrays due to some compelling reasons.

Better addressing

If a C language array starts from 0, the address of array[i] would be exactly:

(array + i)

which is very consistent. But if it starts from 1, the address of array[i] would be:

(array + i - 1)

One more computation would affect the performance, and it would be even worse if extended to a two-dimensional array.

For ZERO-based indexing, the address of array[i][j] is

(array + i * N + j)

which is very neat. But for 1-based indexing, the address of array[i][j] will become:

(array + (i - 1) * N + (j - 1))

It would be more cumbersome. Additionally, if we start with 1, the same address cannot be addressed uniformly using "pointer + offset" and "array + index", and conversion is often required. So, why bother?

Computer hardware is designed to use 0 as the starting index

Physical memory addressing and port addressing both start from 0. For example, the address range of a 32-bit computer's memory is:

[0, 2 ^ 32 - 1]

which can be represented by a 32-bit integer. However, if the memory is addressed starting from 1, then the address range of a 32-bit computer becomes:

[1, 2 ^ 32]

In that case, the highest address 2 ^ 32 would require a 33-bit integer, which would be a waste of resources.

Other port addresses and DMA channels also follow this starting from 0 principle. If we use 3 bits to represent DMA channels, it is better to express 8 channels (0-7), while starting from 1, the same 3 bits can only express 7 channels (1-7), which is also a waste of resources.

Therefore, a language that is close to the system naturally chooses to follow the hardware settings. In addition to the simpler addressing calculation mentioned in the first point, it can also maintain consistency with the computer system and unify the user experience of pointer addressing and array addressing.

Dijkstra explained that the reason programming languages do this is simply to follow hardware design decisions:

The decision taken by the language specification & compiler-designers is based on the decision made by computer system-designers to start count at 0.

Therefore, C language arrays start at zero for the following reasons: 1) Better performance; 2) Unified array and pointer addressing; 3) Following hardware addressing conventions.

In addition to these practical reasons, there are also some theoretical reasons.

Theoretical reasons

Apart from array indexing, Dijkstra advocated that all counting should start from 0, and he wrote an article to explain this viewpoint:

https://www.cs.utexas.edu/users/EWD/ewd08xx/EWD831.PDF

He explicitly criticized early languages like Fortran and Pascal that started from 1 for not considering enough:

He gave an impeccable reason, probably arguing that there are several ways to express the integer sequence 2, 3, 4, ..., 12.

a）2 <= i < 13
b）1 < i <= 12
c）2 <= i <= 12
d）1 < i < 13

Then he explained:

For the left side, the expression "a <= x" is better than "a < x" because if "a < x" is used to express a sequence, you always need to provide a number that is smaller than the first element, which is not only annoying but also often impossible (there exists a smallest rational number, but there is no largest rational number). Therefore, "a <= x" is a better expression.

For the right side, the expression "x < b" is better than "x <= b" because when a = b, "a <= x < b" can represent an empty set while "a <= x <= b" cannot represent an empty set.

Both scheme (a) and scheme (b) can easily show the length of the sequence.

Scheme (a) and scheme (d) are easier to express adjacent sequences.

Thus, it was proved that the left-closed and right-open scheme (a) "a <= x < b" is more suitable for expressing a sequence.

After Dijkstra argued that "a <= x < b" is a better choice, he concluded that an array of length N should start from 0, because the expression "0 <= x < N" is clearer than "1 <= x < N+1".