DEV Community

Cover image for PHP Array: A Gross Mistake
Anton Ukhanev
Anton Ukhanev

Posted on

PHP Array: A Gross Mistake

Any developer who has spent a little time working with PHP knows one of the most used compound types - the array. Its uses reach from de-serialization results, to the backbone of sets, collections and containers, stacks and queues, indexes, and much more. It is so ubiquitous that it's possible to make an object appear to be an array with the ArrayAccess, Iterator, and Countable interfaces.

However, countless examples suggest that even experienced PHP developers frequently make the same kind of mistake, which often has a hidden cost months or even years later. I contend that

  1. An array is not an array.
  2. What you really want is either a map or a list.
  3. You're doing it wrong.

PHP Array

From Java to C++ to Pascal to Basic, an array is an implementation of a list. As follows from official documentation, however, a PHP array is really a linked hash map. This is very versatile and flexible; yet it goes against the very principles that keep our code sane. Specifically, it violates the Interface Segregation Principle, which leads to much lower separation of concerns, puts unnecessary burden on implementations, confuses and complicates consumers, and makes implementations less flexible. All this leads to a cascade of more negative side-effects which could easily be avoided by applying engineering to the problem at hand. What does this mean for us?

ISP

The Interface Segregation Principle states:

no client should be forced to depend on methods it does not use

Let's consider a typical scenario:

function sumNumbers(array $numbers, int $limit): float
{
    $sum = 0;
    $i = 0;
    foreach ($numbers as $number) {
        if ($i === $limit) {
            break;
        }

        $sum += $number;
        $i++;
    }

    return (float) $sum;
}
Enter fullscreen mode Exit fullscreen mode

Most frequently, this is the sort of thing developers would write when it's necessary to perform some kind of operation on elements of a list. The way this list is consumed here is by iterating over its elements, starting from the first one and until either the end or the limit have been reached, using a simple foreach loop. Nothing else is done to this list of numbers, besides iterating over it in the given order. And yet, we are asking for an array, as if we would need to access or modify its elements directly, or in random order. What if the consumer of sumNumbers() has an infinite set of numbers and just wants to know the sum of the first 1000? The signature does not permit that, because nothing but an array must be passed. A type defines ways in which values of that type may be consumed, and looking at ArrayAccess, Iterator, and Countable, which together make up the true interface of an array, we see that this type is in reality far more complex than is let on by the ease of its use. But simplicity is not about ease, and a much simpler version of the function affords us incredible flexibility compared to its former state - without even changing anything about the algorithm:

function sumNumbers(iterable $numbers, int $limit): float
{
    // ...
}

echo sumNumbers((function () {
    $i = 0;
    $k = 1;
    yield $k;
    while(true)
    {
        $k = $i + $k;
        $i = $k - $i;
        yield $k;       
    }
})(), 1000)
Enter fullscreen mode Exit fullscreen mode

The new consumer of sumNumbers() can now use any series of numbers, finite or infinite, generated, hard-coded, or loaded from an external source.

Another example of array usage.

function getGreeting(array $user): string
{
    $fullName = [];
    $fullNameSegments = ['first_name', 'last_name'];

    foreach ($fullNameSegments as $segment) {
        if (isset($user[$segment])) {
            $fullName[] = $user[$segment];
        }
    }

    return implode(' ', $fullName);
}
Enter fullscreen mode Exit fullscreen mode

The only way that the $user argument is consumed is by accessing its specific, discreet indices. If the consumer wanted to use something that is only capable of exposing discreet members, which would actually be enough for the algorithm to work, they cannot! Let's consider a simplified version, where we depend only on what we actually use.

function getGreeting(MapInterface $user): string
{
    $fullName = [];
    $fullNameSegments = ['first_name', 'last_name'];

    foreach ($fullNameSegments as $segment) {
        if ($user->has($segment)) {
            $fullName[] = $user->get($segment);
        }
    }

    return implode(' ', $fullName);
}

interface MapInterface
{
    public function get(string $key);

    public function has(string $key): bool;
}

class Map implements MapInterface
{
    protected $data;

    public function __construct(array $data)
    {
        $this->data = $data;
    }

    public function get(string $key)
    {
        if (!array_key_exists($key, $this->data)) {
            throw new RangeException(sprintf('Key %1$s not found', $key));
        }

        return $this->data[$key];
    }

    public function has(string $key): bool
    {
        return array_key_exists($key, $this->data);
    }
}

$user = new Map([
    'first_name' => 'Xedin',
    'last_name' => 'Unknown',
    'id' => '12345',
]);
assert($user instanceof MapInterface);
echo getGreeting($user);
Enter fullscreen mode Exit fullscreen mode

Because the new getGreeting() only consumes as well as requires the methods of a map, any compatible map can be used, whether hard-coded, loaded from a database, de-serialized, or from a remote API. Cases such as with a remote API or some key-value storages are especially curious here, because they may not allow the listing of all entries, while supporting retrieval/checking by key, and so cannot be represented by an array because its "members" are not enumerable.

Data Representation

Often, data needs to be encoded in text form in order to be saved or transferred. In these cases, some kinds of DTOs are used in order to represent that data in the program. Let's looks at a typical example of some remote API response:

{
    "users": [
        {
            "id": 1,
            "username": "xedin",
            "first_name": "Xedin",
            "last_name": "Unknown"
        },
        {
            "id": 2,
            "username": "jsmith",
            "first_name": "John",
            "last_name": "Smith"
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

The response data contains a map with a single member users, which corresponds to a list, where every member is a map with members id, username, first_name, and last_name. This is because JSON is a very simple interchange format, and supports maps and lists. Note that there is no such thing as an "ordered map": looking at such a response, we understand quite intuitively and rather well that each "user" representation has a schema, which dictates certain mandatory (and perhaps some optional) fields, and in an application this data will be retrieved by key that is known in advance - because the application is written in accordance with the schema. There is never really a need to get all fields of a user. Let's look at a solution for a typical problem, where entries support arbitrary fields.

{
  "users": [
    {
      "id": 1,
      "username": "xedin",
      "first_name": "Xedin",
      "last_name": "Unknown",
      "meta": [
        {
          "name": "date_of_birth",
          "value": "1970-01-01"
        },
        {
          "name": "hair_colour",
          "value": null
        }  
      ]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

This adds support for an arbitrary number of arbitrary members through metadata in the meta member of each user, which is a list of maps, each map with key-value pairs, but can at any time receive additional members if necessary, such as a type which could determine the data or field type of the member. This is structurally very similar to how data is stored in various engines, be it EAV (Magento 1), WordPress (meta tables), some other relational or key-value storage, etc, and allows a simple and seamless flow between the HTTP, the application, and the data layers.

So then, why should the DTO type structure be any different from the schema? If we wanted to represent the entities from the above data in a PHP API, this is what it could look like.

interface UsersResponseInterface
{
    /**
     * @return iterable<UserInterface>
     */
    public function getUsers(): iterable;
}

interface UserInterface
{
    public function getId(): int;

    public function getUsername(): string;

    public function getFirstName(): string;

    public function getLastName(): string;

    /**
     * @return iterable<MetaInterface>
     */
    public function getMeta(): iterable;
}

interface MetaInterface
{
    public function getName(): string;

    public function getValue(): int|float|string|bool;
}
Enter fullscreen mode Exit fullscreen mode

Here, each user is represented by a UserInterface instance, which is a data object that via its methods exposes the members of each "users" entry. Conceptually, it is consumed as a simple map, by knowing its exact getter method names (keys), and in no other way. The design of arbitrary metadata support follows a similar approach. For convenience, the metadata can also be represented as a map by simply augmenting the UserInterface:

interface MetaMapAwareInterface
{
    public function getMetaMap(): MapInterface;
}

interface UserInterface extends MetaMapAwareInterface
{
    public function getId(): int;

    public function getUsername(): string;

    public function getFirstName(): string;

    public function getLastName(): string;

    /**
     * @return iterable<MetaInterface>
     */
    public function getMeta(): iterable;

    // Inherits `getMetaMap()`
}

/** @var $user UserInterface */
$meta = $user->getMetaMap();
if ($meta->has('date_of_birth')) {
    echo $meta->get('date_of_birth');
}
Enter fullscreen mode Exit fullscreen mode

Note: In the above example, generic syntax in e.g. iterable<MapInterface> may not be supported by PHPDoc natively, but is probably supported by your IDE, and is definitely supported by Psalm.

In fact, a more generic DTO type structure could be achieved by converting all maps to an e.g. MapInterface, and all lists to an iterable. Since these are the only two compound types necessary, any datastructure can be represented in lists of maps of lists etc. Following the ISP principle allows great flexibility, because any such structure can be parsed by a uniform algorithm, preserve more type information, and any part of it can be replaced by one that comes from another source, or retrieves data in a different way, or generates mock data on the fly - or anything else, really, and the meaning of the program or the logic of your DTO's consumers need not change.

Indexing

Another very common thing PHP developers do, and which can be found in the code of most frameworks, is something like this:

/**
 * @return array<UserInterface> A list of users, by ID.
 */
function get_users(): array
{
    // Retrieves users from DB...
}
Enter fullscreen mode Exit fullscreen mode

Here, the value returned by get_users() betrays the principles described in this article. While it is perfectly reasonable and valuable to have an index, index does not imply any order, but simply the ability to reference a whole record directly, often by a combination of only some of its members. If consuming code needs an ordered set of users (for example, sorted by first_name), then it is consuming the interface of a list of users, and every user has the same significance to the consuming logic. If consuming code needs an index of users, where each user can be retrieved by their id, then it is consuming the interface of a map of users, every user has a potentially different significance, and the order is irrelevant. Naturally, it is possible to convert a list of users to an index of users at any time by simply iterating over it programmatically. Because with this separation the index is now a separate "collection" than the list, the index can even be cached separately - like databases do, but also in memory, in a file like JSON, etc. With some additional logic, such an index can easily be used as an entity repository, which is usually unable to reliably enumerate its members.

Summary

Here are some practical take-aways that I would like to suggest.

  1. Achieve parity across your application layers by representing all data in a documented format with a single source of truth.
  2. It's either a map or a list. It's not both. If you think you need both, it's a good sign that your design could be simplified.
  3. Do not restrict the consumers of your APIs to using native types. Strings, lists, and maps can all be generated on the fly, in different ways, and there's no reason to limit how your consumers acquire the data they pass to your code.
  4. Observe ISP on one hand, and on the other - always depend on the most narrow type that provides the necessary interface. The array type is far too wide for most cases.

Discussion (25)

Collapse
wetndusty profile image
wetndusty

I love XQuery where everything is a list

Collapse
xedinunknown profile image
Anton Ukhanev Author

I actually didn't know you could generate documents with it. Yeah, looks pretty awesome!

Collapse
wetndusty profile image
wetndusty

Try eXist-db.org

Collapse
wetndusty profile image
wetndusty

I am going to write language independent framework (based on something like html but for server side) - i hope you can help with it, from my side i can help you with XQuery & XSLT

Thread Thread
xedinunknown profile image
Anton Ukhanev Author

Maybe I will. I'd be happy to read more about it's goals!

Thread Thread
wetndusty profile image
wetndusty

Can't find pm here you can find my telegram by same name

Thread Thread
xedinunknown profile image
Anton Ukhanev Author

That's right, there's no PM here. But there's PM in Twitter; link in my profile.

Thread Thread
wetndusty profile image
wetndusty

No twitter here in russia 🗿

Thread Thread
xedinunknown profile image
Anton Ukhanev Author

Oh man, I'm really sorry to hear that. Stay strong, and don't lose sight of what is right.

Thread Thread
wetndusty profile image
wetndusty

Yep, maybe as part of aggressor country i should kill myself but i have two dogs and one cat so everything not so simple.

Thread Thread
xedinunknown profile image
Anton Ukhanev Author

Don't do that, please. Dying is easy. And I'm not talking about "немножко потерпеть". But fighting and going forward is the only way. I really hope for the best in Russia, and for the Russian people. Perhaps, it is up to them to bring down this abomination.

Stay strong, buddy! And same to the people of Ukraine!

Collapse
xedinunknown profile image
Anton Ukhanev Author

Heya, @suckup_de! Could you expand on that a little? I'm not saying anywhere that I don't use collections, but also I am not sure exactly how you mean their usage here. You can specify the type for the index and the value, yes. But I am once again unsure how this relates to what I write here.

Collapse
suckup_de profile image
Lars Moelleken

Heyho, if you want to be sure the "thing" needs to be of a certain type, use typed collections.

They mostly already have or you can simply add e.g. the Iterator interface, and voila. 🙂 No need for falsy phpdocs in the application logic anymore.

Collapse
xedinunknown profile image
Anton Ukhanev Author

Oh, yea, totally. But I thought that you are either agreeing or disagreeing with what I wrote here, while it seems that you are adding a point about collections.

Collapse
abhinav1217 profile image
Abhinav Kulshreshtha

The biggest issue that I have seen among freshers and students is that they are hardwired to imagine arrays as sequential list. Most coaching institutes in India doesn't explain students about the under the hood concepts. All they are taught is that all PHP arrays are associative.

Also, PHP as a language itself, has evolved a lot from 4 -> 5 -> 7. PHP that I code today is not the same that I learned back in college. When I was in college, PHP wasn't an Object-Oriented language. The OO features added in PHP-5 were mostly a syntactic sugar over functional php. This is the reason why all the memes about PHP being dead originate from, people still imagine it as its older incarnation.

PHP Arrays are ordered hashmaps under the hood, thats why it is more powerful than conventional arrays found in Java or C, at the same time it is more frustrating for those who still visualize it as common list type array. But that doesn't mean they are slow. array type might feel like too wide, but it isn't.

from the docs

An array in PHP is actually an ordered map. A map is a type that associates values to keys. This type is optimized for several different uses; it can be treated as an array, list (vector), hash table (an implementation of a map), dictionary, collection, stack, queue, and probably more. As array values can be other arrays, trees and multidimensional arrays are also possible.

Collapse
xedinunknown profile image
Anton Ukhanev Author

Hi!

Fair points about the evolution, and about outdated thinging. However, I never claimed that arrays are slow; just that expecting or passing an array is probably much more than what one wants, which creates the kind of problems I describe here, and hence this type is too wide.

You are stating that it is not too wide. What would you like to back that up with, given the detailed explanation of my reasoning given in this article? Which, by the way, has nothing to do with what they teach you at school, or with how PHP arrays are implemented under the hood.

Collapse
abhinav1217 profile image
Abhinav Kulshreshtha

When you say array type is wide, I am assuming you are talking about all the features, functions, etc it have on it. If that is what you were going for, then php arrays are not too wide, because it has all the functions that is expected from a map collections. Infact, look at the functions related to maps and collection in any language, Java, C, D-lang, Go-lang etc, they are all similar to whats in php. The only thing that is different, is that those languages have have both sequential list (traditional simple array) and collections (maps) whereas php doesn't have any traditional sequential array, only maps that fulfills both roles. Only reason why it feels too wide, is because of the perception of traditional arrays in other languages.

Long ago, when I read the following blogpost, I tried to be cautious around php arrays, But looking back, in past 10 years, there may have been maybe 2 niche scenarios where I even had to think about it. Infact, I have worked on some java projects where people have made complex array-utils library to implement some handy features which are basically available in php for free.

Your example in the Data representation section, is basically a proper, secure, OO way to do that, irrespective of language. That approach is very similar to design pattern used in JavaBeans or kotlin data class, albeit a bit more smaller pieces ( it may be generalization pattern? or specialization pattern? ) .

From what I read, this post covers design pattern, not php arrays.

Just re-read your summary section, you have made amazing points on design patterns, but it had nothing to do with merits and demerits of php arrays. Specifically in your last point about only providing minimal interface under ISP, That is true for all languages. For example, using JavaBean, we don't expose entire java array of a data, we create interface which implements minimum functions needed downstream, abstracting the array ( or collection ) itself. This technique is basically enforced on most OO languages I have worked with.

Thread Thread
xedinunknown profile image
Anton Ukhanev Author

Oh wow, thank you for such an extensive reply!

I think we're just looking at it differently. I am referring to the array type as an interface, a collection of known common attributes of all arrays. I feel that you may be referring to the implementation.

The point I am making is that the interface of the array is comprized of multiple different smaller interfaces, which as you have correctly pointed out would be hidden behind an abstraction in an OO scenario. Depending on array - i.e. read, write, and enumerate interfaces of it, as they are inseparable from the array type - is making too many assumptions that are not useful anywhere.

For example, I have nothing against sumNumbers([1, 1, 2, 3, 5]), because an array is an easy way to create something that passes the iterable typecheck.
What I believe is wrong is requiring an array there. If array values would pass the ArrayAccess check, I would have nothing against declaring function getGreeting(ArrayAccess $user): string, and then invoking getGreeting(['first_name' => 'Santa', 'last_name' => 'Claus']). But unfortunately, it does not work, which IMHO goes a long way to reinforce my point.

What do you think?

Thread Thread
xedinunknown profile image
Anton Ukhanev Author • Edited on

Another perhaps more pragmatic way of explaining this.

By depending on array in my signature, it's the same ISP issue as with depending on ArrayAccess&Countable&iterable - why would you do that? Except that a real concrete array won't pass that typehint in PHP, which is even worse, because it necessitates a userland interface for something so extremely simple.

And I'm not saying that there's really never a reason to depend on ArrayAccess&Countable&iterable. But if you are depending on 3 interfaces by depending on array, you should know well what your reason is. And that I'd be curious to know it.

Thread Thread
abhinav1217 profile image
Abhinav Kulshreshtha

This point of view makes sense. You are right, while arrays are one of the way to work on data, in practical real world usage, it is safer to limit exposure.

Thread Thread
abhinav1217 profile image
Info Comment hidden by post author - thread only accessible via permalink
Abhinav Kulshreshtha

Now this do make sense. PHP arrays does expose the three interfaces. Although like we know, this is by design, a side-effect of emulating an array using ordered maps. Maybe this is why some frameworks prefer to abstract these. Maybe they will rewrite the internals someday, but it would be tough to do that without breaking the internet.

In a talk by Rasmus Lerdorf, he mentioned how he never intended PHP as a language that we know today, that is why all these inconsistencies in internal apis, the language quirks, etc, are all there because he created PHP to be something else. It was after the fact of PHP's popularity, that he decided to rewrite the language internals from the scratch, in a way that it doesn't break internet, yet become a proper backend language.

abhinav1217 profile image
Abhinav Kulshreshtha

Now this do make sense. PHP arrays does expose the three interfaces. Although like we know, this is by design, a side-effect of emulating an array using ordered maps. Maybe this is why some frameworks prefer to abstract these. Maybe they will rewrite the internals someday, but it would be tough to do that without breaking the internet.

In a talk by Rasmus Lerdorf, he mentioned how he never intended PHP as a language that we know today, that is why all these inconsistencies in internal apis, the language quirks, etc, are all there because he created PHP to be something else. It was after the fact of PHP's popularity, that he decided to rewrite the language internals from the scratch, in a way that it doesn't break internet, yet become a proper backend language.

Thread Thread
xedinunknown profile image
Anton Ukhanev Author

Maybe this is why some frameworks prefer to abstract these

Exactly. Because it is just more predictable this way.

I'm glad we have come to an agreement.
Thanks for the amusing conversation :)

Collapse
suckup_de profile image
Lars Moelleken

Why do you not use a Collection with TKey and TValue? 🤔 I think modern tools will support Generics, so that you can specify your index and your values. And it's a easy replacement for "array".

Collapse
abhinav1217 profile image
Abhinav Kulshreshtha

In PHP, Arrays "are" key-value collection under the hood. It's not an array in traditional sense. Which means that it exposes a lot larger functionality on data set. Therefore it is common practice to use design patterns similar to what the author has demonstrated, to minimize the exposed surface of data to users. This is the reason why Laravel has so many functions, they basically abstract the data behind, providing you individual APIs to manipulate/access the data, making sure its integrity is maintained. A similar approach was used by codeigniter, but they did allow direct access to data, so companies would make internal rules to use only framework functionality, to make sure integrity was maintained.

Some comments have been hidden by the post's author - find out more