DEV Community

Cover image for A Better User-Agent Parser for PHP
Hexydec
Hexydec

Posted on

A Better User-Agent Parser for PHP

I know, for years the real nerds have been telling us not to rely on user agents to detect capabilities, but the ubiquitous user agent string still has it's uses.

For one thing it is still present in most web server logs, one of only 8 pieces of information stored there (Common Log Format), and can provide a treasure trove of information about the devices that are accessing your website.

Even as the user agent string is sort of being phased out and replaced with (in my opinion) a not too well thought out replacement, the user agent is still going to hang around for a while, certainly bots and crawlers will still use it to announce themselves.

Server Logs

I am currently working on some software that provides analytics based on web server logs, it will be different to the sort of analytics you get from front-end javascript trackers, in that it will analyse all your website traffic, crawlers, assets and all.

Here the user agent string is critical in segmenting the data into users and robots, desktop and mobile, search engine or commercial crawler.

The software is written in PHP, and has a scheduler to process the server logs at regular intervals into a MySQL database. For each row in the logs, the user agent must be processed to determine information about the session.

Initially I was using the built in PHP function get_browser() which it turns out is fraught with problems.

The Problem

The first issue is that it requires a file called browscap.ini, that doesn't come with PHP, and must be regularly updated.

The second is that the browscap project has slowed it's updates in recent years, and with vendors now knocking out a new version of their browsers every month, often it fails to detect current browsers.

And thirdly is speed, due to the liberal use of regular expressions and the tens of thousands of patterns stored in the browscap.ini file, the performance of the get_browser() function isn't great, fine for one or two lookups, but when you need to do hundreds, you start to notice.

Indeed when reading the server logs in my software project, processing logs can chug along at as little as 50 rows per second, whereas when this function is commented out, it powers along at well over 1,000 rows per second.

Building a Solution

As a software engineer, my brain kicked into gear on how to solve these problems in a simple yet future proofed way.

The core problem in my mind is trying to match the whole string, it would be simpler and more flexible to extract smaller patterns of data out of the string to build up the contained information such as browser, platform, and device name.

In this way, if it happened upon a new combination of features contained within the UA string, it could still make best effort to interpret it, giving it a level of resilience, and meaning the config doesn't have to be updated everytime a new device or browser version is released.

Introducing AgentZero

It was actually quite quick to get a really basic version of my UA parser working, by looking at some common UA strings, and matching the patterns within it, the software was able to extract quite detailed information.

Then the sprawl happened, it turns out there are quite a lot of browsers and platforms and devices and architectures! But this did not deter me, I got more organised with my config files that contained the patterns and started writing some unit tests to make sure the output was consistent.

I devised a first come first serve system, where the order of the captured patterns is significant. For example, the Edge browser uses Chrome's rendering engine, and always presents the Chrome string with version number. If it was the Chrome browser, this would be listed in the returned features, but because the Edge pattern appears first, this information is filled with the Edge information, when the Chrome string is matched, does not overwrite it.

This allows each matched pattern to fill in the blanks as features are matched, even if there is cross over, providing the most significant information first, and falling back into more common patterns.

AgentZero Performance

There are currently around 360 strings that can be matched in order to extract the information from a UA string, and this will likely increase in the future as more devices, apps, and browsers are supported.

Currently on my laptop the information can be extracted in around 0.03s on average, which is a big leap in performance compared to the get_browser() function, whilst also providing more granular information.

This difference in performance is in the design of the program architecture, by looking for simple string features, regular expressions are avoided, and less strings need to be compared as each feature is matched, not every possible full UA string.

Indeed the only regular expression used is to tokenise the UA string, everything else is matched with string functions.

Ensuring Consistency and Correctness

Firstly, UA strings are tricky, some give lots of information, and others hardly any. My library is designed to extract a core set of information, not every possible detail that is presented in the string, even so it is quite comprehensive.

In order to prove it works consistently over a broad range of UA strings, I created a test suite for it with PHPUnit. The test suite is broken down into feature categories and then there are test methods for each feature with a couple of example UA strings to test against.

All the result data is tested against, providing a lot of cross-over with each suite. All in all there are nearly 400 user agent strings in the test suite providing good coverage across all the target platforms, browsers, devices, crawlers etc.

Next Steps

There is still a bit more work to do to test the performance of AgentZero in the wild, I have already started integrating it into my server logs software, and upgraded the user agent database to take the extra information, whilst also using it to identify UA strings that didn't get matched fully.

The results are already very promising with the server log software able to import at over 1,000 rows per second, and there are more metrics to segment the data with.

AgentZero will also enable me to develop features in the software that would not have been possible withget_browser(), such as showing device vendors and models, rendering engines and applications, or segmenting the robots into categories.

Availability

I have made AgentZero a free and open-source user-agent parser for PHP, available for you to download and try in your own projects. It is MIT licenced so you can pretty much do what you want with it.

I built the software to with my server logs project in mind, but I wanted to make it comprehensive, robust, and well documented so that others could use it to.

If you download AgentZero and give it a try, or you just found this article interesting, please leave some feedback in the comments below 👇.

Top comments (0)