DEV Community: goaty92

In response to "Yes, PHP Is Faster Than C#"

goaty92 — Fri, 18 Mar 2022 09:01:25 +0000

Recently there is a blog post titled Yes, PHP Is Faster Than C# that has sparked quite a conversation. I decided to run the tests mentioned in the post and found some interesting result, which I think is worth sharing.

The benchmark used here reads a file from the file system in 4 KiB chunks, and count the number of bytes with the value 1 in the file. First off, I would start by saying that I don't find this "benchmark" to be very meaningful, especially since reading files from disk is involved. There are a lot of things that can impact the file-system performance (caches, state of the disk drive, how busy the kernel is at that time), none of which is address in the test itself.
Nonetheless, the results do indicate some interesting performance characteristics that we can talk about.

Source code for the test can be found here: https://github.com/dhhoang/csharp-php-file-read

Small files

I generated the test file like this

# for this test, we will use file_size of 4 MiB as specified in the original post
base64 /dev/urandom | head -c [file_size] > test.txt

The code for the PHP (8.0) program looks something like this:

function test()
{
    $file = fopen("/path/to/test.txt", 'r');
    $counter = 0;
    $timer = microtime(true);
    while ( ! feof($file)) {
        $buffer = fgets($file, 4096);
        $counter += substr_count($buffer, '1');
    }
    $timer = microtime(true) - $timer;
    fclose($file);
    printf("counted %s 1s in %s milliseconds\n", number_format($counter), number_format($timer * 1000, 4));
}
test();

And for C#:

private static void Test()
{
    using var file = File.OpenRead("/path/to/test.txt");
    var counter = 0;
    var buffer = new byte[4096];
    var numRead = 0;
    var sw = Stopwatch.StartNew();
    while ((numRead = file.Read(buffer, 0, buffer.Length)) != 0)
    {
        counter += buffer.Take(numRead).Count((x) => x == '1');
    }
    sw.Stop();
    Console.WriteLine($"Counted {counter} 1s in {sw.ElapsedMilliseconds} milliseconds");
}
Test();

The result when running on a t3-xlarge EC2 instance is as follows (note: code is run 10 times and runtime is averaged after removing anomalies due to cold file cache)

Test-C#      53.2ms
Test-PHP     11.1ms

So the PHP code is about 5 times faster than the C# code!!! So looks like PHP really is faster than C#?

Something is definitely off here. Is .NET that slow when reading a file? Probably not. I did a simple test where I removed the "counting" part in both programs, and their performance became very similar. The blog's author claimed that the test has "very little user-land code" and mainly test the file-reading performance. I found this to be incorrect.

Now if you look closer at the 2 programs, they are very similar, except for the part where the 1 bytes are counted. PHP uses the substr_count built-in function which is very optimized, while the C# code uses LINQ. LINQ is a very convenient way to work with collections in C#, but they are also quite slow. What if we try to just count the bytes the old-fashioned way?

private static void Test_FileStream_NoLinq()
{
...
    while ((numRead = file.Read(buffer, 0, buffer.Length)) != 0)
    {
        for (var c = 0; c < numRead; c++)
        {
            if (buffer[c] == '1')
            {
                counter++;
            }
        }
    }
...
}

Our result now is (see Test-C#-NoLinq):

Test-C#             53.2ms
Test-PHP            11.1ms
Test-C#-NoLinq      6.5ms

So at this point C# is already doing much faster than before, and about twice as fast as the PHP program. This shows that the byte-counting process contributes significantly to the total run time.

So the next question is, can we do even better? When working with byte buffer, iterating through individual bytes is a pretty naive implementation. A more optimized one would be to utilize vectorization techniques such as SIMD. In fact, I would be very surprised if the substr_count function is not using vectorization. In order to test this, I created another PHP test function that iterate through the string instead of using substr_count, which would be comparable to our C# Test_FileStream_NoLinq function:

function test_manual_count()
{
    ...
    while ( ! feof($file)) {
        $buffer = fgets($file, 4096);
        $length = strlen($buffer);
        for ($i = 0; $i < $length; $i++) {
            if($buffer[$i]=='1'){
                $counter += 1;
            }
        }
    }
    ...
}

And the result (see Test-PHP-Manual-Count):

Test-C#-NoLinq          6.5ms
Test-PHP                11.1ms
Test-PHP-Manual-Count   135ms

That is painfully slow, which is why it's always a good idea to use substr_count when you need to count occurrences in a string. Unfortunately, C# doesn't not provide a built-in method with the same functionality, however it does offer a lot of primitives for implementing vectorization. I found an implementation of a SIMD-equivalent function on StackOverflow: VectorExtensions.OccurrencesOf(ReadonlySpan<byte>, byte). With this we can rewrite our counter:

private static void Test_FileStream_Vectorized()
{
...
    while ((numRead = file.Read(buffer, 0, buffer.Length)) != 0)
    {
        counter += buffer.AsSpan().Slice(0, numRead).OccurrencesOf((byte)'1');
    }
...
}

And the result (see Test-C#-Vectorization):

Test-C#-NoLinq          6.5ms
Test-C#-Vectorization   1.0ms

That is 6 times faster than manual loop and about 10x faster than PHP 😊.

Large file

For this test, I'm using an 3.2 GB Ubuntu ISO image. The result looks like this:

Test-PHP                3228.4ms
Test-PHP-Manual-Count   103966.7ms
Test-C#-NoLinq          5175.3ms
Test-C#-Vectorization   1104.7ms

Here we can clearly see how using vectorization makes things a lot faster for both languages.

Designing TinyURL: it's more complicated than you think

goaty92 — Mon, 10 Aug 2020 10:21:05 +0000

Recently I came across a Youtube video called: System Design : Design a service like TinyUrl, from the channel Tushar Roy - Coding Made Simple. This video discusses a common developer interview question, namely, how do you design a service like TinyURL, which allows users to turn long URLs into short ones that are just several characters long.

Basically, a TinyURL-like service would have 2 main APIs: createShort(longUrl) and getLong(shortUrl). The second one is easy, you simply need to do a lookup and return the long URL (or 404 if none exists). The main problem is the createShort() API: How do you generate a short sequence of characters that is unique among URLs (note that uniqueness is an important property, we don't want different URLs to have the same shortcut).

Tushar's proposed solutions are quite good and I think most interviewers would be satisfied with them (please watch the video before continuing to read this post). That being said, they are sort of unsatisfying. To summarize, the most sophisticated solution proposed in the video is to partition all possible short sequences into ranges, and use a set of servers to return a monotonically increasing sequence, which falls within a range. Each server would be in assigned only one particular range to work with, and Apache Zookeeper is used to coordinate the sequence range assignments.
If the each server has a unique range, then they are guaranteed to generate unique sequences.

The reason I think this answer is unsatisfying is because, while it works, it simply shifts the responsibility of generating the "unique" part of the sequence, which, is the hardest part of the problem, to Zookeeper. Instead of answering the question "how to generate a unique sequence?" (or sequence range, in this case), this solution simply says "I'll just ask Zookeeper to give me one". But how does Zookeeper do that?

First of all, why is it so hard to generate a unique sequence? Afterall, I can use a single computer to keep increasing a counter, and that would be unique, right? In fact, that solution is mentioned by Tushar in the video, but later rejected, because the counter-generating server might fail (either the machine itself crashes, or the network might go down etc.), and Zookeeper, somehow, magically provides "high availability" (i.e. it is resilient to failures).

And that's the gist of the problem. If I had the guarantee that my servers never fails, then I wouldn't need Zookeeper. I probably wouldn't need multiple servers either, one beefy machine might be enough to do the job. Unfortunately, in practice machines do fail, and in fact, they fail all the time. That is why when we design systems, we design for failure. In this case, when one servers in the Zookeeper cluster fails, somehow the system needs to make sure that the others don't return a duplicate range. The only way to do that is to make all servers agree on which ranges have been given, and which have not.

So let's try to simplify & generalize the problem: given a set of servers, how do we ensure that all servers agree on a value, even if the servers might fail randomly (the value in this case would be the range assignment). This is know as the distributed consensus problem, which actually is one of the hardest problems in Distributed systems. In fact, it has been mathematically proven that, in an asynchronous system (meaning a system where we don't know how long it takes for messages to travel between servers), there is NO way to guarantee distributed consensus. This is known as the FLP Impossibility.

Fortunately, in most of the systems in practice, we can workaround this issue by modelling them as "partially synchronous systems", that is, we can apply a boundary on how long it takes to send messages between servers. And in this model, consensus is possible. There are several algorithms that can be used to get consensus, like Paxos or Raft. Zookeeper itself uses a consensus protocol called Zab (which stands for Zookeeper atomic broadcast).

I won't get into details on how these algorithms work. Afterall, they are quite complicated and sometimes difficult to understand. However if you ever need to work with those directly, an important thing to pay attention to is that they are not perfect. Raft and Paxos, for example, only works if the number of failed nodes is less than half the total number of nodes in the system. Failure also take different forms, and while Paxos and Raft works well with Fail-stop and Fail-safe types of failure, Byzantine-type failures are a lot harder to deal with.

[P2] Writing a serialization library in C#: Performance is a feature

goaty92 — Thu, 02 Aug 2018 02:57:34 +0000

In my first blog post in the series, I talked briefly about Ion and why it offers a better serialization alternative to popular formats like JSON and XML. In the next few posts, I'll discuss some of the performance features that IonDotnet is implementing.

The reason why I say performance 'feature' is because developers often don't consider it as one. And sometimes for good reason: shipping the product in a timely manner is (and should be) always concerned with the most important thing. Afterall, people say that the first rule of optimization is don't do it. However when it comes to writing a library, especially one that deals with data, I believe performance should be a feature. If you're using my library and your software runs slow, I want it to be your fault, not mine 🤫.

Generally, optimization goes from architecture -> algorithm -> caching -> micro-optimization. In this case, the algorithm is pretty straight forward: Most serializer operates a a state-machine, writing data and updating its states depending on the input. With that being said, even if you don't have to come up with some novel brilliant algorithm, paying attention to some details in the code that, while having nothing to do with the algorithm, will benefit you a big deal.

We'll finish this blog with a simple tip:

If you have a C# struct, ALWAYS override `Equals()`, `==` , `!=` and `GetHashCode()`.

One of the great features of C# is of course the ability of define struct Value types. This has a caveat, however: comparing 2 struct by default will force the runtime to use reflections and compare all the nested fields within the structs.

Given how slow reflection is, if you have a struct that you intend to compare a lot, overriding Equals(), == , != and GetHashCode() will give you a huge performance benefit.

In IonDotnet codebase there's a structure called SymbolToken that gets compared with each other a lot. I was careful enough to follow the guidelines and override all the necessary methods (as well as implement IEquatable). I was curious to see how the performance would be if I had missed that. I'm just gonna leave the benchmark result here:

// benchmark, serializing 1000 records
          Method |     Mean |     Error |    StdDev |    Gen 0 |   Gen 1 | Allocated |
--------------------- |---------:|----------:|----------:|---------:|--------:|----------:|
 No-override     | 8.385 ms | 0.1034 ms | 0.0916 ms | 828.1250 | 93.7500 |   3.67 MB |
 Overrided       | 2.319 ms | 0.0216 ms | 0.0202 ms | 31.2500  | 15.6250 | 383.37 KB |

There is also a great blog post that talks about this in detail, it you're interested.

That's it, see you next post.

[P1] Writing a serialization library in C#: The case for Ion

goaty92 — Mon, 30 Jul 2018 00:27:33 +0000

During my summer internship, I've had the chance to work with an open-source data serialization format from Amazon called Ion. Amazon describe Ion as "a richly-typed, self-describing, hierarchical format offering interchangeable binary and text representations", and currently provide libraries in Java, Python, C and Javascript. This post is the first in a series of blogs where I ~~advertise~~ talk about the implementation of the format in C# that can be usable for (hopefully) all .NET platforms.

So why would I use this Ion thing?

TL:DR Ion offers a type-rich, compact binary format that's efficient for parsing and also supplies a readable text format to support prototyping/development.

Try out IonDotnet on github. Please don't scream at my code 😂.

Amazon has a whole page that talks about the advantage of Ion compared to other similar formats, which you of course can read if interested. I will mainly discuss in this post from a developer point of view.

If you have worked with softwares that involve more than one computer, you probably have worked with serialization before. It's the process of converting your data object to byte sequence that can then be stored or sent to other processes/machines. The 2 most well-known serialization formats today are (of course) JSON and XML, which you most certainly have heard of.

Generally speaking, there are 2 kinds of serialization software nowadays. The first kind is what I'd call static serialization: you declare your model, then generate the codes that serialize that model ahead of time. Google's Protocol buffers and FlatBuffers, for example, do this. This method leads to extremely compact and efficient output: the serialized object contains basically zero metadata, and parsing the bytes is simply 'casting' the memory layout into the runtime object. It comes at a cost, however: Since the format is static, updating your object models means re-generating the serializing code, which might result in breaking changes to existing consumers. This is an undesirable effect for systems that expect to change and evolve quickly.

On the other hand, we have formats like JSON that are more dynamic: The layout of the data is generated at runtime depending on what kind of data you put in. Being a text-based format, JSON is very readable, which is a reason why it's become popular (beside the fact that it's native to Javascript). That being said, even as a text format, JSON has many shortcomings.

The type system

Let's say you're writing a laboratory softwares that manages experiments and deal with object model like this.

enum ExperimentResult
{
    Success,
    Failure,
    Unknown
}

class Experiment
{
    public int Id { get; set; }
    public string Name { get; set; }
    public DateTimeOffset StartDate { get; set; }
    public TimeSpan Duration { get; set; }
    public bool IsActive { get; set; }
    public byte[] SampleData { get; set; }
    public decimal Budget { get; set; }
    public ExperimentResult Result { get; set; }
}

var experiment = new Experiment
{
    Id = 233,
    Name = "Measure performance impact of boxing",
    Duration = TimeSpan.FromSeconds(90),
    StartDate = new DateTimeOffset(2018, 07, 21, 11, 11, 11, TimeSpan.Zero),
    IsActive = true,
    Result = ExperimentResult.Failure,
    SampleData = new byte[100],
    Budget = decimal.Parse("12345.01234567890123456789")
};

Using JSON.NET, if we do JsonConvert.SerializeObject(experiment ), we get

{
  "Id": 233,
  "Name": "Measure performance impact of boxing",
  "StartDate": "2018-07-21T11:11:11+00:00",
  "Duration": "00:01:30",
  "IsActive": true,
  "SampleData": "2e36MMwesekp5vKCjNEZKyEi+mro6HfE6Q1UcxCwzguscpMX0PLV+qAvU7zlXth4+DyKrKUHjfB1Nka/yj7ZeBfm1ho9AlouTQDJuJW73os03HrTJiFlpOSjoZqsFTBiVtuk/g==",
  "Budget": 12345.01234567890123456789,
  "Result": 1
}

We can see right away that there are several data type that JSON serialization does not properly represent. For example,the ExperimentResult enum gives us "Result" : 1, but this is problematic, because the consumers of this data will have difficulty understanding what 1 means as an ExperimentResult. Even worse, if you update the ExperimentResult enum and add a new enum before Failure, then 1 no longer means Failure. Of course JSON.NET allows us to serialize the enum as a string:

[JsonConverter(typeof(StringEnumConverter))]
public ExperimentResult Result { get; set; }

Which will give us {"Result": "Failure"}. But even then there's still a problem (apart from the ugliness of that attribute): Result is now a string which is typically interpreted as text instead of a specifier.

Another example is the Timespan Duration property. Here JSON.NET gives us the string representation in the format hh:mm:ss, but it's still a string. The intention to represent a time duration is lost.

The same goes for the DateTime, decimal and byte[] properties, which JSON.NET will find a workaround, most often by formatting them to a string (such as Base64-encoding the byte array). These methods often lead to loss of meaning of the value or increase the size of the output (like with byte[]).

Ion offers a solution for that problem. First of all, it has more native types, including decimal(fit for monetary calculation), blob for byte sequence, symbol for encoding Enum. The full list of supported datatypes can be found here. The type system is also extensible with the use of annotations, which I'll talk about in a future post.

The (proper) Ion text format for the above object will look something like this

{
  Id: 233,
  Name: "Measure performance impact of boxing",
  StartDate: 2018-07-21T11:11:11+00:00,
  Duration: seconds::90, //a time duration in seconds
  IsActive: true,
  SampleData: {{ 2e36MMwesekp5vKCjNEZKyEi+mro6HfE6Q1UcxCwzguscpMX0PLV+qAvU7zlXth4+DyKrKUHjfB1Nka/yj7ZeBfm1ho9AlouTQDJuJW73os03HrTJiFlpOSjoZqsFTBiVtuk/g== }},
  Budget: 12345.01234567890123456789,
  Result: 'Failure'
}

Let's look at the above format: The byte sequence SampleData is represented as Base64 in the text format, but will be copied as-is in the binary format. No extra encoding-decoding is required when parsing binary data. Moreover, the double-block sign {{ }} lets us know that it is a byte[], so no meaning of the value is loss. Similarly, the enum Result is represented by a special kind of text called symbol, which is put in single-quote '' as opposed to normal texts (string), which are put within double-quote "". And yes, Ion-text supports comments.

Using Ion in C#

IonDotnet is built with the goal to support reading and writing standard Ion data, while providing a set of APIs that's friendly to .NET developers. Therefore, using IonDotnet is much easier than the Java counterpart. The following code serialize the Experiment object to the binary format as a byte[], and back:

byte[] ionBytes = IonSerialization.Serialize(experiment);
Experiment deserialized = IonSerialization.Deserialize<Experiment>(ionBytes);

At the moment of this writing, the implementation of text-serialization of IonDotnet has not been completed yet. Production systems should use the binary format for its compactness, the text format is readable and can be used to support development and prototyping.

The compactness

A piece of data can be considered as containing 2 components: the actual information and the metadata (how the data should be read/parsed). JSON, being a text-based format, wastes a lot of space for the metadata stuffs like field names, quotation marks ("") and braces ([],{}). The numeric representation in JSON is also not optimal: a 4-byte binary number can take up to 10 bytes when translated to text.

Static serialization format like Protocol buffers essentially removes all the metadata bits in the data which makes it really compact, but also rigid and difficult to change/update. Ion seeks a balance between the two: It's more compact than JSON, but still dynamic in nature. Change/updating your Ion format should be no harder than doing so with JSON.

Let's look at the following code, which compares the serialization size of JSON and Ion for a typical Web API (from Foursquare):


private static string GetJson(string api)
{
    using (var httpClient = new HttpClient())
    {
        var str = httpClient.GetStringAsync(api);
        str.Wait();
        return str.Result;
    }
}

var jsonString = GetJson(@"https://api.foursquare.com/v2/venues/explore?near=NYC
                &oauth_token=IRLTRG22CDJ3K2IQLQVR1EP4DP5DLHP343SQFQZJOVILQVKV&v=20180728");

var obj = JsonConvert.DeserializeObject<RootObject>(jsonString);
byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonString);
byte[] ionBytes = IonSerialization.Serialize(obj);

Console.WriteLine($"JSON size: {jsonBytes.Length} bytes");
Console.WriteLine($"ION size: {ionBytes.Length} bytes");

And the output is:

JSON size: 70920 bytes
ION size: 40675 bytes

Which is a ~40% saving in size! And it's not even the best case scenario. Because of the way that Ion encodes data, you can save even more space by getting the two ends of the transmission to agree on the set of encoding symbols, in which case you can get rid of all the "field names" and bring the size very close to schema-less serializations protocols like Protobuf. This is great for high-performance scenarios such as gamings and real-time communications.