Discussion on: Performance measurement of JavaScript solutions to common algorithmic questions (part 1)

View post

Replies for: Not sure. I didn't enforce any encoding. The content of the string was "a a a a...".

The default is 'utf8', which is variable-length characters. So naturally it's much slower.
'latin1' is 1-1 char for byte (Obviously high Unicode will get corrupted, but if you start with a Buffer, you will always be able to get it back.
'ascii' decodes the same as 'latin1', but the encoder drops the first bit.

> Buffer.from(Buffer.from([0xff]).toString('latin1'),'latin1')
<Buffer ff>
> Buffer.from(Buffer.from([0xff]).toString('latin1'),'ascii')
<Buffer ff>
> Buffer.from(Buffer.from([0xff]).toString('ascii'),'latin1')
<Buffer 7f>

UTF8, lol:

Buffer.from(Buffer.from([0xff]).toString())
<Buffer ef bf bd>

Notably, V8 doesn't use UTF8 to represent strings/string chunks in memory. It uses something like UTF16 (2 or 4 bytes per character, never 1 or 3) and sometimes opportunistically optimizes to using something like 'latin1' when high characters aren't used.
So, for weird strings 'utf16le' is fastest, and for basic strings, 'latin1'/'ascii'. The default of 'utf8' is the slowest in all cases (but the most compatible and compact without compression)

const { log, time, timeEnd, error } = console
const announce = msg => {
  const bar = "=".repeat(msg.length)
  log(bar); log(msg); log(bar)
}

const [a, z, _] = Buffer.from("az ", "latin1")
const len = 1 + z - a
const strings = new Array(len)
for (let i = 0; i < len; i++)
  strings[i] = Buffer.alloc(0x1000000, Buffer.from([a + i, _]))
    .toString("latin1")

const badStuff = ["💩 💩 💩", "ਸਮਾਜਵਾਦ", "สังคมนิยม"]
  .map(x => x.repeat(0x200000))

const tainted = strings.map(x => "💩" + x)

const benchmarkEncoding = strings => enc => {
  let failed
  time(enc)
  for (const x of strings) {
    const out = Buffer.from(x, enc).toString(enc)
    if (out !== x && !failed) failed = { enc, x, out }
  }
  timeEnd(enc)
  if (failed)
    error(failed.enc, "failed:", failed.x.slice(0, 6), failed.out.slice(0, 6))
}
const encodings = ["utf16le", "utf8", "latin1", "ascii"]

announce("Normal")
encodings.forEach(benchmarkEncoding(strings))
announce("And now, the bad stuff!")
encodings.forEach(benchmarkEncoding(badStuff))
announce("What about tainted ASCII?")
encodings.forEach(benchmarkEncoding(tainted))

Edit: So, I was wrong. 'utf16le' is slower than 'utf8' on pure ASCII (But 3+ times faster otherwise)

Dmitry Yakimenko • Feb 27 '19

Some thorough research. Well done! One small problem with this, is all of this doesn't really help with the original problem =) We wanted to count letters or reverse them or whatever. It's not correct to assume any trivial encoding, like latin1 or ascii. Nor it helps to put everything in the byte buffer, even correctly encoded sequence of bytes. But I'm sure there are use cases where it's ok to assume 7 or 8 bit encoding, like massive amounts of base64, for example.

Mihail Malo • Feb 27 '19

There's Buffer.from('ffff','base64') for that :D
I was just responding to you saying that in your quest to to count letters or reverse them or whatever you rejected Buffer because of the conversion time.
So I was wondering if it would be more competitive, on your admittedly ASCII 'a a a a a ... data, without the overhead of the default encoding.

Dmitry Yakimenko • Feb 27 '19 • Edited

You are right, it's much faster on Node using Buffer to reverse a string if it's known it's ASCII. I assume the same goes for any fixed character size encoding. Code:

function reverseStringBuffer(str) {
    let b = Buffer.from(str, "ascii");
    for (let i = 0, n = str.length; i < n / 2; i++) {
        let t = b[i];
        b[i] = b[n - 1 - i];
        b[n - 1 - i] = t;
    }
    return b.toString("ascii");
}

Look at the green line:

Mihail Malo • Feb 28 '19

Heeeey, that's pretty fast :v

I don't know what you're using to benchmark and draw these nice graphs, so I'm forced to bother you again:
Can you try it with Buffer.reverse()? :D
It's in place, like the Array one, and inherited from Uint8Array.

Dmitry Yakimenko • Feb 28 '19

The code is mentioned in the bottom of the post: gist.github.com/detunized/17f524d7...
The graphs I draw using Google Sheets. I can add this code later to my set. I actually looked for reverse on Buffet but didn't see it. Weird.

Mihail Malo • Feb 28 '19 • Edited

Yeah, it's not listed on the Node docs, because it's inherited.
It's on MDN though.
This is over twice as fast :D

function reverseStringBuffer2(str) {
  return Buffer.from(str, "ascii").reverse().toString("ascii");
}

This was 💩 though:

function findLongestWordLengthBuffer(str) {
  const buf = Buffer.from(str, 'ascii')
  const length = buf.length
  let maxLength = 0;
  let start = 1;
  let offset = 0;
  let last = false;
  while (!last) {
    const found = buf.indexOf(32, offset)
    offset = found+1
    last = offset === 0
    const len = found - start
    start = found
    if (len > maxLength) maxLength = len
  }
  return maxLength
}

Just plugging a buffer into your Fast is twice faster on long strings, same and maybe a little worse on short:

function findLongestWordLengthBufferFast(str) {
  const buf = Buffer.from(str,'ascii')
  let l = buf.length;

  let maxLength = 0;
  let currentLength = 0;

  for (let i = 0; i < l; ++i) {
      if (buf[i] === 32) {
          if (currentLength > maxLength) {
              maxLength = currentLength;
          }
          currentLength = 0;
      } else {
          ++currentLength;
     }
  }

  // Account for the last word
  return currentLength > maxLength ? currentLength : maxLength;
}