Grew up in Russia, lived in the States, moved to Germany, sometimes live in Spain. I program since I was 13. I used to program games, maps and now I reverse engineer password managers and other stuff
Location
Berlin and Málaga
Education
MS in CS from State Polytechnic University of St. Petersburg
Grew up in Russia, lived in the States, moved to Germany, sometimes live in Spain. I program since I was 13. I used to program games, maps and now I reverse engineer password managers and other stuff
Location
Berlin and Málaga
Education
MS in CS from State Polytechnic University of St. Petersburg
The default is 'utf8', which is variable-length characters. So naturally it's much slower. 'latin1' is 1-1 char for byte (Obviously high Unicode will get corrupted, but if you start with a Buffer, you will always be able to get it back. 'ascii' decodes the same as 'latin1', but the encoder drops the first bit.
Notably, V8 doesn't use UTF8 to represent strings/string chunks in memory. It uses something like UTF16 (2 or 4 bytes per character, never 1 or 3) and sometimes opportunistically optimizes to using something like 'latin1' when high characters aren't used.
So, for weird strings 'utf16le' is fastest, and for basic strings, 'latin1'/'ascii'. The default of 'utf8' is the slowest in all cases (but the most compatible and compact without compression)
const{log,time,timeEnd,error}=consoleconstannounce=msg=>{constbar="=".repeat(msg.length)log(bar);log(msg);log(bar)}const[a,z,_]=Buffer.from("az ","latin1")constlen=1+z-aconststrings=newArray(len)for(leti=0;i<len;i++)strings[i]=Buffer.alloc(0x1000000,Buffer.from([a+i,_])).toString("latin1")constbadStuff=["💩 💩 💩","ਸਮਾਜਵਾਦ","สังคมนิยม"].map(x=>x.repeat(0x200000))consttainted=strings.map(x=>"💩"+x)constbenchmarkEncoding=strings=>enc=>{letfailedtime(enc)for(constxofstrings){constout=Buffer.from(x,enc).toString(enc)if(out!==x&&!failed)failed={enc,x,out}}timeEnd(enc)if(failed)error(failed.enc,"failed:",failed.x.slice(0,6),failed.out.slice(0,6))}constencodings=["utf16le","utf8","latin1","ascii"]announce("Normal")encodings.forEach(benchmarkEncoding(strings))announce("And now, the bad stuff!")encodings.forEach(benchmarkEncoding(badStuff))announce("What about tainted ASCII?")encodings.forEach(benchmarkEncoding(tainted))
Edit: So, I was wrong. 'utf16le' is slower than 'utf8' on pure ASCII (But 3+ times faster otherwise)
Grew up in Russia, lived in the States, moved to Germany, sometimes live in Spain. I program since I was 13. I used to program games, maps and now I reverse engineer password managers and other stuff
Location
Berlin and Málaga
Education
MS in CS from State Polytechnic University of St. Petersburg
Some thorough research. Well done! One small problem with this, is all of this doesn't really help with the original problem =) We wanted to count letters or reverse them or whatever. It's not correct to assume any trivial encoding, like latin1 or ascii. Nor it helps to put everything in the byte buffer, even correctly encoded sequence of bytes. But I'm sure there are use cases where it's ok to assume 7 or 8 bit encoding, like massive amounts of base64, for example.
There's Buffer.from('ffff','base64') for that :D
I was just responding to you saying that in your quest to to count letters or reverse them or whatever you rejected Buffer because of the conversion time.
So I was wondering if it would be more competitive, on your admittedly ASCII 'a a a a a ... data, without the overhead of the default encoding.
Grew up in Russia, lived in the States, moved to Germany, sometimes live in Spain. I program since I was 13. I used to program games, maps and now I reverse engineer password managers and other stuff
Location
Berlin and Málaga
Education
MS in CS from State Polytechnic University of St. Petersburg
You are right, it's much faster on Node using Buffer to reverse a string if it's known it's ASCII. I assume the same goes for any fixed character size encoding. Code:
I don't know what you're using to benchmark and draw these nice graphs, so I'm forced to bother you again:
Can you try it with Buffer.reverse()? :D
It's in place, like the Array one, and inherited from Uint8Array.
Grew up in Russia, lived in the States, moved to Germany, sometimes live in Spain. I program since I was 13. I used to program games, maps and now I reverse engineer password managers and other stuff
Location
Berlin and Málaga
Education
MS in CS from State Polytechnic University of St. Petersburg
The code is mentioned in the bottom of the post: gist.github.com/detunized/17f524d7...
The graphs I draw using Google Sheets. I can add this code later to my set. I actually looked for reverse on Buffet but didn't see it. Weird.
Just plugging a buffer into your Fast is twice faster on long strings, same and maybe a little worse on short:
functionfindLongestWordLengthBufferFast(str){constbuf=Buffer.from(str,'ascii')letl=buf.length;letmaxLength=0;letcurrentLength=0;for(leti=0;i<l;++i){if(buf[i]===32){if(currentLength>maxLength){maxLength=currentLength;}currentLength=0;}else{++currentLength;}}// Account for the last wordreturncurrentLength>maxLength?currentLength:maxLength;}
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
You can preallocate memory using the Buffer class if you're in a node environment.
I measured it, the conversion to string afterwards makes it quite slow. Slower than the alternatives.
Did you try
latin1
orascii
conversion (both ways)?Not sure. I didn't enforce any encoding. The content of the string was "a a a a...".
The default is
'utf8'
, which is variable-length characters. So naturally it's much slower.'latin1'
is 1-1 char for byte (Obviously high Unicode will get corrupted, but if you start with a Buffer, you will always be able to get it back.'ascii'
decodes the same as'latin1'
, but the encoder drops the first bit.UTF8, lol:
Notably, V8 doesn't use UTF8 to represent strings/string chunks in memory. It uses something like UTF16 (2 or 4 bytes per character, never 1 or 3) and sometimes opportunistically optimizes to using something like
'latin1'
when high characters aren't used.So, for weird strings
'utf16le'
is fastest, and for basic strings,'latin1'
/'ascii'
. The default of'utf8'
is the slowest in all cases (but the most compatible and compact without compression)Edit: So, I was wrong.
'utf16le'
is slower than'utf8'
on pure ASCII (But 3+ times faster otherwise)Some thorough research. Well done! One small problem with this, is all of this doesn't really help with the original problem =) We wanted to count letters or reverse them or whatever. It's not correct to assume any trivial encoding, like
latin1
orascii
. Nor it helps to put everything in the byte buffer, even correctly encoded sequence of bytes. But I'm sure there are use cases where it's ok to assume 7 or 8 bit encoding, like massive amounts of base64, for example.There's
Buffer.from('ffff','base64')
for that :DI was just responding to you saying that in your quest to to count letters or reverse them or whatever you rejected
Buffer
because of the conversion time.So I was wondering if it would be more competitive, on your admittedly ASCII
'a a a a a ...
data, without the overhead of the default encoding.You are right, it's much faster on Node using Buffer to reverse a string if it's known it's ASCII. I assume the same goes for any fixed character size encoding. Code:
Look at the green line:
Heeeey, that's pretty fast :v
I don't know what you're using to benchmark and draw these nice graphs, so I'm forced to bother you again:
Can you try it with
Buffer.reverse()
? :DIt's in place, like the
Array
one, and inherited fromUint8Array
.The code is mentioned in the bottom of the post: gist.github.com/detunized/17f524d7...
The graphs I draw using Google Sheets. I can add this code later to my set. I actually looked for reverse on Buffet but didn't see it. Weird.
Yeah, it's not listed on the Node docs, because it's inherited.
It's on MDN though.
This is over twice as fast :D
This was 💩 though:
Just plugging a buffer into your
Fast
is twice faster on long strings, same and maybe a little worse on short: