DEV Community

Should a modern programming language assume a byte is 8-bits in size?

edA‑qa mort‑ora‑y on March 22, 2018

I'm in the process of creating a byte_buffer fundamental type for Leaf. This is an unsafe buffer needed for compatibility with C APIs. I have a kin...

Read full post

edA‑qa mort‑ora‑y • Mar 22 '18

Yes, for something like integer Leaf defines a default size based on the natural word type. You can request specific sizes if you'd like as well, such as integer 16bit, or even integer 23bit if you want.

float also has a default (64-bit usually), but allows the high or low modifiers.

There's also an octet for a binary 8bit -- note that binary and integer are handled differently, unlike say signed vs. unsigned in other languages.

So the question for me is whether byte is an alias for octet or whether it's an alias binary Nbit where N is platform dependent.

Erebos Manannán • Mar 23 '18

Random thought along these lines, I really prefer my numbers to not have a fixed size as you're almost guaranteed to run into issues with them eventually.

Whenever possible I'd like the default common number type to be limitless, automatically growing to contain whatever you want to store in it. Then on top of that you can have optimized types for those special cases where you really do want an int8.

Vast majority of uses don't have such performance constraints that they need to micro-optimize such things, but there are a large number of cases where even after careful thinking you can end up with bugs due to fixed bit length, especially given a decade or two of progress.

Vinay Pai • Mar 23 '18

I have exactly the opposite preference. At least for the kind of code I normally write it's exceptionally rare for most numbers to grow without bound and come remotely close to overflowing... especially 64 bit numbers. It would be a huge waste for every counter, every database ID, every number everywhere to be handled with arbitrary precision rather than native types.

Erebos Manannán • Mar 23 '18

"exceptionally rare" sounds like you'd practically never take it into consideration and thus can end up with related bugs -> unlimited would be better default to avoid accidents.

When you KNOW you're fine with an int64 like for auto-increment database IDs, loop counters, etc., then you can still use that. How much is the typical application going to be slowed down by using an arbitrary precision integer over native types? Zilch, you will never even notice a few extra CPU cycles going on.

For those people who do performance critical things and performance optimization it's fine to offer the optimized types, but they should be opt-in when appropriate, not a source for the occasional "gotcha, you didn't think this would ever wrap now did you?"

Erebos Manannán • Mar 23 '18

In short I guess you could summarize my stance as: "programming languages should by default empower conveniently getting things done with minimal chance for nasty surprises"

Computers are incredibly powerful nowadays compared to the 80286 days, and it would be better to have the programming languages make it as easy as possible for you to make programs that do what you wanted to, rather than programs that save a few CPU cycles in return for micro-optimizations that nobody will notice until they hit a rare bug with them.

Again, no need to remove the option for optimized types for those who need them, but the vast majority of programming nowadays tends to be about trying to make the correct thing happen without bugs rather than optimizing the number of CPU cycles that it takes.

Vinay Pai • Mar 23 '18

I can't think of the last time I had a bug caused by integer overflow. Does that happen to you a lot?

The cost of using arbitrary precision everywhere is way more than "a few cycles". You're adding overhead all over the place. Incrementing a number goes from a simple machine instruction that can be pipelined to a loop which will likely result in a pipeline stall.

You can't just allocate an array of integers because you don't know how much memory that will need. Increment the nth element of an array? That is potentially an O(n) operation now because it might increase in size. Or your array of integers could actually be an array of pointers to the structs holding your arbitrary precision integers. That DRASTICALLY slows things down on modern processors because your memory references have worse locality so your all-important L1 cache hit ratio goes down the tubes.

It's like making airliners at 10,000 feet instead of 30,000 feet to avoid the risk of cabin depressurization.

Erebos Manannán • Mar 23 '18

When you say "drastically" you mean "has literally no perceivable impact at all in most cases". The O(n) timing etc. means literally nothing if you're talking about a general scope. There are places where speed matters and those places are getting increasingly rare.

I follow the security scene a fair bit and especially there I keep reading about random pieces of software constantly running into integer overflow/underflow issues.

They ARE a cause for bugs when e.g. a developer thinks "well I'm just asking the users to input an age, and a normal human lives to be at most 100 years old so I'll just use int8" and then the user doesn't know or care about what constraints the programmer had in mind and tries to use the same application to catalog the ages of antique items, species, or planets.

"Premature optimization is the root of all evil" is a fitting quote for this discussion. Optimize where you need it, don't worry and micro-optimize your CPU cycles everywhere because some school teacher taught you about O(...) notation. YOUR time is often much more valuable (i.e. you getting things done, without nasty surprises that can lead to unhappy users, security issues, or anything else) than the CPU cycles.

How often you care or even see if what your L1 cache hit ratio is when you write a typical desktop or mobile app, or any web frontend/backend? Much less often than you care about having code that just works regardless of what size a number the user (malicious or not) decided to give you.

And AGAIN, when you DO need to care, the option can be there to be explicit.

Vinay Pai • Mar 23 '18

People mindlessly repeating mantras like "premature optimization is the root of all evil" is the root of all evil.

Erebos Manannán • Mar 23 '18

I think my comment had quite a bit more content to it than that.

cvedetails.com/google-search-resul...

About 16,500 results

cvedetails.com/google-search-resul...

About 3,150 results

And these are just reported security issues, not bugs caused by choosing the wrong integer size.

Here's a new quote, it's quoting me saying it just here: "People quoting O(...) notation and talking about L1 cache as if any of it mattered at all for most cases are the root of all evil" ;)

Vinay Pai • Mar 23 '18

Okay let's say you replaced them with arbitrary precision arithmetic. How many new bugs would be caused there by malicious input causing huge memory allocations and blowing up the server?

Erebos Manannán • Mar 23 '18

Quick estimate: probably fewer. For one it'd be easier to do an if (length > MAX_LENGTH) -type check.

Also if you use user input to determine how much memory you allocate you're probably doing something wrong anyway, regardless of what kind of arithmetic you're doing. Take a file upload, do you trust on the client to tell you "I'm sending you a file that is 200kB in size, and here it comes" and then trust the client, or do you just take in an arbitrary file stream and then if it's too big just say "ok enough" at some point and disconnect?

Anyway I tire of this mindless banter. I've made my point.

edA‑qa mort‑ora‑y • Mar 23 '18

A few notes, related to Leaf, for this discussion:

I intend on doing over/underflow checks by default (unless turned off for optimization). Thus an overflow will result in an error.
I will provide logical ranges for values, like integer range(0,1000) so you can give real world limits to numbers and let an appropriate type be picked.
Arbitrary precision is extremely costly compared to native precision. A fixed, but very high, precision, is not as costly, but doesn't solve anything. On that note, you can do integer 1024bit in Leaf if you want.
Leaf constants are arbitrary rationals and high precision floating points during compilation. Conversions that lose precision (like float -> integer) are also disallowed. This helps in several situations.

Vinay Pai • Mar 23 '18

So you pointed to a bunch of bugs caused by a lack of range checks. Your solution to avoid creating another bug is to... add a range check. Brilliant! You have indeed made your point.

Sam Ferree • Mar 23 '18

it's an alias for octet.

binary Nbit where N is platform dependent is called a WORD

edA‑qa mort‑ora‑y • Mar 23 '18

No, a word typically refers to a natural integer or intstruction size, so 32-bit or 64-bit.

The term byte I'm using comes from the C/c++ standards that define it as platform dependent, and that is it's traditional definition. It's only after hardware standardized on 8-bits that it got that meaning.

Dustin King • Mar 23 '18

It sounds like you already think it's "incorrect" to make that assumption.

On the other hand, languages sometimes have to make compromises based on the existing or expected user base and developer community. I don't know what those are like for Leaf.

edA‑qa mort‑ora‑y • Mar 23 '18

I do lean towards thinking byte == 8-bits is wrong. That latent learning of C in my background, and the thinking that hardware could ultimately change.

For the most part users of Leaf will never see this byte type, only those doing integration with non-Leaf libraries or direct memory access.

jakebman • Mar 26 '18

If your byte type is supposed to be for integration with non-Leaf libraries, I believe you should base it on the specifications for those libraries.

If you expect the integration to be via C libraries, then I believe you should base your specification of byte on exactly what C says it is.

It will be better for your interoperability with C if you can say

byte - The specification of a byte according to all laws and regulations of C's implementation on whatever platform you're running on.

rather than

byte - Mostly what C says it is. Except when it isn't, in which case you will have massive headaches and have to implement a lot of workarounds both in your code and any interop layers you might have.

edA‑qa mort‑ora‑y • Mar 26 '18

Yes, I already have a series of abi_ types, so abi_byte makes sense. But C-integration is only part of the story, there are still some low-level cases that require the same concept as a byte. Those are quite low level though, so still ABI relevant. Perhaps just an abi_byte isn't a bad compromise.

Nested Software • Mar 23 '18 • Edited

I don't do systems programming, so I'm curious about what a typical use case looks like here. Would this be used by people who want to use existing C libraries with the Leaf language?

If there isn't much cost or some other problem with making things parametric, that would seem to be the natural way to go... Instead of bytebuffer, can you have a membuffer where you specify the size of each item?

PS: I am not totally sure, but it seems that even in C, a byte is not defined as 8 bits. It's at least 8 bits..

edA‑qa mort‑ora‑y • Mar 23 '18

The immediate case I'm trying to solve is a C-union used in the SDL library. To solve it I'm going to create a byte-buffer large enough to hold all the types and allow casting it to the correct type.

Hopefully byte doesn't come up often in Leaf, only for integration and low-level work. There's no real cost associated with supporting it, other than having to explain that byte != octet. There's also no real way of enforcing this either, since on all current OS's they will be the same type. Making them strongly different types might be quite inconvenient, but it's something I'm considering.

That is, even if the same size, don't allow byte <-> octet conversions without explicit casting.

Nested Software • Mar 23 '18 • Edited

Hmm. Since C defines the result of sizeof as a number of 'bytes' (but not necessarily octets, as you pointed out), I guess it makes sense to do the same for code who's job is to help interface with C code. You could try to call it something else, but I'm finding it hard to think of something better. 'sizeof_type'?

edA‑qa mort‑ora‑y • Mar 23 '18

byte is the term that means this in C/C++, and historically is the correct term as well. It's only recently that it's become an alias for 8-bits. I don't think I'd like to introduce a new term.

Leaf will also have to have a sizeof that returns number of bits. I guess it won't be so unnatural to have mismatched values though, since you can use arbitrary bit sizes anyway:

sizeof(integer 7bit) = 1
sizeof(integer 8bit) = 1
sizeof(integer 9bit) = 2

tanjent • Mar 23 '18 • Edited

"1 byte = 8 bits" is something that's been included in ISO Standard docs for two decades. Historically, sure, it was a hardware-specific size, but those days are long gone. So yes, your "byte_buffer" really needs to be addressable in 8 bit sized chunks.

I also think there's very little application for data types tied that closely to hardware definition for most development nowadays. As such I'd avoid using the word "byte" unless it's a hardware driven implementation.

edA‑qa mort‑ora‑y • Mar 23 '18

There are apparently some DSPs that don't use 8-bit byte sizes still. It's in the area of embedded and special purpose hardware where I'm most concerned about odd bit sizes.

The ISO docs only apply to languages that state they follow them. I'm not sure there's any actual mandated computing standard saying 8-bit bytes are required for all technology.

Other than direct memory access, OS functions, and foreign libraries, there will be no use for this byte type in Leaf. There is an appopriate octet type when dealing with files and network.

tanjent • Mar 23 '18

I agree "byte=8 bit" not anything absolutely mandated by a standard organization. I probably wasn't clear enough in my previous comment, apologies.

And I also recognize that the small percentage of people who understand the true definition of "byte" also are the folks who are most likely to actually need a 6-bit or 9-bit data type to match their non-commodity hardware.

Seems it comes down to a trade-off... (a) Use "byte" as the name for your flexible address size, which will make the experts happy and maybe confuse the newer developers.... or (b) Use some other term for the flexible address size, which will avoid confusing the newer devs, but will make the experts ask "why didn't you call it a byte."

Miff • Mar 23 '18

Technically, while most numeric types in C# do that (int = System.Int32, ulong = System.ULong64), the eight bit types are System.Byte and System.SByte. (If they followed the convention they'd be UInt8 and Int8 respectively.)

Then again C# also assumes a char is 16 bits...