DEV Community

Should a modern programming language assume a byte is 8-bits in size?

edA‑qa mort‑ora‑y on March 22, 2018

I'm in the process of creating a byte_buffer fundamental type for Leaf. This is an unsafe buffer needed for compatibility with C APIs. I have a kin...
Collapse
 
mortoray profile image
edA‑qa mort‑ora‑y

Yes, for something like integer Leaf defines a default size based on the natural word type. You can request specific sizes if you'd like as well, such as integer 16bit, or even integer 23bit if you want.

float also has a default (64-bit usually), but allows the high or low modifiers.

There's also an octet for a binary 8bit -- note that binary and integer are handled differently, unlike say signed vs. unsigned in other languages.

So the question for me is whether byte is an alias for octet or whether it's an alias binary Nbit where N is platform dependent.

Collapse
 
erebos-manannan profile image
Erebos Manannán

Random thought along these lines, I really prefer my numbers to not have a fixed size as you're almost guaranteed to run into issues with them eventually.

Whenever possible I'd like the default common number type to be limitless, automatically growing to contain whatever you want to store in it. Then on top of that you can have optimized types for those special cases where you really do want an int8.

Vast majority of uses don't have such performance constraints that they need to micro-optimize such things, but there are a large number of cases where even after careful thinking you can end up with bugs due to fixed bit length, especially given a decade or two of progress.

Thread Thread
 
vinaypai profile image
Vinay Pai

I have exactly the opposite preference. At least for the kind of code I normally write it's exceptionally rare for most numbers to grow without bound and come remotely close to overflowing... especially 64 bit numbers. It would be a huge waste for every counter, every database ID, every number everywhere to be handled with arbitrary precision rather than native types.

Thread Thread
 
erebos-manannan profile image
Erebos Manannán

"exceptionally rare" sounds like you'd practically never take it into consideration and thus can end up with related bugs -> unlimited would be better default to avoid accidents.

When you KNOW you're fine with an int64 like for auto-increment database IDs, loop counters, etc., then you can still use that. How much is the typical application going to be slowed down by using an arbitrary precision integer over native types? Zilch, you will never even notice a few extra CPU cycles going on.

For those people who do performance critical things and performance optimization it's fine to offer the optimized types, but they should be opt-in when appropriate, not a source for the occasional "gotcha, you didn't think this would ever wrap now did you?"

Thread Thread
 
erebos-manannan profile image
Erebos Manannán

In short I guess you could summarize my stance as: "programming languages should by default empower conveniently getting things done with minimal chance for nasty surprises"

Computers are incredibly powerful nowadays compared to the 80286 days, and it would be better to have the programming languages make it as easy as possible for you to make programs that do what you wanted to, rather than programs that save a few CPU cycles in return for micro-optimizations that nobody will notice until they hit a rare bug with them.

Again, no need to remove the option for optimized types for those who need them, but the vast majority of programming nowadays tends to be about trying to make the correct thing happen without bugs rather than optimizing the number of CPU cycles that it takes.

Thread Thread
 
vinaypai profile image
Vinay Pai

I can't think of the last time I had a bug caused by integer overflow. Does that happen to you a lot?

The cost of using arbitrary precision everywhere is way more than "a few cycles". You're adding overhead all over the place. Incrementing a number goes from a simple machine instruction that can be pipelined to a loop which will likely result in a pipeline stall.

You can't just allocate an array of integers because you don't know how much memory that will need. Increment the nth element of an array? That is potentially an O(n) operation now because it might increase in size. Or your array of integers could actually be an array of pointers to the structs holding your arbitrary precision integers. That DRASTICALLY slows things down on modern processors because your memory references have worse locality so your all-important L1 cache hit ratio goes down the tubes.

It's like making airliners at 10,000 feet instead of 30,000 feet to avoid the risk of cabin depressurization.

Collapse
 
cathodion profile image
Dustin King

It sounds like you already think it's "incorrect" to make that assumption.

On the other hand, languages sometimes have to make compromises based on the existing or expected user base and developer community. I don't know what those are like for Leaf.

Collapse
 
mortoray profile image
edA‑qa mort‑ora‑y

I do lean towards thinking byte == 8-bits is wrong. That latent learning of C in my background, and the thinking that hardware could ultimately change.

For the most part users of Leaf will never see this byte type, only those doing integration with non-Leaf libraries or direct memory access.

Collapse
 
jakebman profile image
jakebman

If your byte type is supposed to be for integration with non-Leaf libraries, I believe you should base it on the specifications for those libraries.

If you expect the integration to be via C libraries, then I believe you should base your specification of byte on exactly what C says it is.

It will be better for your interoperability with C if you can say

byte - The specification of a byte according to all laws and regulations of C's implementation on whatever platform you're running on.

rather than

byte - Mostly what C says it is. Except when it isn't, in which case you will have massive headaches and have to implement a lot of workarounds both in your code and any interop layers you might have.

Thread Thread
 
mortoray profile image
edA‑qa mort‑ora‑y

Yes, I already have a series of abi_ types, so abi_byte makes sense. But C-integration is only part of the story, there are still some low-level cases that require the same concept as a byte. Those are quite low level though, so still ABI relevant. Perhaps just an abi_byte isn't a bad compromise.

Collapse
 
nestedsoftware profile image
Nested Software • Edited

I don't do systems programming, so I'm curious about what a typical use case looks like here. Would this be used by people who want to use existing C libraries with the Leaf language?

If there isn't much cost or some other problem with making things parametric, that would seem to be the natural way to go... Instead of bytebuffer, can you have a membuffer where you specify the size of each item?

PS: I am not totally sure, but it seems that even in C, a byte is not defined as 8 bits. It's at least 8 bits..

Collapse
 
mortoray profile image
edA‑qa mort‑ora‑y

The immediate case I'm trying to solve is a C-union used in the SDL library. To solve it I'm going to create a byte-buffer large enough to hold all the types and allow casting it to the correct type.

Hopefully byte doesn't come up often in Leaf, only for integration and low-level work. There's no real cost associated with supporting it, other than having to explain that byte != octet. There's also no real way of enforcing this either, since on all current OS's they will be the same type. Making them strongly different types might be quite inconvenient, but it's something I'm considering.

That is, even if the same size, don't allow byte <-> octet conversions without explicit casting.

Collapse
 
nestedsoftware profile image
Nested Software • Edited

Hmm. Since C defines the result of sizeof as a number of 'bytes' (but not necessarily octets, as you pointed out), I guess it makes sense to do the same for code who's job is to help interface with C code. You could try to call it something else, but I'm finding it hard to think of something better. 'sizeof_type'?

Thread Thread
 
mortoray profile image
edA‑qa mort‑ora‑y

byte is the term that means this in C/C++, and historically is the correct term as well. It's only recently that it's become an alias for 8-bits. I don't think I'd like to introduce a new term.

Leaf will also have to have a sizeof that returns number of bits. I guess it won't be so unnatural to have mismatched values though, since you can use arbitrary bit sizes anyway:

  • sizeof(integer 7bit) = 1
  • sizeof(integer 8bit) = 1
  • sizeof(integer 9bit) = 2
Collapse
 
tanjent profile image
tanjent • Edited

"1 byte = 8 bits" is something that's been included in ISO Standard docs for two decades. Historically, sure, it was a hardware-specific size, but those days are long gone. So yes, your "byte_buffer" really needs to be addressable in 8 bit sized chunks.

I also think there's very little application for data types tied that closely to hardware definition for most development nowadays. As such I'd avoid using the word "byte" unless it's a hardware driven implementation.

Collapse
 
mortoray profile image
edA‑qa mort‑ora‑y

There are apparently some DSPs that don't use 8-bit byte sizes still. It's in the area of embedded and special purpose hardware where I'm most concerned about odd bit sizes.

The ISO docs only apply to languages that state they follow them. I'm not sure there's any actual mandated computing standard saying 8-bit bytes are required for all technology.

Other than direct memory access, OS functions, and foreign libraries, there will be no use for this byte type in Leaf. There is an appopriate octet type when dealing with files and network.

Collapse
 
tanjent profile image
tanjent

I agree "byte=8 bit" not anything absolutely mandated by a standard organization. I probably wasn't clear enough in my previous comment, apologies.

And I also recognize that the small percentage of people who understand the true definition of "byte" also are the folks who are most likely to actually need a 6-bit or 9-bit data type to match their non-commodity hardware.

Seems it comes down to a trade-off... (a) Use "byte" as the name for your flexible address size, which will make the experts happy and maybe confuse the newer developers.... or (b) Use some other term for the flexible address size, which will avoid confusing the newer devs, but will make the experts ask "why didn't you call it a byte."

Collapse
 
miffpengi profile image
Miff

Technically, while most numeric types in C# do that (int = System.Int32, ulong = System.ULong64), the eight bit types are System.Byte and System.SByte. (If they followed the convention they'd be UInt8 and Int8 respectively.)

Then again C# also assumes a char is 16 bits...