I'm in the process of creating a byte_buffer
fundamental type for Leaf. This is an unsafe buffer needed for compatibility with C APIs. I have a kind of dilemma on what byte
should actually mean.
I have another type called and octet
which is guaranteed to be 8bits in size.
Traditionally C, and C++, have defined a byte to just be an addressable unit of memory, now at least 8bits in size. A sequence of bytes must also form a contiguous memory area. This makes it work on hardware that chooses not to use 8-bits.
At some point though hardware unified on 8-bit bytes, so the distinction feels a bit odd. A lot of code is written, incorrectly, assuming that is the case. Some languages, like Java and C# define a byte to be 8-bits.
There are some chips around, like DSPs, where a byte is not 8-bits. In the future somebody may experiment again -- which would make it hard for any language assuming a byte is 8-bits to work on those platforms.
Top comments (27)
Yes, for something like
integer
Leaf defines a default size based on the natural word type. You can request specific sizes if you'd like as well, such asinteger 16bit
, or eveninteger 23bit
if you want.float
also has a default (64-bit usually), but allows thehigh
orlow
modifiers.There's also an
octet
for abinary 8bit
-- note thatbinary
andinteger
are handled differently, unlike say signed vs. unsigned in other languages.So the question for me is whether
byte
is an alias foroctet
or whether it's an aliasbinary Nbit
where N is platform dependent.Random thought along these lines, I really prefer my numbers to not have a fixed size as you're almost guaranteed to run into issues with them eventually.
Whenever possible I'd like the default common number type to be limitless, automatically growing to contain whatever you want to store in it. Then on top of that you can have optimized types for those special cases where you really do want an
int8
.Vast majority of uses don't have such performance constraints that they need to micro-optimize such things, but there are a large number of cases where even after careful thinking you can end up with bugs due to fixed bit length, especially given a decade or two of progress.
I have exactly the opposite preference. At least for the kind of code I normally write it's exceptionally rare for most numbers to grow without bound and come remotely close to overflowing... especially 64 bit numbers. It would be a huge waste for every counter, every database ID, every number everywhere to be handled with arbitrary precision rather than native types.
"exceptionally rare" sounds like you'd practically never take it into consideration and thus can end up with related bugs -> unlimited would be better default to avoid accidents.
When you KNOW you're fine with an
int64
like for auto-increment database IDs, loop counters, etc., then you can still use that. How much is the typical application going to be slowed down by using an arbitrary precision integer over native types? Zilch, you will never even notice a few extra CPU cycles going on.For those people who do performance critical things and performance optimization it's fine to offer the optimized types, but they should be opt-in when appropriate, not a source for the occasional "gotcha, you didn't think this would ever wrap now did you?"
In short I guess you could summarize my stance as: "programming languages should by default empower conveniently getting things done with minimal chance for nasty surprises"
Computers are incredibly powerful nowadays compared to the 80286 days, and it would be better to have the programming languages make it as easy as possible for you to make programs that do what you wanted to, rather than programs that save a few CPU cycles in return for micro-optimizations that nobody will notice until they hit a rare bug with them.
Again, no need to remove the option for optimized types for those who need them, but the vast majority of programming nowadays tends to be about trying to make the correct thing happen without bugs rather than optimizing the number of CPU cycles that it takes.
I can't think of the last time I had a bug caused by integer overflow. Does that happen to you a lot?
The cost of using arbitrary precision everywhere is way more than "a few cycles". You're adding overhead all over the place. Incrementing a number goes from a simple machine instruction that can be pipelined to a loop which will likely result in a pipeline stall.
You can't just allocate an array of integers because you don't know how much memory that will need. Increment the nth element of an array? That is potentially an O(n) operation now because it might increase in size. Or your array of integers could actually be an array of pointers to the structs holding your arbitrary precision integers. That DRASTICALLY slows things down on modern processors because your memory references have worse locality so your all-important L1 cache hit ratio goes down the tubes.
It's like making airliners at 10,000 feet instead of 30,000 feet to avoid the risk of cabin depressurization.
When you say "drastically" you mean "has literally no perceivable impact at all in most cases". The
O(n)
timing etc. means literally nothing if you're talking about a general scope. There are places where speed matters and those places are getting increasingly rare.I follow the security scene a fair bit and especially there I keep reading about random pieces of software constantly running into integer overflow/underflow issues.
They ARE a cause for bugs when e.g. a developer thinks "well I'm just asking the users to input an age, and a normal human lives to be at most 100 years old so I'll just use
int8
" and then the user doesn't know or care about what constraints the programmer had in mind and tries to use the same application to catalog the ages of antique items, species, or planets."Premature optimization is the root of all evil" is a fitting quote for this discussion. Optimize where you need it, don't worry and micro-optimize your CPU cycles everywhere because some school teacher taught you about
O(...)
notation. YOUR time is often much more valuable (i.e. you getting things done, without nasty surprises that can lead to unhappy users, security issues, or anything else) than the CPU cycles.How often you care or even see if what your L1 cache hit ratio is when you write a typical desktop or mobile app, or any web frontend/backend? Much less often than you care about having code that just works regardless of what size a number the user (malicious or not) decided to give you.
And AGAIN, when you DO need to care, the option can be there to be explicit.
People mindlessly repeating mantras like "premature optimization is the root of all evil" is the root of all evil.
I think my comment had quite a bit more content to it than that.
cvedetails.com/google-search-resul...
About 16,500 results
cvedetails.com/google-search-resul...
About 3,150 results
And these are just reported security issues, not bugs caused by choosing the wrong integer size.
Here's a new quote, it's quoting me saying it just here: "People quoting O(...) notation and talking about L1 cache as if any of it mattered at all for most cases are the root of all evil" ;)
Okay let's say you replaced them with arbitrary precision arithmetic. How many new bugs would be caused there by malicious input causing huge memory allocations and blowing up the server?
Quick estimate: probably fewer. For one it'd be easier to do an
if (length > MAX_LENGTH)
-type check.Also if you use user input to determine how much memory you allocate you're probably doing something wrong anyway, regardless of what kind of arithmetic you're doing. Take a file upload, do you trust on the client to tell you "I'm sending you a file that is 200kB in size, and here it comes" and then trust the client, or do you just take in an arbitrary file stream and then if it's too big just say "ok enough" at some point and disconnect?
Anyway I tire of this mindless banter. I've made my point.
A few notes, related to Leaf, for this discussion:
integer range(0,1000)
so you can give real world limits to numbers and let an appropriate type be picked.integer 1024bit
in Leaf if you want.So you pointed to a bunch of bugs caused by a lack of range checks. Your solution to avoid creating another bug is to... add a range check. Brilliant! You have indeed made your point.
it's an alias for
octet.
binary Nbit
where N is platform dependent is called aWORD
No, a word typically refers to a natural integer or intstruction size, so 32-bit or 64-bit.
The term
byte
I'm using comes from the C/c++ standards that define it as platform dependent, and that is it's traditional definition. It's only after hardware standardized on 8-bits that it got that meaning.It sounds like you already think it's "incorrect" to make that assumption.
On the other hand, languages sometimes have to make compromises based on the existing or expected user base and developer community. I don't know what those are like for Leaf.
I do lean towards thinking
byte == 8-bits
is wrong. That latent learning of C in my background, and the thinking that hardware could ultimately change.For the most part users of Leaf will never see this
byte
type, only those doing integration with non-Leaf libraries or direct memory access.If your
byte
type is supposed to be for integration with non-Leaf libraries, I believe you should base it on the specifications for those libraries.If you expect the integration to be via C libraries, then I believe you should base your specification of byte on exactly what C says it is.
It will be better for your interoperability with C if you can say
byte
- The specification of a byte according to all laws and regulations of C's implementation on whatever platform you're running on.rather than
byte
- Mostly what C says it is. Except when it isn't, in which case you will have massive headaches and have to implement a lot of workarounds both in your code and any interop layers you might have.Yes, I already have a series of
abi_
types, soabi_byte
makes sense. But C-integration is only part of the story, there are still some low-level cases that require the same concept as abyte
. Those are quite low level though, so still ABI relevant. Perhaps just anabi_byte
isn't a bad compromise.I don't do systems programming, so I'm curious about what a typical use case looks like here. Would this be used by people who want to use existing C libraries with the Leaf language?
If there isn't much cost or some other problem with making things parametric, that would seem to be the natural way to go... Instead of bytebuffer, can you have a membuffer where you specify the size of each item?
PS: I am not totally sure, but it seems that even in C, a byte is not defined as 8 bits. It's at least 8 bits..
The immediate case I'm trying to solve is a C-union used in the SDL library. To solve it I'm going to create a byte-buffer large enough to hold all the types and allow casting it to the correct type.
Hopefully
byte
doesn't come up often in Leaf, only for integration and low-level work. There's no real cost associated with supporting it, other than having to explain thatbyte != octet
. There's also no real way of enforcing this either, since on all current OS's they will be the same type. Making them strongly different types might be quite inconvenient, but it's something I'm considering.That is, even if the same size, don't allow
byte <-> octet
conversions without explicit casting.Hmm. Since C defines the result of sizeof as a number of 'bytes' (but not necessarily octets, as you pointed out), I guess it makes sense to do the same for code who's job is to help interface with C code. You could try to call it something else, but I'm finding it hard to think of something better. 'sizeof_type'?
byte
is the term that means this in C/C++, and historically is the correct term as well. It's only recently that it's become an alias for 8-bits. I don't think I'd like to introduce a new term.Leaf will also have to have a
sizeof
that returns number ofbits
. I guess it won't be so unnatural to have mismatched values though, since you can use arbitrary bit sizes anyway:"1 byte = 8 bits" is something that's been included in ISO Standard docs for two decades. Historically, sure, it was a hardware-specific size, but those days are long gone. So yes, your "byte_buffer" really needs to be addressable in 8 bit sized chunks.
I also think there's very little application for data types tied that closely to hardware definition for most development nowadays. As such I'd avoid using the word "byte" unless it's a hardware driven implementation.
There are apparently some DSPs that don't use 8-bit byte sizes still. It's in the area of embedded and special purpose hardware where I'm most concerned about odd bit sizes.
The ISO docs only apply to languages that state they follow them. I'm not sure there's any actual mandated computing standard saying 8-bit bytes are required for all technology.
Other than direct memory access, OS functions, and foreign libraries, there will be no use for this
byte
type in Leaf. There is an appopriateoctet
type when dealing with files and network.I agree "byte=8 bit" not anything absolutely mandated by a standard organization. I probably wasn't clear enough in my previous comment, apologies.
And I also recognize that the small percentage of people who understand the true definition of "byte" also are the folks who are most likely to actually need a 6-bit or 9-bit data type to match their non-commodity hardware.
Seems it comes down to a trade-off... (a) Use "byte" as the name for your flexible address size, which will make the experts happy and maybe confuse the newer developers.... or (b) Use some other term for the flexible address size, which will avoid confusing the newer devs, but will make the experts ask "why didn't you call it a byte."
Technically, while most numeric types in C# do that (int = System.Int32, ulong = System.ULong64), the eight bit types are System.Byte and System.SByte. (If they followed the convention they'd be UInt8 and Int8 respectively.)
Then again C# also assumes a char is 16 bits...