DEV Community

Cover image for .NET: The ASCII Problem
Wesley
Wesley

Posted on

.NET: The ASCII Problem

This might be seen as just unnecessary for optimizing binaries, but you'd have to at least ask yourself, why does .NET do what leads to these solutions?

For historical reasons, System.String uses the UCS-2 character encoding, that is, UTF-16 without surrogate pairs.
However, most strings in typical .NET applications consist solely of ASCII characters, leading to wasted space: half of the bytes in a string are likely to be null bytes!
Since strings are immutable, we can scan the character data when the string is constructed, then dynamically select an encoding, thereby saving 50% of string memory in most cases.

ASCII Mono | Mono

For some reason, this applies to all string literals except enum names and field names.

The fork linked in this page hasn't been active since 2016, and I know, before I even try, if I clone it and run make that there will be a million errors. So now, it's time to pull out my best gimmick in programming, hacks.

Tests

Here is a base executable with no code to execute:
3.0KB:

0000: 4D5A90000300000004000000FFFF0000 MZ..............
0010: B8000000000000004000000000000000 ........@.......
0020: 00000000000000000000000000000000 ................
0030: 00000000000000000000000080000000 ................
0040: 0E1FBA0E00B409CD21B8014CCD215468 ........!..L.!Th
0050: 69732070726F6772616D2063616E6E6F is program canno
0060: 742062652072756E20696E20444F5320 t be run in DOS
0070: 6D6F64652E0D0D0A2400000000000000 mode....$.......
0080: 504500004C0103000000000000000000 PE..L...........
0090: 00000000E00002010B01080000040000 ................
00a0: 0006000000000000BE22000000200000 ........."... ..
00b0: 00400000000040000020000000020000 .@....@.. ......
00c0: 04000000000000000400000000000000 ................
00d0: 00800000000200000000000003004085 ..............@.
00e0: 00001000001000000000100000100000 ................
00f0: 00000000100000000000000000000000 ................
0100: 702200004B00000000400000D0020000 p"..K....@......
0110: 00000000000000000000000000000000 ................
0120: 006000000C0000000000000000000000 .`..............
0130: 00000000000000000000000000000000 ................
0140: 00000000000000000000000000000000 ................
0150: 00000000000000000020000008000000 ......... ......
0160: 00000000000000000820000048000000 ......... ..H...
0170: 00000000000000002E74657874000000 .........text...
0180: C4020000002000000004000000020000 ..... ..........
0190: 00000000000000000000000020000060 ............ ..`
01a0: 2E72737263000000D002000000400000 .rsrc........@..
01b0: 00040000000600000000000000000000 ................
01c0: 00000000400000402E72656C6F630000 ....@..@.reloc..
01d0: 0C0000000060000000020000000A0000 .....`..........
01e0: 00000000000000000000000040000042 ............@..B
01f0: 00000000000000000000000000000000 ................
0200: A0220000000000004800000002000500 ."......H.......
0210: 5C2000000C0200000100000002000006 \ ..............
0220: 00000000000000000000000000000000 ................
0230: 00000000000000000000000000000000 ................
0240: 00000000000000000000000000000000 ................
0250: 1E02280100000A2A062A000042534A42 ..(....*.*..BSJB
0260: 01000100000000000C00000076322E30 ............v2.0
0270: 2E35303732370000000005006C000000 .50727......l...
0280: D0000000237E00003C01000084000000 ....#~..<.......
0290: 23537472696E677300000000C0010000 #Strings........
02a0: 0800000023555300C801000010000000 ....#US.........
02b0: 2347554944000000D801000034000000 #GUID.......4...
02c0: 23426C6F620000000000000002000010 #Blob...........
02d0: 471500000900000000FA013300160000 G..........3....
02e0: 01000000020000000200000002000000 ................
02f0: 01000000020000000100000001000000 ................
0300: 0100000000007B000100000000000600 ......{.........
0310: 17001E00060034005200000000000100 ......4.R.......
0320: 0000000001000100000010000A000000 ................
0330: 05000100010050200000000086182500 ......P ......%.
0340: 0100010058200000000096002B000500 ....X ......+...
0350: 01000000010012000900250001001100 ..........%.....
0360: 250001002E0013000B00048000000000 %...............
0370: 00000000000000000000000030000000 ............0...
0380: 0200000000000000000000002A007200 ............*.r.
0390: 0000000000000000003C4D6F64756C65 .........<Module
03a0: 3E0050726F6772616D0061726773004F >.Program.args.O
03b0: 626A6563740053797374656D002E6374 bject.System..ct
03c0: 6F72004D61696E007768790052756E74 or.Main.why.Runt
03d0: 696D65436F6D7061746962696C697479 imeCompatibility
03e0: 4174747269627574650053797374656D Attribute.System
03f0: 2E52756E74696D652E436F6D70696C65 .Runtime.Compile
0400: 725365727669636573006D73636F726C rServices.mscorl
0410: 6962007768792E657865000000032000 ib.why.exe.... .
0420: 000000001B8EF60B53E43C4E97C8F435 ........S.<N...5
0430: 7DB1B9050003200001050001011D0E1E }..... .........
0440: 01000100540216577261704E6F6E4578 ....T..WrapNonEx
0450: 63657074696F6E5468726F77730108B7 ceptionThrows...
0460: 7A5C561934E089000000000000000000 z\V.4...........
0470: 982200000000000000000000AE220000 ."..........."..
0480: 00200000000000000000000000000000 . ..............
0490: 0000000000000000A022000000000000 ........."......
04a0: 00005F436F724578654D61696E006D73 .._CorExeMain.ms
04b0: 636F7265652E646C6C0000000000FF25 coree.dll......%
04c0: 00204000000000000000000000000000 . @.............
04d0: 00000000000000000000000000000000 ................
Enter fullscreen mode Exit fullscreen mode

At the bare minimum, executables have an extra 1.5KB of data for the .reloc and .rsrc section, which the latter has version info.
DEV is not helping me here with the little width I have to work with.

Plain String

Here's a plain call to WriteLine in Program.Main.

Console.WriteLine("Lorem ipsum dolor sit amet, consectetur...
Enter fullscreen mode Exit fullscreen mode

Compile command: mcs -debug- -nostdlib- -o+ -sdk:2 test.cs
Binary view: (xxd -u -g 1 -s +0x200 -l 0x400 test.exe)
4.0KB:

03a0: 2F0084000000000000000000003C4D6F /............<Mo
03b0: 64756C653E0050726F6772616D006172 dule>.Program.ar
03c0: 677300436F6E736F6C65005379737465 gs.Console.Syste
03d0: 6D0057726974654C696E65004F626A65 m.WriteLine.Obje
03e0: 6374002E63746F72004D61696E007768 ct..ctor.Main.wh
03f0: 790052756E74696D65436F6D70617469 y.RuntimeCompati
0400: 62696C69747941747472696275746500 bilityAttribute.
0410: 53797374656D2E52756E74696D652E43 System.Runtime.C
0420: 6F6D70696C6572536572766963657300 ompilerServices.
0430: 6D73636F726C6962007768792E657865 mscorlib.why.exe
0440: 000000000084054C006F00720065006D .......L.o.r.e.m
0450: 00200069007000730075006D00200064 . .i.p.s.u.m. .d
0460: 006F006C006F00720020007300690074 .o.l.o.r. .s.i.t
0470: 00200061006D00650074002C00200063 . .a.m.e.t.,. .c
0480: 006F006E007300650063007400650074 .o.n.s.e.c.t.e.t
0490: 00750072002000610064006900700069 .u.r. .a.d.i.p.i
04a0: 007300630069006E006700200065006C .s.c.i.n.g. .e.l
04b0: 00690074002E0020004E0075006E0063 .i.t... .N.u.n.c
04c0: 00200065006700650074002000610075 . .e.g.e.t. .a.u
04d0: 00630074006F0072002000740065006C .c.t.o.r. .t.e.l
04e0: 006C00750073002C0020007500740020 .l.u.s.,. .u.t.
04f0: 0063006F006E00640069006D0065006E .c.o.n.d.i.m.e.n
0500: 00740075006D0020006500730074002E .t.u.m. .e.s.t..
0510: 0020004E0075006C006C006100200066 . .N.u.l.l.a. .f
0520: 00650072006D0065006E00740075006D .e.r.m.e.n.t.u.m
0530: 002C0020007400750072007000690073 .,. .t.u.r.p.i.s
0540: 002000730069007400200061006D0065 . .s.i.t. .a.m.e
0550: 0074002000680065006E006400720065 .t. .h.e.n.d.r.e
0560: 00720069007400200072007500740072 .r.i.t. .r.u.t.r
0570: 0075006D002C00200061006E00740065 .u.m.,. .a.n.t.e
0580: 00200064007500690020006C006F0062 . .d.u.i. .l.o.b
0590: 006F0072007400690073002000610075 .o.r.t.i.s. .a.u
05a0: 006700750065002C0020007300650064 .g.u.e.,. .s.e.d
05b0: 002000700075006C00760069006E0061 . .p.u.l.v.i.n.a
05c0: 007200200065007800200064006F006C .r. .e.x. .d.o.l
05d0: 006F00720020006E006F006E0020006C .o.r. .n.o.n. .l
05e0: 0061006300750073002E0020004E0061 .a.c.u.s... .N.a
05f0: 006D0020007300650064002000650073 .m. .s.e.d. .e.s
Enter fullscreen mode Exit fullscreen mode

This string in binary takes the size of an EXE section. (0x400 bytes / 1KB)
This is doomsday for large console apps.

String Resource

Loading a string from a resource:

Console.WriteLine(Resources.ResourceManager.GetString("test"));
Enter fullscreen mode Exit fullscreen mode

Resource file:

<?xml version="1.0" encoding="utf-8"?>
<root>
    ...
    <data name="test" xml:space="preserve">
        <value>Lorem ipsum dolor sit amet, consectetur...</value>
    </data>
</root>
Enter fullscreen mode Exit fullscreen mode

Command: resgen /useSourcePath /compile main.resx
Added to MCS args: -resource:main.resources
4.5KB:

0200: D0280000000000004800000002000500 .(......H.......
0210: 80230000200500000100000004000006 .#.. ...........
0220: AC200000D20200000000000000000000 . ..............
0230: 00000000000000000000000000000000 ................
0240: 00000000000000000000000000000000 ................
0250: 1E02280900000A2AD27E010000041428 ..(....*.~.....(
0260: 0500000A391E0000007201000070D002 ....9....r...p..
0270: 000002280600000A6F0700000A730800 ...(....o....s..
0280: 000A80010000047E010000042A1E0228 .......~....*..(
0290: 0900000A2A5628020000067209000070 ....*V(....r...p
02a0: 6F0A00000A280B00000A2A00CE020000 o....(....*.....
02b0: CECAEFBE01000000910000006C537973 ............lSys
02c0: 74656D2E5265736F75726365732E5265 tem.Resources.Re
02d0: 736F757263655265616465722C206D73 sourceReader, ms
02e0: 636F726C69622C2056657273696F6E3D corlib, Version=
02f0: 342E302E302E302C2043756C74757265 4.0.0.0, Culture
0300: 3D6E65757472616C2C205075626C6963 =neutral, Public
0310: 4B6579546F6B656E3D62373761356335 KeyToken=b77a5c5
0320: 3631393334653038392353797374656D 61934e089#System
0330: 2E5265736F75726365732E52756E7469 .Resources.Runti
0340: 6D655265736F75726365536574020000 meResourceSet...
0350: 00010000000000000050414450414450 .........PADPADP
0360: 33AF737C00000000C900000008740065 3.s|.........t.e
0370: 0073007400000000000182044C6F7265 .s.t........Lore
0380: 6D20697073756D20646F6C6F72207369 m ipsum dolor si
0390: 7420616D65742C20636F6E7365637465 t amet, consecte
03a0: 7475722061646970697363696E672065 tur adipiscing e
03b0: 6C69742E204E756E6320656765742061 lit. Nunc eget a
03c0: 7563746F722074656C6C75732C207574 uctor tellus, ut
03d0: 20636F6E64696D656E74756D20657374  condimentum est
07a0: 0000000000000000B700230100000000 ..........#.....
07b0: 0000000001000000E401000000000000 ................
07c0: 003C4D6F64756C653E00520050726F67 .<Module>.R.Prog
07d0: 72616D00724D005265736F757263654D ram.rM.ResourceM
07e0: 616E616765720053797374656D2E5265 anager.System.Re
07f0: 736F75726365730047656E6572617465 sources.Generate
0800: 64436F64654174747269627574650053 dCodeAttribute.S
0810: 797374656D2E436F6465446F6D2E436F ystem.CodeDom.Co
0820: 6D70696C6572002E63746F7200446562 mpiler..ctor.Deb
0830: 75676765724E6F6E55736572436F6465 uggerNonUserCode
0840: 4174747269627574650053797374656D Attribute.System
0850: 2E446961676E6F737469637300436F6D .Diagnostics.Com
0860: 70696C657247656E6572617465644174 pilerGeneratedAt
0870: 747269627574650053797374656D2E52 tribute.System.R
0880: 756E74696D652E436F6D70696C657253 untime.CompilerS
0890: 6572766963657300456469746F724272 ervices.EditorBr
08a0: 6F777361626C65417474726962757465 owsableAttribute
08b0: 0053797374656D2E436F6D706F6E656E .System.Componen
08c0: 744D6F64656C00456469746F7242726F tModel.EditorBro
08d0: 777361626C655374617465004F626A65 wsableState.Obje
08e0: 63740053797374656D00526566657265 ct.System.Refere
08f0: 6E6365457175616C7300547970650047 nceEquals.Type.G
0900: 65745479706546726F6D48616E646C65 etTypeFromHandle
0910: 0052756E74696D655479706548616E64 .RuntimeTypeHand
0920: 6C65006765745F417373656D626C7900 le.get_Assembly.
0930: 417373656D626C790053797374656D2E Assembly.System.
0940: 5265666C656374696F6E006172677300 Reflection.args.
0950: 476574537472696E6700436F6E736F6C GetString.Consol
0960: 650057726974654C696E65006765745F e.WriteLine.get_
0970: 4D004D61696E004D007768790052756E M.Main.M.why.Run
0980: 74696D65436F6D7061746962696C6974 timeCompatibilit
0990: 79417474726962757465006D73636F72 yAttribute.mscor
09a0: 6C6962007768792E7265736F75726365 lib.why.resource
09b0: 73007768792E65786500000000077700 s.why.exe.....w.
09c0: 68007900000974006500730074000000 h.y...t.e.s.t...
09d0: 7855E192F3A8DF4ABD3B8148F7C608B2 xU.....J.;.H....
09e0: 0003061205052002010E0E4101003353 ...... ....A..3S
09f0: 797374656D2E5265736F75726365732E ystem.Resources.
0a00: 546F6F6C732E5374726F6E676C795479 Tools.StronglyTy
0a10: 7065645265736F757263654275696C64 pedResourceBuild
0a20: 65720831372E302E302E300000032000 er.17.0.0.0... .
Enter fullscreen mode Exit fullscreen mode

This adds nearly as much bloat with the bottom data.

String as Byte[]

(Real world example)
(sigh):

Console.WriteLine(Encoding.ASCII.GetString(
    new byte[] {(byte)'L',(byte)'o',(byte)'r',(byte)'e',(byte)'m',(byte)' ',(byte)'i'...}
));
Enter fullscreen mode Exit fullscreen mode

5.0KB (...):

0100: F02300004B00000000600000D0020000 .#..K....`......
0110: 00000000000000000000000000000000 ................
0120: 008000000C0000000000000000000000 ................
0130: 00000000000000000000000000000000 ................
0140: 00000000000000000000000000000000 ................
0150: 00000000000000000020000008000000 ......... ......
0160: 00000000000000000820000048000000 ......... ..H...
0170: 00000000000000002E74657874000000 .........text...
0180: 44040000002000000006000000040000 D.... ..........
0190: 00000000000000000000000020000060 ............ ..`
01a0: 2E736461746100000402000000400000 .sdata.......@..
01b0: 00040000000A00000000000000000000 ................
01c0: 00000000400000C02E72737263000000 ....@....rsrc...
01d0: D00200000060000000040000000E0000 .....`..........
01e0: 00000000000000000000000040000040 ............@..@
01f0: 2E72656C6F6300000C00000000800000 .reloc..........
0200: 00020000001200000000000000000000 ................
0630: 03000000003C4D6F64756C653E005072 .....<Module>.Pr
0640: 6F6772616D0061726773004279746500 ogram.args.Byte.
0650: 53797374656D003C5072697661746549 System.<PrivateI
0660: 6D706C656D656E746174696F6E446574 mplementationDet
0670: 61696C733E0024417272617954797065 ails>.$ArrayType
0680: 3D35313600246669656C642D31323333 =516.$field-1233
0690: 38454237413930433932343238313343 8EB7A90C9242813C
06a0: 41463436453833353941303343304242 AF46E8359A03C0BB
06b0: 444133440052756E74696D6548656C70 DA3D.RuntimeHelp
06c0: 6572730053797374656D2E52756E7469 ers.System.Runti
06d0: 6D652E436F6D70696C65725365727669 me.CompilerServi
06e0: 63657300496E697469616C697A654172 ces.InitializeAr
06f0: 7261790041727261790052756E74696D ray.Array.Runtim
0700: 654669656C6448616E646C6500436F6E eFieldHandle.Con
0710: 736F6C650057726974654C696E65004F sole.WriteLine.O
0a00: 4C6F72656D20697073756D20646F6C6F Lorem ipsum dolo
0a10: 722073697420616D65742C20636F6E73 r sit amet, cons
0a20: 65637465747572206164697069736369 ectetur adipisci
0a30: 6E6720656C69742E204E756E63206567 ng elit. Nunc eg
0a40: 657420617563746F722074656C6C7573 et auctor tellus
Enter fullscreen mode Exit fullscreen mode

This adds an extra section named ".sdata" where the byte array is stored, and a stupidly long field name for the array.

Enum

Probably the stupidest way to work around this, but could only work in certain circumstances like storing/using config keys.
(Real world example
Rough version that doesn't use a proper array or support punctuation:

public enum _ {
    Lorem,
    ipsum,
    dolor,
    sit,
    amet,
    consectetur,
    ...
    (dupes have underscore appended :(   )
}
string test = "";
for (int i = 0; i < (int)(_._); i++)
    test += ((_)i).ToString() + ' ';
Console.WriteLine(test);
Enter fullscreen mode Exit fullscreen mode

5.0KB:

03a0: 09000100010002010000120000001500 ................
03b0: 0100030006061400010056801C000400 ..........V.....
03c0: 56802200040056802800040056802E00 V."...V.(...V...
03d0: 04005680320004005680370004005680 ..V.2...V.7...V.
03e0: 4300040056804E000400568053000400 C...V.N...V.S...
03f0: 56805800040056805D00040056806400 V.X...V.]...V.d.
0400: 040056806B00040056806E0004005680 ..V.k...V.n...V.
0410: 7A00040056807E000400568084000400 z...V.~...V.....
0420: 56808E00040056809500040056809A00 V.....V.....V...
0430: 04005680A00004005680AA0004005680 ..V.....V.....V.
0440: B10004005680B60004005680BA000400 ....V.....V.....
0450: 5680C30004005680C90004005680CD00 V.....V.....V...
0770: 5201080014015701080018015C010800 R.....W.....\...
0780: 1C016101080020016601080024016B01 ..a... .f...$.k.
0790: 08002801700108002C01750108003001 ..(.p...,.u...0.
07a0: 7A01080034017F010800380184010800 z...4.....8.....
07b0: 3C018901080040018E01080044019301 <.....@.....D...
07c0: 2E003300BC01B5010480000000000000 ..3.............
07d0: 00000000000000000000780200000200 ..........x.....
07e0: 00000000000000000000DB01BA020000 ................
07f0: 0000030002000000003C4D6F64756C65 .........<Module
0800: 3E0050726F6772616D005F0076616C75 >.Program._.valu
0810: 655F5F004C6F72656D00697073756D00 e__.Lorem.ipsum.
0820: 646F6C6F720073697400616D65740063 dolor.sit.amet.c
0830: 6F6E7365637465747572006164697069 onsectetur.adipi
0840: 7363696E6700656C6974004E756E6300 scing.elit.Nunc.
0850: 6567657400617563746F720074656C6C eget.auctor.tell
0860: 757300757400636F6E64696D656E7475 us.ut.condimentu
0870: 6D00657374004E756C6C61006665726D m.est.Nulla.ferm
0880: 656E74756D0074757270697300736974 entum.turpis.sit
Enter fullscreen mode Exit fullscreen mode

This takes up 0x440 bytes.

Raw ASCII string squished into UTF-16 LE C# String

and then converted to a byte array with Unicode GetBytes,
and then converted to a string with ASCII GetString.

Console.WriteLine(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(
    "潌敲灩畳潤潬⁲楳⁴浡瑥‬潣獮捥整畴⁲摡灩獩楣杮攠..."
)));
Enter fullscreen mode Exit fullscreen mode

3.5KB(!!):

03a0: 750017002100750017002E003B002100 u...!.u.....;.!.
03b0: 04800000000000000000000000000000 ................
03c0: 00008000000002000000000000000000 ................
03d0: 00004000C200000000000000003C4D6F ..@..........<Mo
03e0: 64756C653E0050726F6772616D006172 dule>.Program.ar
03f0: 677300456E636F64696E670053797374 gs.Encoding.Syst
0400: 656D2E54657874006765745F41534349 em.Text.get_ASCI
0410: 49006765745F556E69636F6465004765 I.get_Unicode.Ge
0420: 74427974657300476574537472696E67 tBytes.GetString
0430: 00436F6E736F6C650053797374656D00 .Console.System.
0440: 57726974654C696E65004F626A656374 WriteLine.Object
0450: 002E63746F72004D61696E0077687900 ..ctor.Main.why.
0460: 52756E74696D65436F6D706174696269 RuntimeCompatibi
0470: 6C697479417474726962757465005379 lityAttribute.Sy
0480: 7374656D2E52756E74696D652E436F6D stem.Runtime.Com
0490: 70696C65725365727669636573006D73 pilerServices.ms
04a0: 636F726C6962007768792E6578650000 corlib.why.exe..
04b0: 0082034C6F72656D20697073756D2064 ...Lorem ipsum d
04c0: 6F6C6F722073697420616D65742C2063 olor sit amet, c
04d0: 6F6E7365637465747572206164697069 onsectetur adipi
04e0: 7363696E6720656C69742E204E756E63 scing elit. Nunc
04f0: 206567657420617563746F722074656C  eget auctor tel
0500: 6C75732C20757420636F6E64696D656E lus, ut condimen
0510: 74756D206573742E204E756C6C612066 tum est. Nulla f
0520: 65726D656E74756D2C20747572706973 ermentum, turpis
0690: 612C20657420626962656E64756D2066 a, et bibendum f
06a0: 656C69732074656D7075732061636375 elis tempus accu
06b0: 6D73616E2E0100003EF604CCBDC0D640 msan....>......@
06c0: B4394801D5D5B2C90004000012050520 .9H............
06d0: 011D050E0520010E1D05040001010E03 ..... ..........
06e0: 200001050001011D0E1E010001005402  .............T.
06f0: 16577261704E6F6E457863657074696F .WrapNonExceptio
0700: 6E5468726F77730108B77A5C561934E0 nThrows...z\V.4.
0710: 89000000000000000000000000000000 ................
0720: 4825000000000000000000005E250000 H%..........^%..
0730: 00200000000000000000000000000000 . ..............
0740: 00000000000000005025000000000000 ........P%......
0750: 00005F436F724578654D61696E006D73 .._CorExeMain.ms
0760: 636F7265652E646C6C0000000000FF25 coree.dll......%
0770: 00204000000000000000000000000000 . @.............
0780: 00000000000000000000000000000000 ................
Enter fullscreen mode Exit fullscreen mode

The string takes 0x200 bytes / 0.5 KB.

Top comments (0)