Efficient, standards-compliant binary encoding
August 26, 2009 10:51 PM Subscribe
Efficient, standards-compliant binary encoding
Looking for a more efficient method than Base64 to encode binary data for a web application, I did not find a general-purpose solution. So I specified the design of a custom script in Unicode's Private Use Area in planes 15 – 16.
Looking for a more efficient method than Base64 to encode binary data for a web application, I did not find a general-purpose solution. So I specified the design of a custom script in Unicode's Private Use Area in planes 15 – 16.
I also made a reference implementation of the encoder/decoder in Javascript and did some testing of it with HTML, Twitter, and a WordPress plugin.
Base16b is intended to supplement existing base encoding methods. Measured by bits per character, Base16b is nearly three times more efficient than Base64.
General support for Unicode's higher planes is still patchy though, so this encoding maybe should not be deployed unless the environment is controlled.
Constrained character environments such as micro-blogging could be a strong use-case. Can you think of an application which could benefit from this ?
Looking for a more efficient method than Base64 to encode binary data for a web application, I did not find a general-purpose solution. So I specified the design of a custom script in Unicode's Private Use Area in planes 15 – 16.
Looking for a more efficient method than Base64 to encode binary data for a web application, I did not find a general-purpose solution. So I specified the design of a custom script in Unicode's Private Use Area in planes 15 – 16.
I also made a reference implementation of the encoder/decoder in Javascript and did some testing of it with HTML, Twitter, and a WordPress plugin.
Base16b is intended to supplement existing base encoding methods. Measured by bits per character, Base16b is nearly three times more efficient than Base64.
General support for Unicode's higher planes is still patchy though, so this encoding maybe should not be deployed unless the environment is controlled.
Constrained character environments such as micro-blogging could be a strong use-case. Can you think of an application which could benefit from this ?
jwells: You're missing the difference between bytes and characters. a Character in UTF-8 can be either 8 bytes or 3 bytes (or more if you're talking about extended Unicode). Everything beyond the first 127 characters uses 3 bytes, so these characters would actually take 3bytes each. If you switch your browsers encoding to something like ISO-8859-1, the "base16b" stuff will look much longer.
But twitter accepts longer 160 unicode characters, so if you use high codepoints, you can fit more data in your strings, so you can do stuff like this: http://➡.ws/뀼 to make really short links.
posted by delmoi at 1:37 AM on August 30, 2009
But twitter accepts longer 160 unicode characters, so if you use high codepoints, you can fit more data in your strings, so you can do stuff like this: http://➡.ws/뀼 to make really short links.
posted by delmoi at 1:37 AM on August 30, 2009
"Base17b" is a special case of the Base16b family which runs from 7b to 17b. I see this style of naming is confusing so I'll change it.
Indeed, the project is specifically aimed at reduction of characters, not necessarily of bytes. Byte-wise efficiency depends on the style of Unicode (UTF-8, UTF-16 UTF-32). While not the objective, byte-wise does not come out too bad in Base16b. There is a byte and character efficiency comparison chart in the Appendix of the specification.
Though the project is mainly aimed at encoding binary data rather than text, the logo on the main page shows four different encodings of the text "http://base16b.org".
My favourite primer on bytes/characters is by Tim Bray.
posted by henchan at 4:18 PM on August 30, 2009
Indeed, the project is specifically aimed at reduction of characters, not necessarily of bytes. Byte-wise efficiency depends on the style of Unicode (UTF-8, UTF-16 UTF-32). While not the objective, byte-wise does not come out too bad in Base16b. There is a byte and character efficiency comparison chart in the Appendix of the specification.
Though the project is mainly aimed at encoding binary data rather than text, the logo on the main page shows four different encodings of the text "http://base16b.org".
My favourite primer on bytes/characters is by Tim Bray.
posted by henchan at 4:18 PM on August 30, 2009
Outside of the Twitter use case, isn't this pretty much covered by yenc? It uses 8-bit Extended ASCII rather than Unicode, but it has what appears to be similar gains over Base64. If your Unicode ends up represented as UTF8, it seems like you'd have very close to a wash in terms of efficiency.
Not many things that I can think of are as sensitive to "characters" (as opposed to bytes) as Twitter—in many circumstances I'd consider counting characters rather than bytes where you allow Unicode to be a bug rather than feature. Of course, Twitter is so big right now, that might be all you need to gain traction and interest from users.
posted by Kadin2048 at 9:46 AM on August 31, 2009
Not many things that I can think of are as sensitive to "characters" (as opposed to bytes) as Twitter—in many circumstances I'd consider counting characters rather than bytes where you allow Unicode to be a bug rather than feature. Of course, Twitter is so big right now, that might be all you need to gain traction and interest from users.
posted by Kadin2048 at 9:46 AM on August 31, 2009
Interesting to think of characters becoming the new bytes at higher levels of the network / application stack. I expect the comment box I am now typing into counts characters (or tries to - there is a bug in Javascript).
Assuming that we are in transition from a world of no Unicode to ubiquitous Unicode. At the older end of the scale, certainly yenc removes much of Base64's overhead. As we get closer to the assumed end-point Base16b looks to be in a better position.
Character-wise, efficiency is equivalent to number of bits. So yenc (8) is less than half as efficient as Base16b (17). I'll go ahead and add yenc to my comparison chart.
posted by henchan at 3:17 PM on August 31, 2009
Assuming that we are in transition from a world of no Unicode to ubiquitous Unicode. At the older end of the scale, certainly yenc removes much of Base64's overhead. As we get closer to the assumed end-point Base16b looks to be in a better position.
Character-wise, efficiency is equivalent to number of bits. So yenc (8) is less than half as efficient as Base16b (17). I'll go ahead and add yenc to my comparison chart.
posted by henchan at 3:17 PM on August 31, 2009
If your Unicode ends up represented as UTF8, it seems like you'd have very close to a wash in terms of efficiency
UTF-8 requires 24 bits to encode each Unicode character in the ranges that this proposed encoding uses, and each such Unicode character encodes two input bytes (16 bits). So using this encoding bloats the input data by 50%.
Base64 only bloats it by 33% (24 bits in encodes as 32 bits out) and allows for transmission via communication channels that are about as far from 8 bit clean as can be.
This whole exercise strikes me as an attempt to work around Twitter's 140 character limitation. The point of that limitation is to let Twitter interoperate with SMS. Simple SMS messages are limited to 160 characters from a 7 bit alphabet. All of the base64 characters are members of that alphabet. No UTF-8-encoded Unicode characters over U007F are. So to transmit Base16b-encoded messages via SMS you'd need to double-encode them, probably as base64, and this would pack in 33% less binary data per SMS than you'd get by directly encoding your binary data as base64.
Transmitting UTF-8 requires an 8-bit channel. If you have that, you can just transmit arbitrary binary byte data over it anyway. If you have an "almost 8 bit" email-like channel where certain codes are reserved for non-printing characters, the yEnc escaping convention works better than this UTF-8 based scheme.
So I'm a thumbs-down.
posted by flabdablet at 8:28 PM on September 7, 2009
UTF-8 requires 24 bits to encode each Unicode character in the ranges that this proposed encoding uses, and each such Unicode character encodes two input bytes (16 bits). So using this encoding bloats the input data by 50%.
Base64 only bloats it by 33% (24 bits in encodes as 32 bits out) and allows for transmission via communication channels that are about as far from 8 bit clean as can be.
This whole exercise strikes me as an attempt to work around Twitter's 140 character limitation. The point of that limitation is to let Twitter interoperate with SMS. Simple SMS messages are limited to 160 characters from a 7 bit alphabet. All of the base64 characters are members of that alphabet. No UTF-8-encoded Unicode characters over U007F are. So to transmit Base16b-encoded messages via SMS you'd need to double-encode them, probably as base64, and this would pack in 33% less binary data per SMS than you'd get by directly encoding your binary data as base64.
Transmitting UTF-8 requires an 8-bit channel. If you have that, you can just transmit arbitrary binary byte data over it anyway. If you have an "almost 8 bit" email-like channel where certain codes are reserved for non-printing characters, the yEnc escaping convention works better than this UTF-8 based scheme.
So I'm a thumbs-down.
posted by flabdablet at 8:28 PM on September 7, 2009
« Older The Manhattan Wonderwalk!... | a New Blog to Review Products ... Newer »
http://www.base16b.org/lib/version/0.1/js/larry.html
The hex is 546 characters, URL encoded. Base 64 is 380, and Base16b (labeled Base17b on the page, BTW) is 1545. So it would add overhead, not decrease it. As a result, I'm not sure transmission over the web is the place to look for applications. Having said that, I've never seen a case for data changing while being transmitted, so maybe URLEncode isn't needed. I'd just like to see some serious tests done to prove it I guess.
Lossless storage, on the other hand, might be a good place to look. That gif image is 273 bytes and base16b has it down to more than half that size, 129 bytes.
posted by jwells at 10:25 AM on August 28, 2009