Sunday, September 29, 2013

You're gonna need a bigger field (not) ... Radix 64 Revisited

Summary: It is easy to fit a long number in a short string field by transcoding it to use more (printable) characters; the question is what encoding to use; there are more alternatives than you might think, but Base64 is the pragmatic choice.

Long Version.

Every now and then the subject of how to fit numeric SNOMED Concept IDs (defined by the SNOMED SCTID Data Type) into a DICOM representation comes up. These can be up to 18 decimal digits (and fit into a signed or unsigned 64 bit binary integer), whereas in DICOM, the Code Value has an SH (Short String) Value Representation (VR), hence is limited to 16 characters.

Harry Solomon suggested "Base64" encoding it, either always, or on those few occasions when the Concept ID really was too long (and then using a "prefix" to the value to recognize it).

The need arises because DICOM has always used the "old fashioned" SNOMED-RT style SnomedID values (like "T-A0100" for "Brain") rather than the SNOMED-CT style SNOMED Concept ID values (like "12738006"). DICOM was a relatively "early adopter" of SNOMED, and the numeric form did not exist in the early days (prior to the incorporation of the UK Read Codes that resulted in SNOMED-CT). Fortunately, SNOMED continues to issue the older style codes; unfortunately, folks outside the DICOM realm may need to use the newer style, and so converting at the boundary is irritating (and needs a dictionary, unless we transmit both). The negative impact on the installed base that depends on recognizing the old-style codes, were we to "change", is a subject for another day; herein I want to address only how it could be done.

Stuffing long numbers into short strings is a generic problem, not confined to using SNOMED ConceptIDs in DICOM. Indeed, this post was triggered as a result of pondering another use case, stuffing long numbers into Accession Number (also SH VR). So I thought I would implement this to see how well it worked. It turns out that there are a few choices to be made.

My first pass at this was to see if there was something already in the standard Java class library that supported conversion of arbitrary length base10 encoded integers into some other radix; I did not want to be constrained to only handling 64 bit integers.

It seemed logical to look at the arbitrary length numeric java.math.BigInteger class, and indeed it has a radix argument to its String constructor and toString() methods. It also has constructors based on two's-complement binary representations in byte[] arrays. Sounded like a no brainer.

Aargh! It turns out that BigInteger has an implementation limit on the size of the radix that it will handle. The maximum radix is 36 (the 10 digits plus 26 lowercase alphabetic characters that is the limit for java.lang.Character.MAX_RADIX). Bummer.

OK, I thought, I will hand write it, by doing successive divisions by the radix in BigInteger, and character encoding the modulus, accumulating the resulting characters in the correct order. Turned out to be pretty trivial.

Then I realized that I now had to choose which characters to select beyond the 36 that Java uses. At which point I noticed that BigInteger uses completely different characters than the traditional "Base64" encoding. "Base64" is the encoding used by folks who do anything that depends on MIME content encoding (email attachments or XML files with embedded binary payloads), as is defined in RFC 2045. Indeed, there are variants on "Base64" that handle situations where the two characters for 62 and 63 (normally '+' and '/' respectively) are problematic, e.g., in URLs (RFC 4648). Indeed RFC 4648 seems to be the most current definition of not only "Base64" and variants, but also "Base32" and "Base16" and so-called "extended hex" variants of them.

If you think about it, based on the long-standing hexadecimal representation convention that uses characters '0' to '9' for numeric values [0,9], then characters 'a' to 'f' for numeric values [10,15], it is pretty peculiar that "Base64" uses capital letters 'A' to 'J' for numeric values [0,9], and uses the characters '0' to '9' to represent numeric values [52,61]. Positively unnatural, one might say.

This is what triggered my dilemma with the built-in methods of the Java BigInteger. BigInteger returns strings that are a natural progression from the traditional hexadecimal representation, and indeed for a radix of 16 or a radix of 32, the values match those from the RFC 4648 "base16" and "base32hex" (as distinct from "base32") representations. Notably, RFC 4648 does NOT define a "base64hex" alternative to "base64", which is a bit disappointing.

It turns out that a long time ago (1992) in a galaxy far, far away, this was the subject of a discussion between Phil Zimmerman (of PGP fame), and Marshall Rose and Ned Freed on the MIME working group mailing list, in which Phil noticed this discrepancy and proposed it be changed. His suggestion was rejected on the grounds that it would not improve functionality and would threaten the installed base, and was made at a relatively late stage in development of the "standard". The choice of the encoding apparently traces back to the Privacy Enhanced Mail (PEM) RFC 989 from 1987. I dare say there was no love lost between Phil and the PEM/S-MIME folks, given that they were developers of competing methods for secure email, but you can read the exchange yourself and make up your own mind.

So I dug a little deeper, and it turns out that The Open Group Base (IEEE Std 1003.1) (POSIX, Single Unix Specification) has a definition for how to encode radix 64 numbers as ASCII characters too, in the specification of the a64l() and l64a() functions, which uses '.' (dot) for 0, '/' for 1, '0' through '9' for [2,11], 'A' through 'Z' for [12,37], and 'a' through 'z' for [38,63]. Note that is this is not part of the C standard library.

An early attempt at stuffing binary stuff into printable characters was used by the "uuencode" utility used in Unix-to-Unix copy (UUCP) implementations, such as was once used for mail transfer. It used the expedient of adding the 32 (the US-ASCII space character) to the 6 bit (base 64) numeric value and came up with a range of printable characters.

Of course, from the perspective of stuffing a long decimal value into a short string and making it fit, it doesn't matter which character representation is chosen, as long as it is valid. E.g., a 64 bit unsigned integer that has a maximum value of 18,446,744,073,709,551,615, which is 20 digits, is only 11 characters long when encoded with a radix of 65, regardless of the character choices.

For your interest, here is what each of the choices described above looks like, for single numeric values [0,63], and for the maximum unsigned 64 bit integer value:

Extension of Java and base16hex to hypothetical "base64hex":
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z : _
f__________


Unix a64l:
 . / 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z
Dzzzzzzzzzz

Base64 (RFC 2045):
 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 + /
P//////////


uuencode (note that space is the first character):
   ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _
/__________


Returning to DICOM then, the choice of what to use for a Short String (SH) VR is constrained to be any US-ASCII (ISO IR 6) character that is not a backslash (used as a value delimiter in DICOM) and not a control character. This would exclude the uuencode representation, since it contains a backslash, but any of the other choices would produce valid strings. The SH VR is case-preserving, which is a prerequisite for all of the choices other than uuencode. Were that not to be the case, we would need to define yet another encoding that was both case-insensitive and did not contain the backslash character. I can't thank of a use for packing numeric values into the Code String (CS) VR, the only upper-case only DICOM VR.

The more elegant choice in my opinion would be the hypothetical "base64hex", for the reasons Phil Z eloquently expressed, but ...

Pragmatically speaking, since RFC 989/1113/2045/4648-style "Base64" coding is so ubiquitous these days for bulk binary payloads, it would make no sense at all to buck that trend.

Just to push the limits though, if one uses all 94 printable US-ASCII characters except backslash, one can squeeze the largest unsigned 64 bit integer into 10 rather than 11 characters. However, for the 18 decimal digit longest SNOMED Concept ID, the length of the result is the same whether one uses a radix of 64 or 94, still 10 characters.

David


3 comments:

Anonymous said...

Interesting discussion. You're missing the ASCII85 (Base 85) encoding that is widely used within the PostScript language and uses 85 different characters to represent 6.5 bits per character (4 bytes are encoded as 5 characters).

David Clunie said...
This comment has been removed by the author.
David Clunie said...

Fixed bad link in previous comment response ...

Thanks for pointing that out ... there is a summary on this wiki page on Ascii85, which makes reference to another helpful wiki page on Binary-to-text encoding.

Ascii85 uses the backslash character in its representation, which is a problem. The RFC1924 variant does not use a backslash though, and could work.

Using a radix of 85 doesn't seem to offer that much of a length advantage over using a radix of 64 (i.e., a 64 bit integer does fit in 10 rather than 11 characters, but an 18 digit decimal number uses 10 either way), and the choice of 85 to make 4 input bytes map to exactly 5 output bytes is not really a factor in our use case.

Anyway, good to know that there are yet more choices :)