HL7 TSC review of EHR Language Display

From HL7 TSC
Jump to navigation Jump to search


There are numerous character set encodings that can be used when undertaking electronic communications. These include UTF-8 and ISO 8859-15 (for which the Wikipedia summaries are provided below). Briefly, UTF-8 uses from one to four 8-bit bytes to represent all of the characters in the Unicode character set (also summarized below) which thereby affords mappings for "handling ... text expressed in most of the world's writing systems." On the other hand, ISO 8859-15 encodes each character in a single byte, and is limited to being generally intended for “Western European” languages. (Note that for the "traditional" ASCII encoding, which was limited to 7-bits, the encodings in UTF-8 and ISO-8859 are identical.)

Recently, it was noted at the Technical Steering Comittee (2/11/2013 Minutes) that the"HHS HIT Standards Committee was asked about certification for EHR language display; recommendation was ISO 8859-15 aka 'Latin 9; which has character support for all the required ISO 639 languages including direct support for the Eastern European languages and transliteration to Latin characters for e.g. cyrillic and mandarin." (The reference to "mandarin" must refer to "romanization" of Mandarin as none of the primary Asian languages are listed as being "covered" by ISO-8859. Remember, there are only 255 characters available.) The further question was then posed: "What if any impact this may have on HL7 standards?"

HL7 Specifications and Encoding

Review of HL7 specifications shows that none of them have a constraint requirement on the character encoding used in communication. The CDA standards have recommended UTF-8. In the case of CDA and most Version 3 specifications, HL7 produces an XML schema for those specifications. While these are "published" in UTF-8, the schema cannot constrain the encoding of instances.

In early discussions about the Version 3 standards, HL7 has taken the position that we need to enable all of the non-Western languages, and have relied upon Unicode (and thereby UTF-8) to that end.

Thus, HL7 is agnostic about the encoding used to communicate with its standards.

Advisability of "certification of EHR language display" being constrained to ISO-8859

To state what may be obvious, the "display of language" has nothing to do with the encoding used to communicate it. To assert that the communication to and from systems might be constrained to a particular encoding is reasonable. The constraint "inside" of the systems would appear to be inadvisable.

Reference Notations

Wikipedia entry for UTF-8 begins:
UTF-8 (UCS Transformation Format—8-bit) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.

UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages.

Wikipedia entry for ISO/IEC 8859-15 begins:

ISO/IEC 8859-15:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 15: Latin alphabet No. 9, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1999. It is informally referred to as Latin-9 (and was for a while called Latin-0). It is similar to ISO 8859-1, and thus generally intended for “Western European” languages, but replaces some less common symbols with the euro sign and some letters that were now deemed missing in part 1 for the target use.

Wikipedia entry for Unicode begins:

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems.