Unicode Polytonic Greek for the Web
 Unicode Normalization Forms ▣ Home | ◈ Contents | △ Section | ◁ Previous | Next ▷

Unicode Polytonic Greek
for the World Wide Web

Version 0.9.7

D R A F T

Unicode Normalization Forms

How the Accents Work

Nearly all the combinations of character and diacritical mark encountered in languages using the Latin script were included in the first version of the Unicode standard, Unicode 1.0 - for example, the e with acute accent, the c with hacek, and the c with cedilla. But in the first version of the Unicode standard, Greek characters were only encoded with the diacritcals for modern, monotonic Greek. Specifically, while an alpha with an acute accent was provided as a single character, alpha with a smooth aspirate, a circumflex accent, and an iota subscript was not. However, the Unicode standard does provide special characters called combining diacriticals, which represent single diacriticals and combined diacriticals and which when typed after the character the diacriticals should modify are supposed to be rendered above or below that character rather than after.

Keep in mind that the current typography of ancient Greek is merely traditional, and is not an accurate representation of how Greek was written in the ancient world. For instance, Homer is typeset not only in lower-case characters that didn't exist in the 8th century bce, but with diacriticals that were first used in Alexandria, new characters that were first used in late 5th century Athens, word breaks that were first used in the Middle Ages, and without the digamma which the meter clearly indicates was present when the text was composed.

Beginning with Version 2.0 of the Unicode standard, and continuing through to all later versions of the standard, provisions have been made for polytonic Greek characters using additional precomposed characters in a new, "Extended Greek" character block, in which a single typeset character represents not only the letter itself but all the diacriticals that modify it (including iota subscripts). Some Unicode fonts for reading Greek support combining diacriticals but not the precomposed characters (most notably, Lucida Sans Unicode and the default Unicode font in XFree86 4.0); these display "unrecognized" characters (which usually appear as empty boxes or question marks) when they encounter precomposed characters. In Unicode fonts which do support precomposed characters, there are often flaws in the execution of the combining diacriticals that can lead to surprising results when a reader tries to view a page with combining diacriticals (for instance, Palatino Linotype displays artifact characters where zero-space characters are supposed to be; older versions of Athena Unicode would crash - a problem which has recently been fixed - and other fonts display "unrecognized" characters instead of the combination of glyphs). Of the fonts I have tested, only Code 2000, Arial Unicode MS, and the new version of Athena Unicode consistently handle both methods of typesetting Greek correctly.

Henceforth, when I refer to the use of "precomposed characters" I mean the character codes defined in Normalization Form C of the Unicode standard; this includes all the characters in the Greek character set, plus the unique characters in the "Greek Extended" set. Note too that there are "canonical decompositions" of these precomposed characters: each precomposed character is said to be equivalent to one sequence of alphabetic character plus combining diacriticals, in a specific order. For more complex tasks than mere display, the canonical decompositions are preferable; however, thusfar the usual course has been to use beta code for data storage rather than Unicode. Perhaps in an ideal world the canonically-decomposed Unicode form would be used for storage and in the HTTP response sent to the browser, and the browser would then normalize that response for display as a page, but so far as I know no-one has as yet managed this for ancient Greek.

As described above, the Unicode standard provides special characters called combining diacriticals which represent single diacriticals and which when placed after the character the diacriticals should modify in the logical order of the file (which is not necessarily the order in which the characters are typed on a keyboard) are supposed to be rendered above or below that character rather than after.

Unfortunately, even the best implementation of combining diacriticals cannot quite place the accents where they are supposed to go without a powerful font rendering engine. The font rendering engines used in current operating systems do not yet have good support for the placement of accents in Greek text. Designers of standard TrueType fonts with combining diacriticals must therefore choose one position in which the combining diacritical can appear relative to the character and any other diacriticals, regardless of what the character is or how many of which diacriticals are combined. For instance one must place the iota subscript beneath the center of the vowel, even though this is the wrong position when the iota subscript is used with the eta. This limitation also results in the overlapping of accents one sees while using Arial Unicode MS to read Greek texts on the Perseus website (with combining diacriticals).

The OpenType technology provides a mechanism to overcome this limitation: theoretically, an OpenType font can be used with an OpenType-capable font rendering engine to place say a combining iota subscript under the middle of an alpha, and under the left descender of an eta. Unfortunately, OpenType technology is as yet only imperfectly implemented, and the first serious OpenType font, Palatino Linotype, does not work with Greek combining diacriticals in a web browser.

Keep in mind that the current typography of ancient Greek is merely traditional, and is not an accurate representation of how Greek was written in the ancient world. For instance, Homer is typeset not only in lower-case characters that didn't exist in the 8th century bce, and with letterforms which would have been all but unrecognizable to Homer, but with diacriticals that were first used in Alexandria, characters introduced into the Ionian alphabet after homer, and without the digamma which the meter clearly indicates was present when the text was composed.

Beginning with Version 2.0 of the Unicode standard, and continuing through to the current Version 3.0 standard, provisions have been made for polytonic Greek characters using additional precomposed characters in a new, "Extended Greek" character block, in which a single typeset character represents not only the letter itself but all the diacriticals that modify it (including iota subscripts). Some Unicode fonts for reading Greek support combining diacriticals but not the precomposed characters (most notably, Lucida Sans Unicode and the ClearlyU BFD Unicode font in XFree86 4.0; oddly enough, as XFree86 needs to use zero-width characters to actually display combining diacriticals, and cannot do so properly); these display "unrecognized" characters (which usually appear as empty boxes or question marks) when they encounter precomposed characters. In Unicode fonts which do support precomposed characters, there are often flaws in the execution of the combining diacriticals that can lead to surprising results when a reader tries to view a page with combining diacriticals (for instance, there were earlier versions of Athena which could cause your web browser to crash when it tried to render combining diacriticals; Palatino Linotype displays artifact characters where zero-space characters are supposed to be; and other fonts display "unrecognized" characters instead of the combination of glyphs).

The current effect of these complexities for the publication of simple web pages is that there are two methods of using Unicode to encode ancient Greek which are barely compatible with one another. For display purposes, one must chose between the use of combining diacriticals, which will work in three of the tested fonts (Arial Unicode MS, Lucida Sans Unicode, and Code 2000), or precomposed characters, which will work in seven of the tested fonts (Athena, Arial Unicode MS, Palatino Linotype, Code 2000, Georgia Greek, Vusillus Old Style Italic, and Titus Cyberbit). And because there are issues with the implementation of combining diacriticals in the Linux operating systems, one must choose which audience to lose: those who have Linux and have serious difficulties reading the combining diacriticals, or those who chose not to download one of the free fonts that can read precomposed characters.

The most obvious, and potentially hazardous, consequence of this double-encoding is that one can potentially encode the same character grouping in a number of ways: for instance, alpha with a smooth aspirate, an circumflex accent, and an iota subscript can potentially be encoded as :

  1. 1 character: alpha with smooth aspirate, circumflex, and iota subscript
  2. 4 characters: 1.) alpha, 2.) combining smooth aspirate, 3.) combining circumflex, 4.) combining iota subscript.
  3. 3 characters: 1.) alpha with smooth aspirate, 2.) combining circumflex, 3.) combining iota subscript
  4. 3 characters: 1.) alpha with circumflex, 2.) combining smooth aspirate, 3.) combining iota subscript
  5. 3 characters: 1.) alpha with iota subscript, 2.) combining smooth aspirate, 3.) combining circumflex
and so forth. While this might not sound like a problem, most computer programs used for searching, frequency analysis, etc. search by byte, not by character equivalence. In other words, most programs are not aware that the five alternates represented above are congruent, and will not match the fifth alternative if a user searches for the first alternative.

To resolve these issues, Unicode-aware programs need to include decomposition mechanisms that will decompose precomposed characters into "canonical decompositions," so that for instance and alpha with a smooth aspirate, circumflex, and iota subscript is always represented as an alpha character followed by a combining smooth aspirate, a combining circumflex accent, and a combining iota subscript. This process of making sure that the same character combinations are always represented in the same way is called normalization, and the formats that normalization results in are called normalization forms. The Unicode Standard defines four standardized Normalization Forms: Normalization Form C (NFC), Normalization Form D (NFD), Normalization Form KC (NFKC), and Normalization Form KD (NFKD). More detailed information on these Normalization Forms can be found on the Unicode website in the normalization FAQ and Unicode Technical Report 15: Unicode Normalization Forms.

Of the four normalization forms, NFKC and NFKD are not appropriate for our purposes; they decompose characters too far down and loose important information. The remaining two, NFC and NFD, are already widely in use on the World Wide Web.

Normalization Form C

Henceforth, when I refer to the use of "precomposed characters" I mean the character codes defined in Normalization Form C of the Unicode standard; this includes all the characters in the Greek character set, plus the unique characters in the "Greek Extended" set (though I find this name at best undescriptive). Note too that there are "canonical decompositions" of these precomposed characters: each precomposed character is said to be equivalent to one sequence of alphabetic character plus combining diacriticals, in a specific order. For more complex tasks than mere display, the canonical decompositions are preferable; however, thusfar the usual course has been to use beta code for data storage rather than Unicode. Perhaps in an ideal world the canonically-decomposed Unicode form would be used for storage and in the HTTP response sent to the browser, and the browser would then normalize that response for display as a page, but so far as I know no-one has as yet managed this for ancient Greek.

Normalization Form D

In NFD, Greek characters are stored using the basic Greek and combining diacriticals ranges of the Unicode standard using canonical orders.

There are good arguments in favor of combining diacriticals.

Note that Normalization Form C also defines the canonical decompositions for purposes of comparing text strings; for applications, rather than mere published pages, this decomposition would be preferable.

What To Do For the Web

The World Wide Web Consortium recommends that XML and HTML documents for publication on the World Wide Web use Normalization Form C (which uses precomposed characters for Greek) rather than Normalization Form D (which uses combining diacriticals for Greek).

Current user issues with precomposed characters can be resolved more easily - the faults they expose in fonts are limited to display problems, while those exposed by combining diacriticals are often harder to recover.

In an ideal world in which all Unicode tools were capable of reading combining diacriticals, I would agree that combining diacriticals would be the best solution; in an ideal world, I could mark a document up in TEI/XML with Unicode 3, add an XSL stylesheet, and folks could read it without having to download a font or a new web browser. Alas, this is not an ideal world, and compromises must be made. The purpose of the web is to provide a single cross-platform resource for the sharing of documents; until such time as all current platforms are capable of displaying polytonic Greek using Unicode combining diacriticals, the combining diacriticals cannot be described as a cross-platform solution; but the precomposed characters can (insofar as a method which excludes older operating systems, like Mac OS 8.0, Windows 3.11, etc. can be called "cross-platform"). Those who are able should consider this a challenge to resolve the Unicode display issues in Linux and to provide open-source Greek fonts that have workable implementations of combining diacriticals and also preserve the Greek extended precomposed forms for compatibility purposes.

Linux and Normalization Form D

For issues with Unicode functionality in Linux (which should more properly be discussed in the Linux appendix, but since that remains to be written . . .), see Markus Kuhn's discussion of Unicode Normalization forms and precomposed characters and their implementations in Linux and X, respectively, in his FAQ for Unicode and UTF-8 in Linux.

Full Unicode functionality with all bells and whistles can only be expected from sophisticated multi-lingual word-processing packages. What Linux will use on a broad base to replace ASCII and the other 8-bit character sets is far simpler. Linux terminal emulators and command line tools will in the first step only switch to UTF-8. This means that only a Level 1 implementation of ISO 10646-1 is used (no combining characters), and only scripts such as Latin, Greek, Cyrillic, and many scientific symbols are supported that need no further processing support. At this level, UCS support is very comparable to ISO 8859 support and the only significant difference is that we have now thousands of different characters available, and that characters can be represented by multibyte sequences.

Combining characters will also be supported under Linux eventually, but even then the precomposed characters should be preferred over combining character sequences where available. More formally, the preferred way of encoding text in Unicode under Linux should be Normalization Form C as defined in Unicode Technical Report #15.

[. . . .] Combining characters: The X11 specification does not support combining characters in any way. The font information lacks the data necessary to perform high-quality automatic accent placement (as it is found for example in all TeX fonts). Various people have experimented with implementing simplest overstriking combining characters using zero-width characters with ink on the left side of the origin, but details of how to do this exactly are unspecified (e.g., are zero-width characters allowed in CharCell and Monospaced fonts?) and this is therefore not yet widely established practice.

See the Technical Report cited by Kuhn and the accompanying Normalization Chart for Greek (thanks to Peter Constable for pointing these out), which provide a thorough discussion of normalization and canonical decomposition and a mapping of the complexity of the preferred and compatibility forms for each Greek glyph.

At this time, the Suda On Line, BMCR, and the Perseus Digital Library both use a display script (written by the Perseus team) which allows readers to utilize either precomposed characters or combining diacriticals.


 Unicode Polytonic Greek for the World Wide Web Version 0.9.7
 Copyright © 1998-2002 Patrick Rourke. All rights reserved.
D R A F T - Under Development
 Please do not treat this as a published work until it is finished!
▣ Home | ◈ Contents | △ Section | ◁ Previous | Next ▷