The Unmoroniser: using Unicode in the Demoroniser

This page explains why and how the Demoroniser, a popular script for extracting HTML from a subversive character encoding, should produce character entities instead of ASCII. It introduces a replacement, called the Unmoroniser.

If you are angered by the idea of a text utility not producing plaintext, please at least skim the argument below. But if you’re not looking for trouble and just want the Unmoronizer already, go ahead and snaffle it.

Charlie Loyd, 2003–2005

The save as web page function of several Microsoft products produces a singularly repugnant dialect of pseudo-HTML, which John Walker wrote demoroniser.pl to fix. However, where the Demoroniser translates Microsoft’s invalid character representations (lines 96 through 122 in the January 1998 version), it flattens them to ASCII instead of to valid representations of the mis-addressed characters. That is the wrong behavior.

We will consider three representative characters: the capital o-e ligature (Œ), the function symbol (ƒ), and the em dash (—). The Demoroniser’s mappings are:

\x8C → Oe
\x83 → <em>f</em>
\x97 → --

Someone taking this as the best possible translation may come to think that the characters the Microsoft is trying to represent are pathological. In fact, people have spelled with them for hundreds of years, and they are still in good standing among printers. Nothing is wrong with the characters. It’s Microsoft’s representations that must be replaced with something better.

This philisophical problem leads to three problems with the Demoroniser: First, by making the lossy translation to ASCII, it mistakes bad references (bytes) for bad referants (abstract characters). Second, it maps pairs of distincts inputs (Œ, Oe) to single outputs (Oe), thus losing information. Third, it absurdly implies that good typography is something Microsoft invented to be annoying.

We could avoid these altogether and address the cause if we had our own standard, non–Microsoft-owned character set to work with. And we do: Unicode!

Yes, I’m talking to you, smug 7-bit Unix fans. The belief that ASCII is God’s best invention is a senseless extension of the far more defensible idea that text interfaces are useful. It shows a troubling lack of imagination to talk as though Unix is tied to a rather crude old character set. The most popular Unicode encoding, UTF-8, was invented by Ken Thompson and Rob Pike and is supported by all modern distros. You can use it in the shell and everything. No, really. Please stick around for a few minutes and consider that it might be both useful and practical, especially in something as high-level as HTML.

Step back and look at ASCII. 95 printable characters can’t represent much. Of the major natural languages, it can only be used to properly print Latin, English, and Swahili. If you can think of another language at home in ASCII, look up the history of its orthography and you may well find that its inventors openly had to account for the limits of typewriters and early computers. Take our programming languages, which resort to hacks like && for conjunction and * for multiplication – a regress from the compact formal notations of logic and math. We are in a strange inversion when our orthography constrains the invention of new ways of writing.

Character sets need to provide a little more than portability: they need to provide characters. Perhaps two billion computers support ASCII, but if ASCII doesn’t support the character I need, it’s disqualified; its utility to me is two billion multiplied by zero.

It’s not like computers are stuck with ASCII. It’s a social thing. A lot of smart programmers, intent on writing things that handle strings, haven’t recalibrated their assumptions since Unicode started making sense a decade ago. But the good word is seeping in, slowly, mostly through the influence of the web.

HTML is absolutely and uncompromisably internationalized, and it has been for most of its life. Every Unicode character can be represented in legitimate, portable, validating HTML – even stored or transmitted as ASCII. Thus HTML has no problems transmitting any text in any living written language, and Œ, ƒ, and — are trivial. Each of those characters (Œ, for example) is expressable in at least four ways: as a semi-readable entity (&OElig;), as a decimal Unicode entity (&#338;), as its hexidecimal equivalent (&#x152;), and as a literal in the UTF-8 character encoding. Unicode is not just some outside goop you can persuade HTML to carry – it’s part of the standard.

Therefore, any browser which cannot render Œ, ƒ, and — is a broken browser. The confusing part to technically-minded people is that we’re the ones most likely to compose HTML in a broken browser: I’m writing this in Emacs, which fails to display &fnof; as ƒ. Of course it doesn’t claim to be a browser (in this mode), but that can be hard to remember. One has to accept that for a relatively abstract markup language, the most literal view is not the intended one. It’s for the same reasons that Perl scripts shouldn’t include object code inline, even though that’s what Perl really is. HTML is for reading, not writing. By writing it (and thus reading it) as textfiles, I can use it more expressively, but I have to remember that in a sense this 80×24 white-on-black rendering is the illusion, the abstraction, the convenience.

It’s frustrating, to say the least. It seems like telnet example.com 80 should be the definitive HTML experience, and browsers should be for people with too much time and not enough attention. After all, HTML is defined as text, and browsers vary widely. But they vary because HTML is defined as text-to-meaning translations, not text-to-display translations. HTML is not PDF, SVG, or TIFF. HTML is semantic, and differing renderings of one page may be equally valid, because the markup is meant to be expressed within limitations and preferences. A screen-reader for the blind has no way to tell whether the <i> tag means a whisper or a shout, VGA displays can’t do much with <font face=fantasy>, and how is an ordinary laser-printer supposed to handle bgcolor=#ff0000?

We have to let renderers do the rendering, because to send HTML at all is to presume that between the server and the reader is a browser. Every browser is there to do no less than render HTML, and sending it a half-digested mishmash of ASCII, HTML, and <font> tags is extra work on your end for an uglier page on the reader’s end. Good HTML makes good pages.

If you have a high opinion of your graphic design skills and think you can do better than an unaided browser, don’t stop using good HTML; start using good CSS as well. With CSS, you can add meaning-to-display translations that specify parameters down to single pixels and 24-bit color, with conditional statements to adjust rendering for screen display, aural systems, printers, and what-have-you. But whether or not you tune it with CSS, please let HTML handle elemental structures like references, sequences, and characters. Yes, it will be redundant. Hypertext doesn’t make crisp and linear text files for the same reason that sculpture makes boring photographs. There’s no shame in HTML not being simple like ASCII, because HTML is much richer than ASCII.

HTML should carry more than any one browser should render!

Graphical browsers are not supposed to display the alt attribute of images, for instance, because it would be silly and redundant. However, it’s to everyone’s advantage that there is redundancy in the data. When the input can be assumed to have certain redundancies, and it’s finally time to give it to the user, it is positively responsible to do exactly what the Demoronizer shouldn’t: to leave out some information and to map some distinct things to one final form. This is the fundamental difference in philosophy between plain text and HTML – even more important than hyper-reference. Plain text can cite far-off resources, but it cannot provide more context and intelligence than the user sees. HTML can, and in well-written HTML it’s often a symptom of strength that not all the tags and attributes will render in any one browser – just as a program doesn’t have to call all its functions every time it runs to be considered good.

Browsers have to be able to handle things in their own ways, because an HTML file is not a page: it is a construct from which to make a page.

HTML is more like a recipe than like a cake. A recipe should call for the proper ingredients and, at most, suggest cheaper and more convenient substitutes. If friends ask how to make your cake, you don’t give them a simplified recipe that calls for flour instead of confectioner’s flour, peanuts instead of walnuts, and salt instead of cinnamon on the assumption that those ingredients are easier or more compatible. No, you give them the real recipe, because it’s what they asked for. They’ll probably make some changes in the kitchen, but you can assume that they know best. The HTML writer shouldn’t have to second-guess the browser. When browsers say Accept: text/html, you believe them. Even if they can’t follow every detail of the recipe, they’ll do their best, and you shouldn’t presume to dumb it down for them.

Sometimes you mean Unicode characters. If I want to use a real apostrophe, or to spell façade, or to quote Plato and Lao Tzu directly, or to say I ♥ NY, or to write any but the simplest mathematical expressions, I mean Unicode characters. More importantly, when Tibetans want to write to other Tibetans, they mean Unicode characters. When someone named José writes his own name, he means Unicode characters. Asking everyone to use ASCII, or to mode-switch between a plethora of 7- and 8-bit character sets, is culturally recondite, typographcally repugnant, and technically unnecessary. Natural language text should never have been OS-dependent.

Unicode is the way things should have been for years. By and large, it is the way things have been for years. If support for it is sketchy here and there, it’s still better than perfect support for ASCII, which cannot show the vast majority of things humans write.

Is Unicode support really that sketchy? Visit this page in Lynx (I used version 2.8.4rel1). Ah, well, Lynx is a no-nonsense browser, and we could hardly expect it to support something as frivolous and idealistic as Unicode: but lo, the page looks fine. Lynx renders Œ, ƒ, and — as OE, f, and --, and that’s that. It even recognizes &#x2014;, the scary-looking hexadecimal entity for .

(If you’re laughing at my talk of wide support because your browser shows nothing or nonsense for the weird characters, I’m telling you that your browser or fonts are broken. You have no more right to complain in the year 2005 than people with 16-bit CPUs do that no one provides binary software packages for them. There’s been a de-facto, de-jure standard of great utility for years now, and if your software doesn’t support it, no blame is shed on Unicode or the people who use it — they have technical correctness and utility and majority.)

In summary: to flatten everything to ASCII in HTML is not cautious, it’s careless. There’s a Unicode point for every character you need, and browsers are perfectly capable of doing their own flattening if they need to. If they don’t need to, you may be leaving out integral information. Oe, f, --, and all similar kludges are lossy, misleading, and bad HTML.

Use the character you mean.

Improved HTML to HTML

These lines can be painted right over the Demoronizer’s character translations to make them Unicode-ish (and to provide a few extra mappings). They are extracted at this page’s runtime from the Unmoroniser, my Unicode-hep fork of the Demoroniser, so they’re unescaped and can be copied and pasted directly from the page source. (But you may as well just download it complete.)

$s =~ s/\x80/€/g; # Euro currency symbol (looks like e)
$s =~ s/\x82/‚/g; # single low open quote (looks like ,)
$s =~ s/\x83/ƒ/g; # function, folder, and florin symbol (looks like f)
$s =~ s/\x84/„/g; # double low open quote (looks like ,,)
$s =~ s/\x85/…/g; # horizontal ellipsis (looks like ...)
$s =~ s/\x86/†/g; # dagger symbol (death or second footnote)
$s =~ s/\x87/‡/g; # double dagger symbol (third footnote)
$s =~ s/\x88/ˆ/g; # empty circumflex accent (looks like ^)
$s =~ s/\x89/‰/g; # per-thousand symbol (looks like %0)
$s =~ s/\x8a/Š/g; # capital s with caron (looks like S + v)
$s =~ s/\x8b/‹/g; # left single angle quote (looks like less-than)
$s =~ s/\x8c/Œ/g; # capital o-e ligature (looks like Oe)
$s =~ s/\x8e/Ž/g; # capital z with caron (looks like Z + v)
$s =~ s/\x91/‘/g; # left single quote (looks like `)
$s =~ s/\x92/’/g; # right single quote (looks like ')
$s =~ s/\x93/“/g; # left double quote (looks like ``)
$s =~ s/\x94/”/g; # right double quote (looks like ")
$s =~ s/\x95/•/g; # bullet (dot for lists)
$s =~ s/\x96/–/g; # en dash (looks like -)
$s =~ s/\x97/—/g; # em dash (looks like --)
$s =~ s/\x98/˜/g; # small tilde (looks like ~)
$s =~ s/\x99/™/g; # trademark symbol (looks like TM)
$s =~ s/\x9a/š/g; # lowercase s with caron (looks like s + v)
$s =~ s/\x9b/›/g; # right single angle quote (looks like greater-than)
$s =~ s/\x9c/œ/g; # lowercase o-e ligature (looks like oe)
$s =~ s/\x9e/ž/g; # lowercase z with caron (looks like z + v)
$s =~ s/\x9f/Ÿ/g; # capital y with diaeresis or umlaut (looks like Y + ")

If you have additions, subtractions, changes, or quarrels, go ahead and write me. These modifications are in the public domain, so you can pick and choose and mix your own.

Notes and apologies

I don’t mean to personally fault John Walker (or anyone else) for not thinking about Unicode when working on pragmatic things like the Demoroniser. Ideally, no one would ever have to; we’d take Unicode for granted and cheerfully accept things like perl variables called $¶, the URI http://alþing.is/þjóð/, and people named Zoë. That’s several years away; until then, digital writing is half-mapped and unsettled, and scripts like these will be subject to taste, because they’re on the frontier. The Unmoroniser is an interpretation of the standards filtered through my esthetics, period. All I promise is something better than ASCII.

My use of the word Unicode is sloppy here. Sometimes I mean the character database proper, but more often I mean encoding schemes like XML character entities or UTF-8, which simply address that database. Likewise, but more arguably, you should probably read XHTML for most instances of HTML.

Transcription methods disagree about whether a word like Œstrus, when capitalized in the original, should be flattened to Oestrus (lossy) or OEstrus (ugly). I prefer the former, with Walker, against Lynx and the entity name.

Walker says ASCII already contains apostrophe and single open quote characters. This is not false; the ASCII-67 standard even says Some users of the code may design special type fonts which will facilitate use of these graphics for either purpose about a set of characters including ` (0/6, often called backtick). However, as far as I can tell (and I invite correction), people have always preferred to use pairs of '. This may be because:

  1. Very few fonts give the ASCII backtick and apostrophe characters the 6 and 9 shapes they need to look right as open and close single quotation marks.
  2. Most respectable programming languages don’t consider ASCII backtick and apostrophe as paired. Perl, for instance, has has strings with '' and with ``, but not with `'.
  3. ECMA-6, which can be regarded as the non-US-localized version of ASCII, only refers to the character as grave accent, although it permits ad-hoc assignment.
  4. Users assume that if ASCII wanted us to have curly quotes, it would give us curly double quotes too, because they are much more common in standard American English and in most programming languages.

A final nitpick: <em>f</em> is poor form, because the symbol is certainly not an emphasized f, even if it might be regarded as an italic one. In most cases where <i> was used in early versions of HTML, <em> is indeed to be preferred, but not here; <i> would be much better because it would not imply a meaning (emphasis) where there is only variant form (italic).

This page by Charlie Loyd, 2003-11-25. Last edited 2005-04-14.