Apostrophes are hard

Preface

This page shows why it’s impossible to correctly translate instances of the ASCII apostrophe character ' (U+0027) to ‘ (U+2018) or ’ (U+2019) in typical English text without comprehending it. It should help, or at least enlighten, anyone converting ordinary keystrokes into typographically sophisticated text.

A quote, on this page, is a quotation mark. A quotation is still a quotation.

The problem

We want an algorithm to correctly convert ' to ‘ or ’ in arbitrary written English without actually understanding the text, given that:

Some quotations begin at an opening single quote and end at a closing single quote.
An apostrophe may be adjacent to any alphabetic character.

This leaves the meaning of some instances of ' in the text ambigous, as in the context:

Well, a 'possum is eating the win' again.

That might be two elisions, from opossum and wing, or it might be a reference to an odd but outwardly valid phrase. Even a human reader may delay a final parsing of this fragment, but more context cannot help an algorithm without a semantic grasp of the text. More examples show other forms of ambiguity:

I don't do 'meek assistant' very well.
He was the '50s' greatest rock 'n' roll singer.
No, I said: 'oy, are you kin' of these people?
I 'it 'im on the bleedin' noggin'!
It was the 'splainin' voice perfected by Mr Ricardo.

A and B are heuristically tractable (B requiring special cases). C would be easy with comprehension of context and basically impossible without; D might trick an algorithm which could find that nogging is a word but not that it’s much more obscure than noggin; and without a good knowledge of the writer’s style, E is ambiguous to any reader including expert humans.

Heuristic approaches

These heuristics, properly ordered and tuned, can give useful though entirely approximate mappings from typical text:

pattern	mapping	intent
[non-whitespace]'[letter]	’	contractions and possessives
'[digit][digit][non-digit]	’	two-digit years
[whitespace][digits]'	′	prime (U+2032) for feet
[newline]'	‘	multi-paragraph non-block quotation
[whitespace]'[non-newlines]'	‘’	single quotes
'	’	default to apostrophe

I’ve used a variant algorithm as part of a plaintext-to-XHTML translator, and it’s proven quite reliable. Only a few cases, such as a quotation including only the opening or closing mark of an inner single quotation, have to be accepted as pathological.

Presumably, a carefully trained Bayesian system with some special cases would give superior results. This is left as an exercise to the reader.

Unicode’s limits

The Unicode Standard 4.1 states that U+2019, right single quotation mark … is the preferred character to use for apostrophe. This is often said to abide by the Consortium’s doctrine that visually and behaviorally identical characters are totally identical, but in fact many fonts distinguish the glyphs, and the behavior (sort order, for example) may differ. In any case, the overloading of U+2019 leaves semantic ambiguity even in correctly Unicode-encoded plaintext: ’ may be either a closing single quote or an apostrophe.

XHTML

XHTML’s quotation element (q) allows a comprehending writer to compose text which is unambiguous at the toplevel and, with CSS, can properly render the quote styles of fragmentary quotations. A computational translator could use the quotation element where most confident and produce different encodings of U+2019 (literal ’, ’, ’, or ’) to signal which pattern was matched or degree of confidence. Such a scheme would not survive rendering or normalization and is well outside the spirit of Unicode, but might find some ad hoc use.

Notes

Single quotes are rare in American English, where they’re usually used only for inner marked quotations (and inner inner inner, etc.), but they must not be ignored: British English uses them for outermost (and inner inner …) quotations, and some dialectal usage has them signal ironic or non-verbatim quotation.

The Language Log’s Final periods and quotation marks: harder than you thought is a good description of an equivalent problem. One of Hal Fulton’s posts to ruby-talk helps explain why using ` (U+0060 = grave accent) for ‘ isn’t as clever and well-established as people who suggest it hope.

Bob the Angry Flower’s Quick Guide to Destroying the Apostrophe presents an alternate solution. Also see his Quick Guide to the Apostrophe, You Idiots.

For a fuller defense of Unicode against ASCII – if you’re wondering whether it’s worth the trouble – please read about the Unmoroniser.

Besides my co-workers, Chris Black, Aredridel Stewart, and Hal Fulton have had interesting things to say about this.