furialog

¶ "Syntax" · 24 March 2011 tech

You could do math and numbers without periods and commas.

In the US we use commas as thousands separators, like "1,456,012". There's no functional significance of this. "1456012" is the same number, it's just slightly harder for humans to read quickly. A computer can read it fine. You could add 47 to it easily enough. As a syntax element, then, this is both pure and optional.

(To be really strict about this, the comma is not quite pure syntax, because although you can always take commas out with functional impunity, you can't put them in with the same freedom. Most humans will not recognize "1,4,5,6,01,2" as an alternate rendition of "1,456,012", and it's debatable whether a computer should or shouldn't.)

In the US we use the period as the decimal point, like "1.25". This is a functional element of the language of math, obviously, as it's critically different from "125" or "12.5". But it's still optional. We could write 1.25 as 1¼ or 1 25/100. The decimal point is, in a sense, a shortcut notation for expressing fractions whose denominators are negative powers of 10. This is a elegantly sensible idea, since the digits of a decimal-point-less number are already expressing powers of 10, and it allows a whole repertoire of arithmetic techniques to be applied to fractions numbers the same way they're used on whole numbers.

But adding syntax is never free. Giving the period special meaning in numbers introduces some amount of potential confusion when numbers are used in English sentences, where the period has a different syntactic function. Now when I write "I set the intensity to 1.5.", the first period belongs to the number and the second to the sentence. If people started using numbers in names, we might find ourselves saying "I have an appointment to see Dr. 1.5, Jr..", in which there are four periods serving three different functions.

But as a tradeoff, this seems OK. We don't commonly use numbers in names, and sentence-ending periods are always followed by spaces while decimal points always aren't, so there's no appreciable ambiguity from a human perspective. The overloading of functions makes a computer's task harder if it's necessary to parse English sentences, but there the use of periods for abbreviation is already a much trickier problem, so the incremental effect of adding the decimal-point use doesn't affect the overall complexity very much.

But if this were a language-design decision we were actually trying to make, we would also consider the alternatives. If we expect to want to embed snippets of our number language in our sentence language, which is what we're doing in language terms, then we have basically four choices: pick symbols that don't conflict, subsume the number language into the sentence language, quote whole snippets of the number language where we embed them into the sentence language, and/or mark the special characters from the sentence language, where they appear in the embedded bits of number language, to indicate that they are not performing their sentence functions in that context (this last is often called "escaping", in the sense that the characters so-marked are escaping from their normal responsibilities).

Picking symbols that don't conflict is easy, of course, if you can just make up your own symbols. But if we pretend that we're making this decimal-point decision today, in a world filled with already-manufactured keyboards with only certain characters on them, and we agree that we want to use one of these characters, then there aren't a lot of good choices. Almost all the characters on a keyboard already have meaning in sentences, so you're mostly left with ^ ~ and `. Two of these already have math uses, though, and the backwards quote doesn't seem very appealing here.

The other three solutions, subsumption, quoting and escaping, move the embedding problem out of the embedded language (numbers) into the embedding language (sentences).

For subsumption we provide constructs in the sentence language for everything in the number language. This, in fact, has exactly been done. You can write numbers and math without using symbols or numerals: "One point two five divided by the sum of nine point four and eleven." That's the sentence version of 1.25/(9.4+11). The cool insight, from the sentence language's perspective, is that anything you can describe in words, you can say in a sentence by writing out those words, and if you do that, you can reserve punctuation for its special sentence functions. The tradeoff is particularly obvious in this case: the number language is very symbol-heavy, so the sentence-language translation is a lot different from the number-language equivalent, and in particular the number-language's careful balance of content (numbers) and function (numeric operations on those numbers) has been destroyed by turning both numerals and punctuation into words. So the cost of eliminating the ambiguity of symbols, by this method, is that translated snippets of number-language have to be mentally translated back by the reader. Given that the symbol confusion that motivated us not very severe, it's probably not worth incurring this cost, at least most of the time, but it's good to know it's an option.

For quoting we need a way to mark the point where we switch from the sentence language into the number language, and the point where we switch back. The meta-language of paragraphs already has a block-quotation method for separating languages, which we can use for this, so we're just exploring the idea of an additional method for inline quotation, putting number-snippets inside of sentences without block-quotes or paragraph breaks.

The sentence language already has quotation marks, of course, so we might be tempted to use those. But actually, the quotation marks in the sentence language are doing a complicated mixture of language functions: some quoting, some escaping, some nesting, and often a specific language-internal punctuation role for denoting speech. Parentheses and square brackets also have established sentence-language roles. But braces (the squiggly brackets: { } ) are available on keyboards, and not used (for much) in the sentence language. So we could adopt those as our quoting symbols. This will work well or poorly depending on whether the embedded languages use curly brackets. The number language we're talking about doesn't, but remember we're now talking about a sentence-language feature, so it has to be considered in the context of all the other languages we might want to embed in sentences. I can easily think of a few that use braces (several programming languages, the expanded math language of sets...), so that's not a clear win.

So, in fact, we adopt a kind of hybrid solution. Snippets of the number language look different enough from the sentence language that we usually just jam them in and hope for the best, and in the rare cases where there are numbery bits of sentence-language in proximity to number-language snippets, we can use the sentence-language's function for rewriting the numbery bits without numerals. So if I'm talking about a numeric-expression collection, I can write "I have two 3+5s and one 9/13", instead of "I have 2 3+5s and 1 9/13". This is pretty complicated to explain in formal language-design terms, but it's easy to do.

The last approach to consider is escaping. Instead of marking whole blocks of the number language, we could mark just the specific characters in the number-language that would normally have functions in the sentence-language. Two common tactics for this, from machine-oriented languages, are prefacing the characters with backslashes or doubling them. So instead of 1.25, we'd do 1\.25 or 1..25. The former approach has the advantage that we can use it for lots of characters, so we can use it for dashes and slashes and asterisks and other things which also appear in both the number language and the sentence language. But the readability cost of this, for the embedded snippet, can be high. How do you like "1\.25\/\(9\.4\+11\)"? And backslash escaping gets really ugly if you need levels of nested languages, because then you have to start backslashing the blackslashes. So here again, the cost of the solution is high compared to the benefit in ambiguity-reduction.

So, all things considered, sticking with unadorned decimal points in numbers in sentences seems, pretty clearly, like the best approach, given all the factors we could think to consider.

Language-design decisions are always provisional like this. If factors change, we might reconsider. Note, for example, how we've mostly changed from writing fractions by stacking tiny numbers vertically to writing normal numerals with slashes like 1/3, because numbers are increasingly expressed by people using keyboards, typing into software oriented around lines of text, and 1/3 is much easier than finding the special ⅓ character or relying on the software you happen to be using having the proper input/typography support for expressing arbitrary fractions the old way.

So, if you've read this far, hopefully you've learned at least one thing. And probably that one thing is that you do not want to be a language designer.