I’m no linguist – nor do I play one on TV, but it can be fun to have a dig around in their world. I’ve been doing a bit of research.
Enthnologue is an interesting database of human languages and worth a wander through.
However, it doesn’t contain what I was looking for – a list of human languages (well, more strictly, the scripts) that have both majuscules and minuscules – better known as “upper-case” and “lower-case” to the likes of me. Certainly the Germanic scripts do.
What I learnt from the Unicode FAQ was that:
-
Most scripts don’t have cases.
-
The mappings from upper-to-lower-case are not one-to-one.
-
“For example, both a sigma and a final sigma upper-case to a capital sigma.” – Unicode FAQ
-
Some mappings are locale-dependent.
UPPER("i") != "I"
in Turkish! -
Some lower-case Unicode glyphs can’t map to a single upper-case glyph. For example, the fl ligature (
U+FB02
), when converted to upper-case, should end up taking two characters.
-
There are a couple of conclusions:
- When dealing with non-English scripts, avoid assumptions like
UPPER(x) == UPPER(LOWER(x))
. - Relying on a computer to change the case of a character is going to be a non-trivial operation.
(Why am I going on about typography and linguistics? Bear with me. I’m building up a framework for an argument. I’ll come back to this later on.)
Comment by alastair on October 25, 2005
I find Unicode fascinating just for the window it provides onto the range of human written expression. As you have seen, even the most basic English assumptions (like the characteristics of upper- and lower-case letters) do not hold for the rest of the non English-speaking world.
Comment by Alan Green on October 26, 2005
Still bearing with you, Julian. But if you don’t come to the point soon, my head, unsatiated curiosity will develop into intellectual vacuum, and my head will implode.