Medieval Unicode

[Permalink]

And Voynich, while you’re at it

3 Likes

Next I want to see ISO Steampunk-1

Isn’t that called a “font”?

1 Like

Finally!

1 Like

I believe a “font” is the specific combination of a typeface (the overarching “design”), weight (bold/normal/light etc), style (italic/oblique/normal , condensed/normal/wide) and size.

Of course, it can also refer to the file or files that contain the information needed to draw those. A classic font file really was a font; it had a picture defining each character. A modern vector font typically defines a specific weight+style, but can be scaled to any size (so there are files for arial/arial bold/arial italic/arial bold italic, and maybe light versions of the previous - and maybe condensed versions of all those). Calling the overarching thing a “font” is wrong (arial is a typeface), though common and I do it myself. The files are “font files” by tradition and history - and apart from being scalable, they more or less are.

Anyway, this is more about codepoints - defining that unicode character 1E53 actually means “latin small letter o with macron and acute”. Any given font file might not include all the glyphs (though I guess the ones that can easily be composed from existing parts should be easy), so there is also an initiative to make font files that include glyphs for all their suggested/accepted codepoints.

1 Like

No. Unicode is effectively a list of (almost) all the characters anyone could need, anywhere in the world. Individual fonts will contain representations of a selected subset of Unicode*.

There’s space in Unicode for just over a million different characters, though only about 100,000 are used. The mediaevalists are asking for their favourite characters to be added to the list.

*Some of which may look nothing like normal versions of those characters- e.g. the ‘Wingdings’ font.

2 Likes

To expand: Historically, most fonts only had at most 255 glyphs, since each character in text was encoded as a single byte (7-bit byte in the olden days, 8-bit byte on anything remotely recent).

Most of the world agreed to use ASCII, which defines more or less the characters you can type on a plain US keyboard within the lower 127. That’s convenient, since most programming languages and OSes were in English - and everyone agrees on the numbers. When the world started using 8-bit bytes, the top 127 were kind of free, and were used for a number of different character sets (like my local iso-8859-15, or the DOS codepage 850 that had the same characters but in different positions).

In those, pressing a key produced a number, and if your keyboard layout and font agreed it displayed as the expected glyph. If you then saved those numbers to a file, and viewed them with a different font (say a cyrillic one, or even just one where the non-English letters were different) it would look wrong.

Unicode is different, in that it has a single unique code for every character. The lowest 200 or so are the same as ISO-8859-15, so a UTF-8 font will render text files made in a ascii character set with at least the basic characters correct.

As for UTF-8/UTF-16/UCS2/etc: A unicode codepoint is a 32-bit number - which takes four bytes to store. You don’t really need all four bytes for all text, though: UTF-8 will use a single byte if the glyph is in the 200-ish lowest, otherwise it stores an “escape” number (from the just-below-256 set) that indicates that the next few bytes all encode a single codepoint.

That’s efficient if the text in question is latinic with a smattering of other characters. However, imagine Chinese, where each character needs two bytes: Using three bytes per character (prefixing each character with a byte that says “the next two bytes go together”) is a waste. For a language like that, it’s better to use two bytes per character by default - so UTF-16 does exactly that. It still has escapes for writing characters that need three or four bytes (there are a bunch of those - dead writing systems, all sorts of symbols, unusual or historical chinese/japanese/korean signs).

The downside to UTF-8 and -16 is that a given amount of bytes can contain a varying amount of characters, so you have to parse text before you can work with it - you can’t even know if it’s safe to cut and paste at a given byte position without parsing back a few bytes, and you have to be very pessimistic when allocating memory for a given amount of characters. The compromise solution is UCS2, which is like UTF-16 in using two bytes per character, but does not support escapes: If you want to use a three or four-byte character, that’s just too bad. UCS2 is easy to work with - 100 characters take 200 bytes, and the lower 64k of unicode contains enough characters to write decent text in most (or all?) modern languages. It’s popular as a low-level replacement for ASCII, so you’ll find it in everything from windows internals to EFI bootcode.

(Technically speaking, UCS-2 has been deprecated since 1996, but valid UCS-2 is valid UTF-16, and valid UTF-16 that doesn’t use 3- or 4-byte codepoints is valid UCS-2. That and the amount of code that expects UCS-2 means it’ll be around for a while.)

There are also formats that use 4 bytes per character as standard; convenient enough if you’re actually going to wring every possible use out of unicode.

4 Likes

And of course the unicode convention is that it does NOT encode “glypfs” but characters. Different typefaces can represent the same letters with different characters. eg. 7 sometimes has a line through the middle, sometimes a lower case “a” has a line at that top and sometimes it does not. OTOH, sometimes the same glypfs can represent different characters. eg old typewriters usually didn’t have a character for the number one, the lower case “L” was used instead.
So presentation forms are not part of the unicode scheme. The idea was that those would be implemented by the software rather than within the character encoding.

1 Like

Thanks for asking that question. You triggered several informative answers.

2 Likes

Another illustration of the glypf vs letter divide and how that is related to fonts is blackletter (aka gothic lettering) Now a set of blackletter characters IS included in unicode, but it is in the section for mathematical symbols, because blackletter style letters ares used as a distinct set of symbols in some mathematical equations. You are supposed to encode blackletter text within the normal Latin character range and use a a blackletter font.

This topic was automatically closed after 5 days. New replies are no longer allowed.