Unicode Suggestions Requested.

October 5, 2015 by languagehat 87 Comments

I just got the following e-mail:

We’re drafting a proposal to add as many remaining unsupported phonetic and orthographic symbols to Unicode as we can justify. I thought you might have come across things you’d like to have encoded. You seem like the kind of person who might have stashed away notes on things like that.

We’re not interested in idiosyncratic inventions that never spread beyond their authors, or obsolete systems that scholars don’t bother to use even when citing sources that do use them, but sometimes Unicode doesn’t support things in fairly widespread use, such as superscript variants of IPA characters, subscripts made superscript to avoid descenders, letters with a swash for velarization, and informal IPA letters or substitutions. Or if you know of a really neat symbol that should be available but isn’t, and can send published documentation, we should be able to include it.

So this is your chance: if you’ve got ideas on the subject, put ’em in the thread and they will be seen by someone who can do something about it.

Comments

Kirk says

October 5, 2015 at 10:23 pm

Just to clarify, to keep things manageable we’re restricting submissions to the Latin script (broadly speaking, including punctuation) and to phonetic alphabets such as the IPA that are based on the Latin script. Requests should include a citation, ideally of an instance of actual use rather than just from a list of symbols. The more authors and publications we can cite, the more likely a request is to be accepted. A link to a PDF or screenshot is needed, but if you don’t have a place to post we can work something out. The source doesn’t have to be for language or phonetics. E.g., I’m requesting a superscript comma for chemical notation.

Note that Unicode is no longer accepting requests for precomposed glyphs (letter + diacritic, such as the r-tilde used for an alveolar trill in Americanist notation). Exceptions are made when the diacritic intersects the base letter, such as the swash used for velarization, the tail for retroflection, and the obsolete tail for palatalization in the IPA, as fonts still sometimes have difficulty composing them.

I don’t have any published evidence for a retroflex lateral flap (turned r–l ligature + retroflex tail: see the Wikipedia article) despite over 100M people speaking languages with that sound and linguists saying they could really use it. So that’s a priority. I have minor attestation of small-cap Q, but more would be helpful. TIPA has an l–r ligature that I can’t find.

Superscript IPA letters need to be assigned their own codes so that they can be entered into databases and used for file names without the superscripting being lost. I plan to request all vowels, pulmonic consonants, and extIPA letters, but in case that fails, I’m trying to document as many individual superscript letters as I can. I have yet to document the vowels ø ɤ æ ɶ ɘ ɞ or rhotic ɚ ɝ (not sure the latter are worth including, since the rhotic diacritic works with superscript vowels) and the pulmonic IPA consonants ɖ ʡ ɮ ħ ʙ ⱱ ɺ (though I have linguists requesting several of them). I don’t have any implosive, click, or extIPA letters, though I’ve received a couple requests now to include the extIPA. But some of the letters I do have citations for, such as the uvulars, are rather minimal, so additional citations would be useful. Superscript sinological letters would also be nice. (I have found superscript ɿ ʅ.)

*Subscript* IPA letters are sometimes used, but not sure the convention is established enough to include. Maybe someone here knows of something that would make it worthwhile? Note that in order to argue for a superscript or subscript form of a letter, it needs to be semantically distinct from the base letter. That’s the case when IPA letters are used as modifying diacritics, but not when e.g. subscript b, c, d are just used to identify a variable.
Luke says

October 6, 2015 at 2:27 am

I’ve got nothing to offer, but…that is awesome.
Y says

October 6, 2015 at 3:10 am

“We’re not interested in idiosyncratic inventions that never spread beyond their authors, or obsolete systems that scholars don’t bother to use even when citing sources that do use them.”
If Unicode has seen fit to encode the Deseret alphabet and Tolkien’s scripts, I don’t see why they shouldn’t encode oddball academic phonetic systems.

Anyway, I nominate the slashed b, d and g, used by the University of Chicago Press to represent the corresponding fricatives in Spanish.
Carey Evans says

October 6, 2015 at 3:30 am

You can represent their slashed b, like in https://books.google.co.nz/books?id=G8ZY3dVGF7cC&lpg=PP1&pg=PA3#v=onepage&q&f=false, with U+0062 U+0338 to make b̸, as long as the font works. It doesn’t seem like they need separate codepoints assigned for them.

Tolkien’s scripts are very common, but I wouldn’t expect to see Marain or the dragons’ runes from Skyrim encoded.
David Marjanović says

October 6, 2015 at 6:28 am

the slashed b, d and g

Do you mean ƀ, đ and ǥ, or do you mean letters with / through them?
Ian Press says

October 6, 2015 at 6:42 am

I would be so happy if we could have cyrillic vowels with acute and grave accents (grave for stress traditionally in Bulgarian). Acute-accented ‘e’ is immediately justifiable, as it quite simply does exist in Russian; and ‘ë’ is already there. ABBYY on a Mac doesn’t recognise such characters, which is such a pity. I know they’re only really found in textbooks and grammars and the like, but still, please! I have begged ABBYY to extend recognition to their Mac software. One can mark stress in other ways, e.g. emboldening the stressed vowels, but that acute would be so useful – I know one could paste in from non-cyrillic, but…
Lazar says

October 6, 2015 at 6:48 am

A ‘y’ with breve would be nice for metrical notation; I’m not sure why it hasn’t been included along with the other breve characters.
Ken Miner says

October 6, 2015 at 8:19 am

I’m not technically sophisticated about the difference between Unicode and special fonts, but the last time I tried to word-process in Classical Greek it was pretty horrendous. Can I assume that even by now you can’t get all the combinations of diacritics you need for that in Unicode? It seems it would be faster than special fonts.
Rodger C says

October 6, 2015 at 8:19 am

I use only the symbol sets in Word and IPA sites, so maybe this is off base, but I continually miss consonants with dots under them that are common in academic transcriptions of Arabic, Sanskrit, etc., but are rare in actual orthographies.
Brett says

October 6, 2015 at 8:44 am

It would, in fact, be useful to have versions of all the italic letters with / through them.
Dan Jones says

October 6, 2015 at 8:48 am

Ken Miner,

Unicode is not a font. Unicode is a method of encoding text for use by computers. All text files have a specific encoding, and Unicode is becoming the most popular encoding because, unlike many others, it has broad support for nearly every language in the world, as well as a large number of symbols (and even emojis). Prior to the creation of Unicode, most encodings were language-specific.

I believe Unicode already has support for all Classical Greek characters. But, as I said, it’s not a font. And you need a font that includes the characters supported by Unicode. If you’re having trouble finding one, I suggest Noto (http://www.google.com/get/noto/). It’s a family of fonts by Google that includes support for nearly all Unicode characters, and unlike some other fonts with such broad support (like Unifont), it’s actually pleasing to look at.
Alex says

October 6, 2015 at 9:09 am

Another vote for accented cyrillic vowels. As they are standard in many text books it is suprising they are not already available. I know you can create them with a combining acute accent (U+0301) but the result is often not correctly rendered.

Another useful addition would be alternatives for the old DOS reserved characters. A few exist already: Modifier Letter Colon (U+A789) which is used a tone mark in certain languages is a good replacement for the normal colon (U+003A) and fullwidth question mark (U+FF1F) can take the place of U+003F but existing alternative forms for many others are less successful.
languagehat says

October 6, 2015 at 10:03 am

Another vote for accented cyrillic vowels.

I agree.

I would remind people that you will have a better chance of getting your suggestions accepted if you can provide a citation of actual use in a publication.
Rodger C says

October 6, 2015 at 10:03 am

There are Greek letters with Classical accents in the Word symbol set, but you have to enter them one at a time. They’re useful for word citations and short quotations but would be a terrible chore for extended text.
Jongseong Park says

October 6, 2015 at 10:29 am

Unicode doesn’t accept submissions for characters that can already be encoded by using base letters and combining diacritics already in Unicode. Accented Cyrillic vowels for example can be encoded with the normal Cyrillic vowel letters followed by combining acute accents, which is why we can write Кры́мское ха́нство without a problem. If this doesn’t display correctly, that is a problem with fonts or rendering environments, not with Unicode.

The precomposed characters that are already in Unicode are there for reasons of backward compatibility, because previous encoding standards assigned characters like é, ü, and ç their own code points instead of expressing them as combinations of base letters and diacritics.

Please see this FAQ for further information.

The drawback of course is that font developers may naïvely support only the precomposed characters that are part of existing character sets, but more and more fonts being developed nowadays (including system fonts for the new OSes) support the base letter and diacritic model natively, allowing so obscure combinations like ʣ̢̥ to be rendered more or less correctly. Instead of creating each combination separately, as font developers used to do, nowadays you can use the “mark to base positioning” feature of OpenType to define where to put diacritics on base letters so that any combination of base letters and diacritics in your font can be supported and displayed correctly.
Jongseong Park says

October 6, 2015 at 10:56 am

We’ve discussed this issue back in 2010, by the way:
UNICODE, NORMALIZATION, AND GREEK

There, I brought up the lack of Unicode support for many Arabic letters used in Ajami orthographies. There have been plenty of developments since then, with some of those missing letters added to Unicode (notably with the addition of the Arabic Extended-A block in 2012’s version 6.1) and several more in the pipeline.
Greg Pandatshang says

October 6, 2015 at 11:19 am

OT: I was watching the Satyajit Ray movie Mahanagar (aka The Big City or The Great City), about a not-at-all-affluent white collar family in mid-century Calcutta, the other day. There’s a scene in which the wife of the family, not at all comfortable in English, has to sign her name in Roman letters. I was a little surprised that she uses a macron over the -i at the end of her given name (the macron is clearly visible in addition to the dot of the i). I wonder what norms prevailed in British India as to individuals transcribing their own names for use in English.
Yuval says

October 6, 2015 at 12:12 pm

Is ancient Hebrew in there?
MMcM says

October 6, 2015 at 12:50 pm

Greg Pandatshang, I think you misinterpreted that scene. It’s not a macron; it’s a misplaced crossing for the t.

Her name is আরতি, not আরতী.

here.
Greg Pandatshang says

October 6, 2015 at 2:37 pm

Rats. I thought it was that rare moment: a macron sighting in the wild!
Rodger C says

October 6, 2015 at 3:08 pm

Well, there was the British literary scholar of Sri Lankan origin, Gāmini Salgādo, whose surname wasn’t even of Asian origin.
Y says

October 6, 2015 at 6:13 pm

David M., It’s the characters with a slash through them, exactly as Carey Evans said. I hadn’t realized there was a slash-through diacritic. b̸, d̸, g̸ work fine.
Greg Pandatshang says

October 6, 2015 at 6:26 pm

for unicode suggestions, superscript y is the main thing that comes to mind, because it seems to pop up from time as an alternative to ʲ. According to Wikipedia, the Americanist notation chart of 1916 also made extensive use of subscript y and subscript w, along with some obscure small caps characters, such as smallcap eng. However, I have no reason to think those have been in use much in the last century.

It would be nice if there could be new precombined characters added, such as macron+acute or grave for transcribing Vedic Sanskrit, since combining characters still often fail to display correctly in my experience. Hopefully the next few years will see more progress on that, so eventually the combined characters are consistently indistinguishable from the precombined.
Y says

October 6, 2015 at 6:50 pm

Yuval, there’s Phoenician, intended for use with Old Hebrew too, and there’s Samaritan.
Y says

October 6, 2015 at 7:16 pm

Who at Unicode should one write to directly, with images etc.?
Carey Evans says

October 6, 2015 at 8:28 pm

If you want to see a lot of macrons, Greg, you can have a look at New Zealand government websites like http://www.tetaurawhiri.govt.nz/.
zyxt says

October 6, 2015 at 10:20 pm

For the Croatian and Serbian language(s), the then Yugoslav Academy (present-day the Croatian academy) used in its publications a set of symobls designed to have a one-to-one mapping between the Latin and Cyrillic alphabets. In the current Croatian alphabet they are represented by digraphs lj, nj, dž and dz. The signs look like: ļ, ń (ń is already recorded in Unicode), ģ, and ȥ. A link to one volume of the Academy’s dictionary in PDF form is at https://archive.org/details/rjecnikhrvatskog06jugouoft.

Croatian dialectology also uses symobls for
(a) an affricate sound that is intermediate between č and ć. It is a č with a point or a stroke inside the haček.
(b) open and closed e and o sounds by placing dots and diacritics under & over those letters.
The letters under (a) and (b) can be seen in PDF form in Croatian academic journals, eg. the dialectological “Kaj” at http://hrcak.srce.hr/kaj

Incidentally, has anyone at Unicode thought about improving the Glagolitic block by:
(1) Introducing the third case used in glagolitic publications. See for examples the 3 different forms/cases of the letter A used in the 1629 Azbukividnjak: https://books.google.com.au/books?id=ti5UAAAAcAAJ&printsec=frontcover&dq=inauthor:%22Rafail+Levakovi%C4%87%22&hl=en&sa=X&ved=0CBwQ6AEwAGoVChMIjJ74wKKvyAIV4x2mCh2OJQCg#v=onepage&q&f=false
(2) Introducing the special letters designed to correspond to certain Cyrillic letters. These were introduced in the 17th century in the “East Slavonicised” editions of the Roman Congregation for Propagation of the Faith. See eg. the table on pages 4, 6 and 8, and explanation at p 69-78 of the 1753 “Bukvar” (primer-book) https://books.google.com.au/books?id=8lpdAAAAcAAJ&pg=PA9&dq=Bukvar&hl=en&sa=X&ved=0CEIQ6AEwBmoVChMI2-W59aKvyAIVRNqmCh1PogG2#v=onepage&q=Bukvar&f=false
(3) Introduce the “i pročaja” (etc.) sign used in glagolitic. This is essentially a cursive č with a titlo on top of it. See p 62 of the 1753 Bukvar.
(4) Introduce the letter šć, which is essentially the letter šta with a three-dot diacritic. See p 62 of the 1753 Bukvar.

Solarić’s 1812 Bukvar is another source for suggestions (1)-(4) and it contains a nice illustration of the three cases used in Glagolitic. See the last (glagolitic) section of Solarić’s Bukvar at:
https://books.google.com.au/books?id=hZZbAAAAcAAJ&pg=PT6&dq=Bukvar&hl=en&sa=X&ved=0CE8Q6AEwCGoVChMI2-W59aKvyAIVRNqmCh1PogG2#v=onepage&q=Bukvar&f=false
George Gibbard says

October 7, 2015 at 1:58 am

Combining diacritics are now well supported by SIL fonts (free from their website, so I haven’t even bothered to see how the competitors compare). I don’t know what font is used to display on this site, but ā́ looks good to me now. See how it looks when I post. For that matter sádēvā̃́ḗhávakṣatu
George Gibbard says

October 7, 2015 at 2:04 am

ī̃́
George Gibbard says

October 7, 2015 at 2:05 am

I’ll admit the last post doesn’t look entirely elegant in my browser.
Ian Press says

October 7, 2015 at 4:43 am

Thanks for all those. I’ll see what I can do. My problems started when Word/Mellel/Nisus files of stuff I had written turned into gobbledygook. Scanning with ABBYY on a Mac was fine except for those accented characters in cyrillic, e.g. ‘é’ became ‘б’. I know after years of drudgery that such things can be dealt with relatively straightforwardly, but still. And I so hanker after the days of Word 7.4 or 7.5, when one could assign different fonts to a cmd + whatever you liked sequence.
David Marjanović says

October 7, 2015 at 6:28 am

superscript y

ʸ is already in (U+02B8), right behind good old ʷ (U+02B7).

I’ll admit the last post doesn’t look entirely elegant in my browser.

It does in Firefox.
languagehat says

October 7, 2015 at 9:12 am

It does in Firefox.

Yup, looks good to me.
Doug Barton says

October 7, 2015 at 10:18 am

Aside from Roman characters, it certainly would be helpful to include epsilon and omicron with circumflex/persipomeni (ε̃ ο̃) for non-Byzantine accent placement systems like United Greek. United Greek also marks long alpha/iota/upsilon with underdots and these chracters should be included too, both majuscule & minuscule here (α̣ι̣υ̣ẠỊΥ̣).
David McCann says

October 7, 2015 at 10:36 am

We need the AR ligature for numismatics: any textbook or catalogue will
show the use of the standard symbols Æ for bronze, Ꜹ (AV, if it’s not in
your font) for gold, and AR for silver; only the last is missing.

The only superscript letter missing is “q” (look in Phonetic
Extensions), but we could use a full set of subscript letters for
labeling variables.

Also, we have symbols for French playing card suits (e.g. diamonds,
spades, etc), but not for the Latin suits (cups, coins, swords, and
clubs). The latter are used throughout the Spanish-speaking world, and
also in Italy. That’s hundreds of millions of users.

[this comment posted by Alex Fink on David’s behalf]
Jan van Steenbergen says

October 7, 2015 at 12:27 pm

One thing I have always found surprising and inconvenient is that Unicode does not have T-acute and D-acute. I know by experience that the combining diacritics often looks bad in these cases. They are often used in Slavistics and in romanizations of Old Church Slavonic. They are needed in Interslavic as well, and using T-haček and D-haček as a workaround is far from ideal.
Jan van Steenbergen says

October 7, 2015 at 12:29 pm

Aha, and although the combining diacritics is not a problem in this case, adding V-acute wouldn’t be a bad idea either.
Unicode requester (Kirk) says

October 7, 2015 at 4:59 pm

@ Y: Send documentation (screenshots, PDFs, etc.) to me via languagehat, or I can provide you with a DropBox location. I’m gathering the documentation for SIL, and they will make the actual proposal to Unicode. (Proposing new code points is not straightforward, and SIL has experience making successful proposals.)

@ Y: SIL has provisionally decided to request slashed p, b, d, g. I found all four at the link you provided. Do you have any other sources, so we can demonstrate this isn’t nonce usage?
TIPA decided not to go with precomposed letters for these because they’re typewriter substitutions and the combining diacritic Carey Evans mentioned replicates a typewriter quite well. But Unicode did accept a slashed letter as recently as Latin Extended-E, which is promising. So we’ll see how it plays out.

@ Greg Pandatshang, we already have superscript y. But the *subscripts* in the 1916 Americanist chart are promising. There’s also a small-cap Δ and maybe a few other unsupported letters. (Hard to verify with an html table.) Do you have sources of any of these symbols in actual use? I have the booklet with the original chart on order through ILL, but plenty of transcription proposals like this never went anywhere so citations of use, especially if they span a significant chunk of time, would make a stronger argument.

For the rest of you, as Jongseong Park noted, Unicode is not accepting letter+diacritic combinations unless the diacritic joins the letter (like a retroflex [!]) or intersects it (like b̸). So accented Cyrillic vowels ain’t gonna happen. Such things should be generated by the font, and SIL won’t even consider making requests for them. If you aren’t getting good results, then you need a better font. If an SIL font isn’t producing good results, then write to SIL and they should be able to fix it. At least, I’ve had success doing that. You could try other font producers too.

Also, proposing a new script would be a very time-consuming project of its own, one which I do not have the expertise for. Many ancient scripts are in the works anyway. Sorry!
Unicode requester (Kirk) says

October 7, 2015 at 5:25 pm

@ David McCann: Citation please. I’m not finding anything. I’ve gone back to the 19th century, and all I’m getting is “AV” and “AR” written as two letters.
Y says

October 7, 2015 at 7:42 pm

Kirk, thanks, I’m personally pleased with the overstrike slash. The other source other than Canfield’s book which you mentioned is the The University of Chicago Spanish-English English-Spanish Dictionary, at least some older editions.
Unicode requester (Kirk) says

October 7, 2015 at 8:06 pm

@ zyxt: Glagolitic is beyond my ken. Sorry.

For the Latin Slavic stuff, could you tell me where to look for examples? I don’t see the Z anywhere. The G just has an acute accent, so that’s not going to happen. The L has a cedilla, which would be worth encoding, but Unicode Ļ is called “L with cedilla” even though it’s a subscript apostrophe in the font I’m using, so I’m guessing that they do not consider the two forms to be semantically distinct.
Doug Barton says

October 7, 2015 at 8:24 pm

Odd criteria, touching or intersecting. Makes you wonder why Unicode bothered with all the non-touching diacritic+letter combos they’ve already made. The fact is that it’s not a question of finding the “right font”; some combinations do not display properly in any font depending on the application, and some do not display consistently with similar single-character symbols in Unicode.

The Unicode folks should also remember that you can’t put a deadkey to use in existing or custom keyboards unless the letter+diacritic result is already a single unicode character.

So on either count using combining diacritics is NOT an acceptable alternative. The idea of some tiny anonymous body in charge of deciding what the computer world needs, it’s kind of exasperating. More so since they turn down new letter+diacritic characters because of – what? – they don’t feel like completing what they’ve already started?
Unicode requester (Kirk) says

October 8, 2015 at 12:56 am

@ Doug: Actually, you can create dead keys. I’ve done it myself. But your font needs to support the output. The consortium is basically saying they’re not going to do the work of font designers any more. Unicode is a mess: we could probably remove a thousand Latin code points, eleven thousand Korean ones, and several tens of thousands of Chinese ones without reducing what Unicode covers. But at some point it’s gotta stop: just because there are a thousand redundant Latin code points doesn’t mean we should add another thousand.
Unicode requester (Kirk) says

October 8, 2015 at 1:01 am

@ Doug: I mean, you can create dead keys that generate characters that are not in Unicode as long as their components are.
zyxt says

October 8, 2015 at 1:19 am

Kirk: The ȥ symbol can be found at page 686 of this volume of the Academy’s dictionary:
https://archive.org/details/rjecnikhrvatskog06jugouoft for the word meȥan. You can see there that ȥ and z are interfiled ie. the alphabetical order of the words is “mezalin, meȥan, mezana”, not “mezalin, mezana,… meȥan”. However, by the time they got to the volume of the dictionary for the words begining with ȥ, I believe the editorial policy changed and they grouped all the ȥ words separately from the z words.

As a side note, the first editor of the Academy’s dictionary used an ordinary c to represent the /dz/ sound, so it is possible that the Croatian user community might want to link the two uses of c in the encoding of ȥ so that in the first few volumes of the dictionary c /ts/ maps to c and c /dz/ maps to ȥ.

A second side note is that late 18th and early 19th century Dubrovnik printers differentiated the two sounds in their version of the Latin alphabet. Jakov Mikalja (Croatian writer of the 1600’s) also differentiated the sounds in the edition of his works – z = /dz/, ç = /ts/, ç = /tʃ/, 3-like symbol = /z/. The two ç cedillas are different, in that for one sound, the cedilla swings to the left, while it swings to the right for the other sound. Refer to page 3 of his Dictionary for the two different cedillas in the word çlançieh /tʃlantsieh/ = člancijeh (modern orthography). (https://books.google.com.au/books?id=VTJRAAAAcAAJ&printsec=frontcover&dq=micalia&hl=en&sa=X&ved=0CBsQ6AEwAGoVChMI4ZfFjI2yyAIVyhOUCh1SOgfx#v=onepage&q=micalia&f=false).

For dialectological symbols, this article in the Kaj magazine gives examples: http://hrcak.srce.hr/index.php?show=clanak&id_clanak_jezik=210846, including examples of č with a modified haček, as well as the open and closed vowels

Finally, Albanian and Maltese books have interesting symbols to represent the sounds unique to their language. These letters are not represented in Unicode:

For Albanian, the symbols for dh, th, y, z and zh. Refer to the section “Older versions of the alphabet in Latin characters” at https://en.wikipedia.org/wiki/Albanian_alphabet.

For Maltese, the symbols for g, għ, ħ, w, x. Refer to the section “Older versions of the alphabet” at https://en.wikipedia.org/wiki/Maltese_alphabet.
zyxt says

October 8, 2015 at 1:29 am

There are also interesting new letter combinations for Croatian used by Đuro Augustinović in the middle of the 19th century. His attempt to continue Ljudevit Gaj’s language reforms was not well received but he published a number of books using his version of the Latin alphabet. The details and pictures of the letters can be found in this article: hrcak.srce.hr/file/116318.
zyxt says

October 8, 2015 at 1:34 am

PS
Just to clarify: Augustinović’s reforms to the Latin script were only used by him – and would not qualify for encoding in Unicode.
Mikalja’s use of left and right ç cedillas was more widespread, I believe. However off the top of my head, I cannot think of other authors who published books using those cedillas. Having said that, Mikalja’s dictionary was a widely used resource in the Catholic education system of the time.
Jongseong Park says

October 8, 2015 at 3:07 am

Kirk is right to point out that the request for better coverage of (non-touching) letter-diacritic combinations should be directed at font designers, not the Unicode Consortium. Of course, the problem is that while the Unicode Consortium is a single body, font designers are not. But it is important to raise the issue anyway for at least developers of operating systems, so that the default fonts we use on our computers and mobiles are able to support the more obscure letter-diacritic combinations we need.

By the way, one of my favourite fonts with coverage for Latin, IPA, Greek, and Cyrillic with comprehensive support for lots of rare characters is the new custom typeface designed for the academic publishing house Brill, the Brill Typeface.
John Cowan says

October 8, 2015 at 3:43 am

Makes you wonder why Unicode bothered with all the non-touching diacritic+letter combos they’ve already made.

No Unicode without them would ever have flown at the time. People thought that every kind of computer would keep its own character set and just translate to and from Unicode. If Unicode needed two characters where (almost) any other character set could use just one (never mind the size of the character in bits), there would be no Unicode, that’s all.

The fact is that it’s not a question of finding the “right font”

It is a question of having the right font engines. Some kinds of fonts are dumb and the engine is wrong (either obsolete, or treating something as trivial that is not trivial, especially in edge cases). Other fonts have built-in engines, each their own. Some engines reside in specific applications, others with the operating system. And there are intermediate positions.

I of course support this effort.

(Kirk, are you meteg-Kirk? I can’t remember your/his last name.)
David McCann says

October 8, 2015 at 12:56 pm

Citations:

Numismatics / Philip Grierson. Oxford, 1975.
[Captions to plates]

Roman history from coins / by Michael Grant. Cambridge, 1958.
[Key to plates p. 91-2]
Kirk says

October 8, 2015 at 4:23 pm

@ zyxt: Looks like the z-hook is covered by Ⱬ ⱬ. Any reason to think they are distinct letters? Similarly c-reversed cedilla vs c-ogonek.

I’m afraid I’m completely ignorant of medieval orthographies and the Unicode requirements for them (the discussions I see online get quite esoteric), so I’m afraid I can’t judge whether the conventions you mention were widespread enough, or semantically distinct enough, to encode. I don’t see any new letters in Albanian apart from digraphs. For Maltese, there’s the question of whether Vassali’s alphabet is important enough to codify. The Wikipedia article doesn’t give any indication of its notability, and the image isn’t clear enough for me to make out all the letters with any certainty (is the tsade a turned y?). I’d be happy to request them if they were more than just a nonce creation for a grammar — there are lots of those, and lots of requests for them that haven’t gone anywhere.

I’m on surer ground with KAJ magazine. The ‘crown’ diacritic (hacek with a third line) is probably new; I don’t see it in the charts. J-tilde will be handled by any decent font. Is there anything else I should see?
Y says

October 8, 2015 at 4:52 pm

There’s John P. Harrington’s phonetic symbols, with an extensive summary here (with worse penmanship than the originals). Although most of these symbols are idiosyncratic, they are worth encoding, because a) Harrington was famously a very exact and perceptive phonetician, and the symbols are meaningful; b) the meaning of these symbols is not certain in all cases, some having to do with segmental phonetics, others having to do with prosody, and so can’t be relaibly substituted for with commoner equivalents; and c) for quite a few languages, Harrington’s materials are the most extensive and reliable documentation available, and is worthy of being fully transcribed. For these reasons, the Harrington Database Project has been transcribing everything, but has to use a makeshift encoding system, as have some linguists.

Some of the symbols with no clear Unicode equivalent are reverse B; superscript r, ʀ, t; above-combining #, T; below-combining iota, reverse iota, double comma, ring-and-dot; and some others.
Kirk says

October 8, 2015 at 5:09 pm

@ David McCann: Got it. Is a lower-case a-r ligature ever used, or should I just request the capital form?
Kirk says

October 8, 2015 at 5:14 pm

@ Y: I’ve already received a request for Harrington’s #. Superscript r and R I’m already requesting for English dialectology, but this will be good to add. If you think your name or position would be of value in the request, please ask Steve to pass it along so I can quote you on the need to transcribe Harrington precisely.
Y says

October 8, 2015 at 5:25 pm

And one more. I’ve seen a lower tilde used as a partially closed allophone of the glottal stop in Polynesian languages. Phonetically it’s like an under-tilde (creaky voice) below the folllowing vowel, but phonologically it’s just a consonant. There is 02F7 which looks right, but a spacing modifier doesn’t have the right semantics. I’ve only seen it in one publication (Margaret Mutu and Ben Teikitutoua, Ùa Pou : aspects of a Marquesan dialect. Canberra: Pacific Linguistics, 2002).
Kirk says

October 8, 2015 at 6:20 pm

@Y, re. Harrington: Some of those look like standard cursive IPA.

I’ve written a couple people on the project to ask for advice.
Kirk says

October 8, 2015 at 6:26 pm

@Y, re. Marquesan: Got a response that they wouldn’t want to propose anything that’s already in the spacing modifier range because those code points should be able to be used as letters. But, if U+02F7 doesn’t have the right properties we could request that Unicode change the properties.

Is there something specific about that character that’s problematic, or is it just that it’s in the wrong code block?
Y says

October 8, 2015 at 7:09 pm

Marquesan: The appearance is fine, just the semantics may be off. I don’t understand the “spacing modifier” semantics exactly, but I suppose that word-breaking before ˷ (02F7) may not be the same as with its allophone ʔ.

Harrington: I’m no expert. The folks at the Harrington project have worked with his materials for years.

Many of his symbols are indeed cursive versions of what would evolve into the IPA, but many have no equivalent or even anything which appears similar.
Kirk says

October 8, 2015 at 8:14 pm

@ David McCann, re. suits: Since the Unicode range for the 52 cards covers both playing cards and the minor arcana of tarot, which typically use something closer to the Latin suits, and since there’s a note that the exact realization will depend on the font, I suspect that the code points for suits themselves may be intended to be just four generic suits in black and white, and whether you get wands, coins, acorns or bells is up to the font designer: the suits do all have a one-to-one correspondence, after all.

The Unicode chart says of the cards, “These characters are used to represent the 52-card and 56-card variants of modern playing cards, as well as the 56-card Minor Arcana of the Western Tarot. The glyphs shown in the charts have only a symbolic and schematic equivalence to particular varieties of actual playing cards.”

Since this is completely unrelated to my project, I’ll give it a pass.
zyxt says

October 8, 2015 at 9:24 pm

Kirk:

I believe the Albanian special letters were in widespread use in the pre-standardisation era (before 1908) in the publications of the Catholic church. For an example of the special letters refer to the 1716 work “Osservazioni grammaticali nella lingua albanese” at p 1-2: (https://books.google.com.au/books?id=E9FEAAAAcAAJ&printsec=frontcover&dq=Osservazioni+grammaticali+nella+lingua&hl=en&sa=X&ved=0CB4Q6AEwAGoVChMIhJ_3qpm0yAIVw9imCh12oQc7#v=onepage&q=Osservazioni%20grammaticali%20nella%20lingua&f=false).

My familiarity with the Maltese language is superficial and I don’t know how widespread these symbols were. At first glance it does appear that the various diacritics and specially-invented letters were not used with much consistency – it is likely that their use differed from author to author. However, more research on this is required, especially by the Maltese language user community. (Are there any Maltese people reading this? – it would be great to hear from you)

As to the Croatian letters, thanks for your advice on ⱬ and ç. For other Croatian dialectological symbols, I am happy to email you additional scans of dialectological works which contain some more symbols. Please email me on my newly-created “unicode” email address: zyxt-uc “AT” net.hr.
Kirk says

October 8, 2015 at 10:32 pm

@ David McCann: the A-R ligature already exists, at U+1F707, as an alchemical abbreviation for aqua regia.
Kirk says

October 8, 2015 at 10:50 pm

@ zyxt, re. Albanian: Yes, it appears that some sort of augmented Latin alphabet for Albanian was widespread for quite some time. I would assume that it’s the one in that grammar. The additional letters appear to be Greek ξ ξξ ȣ ζ λ. The question then is whether they are distinct enough to require separate encoding for Albanian, or if they were simply local variants of Greek letters that would be typeset with standard Greek today, much as we use Greek code points in the IPA. As part of a basically Latin alphabet, you’d *think* we’d need Latin Xi, Latin Zeta, and Latin Lambda. (We already have Latin Ou.) This is unfamiliar territory for me: the Kajkavian stuff is simply dialectical notation in a modern alphabet, so it’s not a problem, but I don’t know about this. I’ll ask around and see what I can find.
George Gibbard says

October 9, 2015 at 2:33 am

I don’t know how to find out what this is called, or if it’s in Unicode, but I know my SIL IPA keyboard for Mac doesn’t support it: namely the diacritic for variable vowel length, as used in Armbruster’s (1960) grammar of Dongolese Nubian. This is a macron combined with a following breve. I can retranscribe this as e.g. a(a) or a(ː), but it would be useful to have the original symbol as an option. Does anyone know about this? It would also be useful for people who talk about quantitative poetic meter, to mean a syllable that is allowed by the meter to be short or long.
George Gibbard says

October 9, 2015 at 2:59 am

ā̆ is also an option for me, but this reminds me: how does one write a symbol for “either one heavy or two light syllables” (macron with two breves above it)?
Kirk says

October 9, 2015 at 3:00 am

@ George: Do you mean this: a᷌ ? Those are used for Lithuanian, but the macron hangs between the ‘a’ and the previous letter, while the breve is centered over the ‘a’: aa᷌. There’s also the opposite ordering, aa᷋. Those are at U+1DCB and U+1DCC.

If you mean something else, could you post a link or a screenshot? I’m trying to see them in his posthumous dictionary on GBooks, but the image is too blurry to make them out.
Kirk says

October 9, 2015 at 3:03 am

@ George: if you mean two breves side-by-side, AFAIK you can’t do it. That would require a separate code point. Send me a link or image and unless it’s vanishingly rare I’ll request it.
George Gibbard says

October 9, 2015 at 3:08 am

The Lithuanian usage is indeed what I’m looking for (can you get it in SIL keyboards?). As for the two breves side-by-side, I’ll look for evidence. I have the Latin Grammar I learned it from in storage so I can’t use that for now.
George Gibbard says

October 9, 2015 at 3:20 am

I can’t exactly find what I’m looking for in Google books, but I have so far three examples of the opposite, two breves with a macron above them:
https://books.google.com/books?id=mM9A9S9KDFoC&pg=PA218&dq=%22dactyls+or+spondees%22&hl=en&sa=X&ved=0CDcQ6AEwBWoVChMIvuvejOu0yAIViRg-Ch1T5A98#v=onepage&q=%22dactyls%20or%20spondees%22&f=false
https://books.google.com/books?id=I2IwAQAAMAAJ&pg=PA201&dq=%22dactyls+or+spondees%22&hl=en&sa=X&ved=0CDAQ6AEwBDgKahUKEwi1pNzP67TIAhVJax4KHdeVBPY#v=onepage&q=%22dactyls%20or%20spondees%22&f=false
https://books.google.com/books?id=g49fAAAAMAAJ&pg=PA345&dq=%22dactyls+or+spondees%22&hl=en&sa=X&ved=0CDYQ6AEwBTgKahUKEwi1pNzP67TIAhVJax4KHdeVBPY#v=onepage&q=%22dactyls%20or%20spondees%22&f=false

I hope this is enough.
George Gibbard says

October 9, 2015 at 3:33 am

So maybe too obscure to qualify. I have been looking for texts on Persian Poetic meter that would use this but I can’t find them.
Kirk says

October 9, 2015 at 3:39 am

I don’t think your Latin grammar would be too obscure. I’d assumed you meant a macron and two breves over a letter. That I don’t think is possible. But if you just mean the macron and breves by themselves, in a line of spaced breves and macrons for meter, then we can probably do it.
George Gibbard says

October 9, 2015 at 4:24 am

Right, I mean a macron by itself with two breves above it, but I can’t find examples in google books for it. In my comment awaiting moderation, I have instead some examples of pairs of breves with macrons above them.
Kirk says

October 9, 2015 at 4:05 pm

@ George: How’s this for the second link? : ˉ́˘͞˘|ˉ́˘͞˘|ˉ́˘͞˘|ˉ́˘͞˘|ˉ́˘˘|ˉ́ˉ̆.
I realize the alignment’s not ideal, but the pipes could be made superscript.

The problem isn’t the combination, but getting a baseline breve to begin with. If you don’t object to the plain breves being superscripted, then this should work.
Kirk says

October 9, 2015 at 4:11 pm

@ zyxt, re. Albanian: This is too far afield. SIL is looking to improve support for current work, especially linguistic. They recommend contacting Michael Everson to see if he’s interested, as he’s made lots of proposals like this. Evidently he got a capital Latin xi accepted (don’t know where), so the l.c. Latin xi here might fit in.
George Gibbard says

October 9, 2015 at 5:33 pm

Thanks, looks like a tolerable workaround.
Kirk says

October 10, 2015 at 7:38 pm

@ zyxt, re. emails:

Some of my responses are getting bounced back, so I’ll summarize here.

Most of your scans and links seem to be of letter-diacritic combinations, which Unicode won’t accept. The three-point “crown” appears to be semantically hacek + acute, which often replaces it, so I assume that they’re equivalent and the exact form will depend on the font, so again, no need for a Unicode point. If the (z) copyright sign is a normal way of indicating that in Croatian, and is superscripted, please send me a scan. (If it’s just from that one dictionary it would be inappropriate to request.) The jer & jeri should probably be encoded in Latin, not just Cyrillic, so those are good. I really like the historical forms and wish I had them available but, like the Albanian letters, they’re beyond the scope of this project. Maybe try Everson.
Kirk says

October 10, 2015 at 9:23 pm

@ zyxt: my response has failed twice. Maybe you’ve hit your account limit?

Short answer: sorting is handled by your software, which should now be sophisticated enough that Unicode needn’t assign code points to every alphabetic character. But they’ve got regardless.
Kirk says

October 10, 2015 at 9:26 pm

I posted a procomposed g+hacek, but it doesn’t appear in the text. Go to “g” on Wikipedia and scroll to the bottom of the page. It’s listed there.
Kirk says

October 11, 2015 at 12:35 pm

@ zyxt, re. capital long-s: Email failed again. Here’s what I wrote:

I forwarded your texts (19) to someone who had requested a capital long-s, but who didn’t have any direct evidence for it. Here’s what he had to say:

“Interesting, in particular since one of the texts (Scivot) clearly distinguishes between ∫ and ſ. Thus ∫ clearly needs to be encoded separately from ſ and does not provide a “backdoor” for getting a capital ſ encoded.

“Note that the currently encoded long s (ſ) also has a descender in all blackletter fonts, all italic fonts (analogously to f) and some roman fonts (e.g., the Ehmcke Antiqua, a specimen of which I sent you earlier). As the ∫ used in the example text seems to be an ſ from some italic typeface (probably the italic version of the same typeface), it is not surprising that the distinction between ∫ and ſ vanishes in italic type (see the introductory paragraphs of chapters in the Scivot example).

“Finally note that U+0283 (ʃ, latin small letter esh) bears a strong similarity to the proposed character. However, its capital form U+01A9 (Ʃ) looks pretty different.”
Kirk says

October 11, 2015 at 12:48 pm

Steve, since no new threads have appeared in a couple days, and the existing discussions appear to have petered out, I’m going to unpin this page. If there’s a new thread, could you let me know?

Thanks for your help with this. We’ve added at least Latin ъ ь to our proposal, with a couple other letters pending.

George, can you do anything with this: ᴗ͟ᴗ ? or this: ᴗ͞ᴗ ?
George Gibbard says

October 11, 2015 at 8:19 pm

Those look great actually, thanks!
Patrick says

October 15, 2015 at 5:21 pm

Maybe the Tahitian ʻeta and the Wallisian fakamoga, whose shape is distinct from the ʻokina used in Hawaiian and other Polynesian languages?

(See the Wikipedia page for the ʻokina.)
Y says

October 15, 2015 at 6:11 pm

The ʻokina (opening single quote lookalike) is unique to Hawaiian. Other Polynesian languages mark the glottal stop with a ʼ (a closing single quote lookalike), except when they go for idiosyncratic marks like the ‘turned apostrophe’ in the Wikipedia article, or more complex systems (now obsolete or obsolescent) fusing vowel length and the presence or absence ofpreceding glottal stop into a variety of diacritics.
All the same, the turned apostrophe deserves a Unicode point of its own, much as s and ſ have theirs.
Annette Pickles says

October 17, 2015 at 8:13 pm

As a curiosity found in a documents from a sad era of strife in Lebanon, there is the alphabet devised by the Lebanese nationalist Said Akl and used by him in his publications and literary productions, which you can see here. Several of the letters cannot be formed using the current set of combining diacritics, I think—specifically, the c with a small diagonal stroke through it around 8 o’clock for the glottal stop /ʔ/, and the modified y for the ‘ayn /ʕ/.
Yuval says

November 10, 2015 at 5:32 am

Tap tap, is this thing still on?
The Hebrew Latinization movement might be interested in a Unicode batch (there’s an ISO standard), can you please contact me at the mail I submitted here?
languagehat says

November 10, 2015 at 9:32 am

Just adding a comment to bump the thread up so that more people are likely to see Yuval’s request.
John_Mayor says

December 14, 2015 at 7:55 pm

Hi!:

Will there be a a Latin superscript “q” placed in Unicode?… and, who’s pushing for its inclusion? Lastly, it would also be helpful to know whether ALL of the Latin SUBscript characters, will eventually be incorporated into Unicode!

Please!… no emails?
John Cowan says

December 14, 2015 at 9:22 pm

John Mayor: When someone can produce documentary evidence of their use either in print or in manuscripts (ones not written by the submitter or his friends). Nowadays, Unicode doesn’t encode characters just to be encoding them.

Unicode Suggestions Requested.

Comments

Speak Your Mind

Archives

Search

Recent Posts

Recent Comments