Hedvig Skirgård and about six million other authors have a paper in Science Advances (Vol. 9, Issue 16, Apr. 2023) with the title “Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss”; its abstract reads:
While global patterns of human genetic diversity are increasingly well characterized, the diversity of human languages remains less systematically described. Here, we outline the Grambank database. With over 400,000 data points and 2400 languages, Grambank is the largest comparative grammatical database available. The comprehensiveness of Grambank allows us to quantify the relative effects of genealogical inheritance and geographic proximity on the structural diversity of the world’s languages, evaluate constraints on linguistic diversity, and identify the world’s most unusual languages. An analysis of the consequences of language loss reveals that the reduction in diversity will be strikingly uneven across the major linguistic regions of the world. Without sustained efforts to document and revitalize endangered languages, our linguistic window into human history, cognition, and culture will be seriously fragmented.
I can’t really understand much of the article, which is full of terms like “traditional nonweighted PCA” and “the function prcomp,” but it seems like it might be of interest, and I hope those who can make sense of it will have things to say. Thanks, BB!
Clarity does not seem to have been high on the list of objectives for this paper, but it seems to be saying that genetically related languages often tend to be typologically more similar than if they were quite unrelated, so if you’re interested in the whole potential range of possibilities for human language it’s a particular pity when isolates and small language families go extinct.
I can only say that I am glad that they have proved these counterintuitive ideas by the magic of Bayesian methods …
No paper with as many putative “authors” as this is going to tell you anything worthwhile (a rough-and-ready heuristic, admittedly, but a remarkably reliable one …)
I think this is really more of a plug for this “Grambank” database than anything else.
On a more positive note, I’m glad to see that the expected “Niger-Congo” has been replaced by “Atlantic-Congo.” Word is getting round, at last … next stop: Volta-Congo! (but by then, I might believe in Atlantic-Congo myself.)
Talking of multiple listed authors, I was looking at Thomas Ennever’s grammar of Ngardi, which I think lists every single speaker individually on the title page. Impressive. I think I’ve only ever seen that before when the grammar in question is based on the speech of one solitary speaker.
I think this is really more of a plug for this “Grambank” database than anything else.
That was my uninformed guess.
Also plus too, besides! Languages which are found near each other often share more features than languages farther apart—even if they are not related!
Those Beige Anne methods sure are clever.
Far as I can see, the only Oti-Volta language listed is Dagbani, and all the data are taken from Olawsky’s short grammar sketch in the LINCOM series, which is largely not about syntax anyway. Several of their yes/no options based on that are plain wrong, to my knowledge: mostly not Olawsky’s fault, as his work isn’t actually meant to be used in that way in the first place.
Same old problems as with their other databases.
Perhaps I could send them a nice bibliography of Oti-Volta grammars. I’m not sure if it’s significant that many are in French, but I have a sinking feeling that might account for some of the more egregious gaps …
They do have the late Stefan Elders’ grammar for Kulango, though. (And quite right too.)
No, I’ve maligned them: they’ve got Gurenne (under “Farefare”), though they’ve missed the best grammatical description, which is by the late Prof Kropp Dakubu. (In French., though, to be fair, so are most of the sources they do cite. I think I was wrong about that …)
Most of their answers for Gurenne actually are correct, but a good many aren’t.
Interesting looking at the actual questions, and thinking how I would answer them for Kusaal. An awful lot of them, I would have to ask for clarification as to what they meant. Like “Is there a logophoric pronoun?” Kusaal uses its contrastive pronoun series logophorically. Does that count? Quite a number like that …
Interesting that for Gurenne, they say it has “conjugation classes.” In reality, there is just one conjugation of verbs inflecting for aspect in all the Western Oti-Volta languages: the differences between surface forms are entirely due to predictable stem-suffix sandhi. There is another conjugation of verbs, but it’s vestigial in Gurenne, containing only about a dozen verbs, which are invariant for aspect. (Kusaal still has sixty-odd.) I’m pretty sure that they’ve misinterpreted the entirely predictable surface variation in the major conjugation for “conjugation classes.”
This may seem like nitpicking (OK, it is nitpicking) but if this is the case with languages I actually know about, I have no confidence in the quality of the data in general.
This sort of lumping of lots of heterogeneous material mined from very disparate sources together and hoping that it will become reliable in some way just from sheer volume is frankly just a kind of syntactic Mass Comparison.
Jest if y’all must, but Grambank is a Big Deal for people working on language technologies, such as myself, since it’s the largest repository of cross-lingual properties (a distinction so far held by WALS). I have several working projects that will benefit from this resource, and any number of NLP applications can over the course of the next few years be improved and fit to underrepresented languages thanks to it.
PCA = principal-components analysis.
@Yuval:
Fair point. How useful something like Grambank is, is certainly going to be partly dependent on what you are proposing to use it for.
But doesn’t widespead unreliablity of the actual data impinge on any potential use?
However, it occurs to me that my own expertise (such as it is) is in languages which are probably of very little significance for the sort of projects you have in hand; the unreliability is almost certainly substantially less with more familiar languages.
But the paper (as opposed to the database itself|) really asks for the sort of criticism I’ve been making, by playing up exactly the “exotic” aspects of the database and its supposed comprehensiveness as a picture of human language. I very much doubt if the documentation of any but a tiny proportion of these 2400 languages is in a better state than Kusaal, or Mbelime or Dagara (which do not figure at all) or the Gurenne or Dagbani which I mentioned, neither of which is as well documented overall as those three. It seems to me that this is building on sand.
But again, I may well be looking at this from too narrow a viewpoint. What’s your take on all that? How do you actually make use of this kind of data in your own work?
So, I am not qualified to review this paper (though I know what PCA is), but there are a few conclusions that do not seem completely trivial. But first, I must say that Grambank sounds like 1990s Russian laundromat.
#1. Phylogeny > geography: While the effect of phylogeny varies markedly […] overall it is consistently greater than that of space […]
#2. There are no easily summarizable features that explain language diversity. First 3 most informative axes explain 21% of diversity, 19 axes explain 49% (authors think that it is too small, why?). At any rate, DE should be pleased that they conclude “there is a high degree of flexibility in grammatical structures, rather than tight constraints determined by a small number of underlying factors”
#3. If traditional linguistic categories are used, the most diverse feature is “fusion” (degree to which a language encodes meanings and functions with bound morphology as opposed to phonologically free-standing markers) followed by presence or absence of noun class/gender and then a mess.
I am not sure that that would be my interpretation of their Table S3 (obviously, I am in no position to insist on my interpretation). I would say that fusion is a clear winner, then noun class/gender and flexivity, but they are all strongly interacting.
There is a headscratcher in their concept of “unusualness”, which is what you would expect, the more a language is different form others the more it is unusual. And then [drumroll] it happens that isolates are the most unusual. C’mon, 93 scientist from 64 institutions cannot be that simple.
And then “Language families such as Austronesian, Nuclear Trans-New Guinea, and Dravidian are tightly packed together, suggesting strong phylogenetic inertia in this part of the design space. However, other families like Afro-Asiatic or Indo-European are more spread out in the Grambank design space, demonstrating high within-family diversity in these dimensions.” This would be consistent with a historical contingency that most of the linguistic apparatus was developed by observing European languages, enlarging the set to Indo-European and adding some Afro-Asiatic before studying everything else. No wonder that differences within this group will be more richly represented among grammatical features under study.
I can only say that I am glad that they have proved these counterintuitive ideas by the magic of Bayesian methods …
The conclusions are indeed as obvious as anyone could want, but in fairness it actually is valuable to test quantitatively whether what we all already thought was true really is true. (It’s the usual problem of experimental psychology: any experiment will either give the answer we expected all along and be boring, or give an unexpected answer and be wrong, but without the experiments our knowledge hardly counts as science.)
I think this is really more of a plug for this “Grambank” database than anything else.
More precisely, it serves the purpose of having a paper to cite when one wants to cite Grambank, so that the team’s work on this can contribute to their h-index.
This may seem like nitpicking (OK, it is nitpicking) but if this is the case with languages I actually know about, I have no confidence in the quality of the data in general.
I went through their Siwi entry in detail the past couple of days; I estimate that about 85% of the features are correct. Of the remaining 15%, half are unambiguously miscoded, half are correctly encoded relative to the source and their procedures, but incorrect in light of fuller data. I would love to know whether this is representative or not.
(I also posted a comprehensive correction proposal to their Github; we’ll see how the response to that goes. It’s good that in principle there’s a way to fix these things, though it would have been nice to build this sort of error correction more systematically into the project structure from the start, rather than waiting for passing linguists to find a bit of spare time.)
I would say that fusion is a clear winner, then noun class/gender and flexivity, but they are all strongly interacting.
So is the reason North Africa+Europe comes out as the most “unusual” part of the world simply the region’s extensive use of fusion and gender? It wasn’t at all clear to me what their unusualness metric actually boils down to in terms of features.
So is the reason North Africa+Europe comes out as the most “unusual” part of the world simply the region’s extensive use of fusion and gender?
I’m not sure they are most “unusual”, it is a completely separate feature. The idea is that whether language uses fusion and/or gender (and/or flexion) is the most determinative of what sort of language it is to the (relatively small) degree that anything is. Other than (basically) this 3, you can pair up any feature with any other feature in any way you want (statistically speaking).
The conclusions are indeed as obvious as anyone could want, but in fairness it actually is valuable to test quantitatively whether what we all already thought was true really is true.
True. And indeed, quite a few obvious things are false,
@David Eddyshaw:
OK, I read this comment thread more closely now. Obviously, wrong data is bad; but as Lameen already set out to do, it is much easier to correct data known to be false than add in new data or replace the whole damn thing (WALS was heavily criticized as well; it took 18 years for it to be made partially redundant by Grambank). It’s regrettable that there’s false data, and I wish they’d taken more care in curation, but for NLP purposes I’d fairly confidently say that it’s better than nothing.
How do we use it? One line of work has been to try and condition “real-world” tasks such as speech recognition and syntactic parsing according to languages’ typological properties, when all you have to start with is a “multilingual” model trained on a bunch of language data and some rudimentary knowledge of how to ingest data from the low-resource language in question. Check out papers referencing the lang2vec repository for some examples. I have some other ideas up my sleeve, and will be sure to advertise them to Hat when they’re presentable, as I believe the audience here would be interested.
[Also, it’s fairly clear that the huge author list is due to the volume of data curated; the contributions statement tells us who was involved in the analyses and writing. I agree that the main value of the paper is in introducing the dataset and making it easily citable, but the analysis to me is interesting and sheds some light on what I can expect from the data.]
Thanks, Yuval. Very interesting,
I find their section on ‘internally-headed relative clauses’ surreal. It lists several languages as having this type, one of which is Korean:
Korean (ISO 629-3: kor, Glottolog: kore1280)
In Korean, there is an internally headed relative clause construction where the relativized noun takes case marking according to its function within the relative clause (here: -ka, NOM), not according to its function within the main clause. The case marker indicating the function of the noun in the main clause comes at the end of the full relative clause (here: -ul, ACC):
Tom-un sakwa-ka cayngpan-wi-ey iss-nun kes-ul mekessta.
Tom-TOP apple-NOM tray-TOP-LOC exist-PLN REL-ACC ate
‘Tom ate an apple, which was on the tray.’ (Chung & Kim 2002: 43)
(Abbreviations: PLN plain speech level)
Korean is coded 1 in Grambank.
The Korean sentence cited is a translation of the Japanese sentence Taro wa ringo ga sara no ue ni atta no o tabeta “Taro ate the apple that was on the plate”. This sentence was used by Kuroda in his 1976 paper on “headless relative clauses” (later “internally-headed relative clauses”) in Japanese, proving that Japanese had this kind of relative clause. This kind of lacuna makes the whole thing look completely laughable. Not that I support Kuroda’s analysis, but to cite the Korean without any reference to the Japanese structure it is modelled on is simply extraordinary. I am able to set little store by this kind of database.
Thanks @Bathrobe, I’m not sure I’m following.
Are you saying the given Korean is not idiomatic/is a calque of the Japanese?
Then what would be a more idiomatic way to express that meaning?
Does Japanese have ‘internally-headed relative clauses’? (I think you’re saying it does. And does Grambank say it does?) Is the Japanese you give an example?
Kuroda’s 1976 paper seems to be talking about Japanese only. There’s no translation into Korean (why would there be?) Then where/how does the Korean come in?
@D.O.:
Interesting point about whether the selection of questions is itself of a kind that would make Indo-European look relatively diverse. Looking at the questions, they certainly seem to have tried to avoid this, with questions about ideophones and serial verbs, for example. But you well may have a point that, to some degree, this is inevitable.
Looking again the Dagbani, it’s worse than I thought: Dagbani, for example, does have internally-headed relative clauses, and it doesn’t have serial verb constructions, though there, to be fair, they are just picking up on Olawsky’s own misanalysis, and to be even fairer, whether you call what Dagbani has “serial verbs” is to a great extent a matter of how you define the idea.
They are also hopelessly confused between grammatical agreement gender and morphological noun classes, to the point that the answers given really don’t mean anything: the questions are too vaguely framed to be answerable “yes” or “no.” The approach they take would actually lead to saying that English has “noun classes” because “ox” forms its plural differently from “cow.” The explanation about Dagbani “true” adjectives is factually incorrect; as in Kusaal, there are fossilised remnants of the old agreement system in set phrases, but the system is synchronically quite defunct. Moreover, as in the rest of Oti-Volta, adjectives used attributively form compounds with preceding noun stems: comparing this to “agreement” with separate adjective words is comparing apples and pears: historically, it derives from infixation of a state-verb stem between a noun stem and its class suffix.
None of this is Olawsky’s fault: his work is just not suitable for this particular kind of data mining, nor intended to be. I’ve found it very useful myself. It’s the best thing in print on Dagbani grammar (there was a much fuller account by André Wilson, but like quite a lot of Oti-Volta stuff it’s a mite difficult to get hold of. I’ve only ever seen the copy kept by GILLBT in Tamale, and I manfully resisted the urge to steal it.)
I am saying that Kuroda discovered ‘internally-headed relative clauses’ in Japanese and used that very sentence as a key example.
Korean scholars, based on the work of Kuroda and others in Japanese, and translating Kuroda’s key sentence into Korean, concluded that Korean also had internally-headed relative clauses. Japanese and Korean are typologically similar, which is pretty much why they share this kind of structure.
Giving Korean as a language that has ‘internally-headed relative clauses’, while giving a question mark for Japanese (as it does), makes it look like these people have no idea of the history of linguistic research. One could say they’ve read papers about one particular language in a total vacuum without any idea of the background it is based on. As I said, it looks surreal to include Korean while excluding Japanese. If their knowledge is so skewed and sketchy on this, how can you trust them on anything else?
Taro wa ringo ga sara no ue ni atta no o tabeta “Taro ate the apple that was on the plate”.
In the hippie epoch, I learned that Japanese “ringo” means apple. That explained the Beatles drummer, the apple logo and everything. Now I find that other explanations are available, based on internally-headed relative clauses. Progress !
@Bathrobe, perhaps Mary did Japanese, Sue did Korean, and Jack who coordinated their work indeed knows nothing about the history of linguistic research in Korean….
You said to cite the Korean without any reference to the Japanese structure it is modelled on …
Which gave me the impression Korean borrows the structure from Japanese. Is that a Sprachbund effect, or some distant genealogy?
I think you’re not claiming either (neither disclaiming either), but that Korean linguists modelled their _analysis_ on analysis of Japanese(?)
Well, if Korean exhibits the structure, then it’s correct for Grambank to mark Korean as such, irrespective of whether other languages exhibit that structure. (Indeed one linguist having described a structure in one language, it might turn out some other language actually exhibits the structure more ubiquitously/’thoroughly’, as it were.)
while giving a question mark for Japanese
You yourself said Not that I support Kuroda’s analysis, . So perhaps Grambank is hearing you. I presume you mean Kuroda’s analysis of Japanese(?)
Then perhaps Korean exhibits the structure so much more ‘thoroughly’ than Japanese, Grambank can confidently mark Korean with the feature, whilst leaving it in some doubt wrt Japanese?
What does the history of the _research_ have to do with it? As opposed to the history of the language?
Which gave me the impression Korean borrows the structure from Japanese. Is that a Sprachbund effect, or some distant genealogy?. No, as I said, Japanese and Korean are typologically similar, which is pretty much why they share this kind of structure.
Yes, “Korean linguists modelled their _analysis_ on the analysis of Japanese”. Using sentences translated directly from the Japanese.
It’s hard to find a good analogy for this. It’s a bit like if Grambank maintained that English is a language that has “gerunds” based on conventional grammatical analyses, while putting a question mark over whether Latin has them — simply because they’ve never actually read any Latin grammars. (This analogy might not work because English and Latin gerunds are probably not quite the same thing, but the point I’m making is that including Korean while excluding Japanese simply shows they know surprisingly little about the topic.)
This is, of course, one of the problems of linguistic typology. Typology is dependent on the whole skein of linguistic work — works by many different linguists — for its judgements, but is surprisingly ignorant of the role that the history of linguistic research plays in creating those judgements.
Incidentally, in declaring that Japanese has “internally-headed relative clauses”, it is surprising that most linguists are not actually qualified to demonstrate that IHRC in Japanese are really the same thing as IHRC in North American languages. People just assume they are because, well, the same name is now conventionally used for structures in Japanese/Korean and, say, Navaho and Quechua.
Skirgård’s dissertation was similarly a product of impressively massive encoding of features characterizing Pacific-area languages. One of her central arguments was that “political complexity” played a significant role in keeping Central Pacific languages from splitting into the multitude of smaller communities that characterize most languages in Melanesia and Micronesia.
“Political complexity” is a vaguer term for what was encoded based on categories in Sheehan, Oliver, Joseph Watts, Russell D Gray, & Quentin D Atkinson. 2018. Coevolution of Landesque Capital Intensive Agriculture and Sociopolitical Hierarchy. Proceedings of the National Academy of Sciences 115(14). 3628–3633.
Their 0–5 scale of political complexity seems to me to be too narrow and squishy for statistical purposes. At one end of the scale 0 and 1 were not distinguished, and 4 and 5 are unattested in the sample (except for Hawaiian, which after Western contact became a centralized state, earning a 4), so perhaps 2 and 3 should also have been coded the same and not distinguished statistically.
In my opinion, categories 2 and 3 were not correctly coded for some of the languages in the database. Pohnpeian, which has very complex hierarchies of titles, including equivalents of “paramount chiefs” and “talking chiefs” with a special high-language reserved for addressing high titles, ranks a 2, the same as Chuukese, whose varied island polities are notoriously fractious, and whose dialectal varieties cover a much wider range. Why is Pohnpeian not ranked the same as Samoan? Yapese, which was at the political apex of a “Yapese Empire” that included most of the Chuukic languages, is also ranked the same, 2, as their primary subaltern (Ulithi) and the whole chain of subordinate Chuukic societies. (Whole villages in Yap also have outcaste status.)
Moreover, Fijian and Samoan share the same ranking, 3, but Fiji is far more linguistically diverse than Samoa. Because of their similarity, Central Pacific languages like Fijian seem to exhibit dialect-leveling to an extent not possible in much more multilingually diverse Melanesia. As modern European (or Japanese) history illustrates well, political hierarchy is perhaps the primary factor in reducing linguistic diversity after it has already come into existence, not preventing diversification in the first place.
Again, that seems reminiscent of Greenberg-style mass comparison, with its covert (in fact, perhaps overt) assumption that if you can only collect enough data, even very frequent errors in individual items of data will somehow all cancel out and you will achieve rigour. Nowadays it is de rigueur (so to speak) to take the further step of proving that you have achieved rigour, unlikely as it may seem, by the judicious deployment of statistics.
As I understand from Yuval’s comments, it is like WALS, just larger, at least for some purposes.
And well, “like WALS, just better” is a good thing, and larger is better.
Then if it also (unlike WALS, WOLD, etc.) can grow, then some day it will be useful to DE.
The point, made by both Lameen and Yuval, that the database is potentially open to amendments offered by third parties, is certainly very important.
Ideologically (or methodologically) I’m more bothered by the vagueness of many of the questions. As one specific case, I doubt whether it’s very useful simply to ask whether a language has serial verb constructions without some sort of guidance as to what counts as a serial verb construction, exactly, Quite a lot of the questions suffer from similar imprecision, and that is not fixable just by soliciting better data, good though that is as a policy. It could only be fixed by starting over with new questions.
On the other hand, to some extent that may well be inevitable in any project of this kind, and the choice may in reality be between doing it much like this or not doing it at all.
Just how serious this is when it comes to drawing generalisations about human language would depend, I suppose, on what purpose the generalisations were intended for, as Yuval rightly implies. There may very well be many applications for which that degree of theoretical purity is quite superfluous.
Still, I’m dubious about an enterprise which seems to take as given that all the relevant very difficult theoretical questions (like “what is a serial verb construction?”, “what is an ideophone?”) have effectively been settled, and that there is an essential consensus about them among all competent grammarians already. They haven’t, and there isn’t. So if you’re not careful, you’ll actually end up comparing apples and oranges not once, but systematically.
So if you’re not careful, you’ll actually end up comparing apples and oranges not once, but systematically.
But they are in fact similar to each other in many respects. They are round, citric, seasonal and good for you, to name but these. It seems as if you desire a different kind of systematic comparison, but from what you say I can’t make out what kind of system that might be. Except perhaps that it should raise difficult theoretical questions. It should rise above the simplicities of consumers and tradesmen, making it harder to strike deals (reach consensus).
it is like WALS, just larger, at least for some purposes.
My point about IHRC, which I initially put quite badly (I’ve been plagued by lack of time to hammer out longer, more developed comments) was that Grambank is not in any way an advance on WALS. WALS put Japanese and Korean in the same class — “IHRC not the dominant means of forming relative clauses” — which, if you go back to the beginning, is actually based on Kuroda’s initial work on this topic. This was then taken up by Hinds (the WALS reference for this phenomenon in Japanese) and Korean scholars. In Japanese it has since been extensively but not conclusively covered and is controversial. Grambank hasn’t even reached anything like this level of coverage of Japanese and Korean linguistics. So what credence should we give to it?
@Stu:
Sorry: a poor analogy on my part. I should have said “apples and anagrams.”
(Though your point is good: there are indeed many contexts in which counting apples and oranges together is perfectly sensible. It all depends up why you are counting them, and what you plan on using the total for.)
@DE: foiled again ! I can think of no pert way of twisting apples and anagrams to serve any purpose.
@David Eddyshaw:
I appreciate your comments, and hope they will lead me to use this resource in a more responsible way. So thanks for the discussion!
@Bathrobe:
If your claim that Grambank is “not in any way an advance on WALS” based on singular evidence of treatment of two languages, you are completely missing the point, which is that Grambank offers great improvement in terms of “lateral” coverage – many more languages have many more features (there’s a figure in the appendix showing this), albeit they focus on grammatical features, whereas WALS has phonetics and more.
Additionally, If you’ve been using WALS to investigate individual languages, and ones that are well-studied at that, I find that to have been an odd use case; there are, and have always been, better places to do that.
I don’t use WALS to investigate individual languages; I use it to get a big picture view, as well as a view on how it assigns particular languages. I’m not very impressed with it. It seems to me to depend completely on “what is reported”, without any editor able to take a broader, more critical view. Of course, that is the problem with such enterprises; there is no one really qualified to give broader judgements.
I don’t have any problem with grammatical features — I’m personally not terribly interested in phonetics, etc. What I do find unacceptable is the fact that Grambank drops the ball on one particular feature I’ve been following for a while. That’s what I’m judging it on. I’m sceptical of WALS on its treatment of the IHRC, but I’m simply astounded at Grambank’s selective superficiality.
Pushing my analogy between Grambank and Greenbergian Mass Comparison:
Used sensibly, there’s nothing really wrong with Mass Comparison in itself: the problems arise from misusing the technique, imagining you can use it to draw firm conclusions, instead of using it as a source of interesting potential relationships that look worth investigating rigorously. For example, I would think that Mass Comparison would quite rightly suggest that there was an Oti-Volta group of genetically related languages, and that it was worth looking at the relevant languages in detail to see if the idea really panned out; and it would also quite rightly suggest that time spent investigating a relationship between Kusaal and Estonian (say) would probably be better spent on filling in grant applications.
In the same way, if you are interested in serial verb constructions, it would surely be worth checking out the languages that Grambank says have such things as a first step, and then following up the primary sources. That would be entirely sensible and a good reason to be very grateful to all the people who have put so much work into this.
What would not be legitimate is to stop the investigation at the level of Grambank itself, for example by doing some nice statistics to demonstrate that e.g. “languages with serial verb constructions almost all have SVO word order.” That is methodologically parallel to making claims about language families without bothering with all that tedious stuff about regular correspondences and loanwords and so on.
Does the grammaticality of Whomever he saw was clearly Chinese mean that English has IHRCs? (This construction is called a fused relative clause in English-internal grammars like CGEL.) Of course the distinction is irrelevant in the majority variety of English that lacks whomever.
Fiji is far more linguistically diverse than Samoa
As a nation, yes. But Samoan is only one language, whereas there are seven indigenous languages of Fiji: the West Fijian linkage[*], composed of Western Fijian and Namosi-Naitasiri-Serua; the East Fijian linkage, composed of Eastern or Standard Fijian, Lomaiviti, Lauan, and Gone Dau; and Rotuman[**]. The West Fijian linkage is most closely related to Rotuman and the East Fijian linkage to the Polynesian family, but there has been substantial convergence between East and West Fijian.
[*] In the sense of the descendants of a dialect continuum where components have been lost in a way that form a network rather than a hierarchy.
[**] Assuming you call Rotuman “indigenous”: on the one hand Rotuma is geographically and culturally remote from Fiji proper despite being politically unified with it; on the other hand internal migration means there are now more Rotuman-speakers in Fiji proper than Rotuma.
time spent investigating a relationship between Kusaal and Estonian (say) would probably be better spent on filling in grant applications
Your Chomskyan would agree, but the applications would request a grant to investigate just how Kusaal and Estonian are both instantiations of UG(H).
Does the grammaticality of Whomever he saw was clearly Chinese mean that English has IHRCs?
Presumably this depends on how you choose to define “internally-headed”, and the fact that the term does not actually have a universally agreed definition is a good example of what I’m moaning about with Grambank.
Kusaal (inevitably) provides a clear example of how the answer to the question can’t be read off mechanically from things like word order, and how the analysis depends critically on other analytic decisions you’ve made.
The language certainly does have relative clauses that pretty much anyone would agree are internally-headed, e.g.
Fʋn bɔɔd ye fʋ kʋ dau sɔ’ la ya’a kpi
You.NOMINALISER want that you kill man certain the if die
“If the man you are trying to kill dies”
However, there is also another construction, as in
pu’a kanɛ biigi vʋe la
woman that.NOMINALISER child.NOMINALISER be.alive the
“the woman whose child was alive”
Now this looks like kanɛ is some sort of relative pronoun introducing a relative clause qualifying pu’a “woman.” However, in the example, pu’a is not the singular form of the noun at all: the singular has mid tone, and here the form has low tone. That is because it is in fact the bound combining form of the noun, not the singular, and the orthographic convention is that combining forms are written as separate words if and only if they happen to be segmentally identical with the singular (in other words, standard orthography routinely misrepresents the real structure.) On the other hand, nid “person” has the combining form nin, so you write
niŋkanɛ biigi vʋe la
person.that.NOMINALISER child NOMINALISER be.alive the
“the person whose child was alive”
and all of a sudden it looks like we’re dealing with an internally-headed relative clause.
Right?
Not necessarily. It depends on how you define “word.” (Obviously the orthography won’t help us there.)
Sure, nin is a bound form that can only appear as the first part of a “compound.” But who says that a bound form cannot be a “word”? The way Oti-Volta construes adjectives by compounding them with preceding noun stems is cross-linguistically very unusual, to say the least, and it would be theoretically much less awkward if such noun stems could just be regarded as “words” rather than word fragments. Anyway, can’t we just call kanɛ biigi vʋe la an “adjectival clause” and say that the whole thing is behaving as an adjective, just like giŋ “short” in niŋgiŋ “short person”?
Kusaal “compounds” can contain free words, after all:
anzurifa nɛ salima la’amaan
silver with gold goods.maker
“worker in silver and gold”
where la’a is the combining form of lauk “item of goods.”
So if Kusaal lacked the first type of relative clause (which actually is relatively uncommon), whether you described it as having internally-headed relative clauses would depend crucially on how you decide to answer “what is a word?” in Kusaal: a question which in my view does not even have a single uniquely “correct” answer anyway.
I recently had another look at Grambank.
The issue with Internally-headed relative clauses in Japanese (lacking) and Korean (present) is easily explained.
Grambank essentially bases its analysis of each language on a few sources. None of their Japanese sources mention Internally-headed relative clauses. One of their Korean sources did. Simple as that.
JAPANESE: Hinds 1986 Japanese, Kaiser et al. 2013 Japanese: A Comprehensive Grammar, Martin 1988 A reference grammar of Japanese.
KOREAN: Lee and Ramsey 2000 The Korean Language, Sohn 1994 Korean, Sohn 1999 The Korean language.
Internally-headed relative clauses were first described in Japanese by Kuroda in 1976. Korean scholars followed this with work on the phenomenon in Korean, often using sentences translated directly from Kuroda’s examples in Japanese. Internally-headed relative clauses didn’t make it into any of Grambank’s Japanese sources but it did figure in one of their Korean sources (Sohn 1999).
They do appear to have a mechanism to modify or fix features assigned to different languages, but it hasn’t worked in this case.
A flimsy basis on which to compare features across languages.
That’s really kind of depressing.
What is even more curious is that Stefan Kaiser, who co-wrote the book Japanese: A Comprehensive Grammar, actually published a book on Circumnominal relative clauses in classical Japanese: an historical study in 1991. “Circumnominal relative clause” is Lehmann’s term for Internally-headed relative clauses.
Perhaps Japanese: A Comprehensive Grammar used “Circumnominal relative clause” instead of “Internally-headed relative clause”, and the persons doing the in-depth analysis missed it? (I don’t have the book in order to check it.)
(It gets worse. WALS mentions that Japanese has Internally-headed relative clauses as a “nondominant type”. It bases this on two sources: Shibatani 1990 and Hinds 1986. If Grambank was using Hinds 1986 as a source, how did it miss it?)
At any rate, this seems to me a good illustration of the pitfalls of the Grambank approach.
Actually, the difference in approach was succinctly described as follows by All Things Linguistic:
In WALS a single author would collate information on a sample of languages for a feature they were interested in, while in Grambank a single coder would add information on all 195 features for a single grammar they were entering data for.
THAT’s the problem. A single author collating information on a sample of languages for a feature they were interested in would not have missed Internally-headed relative clauses in Japanese. A single coder working on one language would be highly likely to miss it if they did not have a wider perspective.