Via Ionuț Zamfir’s Facebook post, I present “A comparative wordlist for investigating distant relations among languages in Lowland South America,” by Frederic Blum, Carlos Barrientos, Roberto Zariquiey, and Johann-Mattis List (Scientific Data 11:92 [2024], open access):
Abstract
The history of the language families in Lowland South America remains an understudied area of historical linguistics. Panoan and Tacanan, two language families from this area, have frequently been proposed to descend from the same ancestor. Despite ample evidence in favor of this hypothesis, not all scholars accept it as proven beyond doubt. We compiled a new lexical questionnaire with 501 basic concepts to investigate the genetic relation between Panoan and Tacanan languages. The dataset includes data from twelve Panoan, five Tacanan, and four other languages which have previously been suggested to be related to Pano-Tacanan. Through the transparent annotation of grammatical morphemes and partial cognates, our dataset provides the basis for testing language relationships both qualitatively and quantitatively. The data is not only relevant for the investigation of the ancestry of Panoan and Tacanan languages. Reflecting the state of the art in computer-assisted approaches for historical language comparison, it can serve as a role model for linguistic studies in other areas of the world.
Background & Summary
Much of the human history in South America is unknown, and linguistics can be one of many tools to investigate the human past. Yet, the linguistic history in South America is poorly understood, and despite the comparably recent human settlement, many genetic relationships between language families remain hypotheses without too much evidence. […]
One such case is the hypothesized Pano-Tacanan language family. Panoan and Tacanan are two language families currently spoken in Lowland South America, which have long been hypothesized to be genetically related. Both language families have also been claimed to be related to other languages in the area, such as Mosetén, Chipaya, and Movima. Even though there is a considerable amount of evidence in favor of the ‘Pano-Tacanan hypothesis’, no fully accepted large-scale reconstruction has yet been carried out. The Panoan language family was first proposed by de la Grasserie in 1889. A preliminary reconstruction of the common ancestor was carried out by Shell, which, however, lacked data from the Northern branch of the family and of Kaxararí. Recently, a new reconstruction has been proposed by Oliveira, which still needs further revisions. The Tacanan languages on the other hand were proposed by Brinton in 1891 and reconstructed by Key and later Girard. Based on this reconstruction and the ‘Reconstructed Pano’ from Shell, Girard also proposed a reconstruction for the ancestral language, Proto-Pano-Tacanan. Given the problems of the sampled languages for Shell’s Panoan reconstruction, however, this reconstruction is not generally accepted as a proof for the Pano-Tacanan family, and some doubts remain. More recently, Valenzuela & Zariquiey provide a new reconstruction of Proto-Pano-Tacanan, but this work is limited with respect to the amount of lexical coverage. It does, however, provide a first detailed account of grammatical morphemes that appear to be cognates between the Panoan and the Tacanan language family. Cognates are lexical roots and morphemes from two genetically related languages that descend from the same ancestral form etymologically.
This dataset aims to present lexical data that can be used as a new starting point for investigating the past of Panoan, Tacanan, and other languages.
I know nothing about the languages involved, but these people sound like they know what they’re doing, and I thought the paper and its wordlist might be of some interest to the Hattery at large.
The words “computer-assisted” caused me immediate anxiety in the light of all the “computer-assisted comparative linguistics” nonsense out there, but this is something quite different: lots of lovely data in a format which is conveniently accessible for comparision. What’s not to like?
(I have decided to refer henceforward to my own study of comparative Oti-Volta as “computer-assisted.” I mean, I really did use a computer to make it … spreadsheets and everything. Advanced!)
FWIW, the last author is a published comparative linguist; he used to have an interesting blog.
Through the transparent annotation of grammatical morphemes and partial cognates, our dataset provides the basis for testing language relationships both qualitatively and quantitatively.
What is the meaning of “transparent annotation” ? That you don’t see the annotations ? Or that you do, and they are easy to understand without further explanation ?
“Transparent” is one of those words that are used in a metaphorical way to suggest something or other, although it’s not clear what they mean or what the suggestion is. Other examples are “vibrant” and “fuck”. “Vibrant” always makes me think of a motor-driven sex toy, which usually doesn’t fit the context in which I encounter the word.
“the transparent annotation of grammatical morphemes and partial cognates”
Yes! So many false cognates become patently absurd the moment you break them down morphologically (and conversely, plenty of good cognates can be buried beneath derivational build-up).
I have decided to refer henceforward to my own study of comparative Oti-Volta as “computer-assisted.”
Like Lucky Strike: “It’s Toasted.”
LSMFT, a non-computer assisted memory from the days when physicians flogged cigarettes.
Wouldn’t merely comparing word lists be a good way to ‘show’ Basque is related to Spanish (or even Welsh to Latin, or Kusaal to everything, of course) — because borrowings, ‘init.
a first detailed account of grammatical morphemes that appear to be cognates between the Panoan and the Tacanan language family.
sounds more reliable. Do these language families have comparable large-scale syntax? (SVO order, gender marked on nouns and noun-qualifiers, case, agreement between what, …)
I don’t think syntax is a particularly good guide to genetic affinity, certainly when it comes to things like constituent order. (Johanna Nichols seems to disagree, but she picks her examples very carefully.)
For example, VSO Welsh is undoubtedly related to SVO English and SOV Persian.
SVO Kusaal is related to SVO Swahili, but is much more closely related to SOV Baatonum and Miyobe.
Similar things are easy to find with the relative ordering of elements within noun phrases, whether a language uses prepositions or postpositions … you name it, really.
“Niger-Congo” languages are justly famous for elaborate “gender” systems, and most Oti-Volta languages fit the pattern admirably; but most Western Oti-Volta languages lack grammatical gender completely.
Kusaal (no grammatical gender) is, however, much more closely related to Mbelime (eleven genders) than to Yoruba (no grammatical gender.)
That’s not to say that morphosyntax isn’t highly valuable in proving genetic relatedness, of course. Indeed, much of the best evidence for “Niger-Congo” comes from class affixes and verb derivational suffixes that are clearly related in both form and function (which is where syntax certainly does come in.) A lot of work hitherto on wider relationships has, however, unforntunately treated lookalikes among affixes as good enough to “prove” relationship, without the vital work of showing that the affixes involves can be related through regular sound chnages; and verb derivational sufffixes often have meanings in individual languahges which are difficult to pin down precisely, which has all too often led to a lot of semantic latitude in comparisons.
But quite apart from that, as Nelson Goering says, a huge number of lookalikes can be immediately outed as not truly cognate by looking at the internal word structure in the language they come from. Greenberg’s Mass Comparison lists are notorious for misanalysing flexional and derivational affixes as parts of roots and vice versa.
A new paper demonstrates, horribly, why syntax is indeed a poor guide to genetic (or other historical) affinity. Fig. 1 (p. 13, based on the properties summarized on p. 7) says it all.
To illustrate the effects of misanalysis of compared forms (in a case which actually results in a false negative rather than false positive):
I just happened recently to be looking at the underlying data for a comparison of Agolle and Toende Kusaal basic vocabulary which came up with a figure of only 84% cognates.
This always struck me as surprisingly low, and when I looked at the actual data, they are just a tad unreliable.
My favourite (but sadly far from isolated) example is “être debout,”
The Agolle Kusaal (actually zi’e) is given as paʔzijɛneagol, which is clearly pa’a zi’eni agɔl “was standing up earlier today”, where the ni is an enclitic particle marking “discontinuous past”, pa’a is the “earlier today” tense marker, and agol just means “up(ward.)”
The Toende form is given as zɛʔɛmɛ, where the mɛ is a particle meaning “at the time under consideration”; this verb phrase does actually mean “is standing”, equivalent to Agolle zi’e nɛ.
In fact, “be standing” is zi’e in Agolle and zɛ’ɛ in Toende, and the vowel correspondence is absolutely regular. The difference (such as it is) is wholly phonological.
(With the selfsame keywords and with modern lexical materials, I get a 96% match in basic vocabulary between the Kusaal dialects.)
Y, interesting.
Some connections can be arealisms, and it is interesting how Moroccan Arabic forms one subgroup with Xhosa and they both form a family with Hausa. I mean, it is not the TOTALLY different regions of the world, eh?
Similarly Slavo-Finnic is a neighbour of both Japanese-Korean and Chinese (and was steppe adjacent earlier). Bangla, Nepali and Quichua together with Japapenese-Korean are unexpected.
Clearly we need better awareness of its technical linguistic sense: it’s the cover term for “trills + flaps”, whenever you feel “rhotic” is either a bit too broad in including various approximants and fricatives, or perhaps not sufficiently inclusive of things like bilabial trills.
https://www.smbc-comics.com/comic/phonemes
@Y, also an explanation of Ethio-Caucasian and Italo-Portuguese can be possible.
Like “other Semitic languages typologically represent a (derived/affected by contact) intrusion in a former Semito-Caucasian area”.
And I don’t know, someone who knows European history better may invent why the Italians and Portuguese are close:)
The fact seems to be that, in reality, “fundamental” syntactic features are very easily borrowed, quite labile over time, and often strongly correlated statistically with one another in far from obvious ways for reasons which are not well understood at all (Mark Baker’s attempt to “explain” it all with “parameters” never worked as advertised.)
All this very inconvenient for people who feel that orthodox comparative methods are insufficiently magical, and many very clever people have abused their talents in trying to deny these realities.
Syntax is very interestingly from a typological standpoint, though getting beyond the stamp-collecting stage and trying to explain the phenomena is very hard and there have been a lot of false starts. (Without stamp-collecting, though, there can be no science.)
The role of typology in historical comparative linguistics is as a guard against wrongly attributing real similarities to common inheritance.
That’s what the concept of “basic vocabulary” is for. Modern English may have more words from French than from Old English, but among the generally most stable words the loans from French and Norse number in the single digits. Everything can be borrowed, but the probabilities do differ appreciably.
Make an affricate out of it, and you get a sound you’ll have trouble finding outside the Yukon Territory.
@Y, by the way, it is not clear why do you say “to genetic (or other historical)“. Why this parenthetical note?
I think syntax is in part “genetical” in part “areal” in part it just “changes with time for whatever reason”.
Now, excuse(-)moi, but historical linguistics of forms does not assume that a Russian word is identical to a related French word identical to the reconstruction – or even that they are recognisably similar. No, they track and reconstruct changes. Same must be true for syntax and this third part is not an obstacle (unless we decide that tracking changes is impossible).
So it is must be just as “historical” as any other historical linguistics.
There is a fourth component: unrelated languages can be similar due to simple coincidence. But is not this just a matter of weighing and number of variables? Someone just tuned her measuring instrument too sensitive and is getting some noice. Same true for a DSLR at high ISOs.
@DE, same responce to your objections. Note that arealisms are historical.
Stu:
What is the meaning of “transparent annotation”? That you don’t see the annotations? Or that you do, and they are easy to understand without further explanation?
Those finding themselves on too easy a trajectory through intellectual life, devoid of mystery and salutary obstacles, are hereby advised to research opacity and transparency in Husserl, Sartre, and their kind. I gave them up many years ago, but noted such terms along the way as especially recalcitrant. Sometimes what was visible or “conscionable” had to be opaque to be visible, other times the opaque was whatever could not be seen or cognitively scanned. No one ever seemed to notice this multivalency, let alone explain it. So glad I gave up on all that!
If the bones were not opaque they’d be invisible to X‑rays. Opaque materials are introduced to make the intestines (or arterioles, etc.) visible in radiography. Astronomy would be a dismal science indeed, without the opacity of objects along with the luminosity of objects. Composites of any sort elude exploration if all their parts are transparent or all their parts are opaque, to the available tools.
“Transparent annotations” of course glides over those inconvenient facts. They are annotations free of obscurity, whose sense is readily apprehended. But I appreciate the finickiness of your objection.
The Wikipedia article Color has declined tragically over the years since I had anything to do with it. Look at this, and wonder along with me what hope there is:
As with punctuation so with thinking about colour: everyone’s an expert.
why the Italians and Portuguese are close
Some Italians. Others are close to K’iche and then Nahuatl, others to French. K’iche is not close to Q’anjob’al, though. The latter goes with Tagalog.
Just like some Greeks are close to Romanians, others to Kusunda (isolate of Nepal).
Why this parenthetical note?
The paper says concludes (rightly) that their method does not detect well genetic subgroupings. I note that it likewise doesn’t detect well contact-related similarities.
Their method is too crude, and we need… better methods for history of syntax.
Note that arealisms are historical.
Sure. That’s why I specifically contrasted them with genetic features, i.e. those where the resemblances are due to shared inheritance from a common protolanguage.
Areal phenomena are very interesting in themselves and well worthy of study in their own right.
But in the specific context of reconstructing protolanguages, such phenomena are not building materials but confounding factors to be excluded. If you’re going to do that right, you will, of course, need to have investigated the areal phenomena properly too, as part of your reconstructive work. You need to be able to identify the weeds correctly if your weeding is going to be effective.
I agree that typological features are very often areal. Examples abound. But that is simply another aspect of their being inappropriate considerations when it is a matter of reconstructing protolanguages and demonstrating genetic relationships between languages. Insofar as they are typological, they have no place at all in proving genetic relationships. This is pretty much a matter of definition.
Historical reconstruction cannot possibly explain everything of significance about any given language, or even anything approaching it. But that is not the issue here: we’re just talking about the right methodology for reconstructing protolanguages, i.e. about genetic relationships between languages.
I certainly agree with your implication that reconstructive work on protolanguages is not, by itself, a good guide to the prehistory of the supposed speakers of those protolanguages. (In fact, areal studies are quite likely to be more helpful – though then you are talking about the obverse phenomenon: you’ll need to have good evidence about the inherited features of the language if you’re going to try to identify which areal features might be attributable to contact.)
I suspect that the curiously persistent delusions of many scholars on this point owe a lot to nineteenth century racial and ethnonationalist concepts: the intellectual shadow of such pseudoscience is long, and still affects the thinking even of scholars who vigorously repudiate its racial basis. (You see it, for example, in the assumption that West African languages which do not share the flexional exuberance of Bantu verbs must have lost a Bantu-like system, or the idea that “Kwa” languages must derive from creoles.)
demonstrates, horribly, why syntax is indeed a poor guide to genetic (or other historical) affinity
Thanks @ Y, but hmm. Perhaps it just demonstrates @DE’s the “computer-assisted comparative linguistics” nonsense out there. When I hear ” methods of Bayesian inference” I reach for my revolver. (If it’s Bayesian it is ipso facto not inference IMO.)
The (indeed) horrible Figure 1 is qualified:
I would have thought a researcher at that point would say: we’re producing nonsense, let’s question the methodology.
I would also expect a more robust methodology would deliberately exclude (more than) half the data when first training the model; if the model is validated, then run across the whole dataset.
how I understand it:
(a) inheritance (b) borrowing (c) change – history
(d) convergence – typology.
It does not matter if we are speaking of syntax or phonology or semantics or what, those are just aspects of language.
I oppose equating “areal” with “typological” or contrasting “syntax” to “historical”. I agree that we are not as good at modelling and tracking history of syntax (as we are at reconstructing morphophonology). But this is weakness, not strength of the traditional comparative methods.
I would have thought a researcher at that point would say: we’re producing nonsense, let’s question the methodology.
To a large extent this paper seems to be saying something to the effect of “we’re clearly producing nonsense, but there might be interesting patterns in the nonsense that could be useful for guiding future researchers trying to work with this data”.
(With a side of “we’re clearly producing nonsense, this is a strong sign that the methodology we’re currently testing is not a good fit for this use case”.)
Aside to the problem of syntactic data being likely arealisms, I believe the excessively long and rakey tree structure means most variables measured there are likely low-variance (i.e. they are recording things best analogical to something like “is the language’s word for ‘two’ /tu/”), and the fragment of Standard Average European languages grouping together suggests that so far the choice of what variables to record at all has been rather Eurocentric; to tie back to the OP, we really don’t have a “syntactic Swadesh list” yet and are still stuck in the analogue of the 17th–18th century field linguistics habit of just making up some wordlist on the fly.
though I am not that impressed by the Swadesh list itself either, it performs decently but has too much of a one-tool-suits-all attitude. Having the Leipzig-Jakarta list beside it is one good start & maybe longer lists like this one from Blum et al. will also contribute to figuring out what works where for what. Recently I even found myself wondering if a standard wordlist would be possible that would demonstrate a language’s different loanword strata as well or as easily as possible?
@J Pystynen, to quote the paper:
“Each property can take one of three values: yes, no, or not applicable (NA).“
wondering if a standard wordlist would be possible that would demonstrate a language’s different loanword strata as well or as easily as possible?
Words for foodstuffs and technology that are known to be spread around the world at specific time depths? Tea, Coffee, horse equipment, sailing technology, the influx from S.America, … The word having entered the lexicon at that time, then subject to regular sound changes. (Or does every language pronounce ‘coffee’ recognisably?)
@drasvi:
There’s actually been a good bit of work done on reconstructing the syntax of protolanguages.
The difficulties are much greater than with reconstructing phonology and morphology, though. I think there are several reasons for this.
The easy case is where every single daughter language does things in the same way: it seems reasonsble to reconstruct that to the protolanguage.
But even that can be a problem. For example, all Oti-Volta languages are SVO. However, all five Eastern Oti-Volta languages put pronoun objects before the verb, and Miyobe, which is probably the closest relative to proto-Oti-Volta, is SOV.
But Eastern Oti-Volta and Miyobe are also part of a phonological Sprachbund, and border on SOV Baatonum and Dendi. However, there is no evidence of Baatonum or Dendi influence elsewhere in these languages. So, what to make of all this?
All the Romance languages are SVO but Latin was SOV. With Latin, we actually do have independent evidence of a change to SVO prior to the breakup into the modern languages, but in the usual case there would be no such evidence and arguing that such a change “must” have happened would just be circular.
This interacts with another problem: there are only three common orders of phrase constituents in Africa, SVO, SOV and VSO, and VSO basically never hsppens in Volta-Congo, so there are only two actual possibilities; moreover, change of SOV to SVO is cross-linguistically common. So you are in the position of someone trying to reconstruct a protolanguage rigorously with only two consonants, one of which is known to change spontaneously into the other. The very nature of the data means that, no matter how sophisticated your methodology, you will never be able to draw firm conclusions.
You might try to get round this by comparing more complex bits of syntax, but then you promptly run into the problem that such features are not very stable in time. The Western Oti-Volta languages are at least as close to each other as the Romance languages, and all have well-developed relative clauses, but it is impossible to reconstruct the proto-WOV system. There are differences in relative clause structure even between the two mutually-comprehensible Kusaal dialects. Again, it’s the data that are intractable, and no more sophisticated theorising will make that go away.
Comparative linguistics works by reconstructing protolanguages, and it is never going to be possible to reconstruct anything like a whole historical ancestral language. Syntax belongs to a very great extent to the parts we will never be able to reconstruct with any certainty.
Another example from Romance: all Romance languages have a definite article, but we know that Latin hadn’t. One could reconstruct a definite article for Proto-Romance / Vulgar Latin, but the fact that some Romance varieties use a different pronoun than ille for the article shows that this not an inherited element, but a parallel development, and the fact that many neighboring languages have a definite article, too, makes it likely that this is an areal development.
Western Oti-Volta is the same, but more so: all the languages except Nootre (away in Benin) have a definite article, but the articles are of four different origins.
The loss of grammatical gender in Western Oti-Volta is also clearly areal: it’s seen also in the neighbouring non-Western languages Konni and (mostly) Moba, it’s not seen at all in the geographically-separated Nootre, which preserves the whole grammatical gender system, as does the main Farefare dialect, and the many morphological traces of agreement in the languages which have lost it show that the system must have been simplified differently in the various languages before it was lost completely.
Yes; characters have states in phylogenetics*, and they are what the datasets are built of.
* A caricature of an example would be: “Flexional exuberance as seen in Bantu verbs: present (0); absent (1).” 0 and 1 would be the two states of this character.
Why? Is it not inference if computers do it?
…However, Bayesian phylogenetics happens to have a huge problem with missing data: it downweights characters that have missing data for any of the languages/species/whatever. In biology, this can lead to strong support for completely spurious clades. In this example, I suppose it should be fine if the characters with missing data are the least reliable characters…
I expect “we’re producing nonsense where the assumptions of our method aren’t met; they aren’t met in this case because…” – but I haven’t read the paper.
This is merely a reminder that Proto-Romance wasn’t Classical Latin but Rather Late Vulgar Latin; the most parsimonious assumption in the absence of historical records is of course that Rather Late Vulgar Latin already had SVO (if only as a weak default).
To my mathematical intuition, Bayesian inference is wrong for exactly the same reason that absence of evidence is not evidence of absence. Your series of events just happens to support only one of the hypotheses you can think of, so you assign a high prior probability to that hypothesis. It works in practice, but it’s not a source of truth. I don’t make guesses in Sudoku either. (I call them assumptions and try to find a contradiction). The only reason computers come into the picture is because they can do it quicker.
As we were told in Stats 101: If you think likelihood is the same as probability, better get out of science now.
(Now if you’ve proved there are only N possible configurations, and the probability of getting the observed results is zero for N-1 of them, it looks like Bayesian inference but it’s also proper under classical inference).
Well, there are only N mathematically possible trees for any number of taxa… and the posterior probabilities Bayesian phylogenetics spits out aren’t necessarily 0.00 or 1.00.
The latter part is exactly why it gives us the creepy-crawlies. Hiss. We don’t likessss it, my precioussss.
It’s a perpetual-motion machine for logic. Embrace the awesome power!
“If you think likelihood is the same as probability, better get out of science now.”
Sounds silly.
Elsewhere in English, likelihood and probability are exact synonyms, but as technical terms in the Bayes Equation they refer to two different variables.
@DM, I don’t think it is a technical term.
Rather some English speakers tried to take advantage of having two “exact synonyms” and began to apply L. to the chronologically preceding [thing] and P. to chronologically following [thing].
Cf. Wolfram:
“Likelihood is the hypothetical probability that an event that has already occurred would yield a specific outcome. The concept differs from that of a probability in that a probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes.“
Wolfram is wrong. There is nothing about future or past in the concept of probability or likelihood in their technical senses.
Sounds silly.
Is silly. Those people don’t own the terms, and many of them are utterly inept at explaining their idiosyncratic high-priestly arrogation of them. Take a look at the Wikipedia article Likelihood function. Is that the most lucid and most useful account that can be given? Sheesh, it’s the product of years of wrangling on the talkpage for the article.
“utterly inept at explaining their idiosyncratic high-priestly arrogation of them”
Well, yes. That’s what I meant.
I think in the narrow sense it is this: your have a hypothesis (or a parameter in a hypothesis) and outcome.
You say that the probability of [obtaining] this outcome given this hypothesis is [a number].
You also say that the likelihood of [???] this hypothesis given this outcome is [very same number].
To confuse everyone, you denote [the number] as P in one case and L in the other even though the proposal is exactly to apply “the probabibility of the outcome” and “the likelihood of the hypothesis” to the same number.
Further confusion because we actually compare hypotheses and I tend to think of the value described above as “probability of the outcome” rather than “[any synonym to attractiveness] of the hypothesis”.
E.g. in Bayesian reasoning – and for one thing, Bayesian reasoning is what I arrive to when describing my actual thought process – you use this value (conditional probability or likelihood) to modify your original idea of attractiveness.
You don’t call it “[any synonym of attractiveness] of the hypothesis”. You call it P(outcome|hypothesis) and you call the bayesian measure of attractiveness P(hypothesis|outcome).
Which is not the same as L(hypothesis|outcome) (which is same as P(outcome|hypothesis)).
Well, yes. That’s what I meant.
Yes, as I thought.
It was quite lucidly explained to us in Phys 101 what a likelihood function is, what a maximum likelihood estimator is, and why it’s not a source of truth unless you already know the a priori probabilities of your hypotheses. My point being that the separation of likelihood and probability as technical terms is not the Bayesians arrogating them, it’s useful in general reasoning about experimental setups. On that background it’s very clear what the Bayesians are trying to do — they are begging the question but with more mathematics.
Matthew 7:26
So here’s a worked example. You have two hypotheses:
1. The sun is up.
2. The Mothership from Close Encounters of the Third Kind is over your house. (And by plot logic, the sun is not up).
Your observed outcome: It’s light outside. The likelihood under each hypothesis is 1. How many times do you have to go outside and not see a Mothership before your Bayesian reasoning converges on a probability of 1 for daylight? (For some people, it clearly never converges. “But, but, maybe it will come tomorrow night!”)
“The likelihood under each hypothesis is 1.”
The likelihood of WHAT? Of each hypothesis or of the outcome?
(well, of course you mean the former)
It’s asymptotic.
Lars, you seem to have some ontological objection to Bayesian statistics, but I cannot get what it is.
I suspected as much. Whereas my preferred approach refuses to assign probabilities at all. There might be a non-zero probability that the Mothership (exists and) has decided to play DEE-DEE-DEE-DAH-DAH at me all night, and nothing I can observe will prove there isn’t!
Actually much of what passes for scientific results is formulated as likelihoods. If the Higgs boson doesn’t exist, there’s less than, say, a.0.00001% chance (one in a million, cf T.P. passim) that we’d have observed what we have — and that’s good enough to build further research on, but it doesn’t give us the right to say there’s a 99.9999% chance that it exists..We call it confidence instead. And given enough results published at 99.5% “confidence,” some of them are probably wrong. (In a colloquial sense of probably, I’m not doing the math now). And stuff like least-squares regression or bell curve fitting is straight up maximum likelihood estimators, with all outcomes given equal prior probability (at least in the naive versions taught in 101).
But pragmatically, Bayesians work in the sense that tech and pharma companies basing their investment decisions on such results usually don’t lose their money.
but it doesn’t give us the right to say there’s a 99.9999% chance that it exists
Why not?
That’s why corrections for multiple testing of the same hypothesis are a thing. In the example, 0.0025 should have been used instead of 0.05.
Because we haven’t excluded all alternatives. We probably haven’t even imagined all possible alternatives.
D.O.:Because all the measurements could just have happened to be wrong in the same way. And either the Higgs exists or it does not*, there’s no probability about it. Either we’re sure or we aren’t. (Spoiler: we aren’t, but we pretend we are).
D.M.: So maybe only one in four thousand published results is wrong instead of one in two hundred? (400? 20? Your decimal places are not my decimal places). With ten billion people being used as controls, there might be hundreds dying every day from errors committed at 99.975% confidence. But tens of thousands getting well from correct results. It’s risk management, not divine truth.
______
(*) Pace arguments about our observable physics being a quantum overlap of one where the Higgs exists and one where it doesn’t. You can have this idea for your next SF novel.
______
(**) I still didn’t do the math, “One in four thousand” is probably off by a small constant factor, but the seat of my pants are not on fire from that guesstimate.
DM, there are potentially many objections to “a 99.9999% chance”. I want to know what is Lars’s. As for alternatives, that was Lars’s statement that “If the Higgs boson doesn’t exist, there’s less than, say, a.0.00001% chance […] that we’d have observed what we have” which presumably accounts for imagined and unimagined alternatives. But you do not object to Bayesian reasoning, so presumably you have found a way around imagined and unimagined alternatives to some degree.
Lars, thank you. I understand that if a phenomenon has no probability in it you don’t want to introduce it as epiphenomenon. Fine. How do you propose to talk about events where there is a definite answer but our knowledge is incomplete? If someone tosses a coin and covers it with their hand so that we don’t know which way it turned, do you propose to use a different word for 0.5 that expresses our knowledge of the fact? Which word?
D.O.: I think our posts crossed, I did try to state why. I’m not sure if it’s ontological, since I don’t see clearly how ontology would apply to me not liking Bayesian inference.
I still don’t understand but it’s ok, I hope we are just having good time here. When I figure out the collapse of the wave function, I will be sure to write a novel about it where truly important things like who loves whom or how childhood experiences shape personality are in the balance (or maybe how we can use it to survive AI).
As a doctrine about the nature of probability (as opposed to a particular set of statistical techniques), Bayesianism is about belief, not reality as such. It belongs to epistemology, not ontology (unless you believe that there actually is no reality independent of our beliefs about it, in which case the whole distinction is moot, really.)
https://en.wikipedia.org/wiki/Probability_interpretations
That’s not to imply that reasoning about beliefs is useless or unimportant. But it has no implications for how things actually are. Some say that the “how things are” is intrinsically inaccessible, so reasoning about beliefs is all we can sensibly do; some go further, and declare that “inaccessible” (in this sense) is the same as “nonexistent.”
@Lars you reminded me this:
Thus Savage said (in the Discussion of Birnbaum (1962a))
link,
hte principle
____
I’m not sure I understand your objections, but my best idea is that you might mean :
“if we treat results obtained by Bayesian as ‘science’ then we let an entirely unknown function of reality into our sceince”
I in turn know little about “Bayesians”. I know that I frequently apply Bayesian reasoning but for me it is a way to admit that I don’t know the exact nature of some of my ideas. It’s a way to modify my expectations.
That is, you say “Whereas my preferred approach refuses to assign probabilities at all.” and this is also my approach.
But as I understand there are situations where bayesian reasoning leads to different results so there is also a practical question “which approach works better”.
That is, you say “Whereas my preferred approach refuses to assign probabilities at all.” and this is also my approach.
…But I also know that as a matter of fact I do have some expectations and its useful to know how I should modify them when I obtain data.
Yes, I guess when you embed this reasoning into calculations of practical significance the question of how well it works arises.
Information and belief
“It’s asymptotic.”
No one said that the shares of sightings of the sun and the ship approach a limit….
@drasvi: My first point of curiosity is what happens when the share is constant at many to zero, we don’t even have to talk about limits on that side. Working it out for something non-convergent like 1 sun, 1 ship, 2 sun, 2 ship, 4 sun, 4 ship, … should make for a good exam question. (Limsup = 3/4, liminf = 1/2 if I didn’t count any thumbs as fingers).
Other than that, DE expressed it much better than I can. I feel that the adherents of Bayesian inference fail to make it clear that while it can give us very useful hints about which way to bet, it cannot tell us how things actually are,
@Lars, FOR ME it is a way to make it explicit that we still are relying on our gut feeling, divine inspiration or something like that. Some belief of unknown nature that we know how to modify based on observations (of course if its nature is known, it can be and should be analysed).
Having this said I would not insist that it is not divine inspiration and the source of the TRUTH:)
In other words, it is my scepticism that makes me like the approach.
Again, this might be not so clear when it is a paper on Khoisan phylogeny that intimidates professional linguists with words like “Bayesian”, “Markov chains” etc.:((((
Anyway, it must converge slowly.
I mean, if we haven’t started a nuclear war in several decades, that does NOT mean we won’t start it some day within next ** years. Perhaps some people think that the system of several crazy generals and evil presidents is either extremely unstable (would start it soon) or very stable (start it once in a dozen millenia). But it can be moderately stable…
For this reason I think it is a very practical idea to move to Ghana or some place like that.
I wanted to find out if any regional words in Peruvian and Bolivian Spanish are of Panoan or Tacanan origin.
On the way, I found this book (in Spanish) for the general readership. It’s rather fluffy—genre ‘they have word for it in their language’, ‘untranslatable foreign word’, etc.—but it offers four words in Panoan languages: Matsés nodo (‘(for birds) to flush up from the ground and then roost in a tree’), Hantxa Kuin (Kashinawa) hinin iki (‘libido, sex hunger, considered as a normal everyday desire for something necessary for wellbeing’), Yaminawa (Sharanahua) itsa (‘foul odor of the type of human armpits, onions, or peccaries’), Kashibo (Kakataibo) bëpa (‘elflock, bedhead that provokes the derision of others’). Itsa is nice—in chemical terms, ‘the smell of thiols’?
Interesting on the meaning of pano from the article on Panoan languages in the Spanish Wikipedia:
It reminded me of another discussion on LH.
As for the origin of the name Tacana, there is the following from ‘Arte y vocabulario de la lengua tacana. Manuscrito del R. P. Fray Nicolas Armentia’, edited and annotated by Samuel A. Lafone Quevedo, Revista del Museo de La Plata, vol. 10 (1902), (available here):
And also this:
The paper linked by Y references other two papers that too build phylogeny based on syntax:
Ceolin et al.: 2020; 2021.
Personally, I prefer to build phylogeny on the alphabetical order of the language names.
Absolutely all that endeavours of this kind show is that syntax does tend to correlate albeit, mostly at shallow depths, with genetic relatedness. (Surprise!) Genetic relatedness must be established entirely independently of such studies, which are valueless for determining genetic affinity.
Oti-Volta languages are all SVO. The most closely related language to Oti-Volta is probably Miyobe, which is SOV. More distantly related are the Grusi languages, which are SVO. Examples like this could be multiplied almost indefinitely.
These results are markedly different from other recent studies on phylogenetic research using syntactic data
Of course they are. In the same way, my proposal to establish genetic grouping by alphabetic order would give differing results if the study was conducted in French rather than English. The technique simply cannot do what is being asked of it. This sort of study is just comparing the results of English alphabetical ordering against French alphabetical ordering to see which fits the known genetic grouping better, so that one or the other can be chosen to “determine” grouping in cases where real evidence is lacking.
Our model is a complex model with flexible parameters
This is a bug, not a feature. The more “flexible” the model is, the more it can be tweaked to fit the data. This is the opposite of proper scientific methodology.
Anyone who thinks that the complexity of their model is a positive feature in itself needs to find another line of work. Complexity in a theory may often be an unfortunate necessity, but it can never be desirable in itself. Excessive complexity may well be a sign that something is seriously wrong with the theory, even if it remains the best available for the present.
Essentially, the authors are boasting that their theory allows for the creation of epicycles wherever needed.
@DE, but what they saying is that their complex model with flexible parameters resulted in a funny tree (which nevertheless corresponds well to geographical regions) while other guys’ model grouped English with Icelandic, Russian with Fylosc etc.
I still think you should like the complex model with flexible parameters: it is flexible enough to include Kusaal (in Supplement).
it is flexible enough to include Kusaal
Astonishingly, given the very high-quality work available on Kusaal syntax, they omit this major world language.
Actually, Kusaal has, besides numerous unsurprising syntactic resemblances to other Oti-Volta languages, one or two definite syntactic resemblances to the completely unrelated Hausa. They’ve really missed a trick in not including it.
Abstract of the first paper:
This must set some kind of record for the most fundamental failures of understanding of scientific method and of the principles of linguistics in a single paragraph. Nonsense piled on nonsense. “Generative biolinguistic framework.” Faugh!
Perhaps the whole thing is really a spoof.
“Abstract of the first paper:”
The second. Yes, that’s one of the reasons why I linked it:)
The second
The effect of reading the papers was so stupefying that I temporarily lost the ability to count.
@David Eddyshaw: This is the usefulness of calculating the χ² per degree of freedom as a rough estimate of the merit of a fit. If it is significantly larger than 1, the model probably doesn’t fit the data. However, if it is significantly smaller than 1, either your model could probably fit anything, or you have probably misunderstood your errors.
The ones that are also tightly knit geographical clusters are significantly supported, the ones that aren’t are not. One of the more spectacular failures of peer review.
@DM, I don’t think Uralo-Altaic is any more tightly knit than Indo-Uralic: IE covered much of present Turkic area before the migration period and then covers much of present Turkic area again since the expansion of Muscovites.
As for Basque-NE Caucasian, they have significant Dravidian-NE Caucasian cluster.
Their heatmap, and Romance hm: preview, png.
Total number of variables is unclear.
E.g. they have “grammaticalized person” which implies “grammaticalized morphology”.
So they say g.m. is absent in Mandarian and Cantonese, and then write zero for g.p. in Mandarin and Cantonese (because absence of g.m. implies absence of g.p.) and say g.p. is absent in Japanese and Korean. For Grammaticalized genitive they write 0 everytwhere.
Ural-Altaic is already known to be a typological cluster, and numerous Uralic branches have lots of Turkic loanwords, many of them from historical times. The Iranian loanwords in many Uralic branches are, mostly, much older and less obvious.
Dravidian and East Caucasian is interesting, however, because that’s never been proposed as a genetic or areal grouping.
My method predicts that too: Daghestani, Dravidian (obviously you have to get the parameters right, though: it is not obvious that Telegu is closely related to Avar.)
@DM, the period of Turkic influence is sandwiched between the two periods of IE influence, so of course Iranian influence is older than Turkic older than Russian.
I don’t think it is proper to decide that all similarities must be areal and say “geography” instead of “typology”:) What we can hope to learn is what information can be extracted from such isoglosses.
@DE, Telugu-Avar is a matter of principles, not parameters.
No, my method is unprincipled.
Parameters without Principles are like Cassowaries without Cossacks.
A cassowary needs a cossack like a fish needs a bivalve!
Cassowaries have few interests beyond West African missionaries.
There are many variants of the cassowary/missionary doggerel here.
Awesome, thanks! The last panel of the first cartoon is terrifying. Those birds are as voracious and ill-mannered as they are heathenish.
“Rhyming with “Timbuctoo” is a challenge, …” – Reminded me the verse from Trollkrittet where the rhyme is Timbuctoo – Marabout. I wonder if it should be counted as a rhyme or rather a play on North African vocalism.
“…but the phonological sequence |’ɪmbʌktu:| suggested “hymn-book, too”” – Now understand the commenter from the song “Fartatou” on youtube – which I occasionaly listen to for a weird reason (…I like the picture) – better:((
I think it’s a worse problem still than that: in material / etymological comparison, we have the principle of regular correspondences to tell what even corresponds with what, so that we can tell some reconstruction should be sought for cat ~ Katze, but not for dog ~ Hund. Syntactic constructions are however one-off or, at least, not independent datapoints. If we find something like a SVO ~ SOV correspondence in main clauses, how can we know if this involves mutation, i.e. some sort of a diachronic syntactic movement — or simply replacement, i.e. something like SVO picked up as-is from some other proto-construction (for the sake of the example, say in dependent clauses), or from a contact language? Without a backing of regularity, reconstruction procedure cannot distinguish divergent data that should be accounted for by gradual change from the proto-state, vs. divergent data that should be treated as intrusive in its attested function, i.e. basically ignored.
The Volga–Kama Sprachbund might be however, and their “Volgaic” (ugh) and Permic representatives are indeed Mari and Udmurt, its two Uralic core members. It would be at minimum instructive to see how this form of analysis fares if 1. more languages are included in the sample, e.g. Chuvash, Tatar, Moksha, Komi, any form of Sami; and 2. Uralic, Altaic or for that matter even e.g. Turkic nodes are not constrained a priori.
(I notice no apparence of the terms language area or Sprachbund in the paper.)
Yes, of course for morphology it is
[thosands meanings] [sequences] where sequences consist of sounds each of which can take dozens possible values.
For borrowings a sequence is replaced with a sequence.
For changes, we expect same sounds change in the same way in all sequences.
That’s A LOT of data.
I think it’s a worse problem still than that: in material / etymological comparison, we have the principle of regular correspondences to tell what even corresponds with what, so that we can tell some reconstruction should be sought for cat ~ Katze, but not for dog ~ Hund. Syntactic constructions are however one-off or, at least, not independent datapoints.
Yes, that was intuitively obvious to me the first time I heard about attempts to use syntax in that way and I’ve never taken them seriously.
So, there are two works:
α – Cypriot Greek is an isolate (“Greek” and Romanian form a clade), Y’s link.
β – Cypriot Greek is at the distance 0 from “Greek” (unlike other Greeks).
I find it remarkable.
Desipite this sensitivity to variation in mutually intelligible languages – I guess some will think that Serbian without cases and Serbian with cases are very different, but speakers don’t feel so – α is also sensitive to arealisms, small (Finnish and Slavic) and large (langauges around the steppe, Africa).
In terms of coverage α and β overlap in IE (and in Italian dialects).
They are more comparable in analysis:
α has two Bayesian trees (RevBayes) and distance-based UPGMA and neighbor joining trees (with node support)
β has a Bayesian tree (BEAST), distance-based UPGMA trees (Jaccard and Hamming) and heatmaps.
So we can compare their UPGMA trees and also see how much bayes-schmayes changes the result.
β uses classes of variables rather than elementary questions that can be addressed to a grammarian.
For example “grammaticalised person” is said to be present in a language IFF
(a) There is agreement in speech-role-designating morphology between a verb and some of its
arguments, or an argument is doubled on the verb by a speech-role-sensitive clitic
or
(b) There are overt expletive items in subject function
or
(c) There are overt resumptive items in (direct or indirect) object function
or (d) or (e) or (f) or (g) or (h) or (i) or (j).
Where answers to questions d, e, f, g, h, i, j are said to be negative for all languages who answer negatively to a, b and c. If that means that not all questions (to languages who don’t have “grammaticalised person” or some other class) have been asked and need to be asked, they, sadly, don’t specify what are those questions that don’t need to be asked. They only say that in this one example their theory predicts that from negative answer to a, b and c follows negative answer to all other questions.
Moreover, they say that presence of “grammaticalised person” implies presence of “grammaticalised morphology”, so for all languages that don’t have “grammaticalised morphology” they mark “grammaticalised person” as inapplicable.
“Grammaticalised morphology” is
(a) The language has affixes or regular phonological alternations that change the grammatical
category of the base
or
(b) There are roots which take different affixes/phonological alternations encoding different closedclass interpretable/grammatical properties (tense, aspect, number, gender, gradation, case, etc.)
Real comparative work spent quite some time in its infancy working out that typological resemblances were of no value in demonstrating genetic relationships.
Nowadays, with the aid of “Bayesian methods”, we can advance confidently into the past.
(This gives a whole new slant to the statistical term “regression.”)
Should not these brave pioneers be revisiting the question of phlogiston?
The heatmap which I linked above represents more or less raw data.
What it looks like depends on the actual history AND our choice of variables (the arbitrary thing here).
“The Volga–Kama Sprachbund” – Yes. Though a well-defined Sprachbund does not mean we should expect lesser “distance” between its members than between either of them and some other language.
Byt eh way: the link to their variables.
Though a well-defined Sprachbund does not mean we should expect lesser “distance” between its members than between either of them and some other language
How not?
“Well-defined” is “well-defined”.
– either correlation: languages that have a feature A also have B, C without a reason in the language as a system itself.
– or languages in a geographical region have A, B, C even though they are not related (A can be shared with one language family and B with another).
Uralic may share some features with IE because of IE dominance in the steppe and because of contact with Russian, but those don’t form such a small area.
You seem to be denying that there actually is such a thing as a Sprachbund, along the lines that the real explanation of such phenomena is either influence from a single prestige/dominant language or pure coincidence.
You are not altogether alone in this, I gather, but I can easily come up with plenty of cases where such explanations seem pretty improbable, to say the least.
The convergences are often of a kind where coincidence is a very farfetched explanation. For example, in my own bailiwick, Ditammari and the neighbouring non-Oti-Volta Miyobe language share (a) devoicing of voiced stops (b) secondary development of class prefixes from proclitic “articles” (c) complete loss of many inherited final syllables in polysyllabic words. It does not seem that Ditammari-Miyobe bilingualism is at all common, and Mooré, which is much more closely related to Ditammari than Miyobe is, shows none of these features. On the other hand, Nootre, which is a Western Oti-Volta language closely related to Mooré, is spoken in the same Atakora region of Benin, and it does have the stop-devoicing thing.
The “dominant single language” objection seems to most cogent, but it seems hard to identify any particular single language as the key in the case of e.g. the famous Balkan example. As multilingualism of one kind or another is presumably at the back of the linguistic convergence seen in all presumed language areas, this is in any case a question of degree. There seems to be no reason why one single language would need to be “dominant” to cause such things.
Many West African languages share a great deal of their general modus operandi with one another despite belonging to at least three families which not even Greenberg thought were genetically related: this turns up in things like closely matching semantic fields and in the kinds of categories marked in flexion (e,g. perfective/imperfective in verbs, not tense, as the primary distinction, with closely analogous usages of the individual aspects. Kusaal and Hausa both use formally subordinate perfective-aspect clauses specifically to carry on the sequence of events in narrative, as does Fulfulde.) There are widespread West African lingua francas, sure, like Hausa and Dyula, but there is no single “dominant” language over this whole area.
Be that as it may, any theory that simply ignores the fact that neighbouring languages influence each other more than geographically separated languages do is just pseudoscience.
@DE, why????
I’m not denying it.
I just mean “well-defined” does not mean “closer to each other than each of them to any other language”.
@drasvi
Do you mean, e.g., that South Slavic (Balkan) languages are closer to West or East Slavic languages than they are to Greek or Romanian (Balkan)? The Sprachbund effects are more obvious when they are in the “delta” between two related languages, one within and one outside the Sprachbund.
Yes, that’s the relatively easy case.
The Eastern Oti-Volta languages, on the other hand, are certainly all fairly closely related to one another anyway, but not as closely as they appear: some of the shared features are also seen in Nootre, which is definitely less closely related genetically, and some of the shared features (like the stop devoicing) differ between the languages in a way that makes it impossible to reconstruct them to a putative proto-Eastern language: they can’t be due to common inheritance. In fact, “Eastern Oti-Volta” is probably not even a valid genetic subgroup of Oti-Volta at all.
It’s the same with the loss of grammatical gender in Western Oti-Volta: proto-Western must have had (at least) eight grammatical genders, and relics in the WOV languages that have lost it show that the system was simplified differently in the different languages prior to its complete abandonment. The loss has also spread to a couple of neighbouring non-Western languages, too. Although loss of grammatical gender seems like a “natural” development to an English speaker (or a Persian), hardly needing to be “explained” at all, it is actually very exceptional in Oti-Volta, confined to a single geographical area (apart from Dagaare/Dagara, which seems to have migrated to its current somewhat separated position just a couple of centuries ago.)
PP, this too.
But I just think a group of languages don’t have to be exceptionally similar to be a Sprachbund…
The period of Turkic expansion in northern Eurasia is right between two periods of IE expansion across the same territory. Uralic languages could have accumulated any number of commonalities in grammar with unknown old IE (if not PIE) first, then Iranian, and now Russian.
Although loss of grammatical gender seems like a “natural” development to an English speaker (or a Persian)
Yes, Anglophones are weird (some are also WEIRD). But Persians, they look like normal people, like you and me. One boob and half a beard (in average). And some chador.
Speaking seriously, IE gender exists in the context of agreement in case and number. English does not mark adjectives even for number. One could expect gender to disappear when the whole system falls, nevertheless there is a plenty of caseless IE languages who retain gender…
@drasvi:
Not for the first time, I misinterpreted you: it looks like you were actually affirming, rather than denying, a fairly standard conception of a Sprachbund.
But then I don’t understand why you don’t feel that a failure to take such phenomena into account seriously undermines the papers we’re discussing. (Not that that is the only problem with them, by any means. Their entire methodology is unsound.) Or was I also wrong in rushing to the conclusion that you were defending the papers?
One could expect gender to disappear when the whole system falls
In fact, languages seem to really lurve grammatical gender, even if they strip it down to just two genders: such systems seem to be amazingly persistent across time, for all the weird cases like English and Persian and Kusaal and Lingala. Once you’ve got grammatical gender in the first place, it seems to be pretty hard to get rid of it.
Welsh, which seems to have been on a mission for two millennia to see how far it can get from Indo-European typologically, still has grammatical gender, even in the most colloquial modern language. And Hausa shares it with Maltese, despite their relationship being at (probably) the very limit of demonstrability by proper comparative methods.
@DE, I disagreed with DM simply because I disagree with the argument. I don’t think that “Ural-Altaic and NE-Caucasian-Dravidian” are more obvious geographically than “Ural-IE and NE-Caucasian-Basque”.
But all that can be done with their Caucaso-Dravidian is: someone who has time for this can consider them to see if it’s a chance similarity or vestiges of a langauge area or a genetical link or what. For Uralo-Altaic these possibilities have been explored by many.
I think the two works that I designated as α and β (did not want to repeat names or labels like “former” “latter” all the time) combined portray a rather interesting picture. Enough to get me curious.
β’s selection of variables is motivated by their theory of langauge aquisition (which pretends to be Chomskyitic and maybe is). I, of course, did not even try to evaluate the theory.
But I think visualising “typological” (on the one hand syntax is just syntax, not “typology” but on the other hand variables with values “present” and “absent” rather than “equus” and “durchschnittsgeschwindigkeit” are of some interest to typology and often it is typologists who select and collect such data) data and playing with different sets of variables is a meaningful exercise.
We don’t even have a database of 1001 variable for 700 langauges and software that allows to to press a button and generate something like this heatmap or a tree or whatever – what to say about dependencies between variables or grouping them so that a “logical” expression with names of features a1, a2, a3 and operators like “or”, “xor”, “and” becomes a new variable and whatever esle can be invented.
Interpretation is going to be difficult of course.
As for blah-blah-blah, I skipped it.