Jason Koebler of 404 Media tells a tale of our times (archived):
The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility.
Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project’s GitHub, creator Robyn Speer wrote that the project “will not be updated anymore.”
“Generative AI has polluted the data,” she wrote. “I don’t think anyone has reliable information about post-2021 language usage by humans.”
See the link for details and a complaint about “the terrible behavior of generative AI companies”; the piece ends:
“Information that used to be free became expensive,” Speer wrote. She closed the note by saying that she wants no part of the industry anymore. “I don’t want to work on anything that could be confused with generative AI, or that could benefit generative AI,” she wrote. “OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they’re constantly cursing the mess that they made themselves.”
Was this post just so damn depressing that no one had the hear to comment?
Well, I wasn’t gonna register for something in order to be able to read the entirety of the 404 Media story. That’s how they getcha.
There are already lots of potential signal:noise ratio problems in “traditional” corpus linguistics with corpora full of texts we can assume with some confidence were produced by primates. But I can’t say it’s implausible that we’ve recently entered uncharted waters and at a minimum any future work requires more carefully (and expensively, at least at present) curated corpora than a let’s-go-scrape-the-web approach uses as input.
You don’t have to register, you can click on “archived” — that’s why I take the trouble to provide the link!
Ah, I often do click on those “archived” links and I do appreciate the trouble hat takes to provide them. I somehow overlooked that option here. Perhaps I was distracted by processing the election results from Brandenburg & Sri Lanka.
I once asked one of the AI’s about a topic I’d written about. They didn’t answer the question but scraped a paragraph or two from my website.
When you ask them a serious question they come back with meaningless generalities. If you point out mistakes in their generalities they say “You’re right”, and spew out more generalities.
I can’t see how any teacher would be misled into thinking a student actually wrote their crap.
APEs are just the latest move by the plutocrats towards Enclosure of the internet Commons.
https://en.wikipedia.org/wiki/Enclosure
Even some of the specious arguments deployed by the rich thieves are the same: essentially, that we peasants will all benefit from being robbed eventually. Once the process is complete.
@Bathrobe: I recently did the same with ChatGPT, and the answers would have been usable as an introduction to the topic after a bit of editing. So I guess it’s a question of what AI, what generation, and how specialized the topic is – if there is a lot of material on the net, like with the topic I tested, the LLM can put together a reasonably useful answer.