The Weird World of LLMs.

Part of Simon Willison’s Catching up on the weird world of LLMs (Large Language Models) is about language, which makes it Hattic material; a great deal of it is about coding, which is Greek a mystery to me but of interest to a lot of Hatters, so it’s worth posting for that as well. Consider it also as a public service message — I draw your attention in particular to the “Prompt injection” section at the end. It’s written so clearly and conversationally that even I was able to get a lot out of it. Here’s a passage with some good stuff:

I’ll talk about how I use them myself—I use them dozens of times a day. About 60% of my usage is for writing code. 30% is helping me understand things about the world, and 10% is brainstorming and helping with idea generation and thought processes.

They’re surprisingly good at code. Why is that? Think about how complex the grammar of the English language is compared to the grammar used by Python or JavaScript. Code is much, much easier.

I’m no longer intimidated by jargon. I read academic papers by pasting pieces of them into GPT-4 and asking it to explain every jargon term in the extract. Then I ask it a second time to explain the jargon it just used for those explanations. I find after those two rounds it’s broken things down to the point where I can understand what the paper is talking about.

I no longer dread naming things. I can ask it for 20 ideas for names, and maybe option number 15 is the one I go with. […]

Always ask for “twenty ideas for”—you’ll find that the first ten are super-obvious, but once you get past those things start getting interesting. Often it won’t give you the idea that you’ll use, but one of those ideas well be the spark that will set you in the right direction.

It’s the best thesaurus ever. You can say “a word that kind of means…” and it will get it for you every time.

An important bit that he mentions in passing: “they don’t guess next words, they guess next tokens.” These models don’t know anything about words or meaning, they just predict token use. Which brings me to what is to me a very basic and important point. I got this via MetaFilter, where one user commented:

But, is that different than me? My words aren’t numbers, but they are squeeks and hoots and grunts that, when strung together, have meaning. As I read this section, I swung between “it’s fake” to “I’m fake”.

And another said “that applies to a lot of people as well.” No! Stop thinking like this, people! I know it feels edgy and cool, but it reinforces an already too common tendency to degrade people’s humanity. Saying “how do I know I’m not a Markov chain?” is like saying “How do I know I’m conscious?”: it’s stupid and self-defeating. The world is hard enough to decipher without pulling the wool over our own eyes.

Comments

  1. PlasticPaddy says

    Judging from that extract, the author might be more comfortable dealing with (“intelligent”) machines than with people. Faced recently with a question from a colleague in an area I was not so expert in, I turned not to ChatGPT, but to a former colleague who has an in-depth knowledge of the topic (which he denied; to many humans, at least, what they don’t know seems bigger and more significant than what they do know).

  2. Stu Clayton says

    Saying “how do I know I’m not a Markov chain?” is like saying “How do I know I’m conscious?”

    Based only on my most recent state of consciousness, I calculate a reasonable expectation that I will still be conscious the next time I

  3. David Marjanović says

    I’m no longer intimidated by jargon. I read academic papers by pasting pieces of them into GPT-4 and asking it to explain every jargon term in the extract. Then I ask it a second time to explain the jargon it just used for those explanations. I find after those two rounds it’s broken things down to the point where I can understand what the paper is talking about.

    That’s bound to fail hard on a few occasions.

    It’s the best thesaurus ever. You can say “a word that kind of means…” and it will get it for you every time.

    That, too.

    I turned not to ChatGPT, but to a former colleague

    How easy was that person to reach?

  4. Judging from that extract

    You could click through and read more, you know. The author clearly has no problem dealing with people, but unlike people, LLMs don’t get bored or irritated with endless “stupid” questions.

  5. Stu Clayton says

    LLMs don’t get bored or irritated with endless “stupid” questions.

    Nor with stupid claims. The world would be a happier place if people could be weaned off “social media” so as to spend their time jawing with their own personal LLM at home.

  6. Stu Clayton says

    They’re surprisingly good at code. Why is that? Think about how complex the grammar of the English language is compared to the grammar used by Python or JavaScript. Code is much, much easier.

    I suspect a hidden bias here. These LLMs were developed by coders. It’s easy to imagine they spent a lot of time making sure their kind would benefit from LLMs. After all, code is easier, as the man claims.

    The grammar of mathematics is also much, much easier than that of English. Does anyone know how good LLMs are at *helping* to solve partial differential equations ? Not bad, I would expect, because there are a lot of standard techniques printed up in books.

    It’s beginning to look like LLM searches are basically Google internet searches with a nose job. At each go you get only one confident answer instead of a million hits of uncertain relevance. Trump is functionally equivalent to an LLM as far as confidence goes. A Large Leader Model.

  7. The corpus of computer code which is on the internet covers the range of its existing idioms much more completely than the corresponding corpus of any natural language covers its.

  8. Stu Clayton says

    The corpus of computer code which is on the internet covers the range of its existing idioms much more completely

    I hope they filter out the Stackoverflow corpus. If not, then bad existing idioms will certainly drive out the good ones, as usual.

    In hundreds of bank applications I see Java code with methods declaring Optional as return type, that end with “return null;”. I am confident I could demonstrate that 95% of the catch-clauses in Stackoverflow examples with them are of the type “catch, log and keep going”.

    Relatively few programmers have any sense of how to deal with problems other than static compiler errors. They have no feeling for flow-logic problems. Exceptions are swatted aside like flies (catch, log and keep going).

  9. Dmitry Pruss says

    Letting you code designer make too many assumptions about data is … you know. Just recovered from an epic crash where a too-smart program turned out to decide what kind of data is in the table by looking at the rows on top. So after seeing a bunch of zeroes and ones, it was like, cool, it’s a binary. But it was a genotype (number of genetic variants at the specified position) which can also be two, oops

  10. Well, sure, it’s silly to have an LLM write a program with stability, good error handling, data checking, all that. For casual scripts though, why not.

  11. @Stu quoting O.P. They’re surprisingly good at code.

    Just not true.

    @Y For casual scripts though, why not.

    Because LLM-generated code is terrible even for short/”simple” ad-hoc jobs. There’s nobody curating the samples of code they go looking for. It might be syntactically correct, but there’s no guarantee it’s fit for purpose — even supposing what it claims is its purpose accurately describes it.

    The way programmers use LLMs for code is to generate the ‘skeleton’ (or “recipe” as Stu puts it) of an application; copying from an existing application that does something similar. (Most commercial applications or games are much of a muchness after all.) So finding “something similar” is what LLMs help with.

    Only a non-programmer would think LLMs are “good at code”. Only a non-programmer would think what an LLM produces is a surprise: you (the human) first has to make many attempts narrowing down the question, to get hold of a piece of code sufficiently similar to the needed purpose.

  12. TR wrote here not long ago about how he used ChatGPT for exactly that purpose, and got exactly the results he’d meant the scripts to produce. I don’t know if they were ugly code, but it doesn’t matter.

  13. Right, and I’ve rarely had to spend time narrowing down the question — ChatGPT is actually quite good at making sense of even badly phrased requests.

    LLMs don’t “go looking for” bits of code to “get hold of” — they’re generative models, not search engines.

  14. If I had to write a bash script, something I very much don’t want to learn how to do, I would not mind at all using ChatGPT. I, for one, etc.

  15. Jen in Edinburgh says

    Comment on ChatGPT maths borrowed from a mathematician elsewhere:

    ChatGPT has been trained on a large corpus of text, some of which consists of “good” solutions to mathematics questions, and if it is presented with a problem it has seen before, it will regurgitate something very close to the original solution. If you throw it something new, it will attempt to create a correct solution according to some probabilistic model, but it may well not succeed and will often cheerfully hallucinate.

    As an example, ask it “In how many ways can 5 objects be chosen from 7?” and you will get an absolutely textbook answer, with explanation. It’s doing this from ‘memory’. Change the numbers, and ask it “In how many ways can 163 objects be chosen from 2500?” and it’s away with the fairies, having never seen these numbers worked through. It told me the answer was approximately 3.40549 x 10^(-160) (not a typo) before ‘helpfully’ adding that this is close to zero and ‘deducing’ that (and I quote) “it is not possible to choose exactly 163 objects from 2500 objects”.

  16. LLMs don’t “go looking for” bits of code to “get hold of”

    Pretty sure this one just has. I asked chatGPT for a program in my (currently) favourite language to calculate the combinatorial per Jen’s quote.

    I then asked it for another program to calculate how many ways 2500 objects can be chosen from 163. I got an identical program, except for the two starting values being switched round. Running that gave me exactly zero rather than some weird approximation. Which is not wrong, but not exactly an answer.

    So I asked “What do you expect as the answer to that second program?”: it quoth “… the expected answer is a very large number. …”, followed by a long screed/standard disclaimer apologising that it wasn’t very clever with large numbers, and I perhaps should “consider using a specialized library”.

    Well the language I chose comes with all sorts of specialized libraries. chatGPT isn’t using any of them. Indeed the code it’s purloined is just about the most inefficient way to work with combinatorics. If a junior programmer came to me with that ‘solution’, I’d ask if they’d slept through their classes.

    It further warned me — but only because I asked — “it’s likely that the program will encounter integer overflow issues when calculating “. I’m glad to know it knows. Except that it’s wrong: the language I chose copes very well with astronomical numbers if you code the problem the right way. Furthermore chatGPT’s program does use ‘the right way’. So if Jen’s mathematician wants an answer, I have it. (Probably it’s correct; I haven’t checked.)

  17. Often, it takes a human a good deal more time and effort to generate a solution to a problem than to validate somebody else’s putative solution. AI can be a labour-saving device for generating such solutions, provided a human is still involved in validation.

  18. John Cowan says

    Often, it takes a human a good deal more time and effort to generate a solution to a problem than to validate somebody else’s putative solution. AI can be a labour-saving device for generating such solutions, provided a human is still involved in validation.

    Quoted for truth.

  19. AntC, my point was that ChatGPT didn’t produce the code you requested by looking for it — where would it look? — but by generating it. But yes, it’s often really dumb, especially where math is concerned (GPT 4 is supposedly better), and the results always need review as mollymooly says. Still, it looks like its code gave the right answer for your second question, which is sometimes all you need. The task of figuring out when that “sometimes” includes “now” still requires human intelligence, of course. How did it do on the first question?

  20. Not coincidentally, the greatest outstanding problem in computer science is whether all solutions that can be checked efficiently can also be found efficiently. And almost everyone thinks the answer is no.

  21. I tried the combinatorics questions on ChatGPT:

    TR: “In how many ways can 163 objects be chosen from 2500?”

    ChatGPT (some formatting changed):

    The number of ways to choose k objects from a set of n objects is given by the binomial coefficient, often denoted as C(n, k) or [vertical notation for “n choose k”]. The formula for the binomial coefficient is:

    C(n, k) = n! / k!(n-k)!

    Where n! represents the factorial of n, which is the product of all positive integers from 1 to n.

    In your case, you want to choose 163 objects from a set of 2500, so n=2500 and k=163. Plugging these values into the formula, we get:

    C(2500, 163) = 2500! / 163!(2500-163)!

    Now, calculating factorials for such large numbers can be computationally intensive. If you have access to a calculator or a computer software that can handle large numbers, you can directly calculate this value. However, if you’re looking for an approximate answer, you might consider using an approximation technique like Stirling’s approximation for factorials.

    Seems legit as far as I can tell. But when I asked it for an actual answer using Stirling’s approximation or any other method of its choice it came up with 4.869×10^116, while Wolfram Alpha thinks it’s 1.65×10^260.

    TR: “In how many ways can 2500 objects be chosen from 163?”

    ChatGPT:

    It seems like you want to calculate the number of ways to choose 2500 objects from a set of 163 objects. However, choosing more objects from a smaller set than the set size itself is not possible without repetition, and even then, the number of possible combinations is limited by the size of the original set.

    In this case, choosing 2500 objects from a set of only 163 objects is not feasible because there are not enough objects in the set to meet this requirement. The maximum number of objects you can choose from a set of 163 objects is 163, which is the size of the set itself.

    If you intended to calculate something else, please provide more context or clarify your question so I can assist you accurately.

    I asked for a script (language unspecified) to “calculate the number of ways that 2500 objects can be chosen out of 163”; it gave a Python script that implements the binomial formula (and runs successfully without integer overflow), and included this error check:

    if chosen_objects > total_objects:
      print("Error: Cannot choose more objects than available.")

    So it still can’t do math, but it did pretty well otherwise (and seems to be improving — my sense is that it would have failed abysmally at all these tasks when it was first released).

  22. John Cowan says

    And almost everyone thinks the answer is no.

    The mensheviks point out that most arguments for P ≠ NP are of the form “It’s absurd to suppose that if there is an efficient method of exhaustive search for solutions we haven’t found it yet”, and that this argument schema is weak, because “yet” represents only about half a century of trying. Whatever Fermat thought he had done certainly didn’t amount to pre-creating the underpinnings of the Wiles proof, and that took 358 years to find. In addition, even if P turns out to equal NP the proof may be non-constructive or (thus Knuth) involve finite but truly enormous constant factors.

    One of the problems I give on programming-job interviews involves finding an O(1)-space solution that depends on the input being 32-bit non-negative integers and uses an array of size 2^32, which is not only independent of the size of the input (hence O(1) for space) but feasible (16 GB of memory). People rarely figure this out unprompted.

    (At the same job, I asked my boss what to do if someone wrote down a correct closed-form rather than recursive or iterative solution for factorial. “Hire them on the spot,” he said. I was probably thinking of Fibonacci rather than factorial, where the closed form is round(((1 + sqrt(5))/2)^n / sqrt(5)).)

  23. Fermat almost certainly thought that Fermat’s Last Theorem could be proved by infinite descent. He wrote down explicit infinite descent proof for exponents n = 3 and 4. However, he presumably never actually tried to do the modular arithmetic step by step for the general case and so never noticed where the method would fail for most other odd primes.

  24. TR: “In how many ways can 2500 objects be chosen from 163?”

    Whereas I asked it to go ahead write a program to calculate that. And I’d expect even a newbie programmer to first ‘sense check’ if that’s really the question I was asking. So (to try to stick to a language-y topic) competent speakers have a notion of there being the same subject of discussion expressible in many ways/embeddable in some wider discussion.

    Of course to say ‘the same’ is begging the question: it’s part of a speaker’s competence to know whether two discussions using much the same words are on the same topic.

    when I asked it for an actual answer .. it came up with 4.869×10^116,

    The guesstimate I got — asking the question a different way round — was “2500 choose 163 ≈ 7.10471574e+414”. Despite giving the ≈, it preceded with “The exact value will be:”.

    That’s three wildly varying answers (including Jen’s quoted “close to zero”). None of them even remotely close to the true answer (I agree with Wolfram); nor with even the first significant digits right. Again it seems to depend on which way round you ask the question. Maths is a place where we can give a watertight stipulation for ‘the same’.

    … and included this error check

    Whereas the example code I got had no error checks. Which makes me think it’s just copying the program text from somewhere.

    Plugging these values into the formula, we get: C(2500, 163) = 2500! / 163!(2500-163)!

    Yes this is the ‘textbook’ answer. But this is the absurdly inefficient algorithm I’d expect even a junior to avoid. Unless calculating factorials is implemented in hardware, the code is going to iterate a multiplication. Then 163! is a factor of 2500! there’s no need to calculate 163! at all (so neither multiply it nor divide the product). Just start iterating from 164 to 2500 for that top line. Furthermore since (2500-163) is greater than 163, there’ll be factors in common with the top line, so don’t bother iterating multiplication over them either. (That’s a more subtle consideration that would separate an intermediate from junior programmer.)

  25. But even the absurdly inefficient algorithm runs pretty much instantaneously on a Jupyter notebook, so it comes down to whether you care about theoretical elegance or just practical utility.

    Whereas the example code I got had no error checks. Which makes me think it’s just copying the program text from somewhere.

    But since we both gave it the same task and got different results, it should make you think the opposite: if it was somehow “copying” (from where, given it lacks a searchable memory?) it should produce the same code every time.

  26. it should produce the same code every time.

    No, you got Python, as its choice of ‘script’. I asked specifically for Haskell. Those languages are very different; I wouldn’t expect it’s smart enough to start from some abstract algorithm (expressed how?) and ‘translate’ into two distinct languages.

    Its Haskell model didn’t include ‘sense’ checking. Though (knowing how it tackled the Haskell), the answer would come out zero anyway (calculated via entirely unnecessary iterations).

  27. Supermarket AI meal planner app suggests recipe that would create chlorine gas

    More in the vein of garbage in, garbage out. Or rather mischief in, dumbness out.

    … recommending customers recipes for deadly chlorine gas, “poison bread sandwiches” and mosquito-repellent roast potatoes.

    A spokesperson for the supermarket said they were disappointed to see “a small minority have tried to use the tool inappropriately …”

    They expect with a name like ‘Savey Meal-bot’, customers will take it seriously? Have you met any humans?

  28. I’ve seen some discussion on LinkedIn of code one can use to restrict ChatGPT from scraping your site. Although we’d like the LLMs to be well-informed, I also am loathe for the knowledge collected on this site to be free fodder for our AI overlords. https://geekflare.com/stop-chatgpt-plugins-from-scraping-website/

  29. What do other people think about this? Should I disallow ChatGPT-User?

  30. David Eddyshaw says

    I vote yes. Ban it. (In full accordance with Kant’s Categorical Imperative.)

    The thing’s a parasite.

  31. No, you got Python, as its choice of ‘script’. I asked specifically for Haskell.

    Fair enough. But I just gave it the exact same prompt for a Python script as I did yesterday, and got different code. As I’ve said, LLMs are not capable of search and don’t have a database they could search in, unless they’re hooked up to one (as Bing chat is but ChatGPT 3.5 isn’t).

  32. Keith Ivey says

    Is there much point in blocking ChatGPT when all the others still have free access? Will Google actually allow blocking it from using your site for AI without also blocking it from indexing for the search engine? Of course search engines may be in the process of destroying their usefulness anyway.

  33. Apparently all I need to do is add this to robots.txt:

    User-agent: ChatGPT-User

    Disallow: /

    If anyone who actually understands this stuff thinks it would have bad effects, now would be the time to tell me.

  34. As far as I can tell the effect would be that people asking linguistics questions on Bing Chat will no longer be able to be enlightened by the collective wisdom of the Hattery. I don’t really see the point, but it’s your call.

  35. Oh, OK, that’s definitely a bad effect. I guess I won’t do it.

  36. @AntC: With that “Savey Meal-bot” story, we are now literally living in a SMBC comic.

  37. Keith Ivey says

    Really the fact that it can produce dangerous recipes when asked to use nonfood ingredients seem like less of a problem that that it produces bad recipes when given normal ingredients.

  38. My vote is ban for now, reevaluate in a year. We have yet to see what sort of pig’s breakfast this will yield, or not. I worry that people who seek answers will get horrible chimeras, with bits of this blog attached in all the wrong places.

    Let the seekers of answers use grampa’s search engine to get them here directly.

  39. less of a problem that[n?] that it produces bad recipes when given normal ingredients.

    The need was real: during the lockdowns (no takeaways) and global disruption to supply chains, supermarkets were frequently running out of all sorts of ingredients you’d expect to find in your larder.

    So ‘creative’ meal planning with what was to hand/what was actually in stock was definitely a bright idea. Just not an idea for which LLMs are at all suited.

    Then calling it Savey McMeal Face was only asking to take the P.

  40. Last week, I asked Google Bard to recommend me a beach resort in the Islets of Langerhans. It answered something to the effect that they are islets in the pancreas, and are not accessible to the public. And would I like to consider beach resorts somewhere else?

    I asked again today, and were told that the IoL are “not a real place”, but a cluster of cells in the pancreas that produce insulin, and again, how about elsewhere?

    It turns out that all Google Bard needed was another coat of paint.

  41. I prefer to take my breaks in Smegma or Spasmodic.

Speak Your Mind

*