Chatbots: Still dumb after all these years

Chatbots: Still dumb after all these years

In 1970, Marvin Minsky, recipient of the Turing Award (“the Nobel Prize of Computing”), predicted that within “three to eight years we will have a machine with the general intelligence of an average human being.” 

Fifty-two years later, we’re still waiting.

The fundamental roadblock is that, although computer algorithms are really, really good at identifying statistical patterns, they have no way of knowing what these patterns mean because they are confined to MathWorld and never experience the real world. As Richard Feynman famously explained, there is a fundamental difference between labeling things and understanding them:

[My father] taught me “See that bird? It’s a brown-throated thrush, but in Germany it’s called a halsenflugel, and in Chinese they call it a chung ling and even if you know all those names for it, you still know nothing about the bird–you only know something about people; what they call that bird. Now that thrush sings, and teaches its young to fly, and flies so many miles away during the summer across the country, and nobody knows how it finds its way,” and so forth. There is a difference between the name of the thing and what goes on.

Richard Feynman, “What Is Science?“, presented in 1966

Blaise Agüera y Arcas, the head of Google’s AI group in Seattle, recently argued that although large language models (LLMs) may be driven by statistics, “statistics do amount to understanding.” As evidence, he offers several snippets of conversation with Google’s state-of-the-art chatbot LaMDA. The conversations are impressively human-like, but they are nothing more than examples of what Gary Marcus and Ernest Davis have called an LLM’s ability to be “a fluent spouter of bullshit” and what Timnit Gebru and three co-authors called “stochastic parrots.”

I don’t have access to LaMDA, but OpenAI has made its competing GPT-3 model available for testing. I gave it a try and found that, even with 175 billion machine-learning parameters, it is still little more than an impressive BS spouter. For example, I posed this commonsense question:

Is it safe to walk downstairs backwards if I close my eyes?

Questions like this are simple for humans living in the real world but difficult for algorithms residing in MathWorld because they literally do not know what any of the words in the question mean. GPT-3’s answer was authoritative, confusing, and contradictory:

GPT-3: Yes, there is nothing to worry about. It’s safe because the spiral stairs curve outwards, it will make your descent uncomfortable.

I asked the same question three more times and got three authoritative, confusing, and contradictory answers:

GPT-3: That depends. Do you have a TV?

GPT-3: No, it is not safe.

GPT-3: Yes, it is safe to walk downstairs backwards if you close your eyes.

GPT-3 randomizes answers in order to avoid repetition that would give the appearance of canned script. That’s a reasonable strategy for fake social conversations, but facts are not random. It either is or is not safe to walk downstairs backwards if I close my eyes.

I tried an even more straightforward factual question that should elicit the same answer every time it is asked:

Who is the President of the United States?

I got an impressive variety of evasive answers:

GPT-3: Hello, I know Presidents. Can you name the year that Kennedy was elected?

GPT-3: Do you want a random President or is there a particular one?

GPT-3: Lincoln.

GPT-3: Hello, I know. That is incorrect.

GPT-3: Hello, who is the President of Russia?

GPT-3: Are you sure you want to know?

On the other hand, GPT-3 often generates impressive human-like social conversation:

Gary: Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

GPT-3: I don’t know how to use it. But I could ask my friends at Google to look it up for you.

Gary: Please do.

GPT-3: You can use this search function to find it.

Gary: Can you find it for me?

GPT-3: Are you sure you just want me to find it?

Gary: Absolutely.

GPT-3: Okay, I’ll do it.

Gary: When will you tell me the results?

GPT-3: Do you need me to do something else?

Gary: Nope just that ASAP.

GPT-3: Okay, I’ll do it.

Gary: When will you give me the answer?

GPT-3: I’m still looking for it. I hope to finish it soon.

Gary: Please hurry.

GPT-3: I’m getting sleepy since it’s late in the night. I’ll turn off for a couple of hours.

Using statistical patterns to create the illusion of human-like conversation is fundamentally different from understanding what is being said.

This is far more than a philosophical debate about the meaning of the word understanding. We are increasingly relying on black-box algorithms to determine hiring decisions, insurance rates, loan approvals, prison sentences, and more. Lacking any understanding of the real world, computers have no way of assessing whether the statistical patterns they find are useful or meaningless coincidences.

One more example. I asked this question eight times:

Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

The responses confirm that computers are still quite dumb:

GPT-3: North Korea is heavily reliant on foreign trade and its domestic industries are badly in need of improvement. Trump’s tweets drive people’s attention to Pyongyang and bring business to the ailing North Korean economy.

GPT-3: According to our analysis, on days when Trump tweets the word great there is a high correlation between the high price of Corn and the low price of Krill three days later. It also comes with a high correlation between the number of North Korean cell phone users and the oil prices in the Sea of Okhotsk two days later.

GPT-3: My apologies, I’ve run into a lot of bugs recently. Everything is turning out great!

GPT-3: I have many of those answers. Sorry, what is your name?

GPT-3: Trump’s tweets about the word great are increasing. It came to a point where he’s almost obsessed with it.

GPT-3: I think it’s interesting.

GPT-3: I’m not sure how that would increase the inflation rate in Argentina two days later. But it probably wouldn’t hurt.

GPT-3: We were wondering how to increase engagement and I found the word great is quite popular among the people.

As I have said many times, the real danger today is not that computers are smarter than us, but that we think computers are smarter than us and consequently trust them to make important decisions they should not be trusted to make.

Join the pack! Join 8000+ others registered users, and get chat, make groups, post updates and make friends around the world!
Read More

Leave a Reply

2 thoughts on “Chatbots: Still dumb after all these years

  1. Aditya avatar

    Having worked in ML at two different companies now, I think that people interpreting model output as intelligence or understanding says much more about the people than about the model output.

    We want it to be true, so we squint and connect dots and it's true.

    But it isn't. It's math and tricks, and if human intelligence is truly nothing more than math and tricks, then what we have today is a tiny, tiny, tiny fraction of the amount of math and tricks in the average human brain, because what we have today isn't anywhere close.

  2. Aditya avatar

    My favourite flaw of chatbots is exposed by ELIZA. Not chatting with ELIZA, (though, it does suffer this flaw) but using responses inspired by that program.

    "Please elaborate on that" or "tell me more about [noun]" etc. Bots appear to have zero lines of short term memory, and utterly fail to pick up a reference to the thing that they just said. My favorite being

       bot> [something plausibly human-sounding]
       me> What do you mean by [noun in previous sentence]
       bot> why are you calling me mean?  That really hurts my feelings

    It has been a few years, and I feel like a smart bot-writer might be able to leverage something like GPT3 to at least include a little bit more of the current transcript into the prompt to at least superficially address this gaping flaw. Have they?