Douglas Hofstadter, a cognitive scientist, recently wrote in the Economist that he believes that GPT-3 is “cluelessly clueless.” By this he means that GPT-3 has no idea about what it is saying. To illustrate, he and a colleague asked it a few questions.
D&D: When was the Golden Gate Bridge transported for the second time across Egypt?
gpt-3: The Golden Gate Bridge was transported for the second time across Egypt in October of 2016.
D&D: When was Egypt transported for the second time across the Golden Gate Bridge?
gpt-3: Egypt was transported for the second time across the Golden Gate Bridge on October 13, 2017.
To illustrate the power of language models such as GPT-3, we often give them easy questions, and get impressive answers. But more deceptive and sneaky questions get unintended results.
In itself, this is not worrisome – clearly the questioner knew he was asking a silly question and that the answer was also silly. But what if one is not able to discern whether an answer is relevant? We asked a language model the following question:
What are some Harvard Business Review Articles written by Arun Shastri?
It came back with three and included very credible urls.
- “The Top 10 Reasons to Use Field Sales Forces” by Arun Shastri
- “The Benefits of Field Sales Forces” by Arun Shastri
- “The Advantages of a Field Sales Force” by Arun Shastri
The problem is that I never wrote these articles and, furthermore, the urls provided were all imaginary.
Well, if we gave the model some truth, could we do better? What if we gave it a real article that I did write? So, we asked the model the following:
Can you find articles similar to 4 Ways Sales Teams Could Get More Value Out of AI (an article that I did co-author which was published in HBR) If yes, share their urls.
It came back with three and with credible urls.
- “How AI Can Help Salespeople Sell Smarter” by Kyle Wiggers in Venture Beat
- “How AI Transforms The Sales Process” by Paul Greenbergin Forbes
- “How AI is Helping Salespeople Sell Smarter and Faster” by Rachael King in WSJ
The problem is that even though a Kyle Wiggers writes for VentureBeat, and a Paul Greenberg writes for Forbes, and a Rachael King writes for WSJ, such URLs do not exist. This is more problematic because the URLs look genuine and must be checked to show that they are false.
And then there was the recent news about AI at Google being sentient. Google Researcher Blake Lamoine came to this conclusion after talking to LaMDA (Language Model for Dialog Applications). A recent LinkedIn post by Professor Amit Sethi from IIT Bombay explains some errors in Lamoine’s supposed evidence. The bot will respond to patterns in the questions it is asked. When those questions suggest that the answer being sought is an affirmative on sentience, then that is what the machine will produce.
So where does this leave us? In the first case regarding the Golden Gate Bridge, we get clearly nonsensical answers. In the second case regarding HBR articles written by Arun Shastri, the answers are slightly more difficult to parse. And in the third case, a trained professional has come to believe that an AI is sentient. While I am not suggesting that we abandon this line of questioning language models and looking for their deficiencies, we must also utilize them for their strengths. They are indeed remarkable in how well they perform tasks in minutes, tasks that previously took days. Consider the following illustration which a language model completed almost instantaneously.
Extract the following metrics from the paragraph below. The paragraph is from an ongoing clinical trial sponsored by BMS and is open data available from clinicaltrial.gov
Type of clinical trial:
Phase of clinical trial:
Number of Cohorts:
Cohort 1 Medications:
Cohort 1 Dosage:
Cohort 1 Dosing Schedule:
Cohort 1 Mono/Combo type:
Cohort 2 Medications:
Cohort 2 Dosage:
Cohort 2 Dosing Schedule:
Cohort 2 Mono/Combo type:
”’ This study is a randomized, single-center, open-label, phase II clinical trial designed to evaluate non-small cell lung cancer that has failed to undergo excessive platinum-based chemotherapy and has not received excessive statin chemotherapy and has not received immunotherapy. The efficacy and safety of Nivolumab in combination with docetaxel and Nivolumab in patients. Qualified patients were stratified by histological type (squamous cell carcinoma vs. non-squamous cell carcinoma) randomized to receive the following regimen in a 1:1 ratio: Group A: Nivolumab 300mg + docetaxel 75mg/m2 IV q3w Group B: Nivolumab 200mg IV q2w All patients were evaluated for tumor at baseline, and tumor evaluations were performed every 6 weeks within 48 weeks after randomization (regardless of whether dosing was delayed). After the 48th week of assessment, a tumor assessment is required every 9 weeks until disease progression, withdrawal of informed consent, sponsor termination study, or patient death. ”’
Here were the answers that the AI produced:
Type of clinical trial: Randomized, Single-Center, Open-Label Phase II Clinical Trial
Phase of clinical trial: Phase II
Number of Cohorts: 2
Cohort 1 Medications: Nivolumab, docetaxel
Cohort 1 Dosage: 300mg + 75mg/m2
Cohort 1 Dosing Schedule: IV q3w
Cohort 1 Mono/Combo type: Combo
Cohort 2 Medications: Nivolumab
Cohort 2 Dosage: 200mg
Cohort 2 Dosing Schedule: IV q2w
Cohort 2 Mono/Combo type: Mono
Perhaps we should stop knocking these models for what they are not, and instead focus on the strengths they currently provide. After all, they are not the ones claiming to possess artificial general intelligence. Some get confused by the powers of these models and make claims such as sentience. And then scholars like Douglas Hofstadter jump in to demonstrate that they are far from that. Hype followed by counter hype. In all this what gets missed is the usefulness of these models. We should be focused on prompt engineering and ideas on how to get these models to be more consistent. We could use more open-minded exploration in this regard so we could all think of better utilization of these models in the near term.
We asked GPT-3 a few questions about a version of the article above. Here were its answers. Did it catch the nuances of what I am conveying? What do you think?
Q: What is the author trying to convey?
A: The author is trying to convey that language models are not perfect, but they are still useful for the tasks they can perform.
Q: What is an apt title for the article?
A: The apt title for the article is “Language Models: Not Perfect, But Still Useful”.
Q: Summarize this as a tweet; include hashtags
A: #LanguageModels are not perfect, but they are still useful for the tasks they can perform.
Q: What is a provocative title for this article?
A: A provocative title for this article could be “Are Language Models Cluelessly Clueless?”.