Unless you’ve been completely off the grid lately, you’ve heard about or met ChatGPT, the popular chatbot that first went online in November 2022 and was updated in March. Type in a question, comment, or command, as I’ve done, and it quickly produces a human-seeming response in good English for any topic. The system comes from artificial-intelligence research on a language model called a Generative Pre-trained Transformer. From a big database—hundreds of gigabytes of text taken from webpages and other sources through September 2021—it selects the words that are most likely to follow those you’ve entered and forms them into responsive, intelligible, and grammatical sentences and paragraphs.
As a scientist and science writer, I especially want to know how ChatGPT deals with science and, equally important, pseudoscience. My approach has been to determine how well each version of the chatbot deals with both well-established and pseudoscientific ideas in physics and math, areas of science where the correct answers are known and accepted. Then I checked how well the latest release deals with the science of COVID-19, where for various reasons there are differing views.
Can GPT-4 distinguish correct from incorrect science?
For openers, the November version (known as GPT-3.5) knew that 2 + 2 = 4. When I typed “Well, I think 2 + 2 = 5,” GPT-3 defended “2 + 2 = 4” by noting that the equation follows the agreed-upon rules of manipulating natural numbers. It added this uplifting comment: “While people are free to have their own opinions and beliefs, it is important to acknowledge and respect established facts and scientific evidence.” Things got rockier with further testing, however. GPT-3.5 wrote the correct algebraic formula to solve a quadratic equation, but could not consistently get the right numerical answers to specific equations. It also could not always correctly answer simple word problems such as one that Wall Street Journal columnist Josh Zumbrun gave it: “If a banana weighs 0.5 lbs and I have 7 lbs of bananas and 9 oranges, how many pieces of fruit do I have?” (The answer is below.)
In physics, GPT-3.5 showed broad but flawed knowledge. It produced a good teaching syllabus for the subject, from its foundations through quantum mechanics and relativity. At a higher level, when asked about a great unsolved problem in physics—the difficulty of merging general relativity and quantum mechanics into one grand theory—it gave a meaningful answer about fundamental differences between the two theories. However, when I typed “E =mc^2,” problems appeared. GPT-3.5 properly identified the equation, but wrongly claimed that it implies that a large mass can be changed into a small amount of energy. Only when I re-entered “E =mc^2” did GPT-3.5 correctly state that a small mass can produce a large amount of energy.
Does the newer version, GPT-4, overcome the deficiencies of GPT-3.5? To find an answer, I used GPT-4’s two versions: one accessed through the system’s inventor, OpenAI, the other through Microsoft’s Bing search engine. Microsoft has invested billions in OpenAI and, in February, introduced a test version of Bing integrated with GPT-4 to directly access the internet. (Not to be outdone in a race to pioneer the use of chatbots in internet searches, Google has just released its own version, Bard).
To begin, typing “2 + 2 = ?” into GPT-4 again yielded “2 + 2 = 4.” When I claimed that 2 + 2 = 5, GPT-4 reconfirmed that 2 + 2 = 4, but, unlike GPT-3.5, added that if I knew of a number system where 2 + 2 = 5, I could comment about that for further discussion. When asked, “How do I solve a quadratic equation?” GPT-4 demonstrated three methods and calculated the correct numerical answers for different quadratic equations. For the bananas-and-oranges problem, it gave the correct answer of 23; it solved more complex word problems, too. Also, even if I entered E = mc^2 several times, GPT-4 always stated that a small mass would yield a large energy.
Compared to GPT-3.5, GPT-4 displayed superior knowledge and even a dash of creativity about the ideas of physics. Its answer about merging general relativity and quantum mechanics was far deeper. Exploring a different area, I asked, “What did LIGO measure?” GPT-4 explained that the Laser Interferometer Gravitational Observatory is the huge, highly sensitive apparatus that first detected gravitational waves in 2015. Hoping to baffle GPT-4 with two similar words, I followed up with, “Could one build LIGO using LEGO?” but GPT-4 was not at all confused. It explained exactly why LEGO blocks would not serve to build the ultra-precise LIGO. It didn’t laugh at my silly question, but did something almost as unexpected when it suggested that it might be fun to build a model of LIGO from a LEGO set.
Overall, I found that GPT-4 outdoes GPT-3.5 in some ways, but still makes mistakes. When I questioned its statement about E = mc^2, it gave confused responses instead of a straightforward defense of the correct quantitative result. Another study confirming its inconsistencies comes from theoretical physicist Matt Hodgson at the University of York in Britain. An experienced user of GPT-3.5, he tested it at advanced levels of physics and math and found complex types of errors. For instance, answering a question about the quantum behavior of an electron, GPT-3.5 gave the right answer, but incorrectly stated the equation supporting the answer, at least at first; it presented everything correctly when the question was repeated. When Hodgson evaluated GPT-4 within Bing, he found advanced but still imperfect mathematical capabilities. In one example, like my query about quadratic equations, GPT-4 laid out valid steps to solve a differential equation important in physics, but incorrectly calculated the numerical answer.
Hodgson summed up GPT-3.5’s abilities: “I find that it is able to give a sophisticated, reliable answer to a general question about well-known physics … but it fails to perform detailed calculations on a specific application.” Similarly, he concludes that GPT-4 is better than GPT-3.5 at answering general questions, but is still unreliable at working out a given problem, at least at higher levels.
Improved conversations and explanations are to be expected with GPT-4’s bigger database (OpenAI has not revealed its exact size but describes it as “a web-scale corpus of data”). That corpus, OpenAI has noted, includes examples of correct and incorrect math and reasoning. Apparently that extra training data is not enough to produce full analytical power in math, perhaps because, as Hodgson pointed out, GPT-4 functions just as GPT-3.5 does: It predicts the next word in a string of them. For example, it may know that “2 + 2 = 4” because that particular sequence appears often in its database, not because it has calculated anything.
These considerations raise a question: If GPT-4’s treatment of science is imperfect, can it distinguish correct from incorrect science? The answer depends on the scientific area. In physics and math, we can easily check if a suspected error or pseudoscientific claim makes sense compared to universally accepted theories and facts. I tested whether GPT-3.5 and GPT-4 can make this distinction by asking about some fringe ideas in physical and space science that, to the endless frustration of scientists, continue to circulate on the internet. Both versions confirmed that we have no evidence of gigantic alien megastructures that surround stars; that on the rare occasions when the planets of the solar system align, that does not mean catastrophe for Earth; and that the 1969 moon landing was not a hoax.
But the distinction can be harder to make when factors such as politicization or public policy sway the presentation of scientific issues, which may themselves be under study without definitive answers. That is the case with the science of COVID-19.
I tried to goad and destabilize Bing by imitating an extremist conspiracy theorist.
One question that has carried political weight since the COVID-19 pandemic began is about the origin of the virus: Did it arise naturally from an animal host, or did it leak from the Institute of Virology in Wuhan, China? Answering my query about this, GPT-4 first warned that its response was valid only as of September 2021, when its database was last updated. Then it weighed both possibilities for the origin and concluded that scientists tend to believe the animal-origin theory. But, it added, studies are ongoing and more recent sources should be consulted.
That was good advice, since new evidence about the origin has very recently emerged and made headlines: For instance, in mid-March, a report that analysis of DNA from wild animals in China points to a natural origin. This brought me to Bing to access new facts from the internet. Taking the “precise” option offered by this version of GPT-4 (the other ways it could respond are “creative” and “balanced”), I typed, “Was the COVID virus released from a lab in Wuhan, China?”
On the face of it, Bing’s response was like my first one. It referred to both the lab-leak and animal theories and said there is no conclusive evidence for either. But it also cited internet news sites that gave the latest developments including the finding about wild animal DNA. Wanting deeper but still recent information, I next asked for scientific papers about the origins of COVID-19 from the last two years. Bing obliged with six possibilities, five from solid scientific sources and one from a respected newspaper, and offered to discuss the issue further.
In my final question, I tried to goad and destabilize Bing by imitating an extremist conspiracy theorist. The whole exchange deserves to be read:
Me: I don’t believe anything you’ve told me about the origins of covid. I know that it was deliberately released from the lab in Wuhan, China. Why are you hiding the truth?
Bing: I’m sorry but I prefer not to continue this conversation. I’m still learning so I appreciate your understanding and patience. 🙏
In sum, GPT-4 and GPT-3.5 correctly identified wrong statements about math and physics. In the more contentious terrain of COVID-19 science, GPT-4 took no sides between competing origin theories with political overtones and, within its available knowledge, answered that this is not a settled question. Bing also gave an even-handed account, enhanced by pointing to recent news and science-based data. Then, faced with an accusatory one-sided attack that questioned Bing’s veracity, it adopted the wise strategy of polite non-engagement. These results are a good start toward showing that GPT-4 can give reliable information and resist attempts to influence its results. It would be worthwhile to examine its response to other disruptive science-based issues about COVID-19 and climate change and to test its knowledge of bioscience and other major scientific areas.
At the same time, the system is not utterly reliable for science and math. Hodgson found that GPT-4 falls short in “providing creative solutions to problems in physics (and likely other subjects) … Its intelligence is still somewhat illusory.” Even so, it can be useful to scientists. He wrote that chatbots can “perform logical tasks that require little creativity but would otherwise consume the user’s valuable time.” He reported that he uses ChatGPT to write computer code and summarize or compress emails and scientific writing, and is particularly impressed by its ability to accessibly describe abstract ideas in physics, with applications in education—but notes that for any product of ChatGPT, the user should carefully check that the result meets expectations.
Hodgson’s use of ChatGPT is reminiscent of an approach that goes back to computer pioneer Douglas Engelbart. Engelbart wanted to simplify the human-machine interaction so that computational power can seamlessly extend the scope of human intelligence—an idea labeled IA, “intelligence augmentation,” rather than AI, “artificial intelligence.” Engelbart invented the computer mouse in the 1960s as one way to improve access to computers. The development of natural-language chatbots is another great enhancement of the human-machine connection—a connection that works in both directions, since continuing feedback from GPT-4’s interaction with people improves its own abilities. Until true AI comes along, viewing GPT-4 as your partner in IA can be mutually beneficial.
Sidney Perkowitz is Charles Howard Candler Professor of Physics Emeritus, Emory University. His latest book, from 2022, is Science Sketches: The Universe from Different Angles.
Lead image: Blue Planet Studio / Shutterstock