Many industries, from entertainment to medicine, are wrestling with the emergence of sophisticated artificial intelligence (AI). Scientific research is no exception. Funding agencies are already cracking down on the use of generative text AIs like ChatGPT for peer review, citing the inconsistency of analysis produced by these algorithms, the opacity of their training models and other concerns.
But for two professors at the George Washington University, the advancing capabilities of large language models (LLMs), including OpenAI’s ChatGPT, Meta’s LlAMA2 and Google’s BARD, are worth careful study—and cautious optimism.
John Paul Helveston, an assistant professor of engineering management and systems engineering in the School of Engineering and Applied Science, and Ryan Watkins, a professor and director of the Educational Technology Leadership program at the Graduate School of Education and Human Development, believe LLMs have the potential to streamline and enhance aspects of the scientific method and thereby enable a greater volume of useful, interesting research. To use them that way, though, people need better education about what these algorithms can and can’t do, how to utilize AI tools most effectively and what standards and norms for using AI exist in their discipline.
To that end, Helveston and Watkins co-run an online repository, LLMs in Science, that provides a record of, and resources for, scientists and educators who are exploring the technology’s potential as an investigative tool.
“Last spring John said something insightful about how people might be able to use LLMs as part of their scientific workflow, and I said, ‘well, we should start collecting information on how that happens and figuring out what are going to be the norms and standards for how we use them,’” Watkins remembered.
LLMs train on colossal datasets to recognize patterns—the probable relationships of certain words to one another, the probability that particular clusters of words relate to particular subjects—and then use the probabilities generated from those patterns to predict answers to user-generated queries. These algorithms predict not only which words and phrases to use but also the sequence in which those words and phrases should follow each other, resulting in “humanlike” responses. ChatGPT-3, which debuted to a huge response in 2022, was trained on more than 570GB of data—300 billion words, the equivalent of 1.6 million copies of “Great Expectations”—scoured from online text databases.
Helveston and Watkins, who began collaborating as faculty advisers of GW Coders, knew early on that LLMs’ promise went beyond mere data regurgitation. In fact, the programs might save enormous amounts of time and energy on the parts of a scientist’s work that don’t require human creativity or collaboration: boilerplate language, outlining a grant proposal, producing “virtual data” on which an analytical tool can be trained and more.
Accordingly, the resources available at LLMs in Science include an evolving list of potential scientific uses for LLMs, tutorials on using LLMs, guidance for peer-reviewers evaluating research that utilizes LLMs, a database of studies about LLMs’ use by researchers in different fields, and more.
Such knowledge is essential, not only for the replicability of scientific studies that use LLMs but for correctly gauging when these tools should and shouldn’t be used. They still have major limitations, as Helveston demonstrated on the first day of his coding class last year by opening ChatGPT and letting it operate live for his students. Helveston first told the algorithm to translate a sentence into Russian, something like Welcome to class and let’s have a good semester. He asked his students: Do any of you speak Russian? No? Well, would you read this phrase aloud to a Russian speaker? If you did, would you have absolute confidence in its accuracy? The answer was obvious to anyone familiar with the foibles of automatic translation: Probably not.
Helveston then asked ChatGPT to write a computer function with certain attributes that would produce specific results. He asked his students again: Would they trust this code to do what it was supposed to? They would not. Well, that’s the point of this class, Helveston said—just as a Russian class would teach them how to spot errors in a bad translation, this coding class would teach them how to read the language in which programs are written and spot mistakes.
“The tool is not useful if you don't know the language, and [students are] here to learn the language,” Helveston said.
Educators have understandably raised concerns about ChatGPT, which some believe may increase academic dishonesty. But Helveston and Watkins say that in their experience, if students are taught how and when to use LLMs as tools for learning, they may actually learn better. Quantitatively, Helveston said, his students did better on a handwritten programming exam this spring—the semester he introduced ChatGPT as a teaching tool—than they have in previous years.
“I think at a minimum, every professor needs a section in their syllabus on AI in this course and how it's going to be used or not,” Watkins said. “That will fluctuate across different classes and the preference of the professors, but as faculty we have to address it with our students and I can’t think of a single course where this isn't relevant.”
“What we don't know yet is whether this will change how students learn,” Helveston said. “We want to teach students to use these tools, because [AI is] the future of how certain tasks will get done. But we also want them to know how to question the results.”
Read the latest update and existing official guidance from the Office of the Provost regarding the use of generative AI at GW.