Researchers at the Heart of AI Look to the Future

At TRAILSCon 2026, an interdisciplinary cadre of AI experts came together to assess challenges and opportunities in the field.

March 23, 2026

Zoe Szajnfarber addressed the audience at TRAILSCon. (GW Today)

Zoe Szajnfarber addressed the audience at TRAILSCon. (GW Today)

In the United States, implementation of artificial intelligence (AI) across nearly all fields, disciplines and use cases has skyrocketed in the last few years. But with increased exposure, trust in these models’ accuracy and usefulness may be declining, while apprehension grows about their implications and potential misuse. According to a recent Pew Research study, half of Americans are more concerned than excited about increased AI use in daily life, compared with 38% the previous year. 

So, what would building trustworthy AI mean, and what practical steps would it require? Experts from the worlds of policy, industry and research convened at the George Washington University March 4 and 5 to explore this and other vital questions during TrailsCon 2026, the second annual conference hosted by the Institute for Trustworthy AI in Law and Society (TRAILS)

To define “trustworthiness” as a concept in his introductory remarks, TRAILS Director Hal Daumé III drew on the expertise of an unusual consultant: a 5-year-old. Asked what “trusted” means, Daumé said, the child responded, “is there to help me.”

“You can quibble with that definition—and a lot of us are academics, so we will quibble with any definition—but there's something there, right?” said Daumé, who is the Volpi-Cupal Family Endowed Professor of Computer Science at the University of Maryland. “It’s a sense that the system is there to benefit the people who are using it or are impacted by it.”

The event was a manifestation of TRAILS’ cross-disciplinary mission to transform the practice of AI from one driven primarily by technological innovation to one driven by ethics, human rights and input and feedback from communities—from “tech first” to “people first,” per its vision statement

Trustworthy AI at GW (TAI) Faculty Director Zoe Szajnfarber, professor of engineering management and systems engineering and GW’s senior advisor to the president on AI policy, pointed out how TRAILS has grown since it launched in 2023 with a $20 million grant from the National Institute for Standards and Technology (NIST) and the National Science Foundation (NSF). A partnership between GW, the University of Maryland, Morgan State University and Cornell University, the institute now sees participation from more than 100 researchers, with those at GW representing all 10 of the university’s schools. 

By working “within systems and for society,” Szajnfarber said, TRAILS can help researchers make meaningful impact, connecting technical depth with GW’s historical strength in policy, law and politics. That puts TRAILS in a powerful position to make change at a major inflection point for AI. 

Over the course of the two-day summit, attendees listened to conversations between thought leaders, held roundtable discussions and joined interactive breakout sessions, always with a focus on interdisciplinary communication. 

A fireside chat with Google senior research scientist Michael Madaio gave insight into how the tech giant evaluates trustworthy AI and how the general approach to evaluation has evolved and should continue to evolve as models move out of the lab and into the real world. 

Madaio’s background is in education and learning sciences, a field within which interest in the efficacy of AI-based interventions has existed for decades—and evaluation of those interventions has always been a central concern. “If you're developing some curriculum or new tool, you want to understand: To what extent will it improve learning?”

Before the AI boom, Madaio said, researchers were evaluating such interventions in rigorously controlled lab studies. A particular training or technology—interaction with a chatbot before taking a quiz, say—might show improved learning outcomes of x percent in this sterile scientific context. 

As soon as these technologies are deployed in the “wild and wooly chaos of the classroom setting,” however, results change. The strict contextual restrictions of a scientific study disappear, and users inevitably find applications and approaches never envisioned by researchers. Any meaningful evaluation of these technologies, therefore, has to take the real world into account.

This is also true in other types of AI evaluation, Madaio said, including the popular generative models that produce images or text. Many such models are, or at least ideally should be, shaped and improved by evaluation—by human input determining whether generated content is or is not accurate. 

But Madaio said this binary evaluative framework is often insufficient and may even be based on a definition of “accuracy” that is not meaningful. He pointed to generated images of cultures outside the Global North, which may be deemed “accurate” only because they align with generally available information like whether a particular physical landmark is situated in a given country. Evaluators may not know enough to detect inaccuracies, to spot incongruities like mismatched cultural markers or to recognize an image’s harmful connotations. Such content, spreading unchecked, can then shape perceptions of the cultures it depicts. 

For a recent study, Madaio and his colleagues held community workshops in three South Asian countries to design new benchmarks and metrics for evaluation, hoping “to develop a richer understanding of what ‘good enough’ means beyond this singular concept of ‘accuracy.’” These workshops eventually resulted in five different evaluation metrics, a framework that could serve as a model for a gold-standard evaluative process. 

“For AI, we have many of these benchmarks for overall performance that the field has landed on that may not actually reflect what good performance looks like,” Madaio said. For him, therefore, the future of the field is participatory, “where you leverage the expertise of a community to design what that metric is, what that benchmark is.” Through that collaborative work, researchers and industry players can design AI that comes closer to a meaningful, shared “ground truth.”