
Artificial intelligence is now mainstream, and as it grows in use cases, learning, and spread, the amount of data it generates and uses will grow exponentially. What does this mean for evidence when all of this is created in environments where even its creators might not know how it’s generated? In this second article in a series on the state of evidence in 2024 I look at the example of artificial intelligence.
Artificial Intelligence is everywhere — or at least it feels that way.
Generative AI tools are growing in popularity and availability, allowing people to create content, review existing content, analyze data and generate suggestions. AI is now being deployed in such diverse use cases as supply chain management, customer service, cybersecurity and agriculture. All of these applications use and generate data to inform algorithmic decisions (predictions) that guide human and machine action. This generation and use of data is at the heart of evidence. Evidence allows us to make more informed decisions about our world by connecting what we experience and observe with outputs and outcomes.
While AI has the potential to be an asset to evidence generation, it also may be its greatest threat. At the heart of this concern is that AI is programmed to learn — or refine and calibrate its algorithms and calculations based on feedback. What goes into each model is information that is gathered from many sources and then combined with feedback mechanisms, which allow the model to “learn” over time. But like any model, what goes into it influences what comes out. As we know, many of the sources of information we use to propagate these models are based on datasets that have human beliefs (and values?) baked into them or ignored, but always with some kind of bias.
What’s more problematic is that many of the developers of these models aren’t clear on what is used to feed the models in the first place. Even worse, they don’t know how these models are generated to produce output.
If we don’t know what goes into the model, how can we judge the reliability and validity of what comes from it? We can’t.
AI is not just a black box; what’s inside are models stacked on top of one another.
Trust Issues in AI-Generated Evidence
If we can’t judge the output of an AI model, how can we make sense of what people choose to do with that information? That’s an enormously complicated issue when AI-generated content continues to build upon itself as it ‘learns. Decisions are built upon decisions, and that carries with it risks.
As time goes on, these decisions get embedded within other decisions. Before long, we’ll have little sense of where things came from. Lies, distortions, half-truths, and facts will be combined to inform a set of actions that will lead to something.
My concern is that “something” will, at best, be far from what we need and, at worst, will do much harm. On the day this article was published, Microsoft launched its CoPilot + Microsoft laptop, which embeds AI features into the machine itself to make it easier to access. Embedded features are those that users have work to avoid, which is a way to increase the likelihood and speed of adoption. Microsoft wants you to use CoPilot in your work because it makes money off of that. Just as it did with browsers, operating systems, and software features, Microsoft is seeking to embed features that are very difficult to avoid using.
Just ask anyone with a Microsoft 365 account.
The risks of AI are well-discussed all over the Internet. Whether it’s some of AI’s founders fearing what they’ve created or those who stand to capitalize on its spread asking for regulation of AI, many voices are concerned about what AI brings with it.
The TED website, for example, already has dozens of talks focused on AI, ethics, and the issues associated with it. It feels like AI is everywhere. The commentary is interesting, but it feels like the Blacksmithing Guilds of the early 1900s speaking on how automobiles and horses will work together. The technology is already in broad distribution, and, unlike automobiles, this one has the power to self-perpetuate.
The speed, intensity, and scope of AI deployments are far too relentless. As we saw with search, social media, and mobile, the tech development groups behind it are far ahead and are seeking to establish these systems into the fabric of what we do first, then we are left with debating what to do.
Read Jonathan Haidt’s work on what phones are doing to children’s mental health or Johann Hari’s work on attention and you’ll see what networked, scalable digital technology can bring. Now imagine this powered by a tool that self-perpetuates and amplifies the very things that contribute to these issues. The cat is out of the (iPhone) box. We are left to repair and remediate what is left, not create something better.
These are also embedded and facilitative technologies. That means they get added to a technology, integrated into the ‘DNA’ of the toolset, and then used to facilitate new technologies and tools. AI is like this, and the problem we have and will continue to have is that these assumptions are baked into the models we use. Once embedded into them, it’s nearly impossible to take them out. What does this mean for evidence?
AI and Evidence

Evidence requires a few things to be useful. Each of them is central to our understanding of what AI brings:
Transparency. We need to be able to see all aspects of the evidence chain. We need to understand how data is collected and treated and how we generate claims about it (e.g., establish themes, apply statistical models, use and treat data etc.). With AI models, we don’t know what’s inside many of them — even their creators are unsure – so transparency becomes impossible. It’s a black box.
Reliability: Evidence has to be reliable – we can count on it. We also can understand where we can’t (e.g. gaps in knowledge, methods, or tools). This aspect of the work is imperfect and always being honed, but if we are gathering and using evidence in a transparent way, it’s possible to see over time what we can come to rely upon and to what degree. We start to see patterns and reproducible findings and increase our confidence in the strength of predictions. While large language models increase their fidelity over time, the issues tied to their inputs at the start make reliability claims difficult to assess.
For example, if we take some of the most harmful qualities tied to colonialism such as judgments about people’s worth, roles, and the use of power, we can see that, over time, the values that underpin those models ‘refine themselves’ over time as they become embedded within a system. For example, when we speak of systemic racism, it’s the embedding of these models into institutions so that we continually reproduce similar outcomes that see certain groups treated differently using metrics that are designed to facilitate specific outcomes that advantage certain groups. Over time, this can happen deliberately or unconsciously, but it doesn’t change that it exists. While these models can be changed (society is always evolving and changing), it’s still very difficult to disentangle entrenched, embedded beliefs within a system. We see the same thing with sexism, ableism, and many other embedded social biases, practices, and values tied to ‘models’ that were installed through exertion of power generations ago. AI is doing the same thing with different values and evidence under the exertion of power from those involved in their development.
Whether well-intentioned or not, these models are embedding certain values, data sources, and decision rules into them that are shaping what comes.
Just as we responsibly have to confront the assumptions we use in our evidence systems now, AI will challenge us to the same. What’s different this time is that the speed, volume, and intensity of AI is at a level far above our human capacity to process and understand. It’s taken years of ongoing reflection, critical interrogation, and cultural understanding to begin to dismantle and rework our colonialist models (work still being done) based on — perhaps ironically — the evidence that shows how destructive those systems can be to many populations and society as a whole. While this is difficult work, the advantage is that we can relate to these models as humans by looking at human values, beliefs, and actions. What do we use with AI?
This question is what has me concerned.
AI-generated Evidence: Is it Valid?
Lastly is validity.
Validity: Is what we’re measuring or capturing true? This is more difficult to assess and evolves more over time, but it’s still possible as it gets us into questions of accuracy. Validity is at the heart of evidence. It is here that AI models have much promise but, as we see from the previous two points, matters of transparency and reproduction will hamper our ability to the case of computer software used to support this work, there is always the option to do things by hand (even if that’s highly unlikely) to solve the problem. We can examine the problem
Even with the software we use to support our evidence work now, we have the option (not one we often use) to handle our data by hand. We can use transparent, reliable and verifiable methods (e.g., mathematics or linguistics) to generate that evidence. It’s unclear whether or to what degree that’s possible with AI.
Solutions and Pitfalls

There is a partial solution to the problems with AI: create new models. Nothing prevents us from establishing entirely new generative AI models with rules and processes that allow for transparent, reliable, and valid outputs. Unlike with search (which is often protected), much of the technical aspects of creating an AI model are open and available for people to access. We can create ones from scratch and learn how they behave, monitor their inputs and outputs, and track their evolution over time. This will allow us to calibrate and align them with the evidence systems we create.
We can train models on trustworthy data sources, using culturally relevant, inclusive processes for engaging those people for whom these AI models are intended to serve. We can verify, validate, and affirm the assumptions, the data sources, and the interpretations from those sources that are used to build large language models. There are ways we can create useful and responsible AI systems to generate quality evidence. Companies like Microsoft and Google speak about responsibility, but how is that manifest in practice? We can build responsible AI, but will we? And will this way of working scale beyond niche communities? Just as we tried to create social network platforms designed around trust, data protection, and ethics, most have failed to catch on.
This will fail if we network them together without thinking about the consequences. This is among the big traps. If we have 10 systems and 9 that we’ve checked and verified with one that we haven’t, we won’t spoil the bunch, but it introduces questions about the verifiability of things moving forward. When the stakes are high, that could spell trouble.
The future of evidence is on shaky ground when it comes to AI. That doesn’t mean hope is lost; however, it also means that it may be more difficult to determine where, exactly, hope lies.
Image Credits: Bernard Hermant on Unsplash

Comments are closed.