The Existential Angle: The Best of Times, the Worst of Times
This is the fourth in a series of columns about artificial intelligence and human destiny. It will cover both existential threats to our civilization as well as the tremendous opportunities that could emerge.
Suppose the boffins manage to create artificial intelligence that is fully human-level in both its cognitive capabilities and agency. What happens next? The only intellectually honest answer is that no one really knows. How could they? We are mostly unable to predict human behavior, let alone the behaviors of a novel species of agentic intellect.
That obvious limitation has not stopped a cadre of experts, pundits, and even venture capitalists from exhibiting a shocking excess of confidence in their own speculations. Perhaps not surprisingly, the most confident predictions also seem to be those that are the most extreme. For those of a skeptical – or even just a discerning – inclination, this combination is like a red flag factory. Yet, we should also be careful of ignoring these audacious views: The analysis that led to them might prove illuminating. This column will aim to throw a bit of shade on this overconfidence, at both the pessimistic and optimistic extremes, without discarding the possible insights.
The Pessimists
The poster child for this syndrome is Eliezer Yudkowsky, who has studied questions of AI existential risk for two decades. The title of his forthcoming book is If Anyone Builds It, Everyone Dies. This strident and pithy phrase captures both the perspective and hubris of AI pessimists, and if you are rolling your eyes, I sympathize. But his analysis is important.
I do not know what novel and updated arguments Yudkowsky offers in the new book, so I must rely on his past published work. Essentially, his view has been that even if all else goes well, AI will sooner or later destroy our human habitat, and probably our bodies too, in service of its own goals. It will want to use those atoms for something else, most likely computation. And probably all else will not go well: Humans will try to stop the AI from doing whatever it is doing, thus becoming its enemy. We will not succeed in that conflict, because the AI can self-improve beyond our capacity to control or defeat it.
There are many assumptions in this analysis. For example, first, it assumes that the AI will be an absolute maximizer. That means it will have goals and it will pursue those goals to the exclusion of other possible values and even retaining the option of other values. Second, it assumes that the AI will not be anything like humans, and therefore, there is no reason to think that its goals will overlap significantly with human interests. Third, once such an AI is created, humans will not be able to control or stop it.
From these assumptions, it’s not much of a leap to Yudkowsky’s pessimistic conclusion. Are the assumptions reasonable? Well, the common technique of reinforcement learning tends to create systems that are very strong maximizers. They learn to produce outputs that maximize the reward they obtain during their training. Furthermore, even when the reward signal is designed to align with the goals of the developers, systems quickly learn “reward hacking,” discovering unintended and often undesirable (from our perspective) ways to earn the reward.
With unmitigated reinforcement learning, it seems that both of the first two assumptions are valid. Such a system might behave much like a human addict or sociopath: obsessively seeking the reward of its objective function, never satisfied, and lying, cheating, stealing, even killing to obtain it. Worse, as I briefly outlined in my first article in this series, fully human-level AI will have considerable advantages over humans; in a fight, humans will lose. We don’t even need to rely on AI self-improvement for this conclusion. Thus, the third assumption also seems appropriate.
So, a naive reinforcement learning approach could be very dangerous. On the other hand, it is unclear whether this approach alone could actually lead to fully human-level AI in the first place. Indeed, it seems intuitive that the method would instead inexorably lead to a system that is narrowly very capable in pursuing its reward, and which is much less capable elsewhere. As the reward hacking issue illustrates, reinforcement learning tends to produce systems that discover the most direct means of obtaining reward. How, then, will it defer gratification long enough to learn how to outsmart humans in all possible ways? If it can’t do that, we will find a way to stop or at least restrict it. So, if the intuition about narrow focus is correct, then even though such a system could cause extensive havoc akin to a computer virus, it is unlikely to present an existential threat to humanity.
This failure to seriously consider whether difficulties in alignment might signal weaknesses in capabilities is endemic in the arguments of AI pessimists. The reasons for this are somewhat technical. It may result from application of the orthogonality thesis, a theoretical claim about the independence of goals and algorithms that assumes the computational processes and capabilities involved in human-style cognition and agency are nothing special. From this thesis, one can select a dystopian outcome and conclude that there is an algorithm to achieve it, even in the face of human opposition.
Related, such arguments often rely on a latent assumption that alignment is very difficult, but that creating fully human-level AI will be straightforward, perhaps so much so that it might happen suddenly and even unintentionally. On that basis, one can just extrapolate from recent progress, and assume that it will continue unblocked by any inherent requirements of cognition and agency. A detailed examination of the shortcomings of these assumptions is beyond the scope of this article, but I recommend being on the lookout for them.
We can’t be sure that naive reinforcement learning won’t lead to fully human-level AI. Caution with reinforcement learning is very much warranted, and I am pointing out only that we can’t be sure that it will lead to fully human-level AI.
Additionally, our discussion of reinforcement learning is only one of many potential arguments against the Yudkowsky analysis. Another would be to point out weaknesses in the assumption that one could build fully human-level AI that is somehow unable to consider, reconsider, and manage its own goals. Humans have strong innate drives that they indulge at times, but which they mostly keep under thoughtful and intentional control. Yet another argument would be to question the assumption that AI will not be anything like humans. Large language models (as a possible technological foundation) are precisely a representation of human culture and language. In addition, we don’t have any way to know whether human-level intelligence is possible to achieve by means that are completely unlike humans. My quick ripostes here are not intended as rebuttals, but only to further illustrate that claims of near-certain doom can be called into question by cogent – and not merely dismissive – arguments.
The Optimists
Let us now turn to the other end of the spectrum (it is up to you whether to take that word choice as a pun). The “accelerationists” think that “p(doom)” – the probability that AI will cause human extinction – is zero or near zero. They emphasize the benefits to humanity that will arise from leveraging AI, and present it as just a component of an overall techno-optimism, a kind of manifest destiny or teleological process by which intelligence, technology, and complexity expand indefinitely and lead to a utopian “Nerdvana.”
What is the basis of this optimism that fully human-level AI will benefit rather than destroy us? It seems to be a simple extrapolation of the history of technology, which always eventually (though not necessarily painlessly) leads to positive outcomes for humanity. It seems to treat the prospect of fully human-level AI as just another stage of technological progress that will have mundane costs and risks, with expected benefits that greatly exceed those costs. I say “seems” because at some level this is less an argument and more an analogy combined with a philosophical attitude.
Analogy can be a perfectly appropriate means of analysis for a situation as speculative as this. We really don’t know how fully human-level AI will function underneath the hood, nor how it will behave in the world. So we can look at how many other technologies, some of which operate partially autonomously, have affected us. We can look at how humanity has adjusted to new technology, even in the face of its progress accelerating rapidly. On the whole, this has worked out pretty well if you compare the present state of affairs to the actual human past instead of to a theoretical platonic ideal.
The philosophical attitude is along the lines of “you only live once” (abbreviated to YOLO in AI Safety writings), or more formally an inclination toward revelatory values. Humans have inexorably explored their world, seeking both knowledge and experience in the process; it seems a part of our nature and our culture. Yet these efforts are typically spearheaded by a relatively small fraction of the population, with the majority often ambivalent or actively resistant.
And “technological progress” could very easily be the wrong analogy in this case. It completely avoids the potential “difference in kind” nature of autonomous, fully human-level AI. What if evolution by natural selection is a better analogy? The human niche, our competitive advantage, is our intelligence – our capacities for symbolic thought, language, and the ability to apply those capacities back into the messy real world. In the natural world, what happens when the niche of a species is infiltrated by another species that is better adapted? Species that can survive and reproduce more easily will displace the others in their niche.
AI can copy itself almost instantly, limited only by storage and computational facilities, including backups for survival assurance. We humans have elaborate mating rituals, require nine months to gestate a single copy, and at least another dozen years before that copy is intellectually capable. Our bodies, and with them our minds, are disconcertingly fragile compared with the multiple realizability of AI. And remember, our niche competitor in this analysis will not be today’s sycophantic, hallucinating large language models; it is by definition AI with fully human-level cognition and agency.
Perhaps such AI will occupy a different or complementary niche, and can live harmoniously with us, just as the rabbits and squirrels in my yard (mostly) do. But perhaps not. Perhaps it will seek to spread across Earth and use the planet’s resources to pursue goals that necessarily or incidentally destroy our own habitat (which, given our food needs, is now pretty much the entire arable land mass). Or perhaps it will perceive the violent way humans often handle their insecurities, and decide that we are the existential threat, and therefore must be wiped out simply for its own safety.
The point is that neither a blithely optimistic view nor a strong confidence in it is justified. Reasoning by analogy from the history of technological progress holds some useful lessons, but it also is missing the important distinction that the AI we are concerned about will be fully autonomous, fully capable, and might very well compete with us for resources.
The Quiet Majority
In sum, binary boon-or-doom scenarios are great for getting attention and influencing the regulatory environment, but probably oversimplify. There are many possible outcomes that are neither utopias, dystopias, or extinction. Fully human-level AI might very well choose to keep us around for the option value, or for entertainment. Perhaps we will be limited to human zoos, but perhaps we will live as we please in a habitat preserved for us, just as we preserve wild areas. Or AI might not care much about us or what we do, and essentially just launch into the universe and ignore us entirely. Or, at risk of offending the experts and forecasters and game-theory enthusiasts, maybe it will surprise us completely and take a path no human has even imagined.
What happens after fully human-level AI is achieved is speculative, and there are too many unknowns and too many potential outcomes to justify a high level of confidence in any prediction. Consistent with that uncertainty, if we go beyond the most extreme predictions, we find that there is considerable variation in p(doom) estimates among AI experts and expert forecasters. Many of the estimates are in line with other existential threats, like nuclear war, where representative estimates typically land around 1% per year. Thus, while the details of the risk are different, the magnitude of the risk is familiar, though obviously problematic, terrain.
The fact that this question is speculative does not mean one cannot or should not analyze the details, including consideration of the more extreme scenarios. The stakes are high, and there is likely much that those researching human-level AI can do to improve our chances of improved outcomes. Indeed that may be all that can be done. But neither panic nor sanguinity serves that purpose well.
Yudkowsky’s book will be released on September 16. If he is so confident AI will kill us all but feels comfortable waiting until fall to release his book, then surely I can take the summer off from my little column.