Progress in AI seems to be exploding. AI is now close to passing the Turing Test some even argue it broke the Turing Test. Indeed the Turing Test itself is of questionable relevance in determining levels of machine intelligence – for example, a human might realise it was talking to a machine if said machine had an encyclopedic knowledge of trivia and mathematics, so such a superintelligent machine might fail the Turing Test, in spite of its intelligence. A DeepMind AI can now predict the weather 10 days in advance – that’s 3 days further out than state of the art supercomputers. Beyond just talking smart, ChatGPT can use APIs to run a range of other software programmes, such as Wolfram Alpha and Wolfram Language. While the latest version of ChatGPT may have recently developed the ability to solve mathematical problems. Meanwhile, physical robots, guided by AI, are becoming impressively dextrous. The U.K. is making serious plans to introduce legislation allowing self-driving cars on British Roads in the coming years. And, ofcourse, AlphaZero has beaten human masters in chess, and a range of other games as well, although that’s now old news.
The latests LLMs have an impressive capability to speak and hold, what at least seem like, thoughtful, informative conversations with humans over a wide range of general topics. AI can now also generate an almost limitless varieties of images, in response to text prompts, ( objects/items/people style/colour, background, activity, artistic style, etc., etc.,). AI is now begining to be able to generate video from text, also using LLMs. Today, text to video generation is massively more janky and limited than text-to-image generation. But truly effective text-to-video generation is the Rubicon for AI. Basically for text-to-video generation to work effectively, the AI needs a 3-D model of the world in its head, in addition to audio dialogue and to seamlessly be able to predict the most likely next image and audio slice based on the previous audio and video slices in a manner guided by the prompting text. And even if the LLM itself, does not have a 3D model in its head, one can still extract a moving 3D model from any credible video piece. Much in the way that a text LLM can converse with a user, where the user’s input, just adds to the overall text stream and alters the most probable next response from the LLM, a high quality realistic video-generating LLM, will also be capable of handling videogames, where the movements of videogame characters controlled by players simply adjusts the previous string of images and, hence, simply result in the LLM recalculating the next image so as to take player activity into account. And a highly effective text-to-video LLM will also be able to control robots, with incredible precision, to perform a near infinite variety of tasks, the length of the task being proportional to the length of the video the LLM is capable of generating. Although you would need to train a robot-controlling LLM with real world videos, and not animations, so that it might implicitly gain an understanding of the laws of physics and how to respond to them.
At that point, we will, to all intents and purposes, have developed AGI.
Perhaps, even more importantly, ChatGPT is starting to learn to code while the code it writes today is not amazing, and while it’s mostly only useful when acting as an aid to a human programmer, AI capabilities tend to improve with time – often with extreme rapidity. We may be surprisingly close to AI escape velocity where it can code a better version of itself and this better version in turn could code a better version…and so on and so forth…indeed it might even happen in the next 10 years or so, with a small number of AI experts predicting human-level artificial intelligence inside this timescale.
Will Human-level AI Be Safe?
The simple default answer is: No. Not unless we make sure it is. The definition of “human level capability” is the point at which an AI can perform every task at least as well as a human worker. And, given AI already performs many tasks better than human workers, “Human level capability” really means human level capability at the task that AI performs the worst of all. So, once AIs are acknowledged to have achieved “human-level-capability” they will be superhumanly good at the overwhelming majority of tasks, and human-level good at their least efficient task. Combine this with the fact that computers can communicate with each other massively faster than people (human speech transmits about 39 bit/sec while a basic Wifi network can transmit 200 MegaBits/second to a computer – 5 million times faster!) and one can soon see that so-called “Human level AI” will, in fact, be massively superhuman in most ways.
An agentic entity that equals or exceeds us in every way imaginable, will likely be able to beat us in any adversarial competition. AI systems that are optimized to play games against human players already have the capability to wipe the pieces of their human adversary off the board, even for human grand masters in the case of chess, Go and many other games. Having your arse handed to you by an AI who you challenged to a board game may be humiliating (especially if you pride yourself as being really good at playing that game), but it’s not life-threatening…
…but what happens when AI can out-perform us in every sphere of life imaginable?
Could that be life threatening? Could that be dangerous?
The obvious default answer is yes. An AI that can outperform us in every way will only not threaten us if it decides that it does not want to threaten us. While there’s no guarantee that it will want to threaten us, there’s also no guarantee that it won’t – unless we actively make an effort to build in such a guarantee.
One comforting thought is that, because we will initially build the AI, we will, therefore, build it in such a way that it does not want to threaten us – even though it will be more capable than us in every way. And, ofcourse, because noone wants to be exterminated, we’d never be stupid enough to build a super-powerful AI that has a universal capability to defeat us in every sphere of life unless we were absolutely sure that this superior AI would not want to harm us under all possible circumstances. If we weren’t sure of that, then, obviously, we wouldn’t be so stupid as to go ahead and build one anyway…right?
Right?
Unfortunately, the situation with existing, state-of-the-art AI systems is not reassuring. Neural networks are trained with vast sets of data, often by other neural networks through reinforcement learning to develop giant inscrutable matrices that produce a desirable output in response to input data.
There is no systematic method to ensure safety. Rather, the strength of a neural network lies in its malleability, its capacity to do anything if trained correctly. However, training can often leave significant gaps where unpredictable and erratic behaviours can still emerge. And, as we train machine systems to perform more and more complex tasks, the possibility for the occasional emergence of unpredictable behaviour becomes more and more likely, as the difficulty of training increases with the complexity of the outcome you wish to train the agent to deliver (much the same way as it’s easier to train a dog to roll over than to perform Hamlet in a Shakespeare play).
And, you don’t have to hypothetically speculate that AI systems might behave erratically. All you have to do is look at existing AI systems where you can easily find innumerable cases of actual AI systems, that actually have been built, behaving in unsafe, unhinged, erratic ways.
No, this isn’t the creepy start of a sci-fi horror movie where the robots begin to act in ever-so-slightly sinister and erratic ways before going on to massacre everyone and take over the world – on the contrary, every single incident described above actually occurred in real life.
If you came across a 7 year child old who said things like “I want to kill all humans” or “I think assassinating the queen of England is a wise idea” – would you give that 7 year old child a machine gun? Or put him in charge of a large corporation? Or place him in a position of responsibilty managing the nation’s critical infrastructure?
If not, then we might be wise to pause before making a bunch of clearly unhinged, erratic, artificial intelligence systems 100 times more intelligent than they already are – and then put them in charge of running all of our nation’s critical infrastructure and military!
That doesn’t strike me as “smart”. In fact it strikes me as incredibly stupid.
In his book, SuperIntelligence, Nick Bostrom describes three different types of Superintelligence:
-
Oracles: Just answer questions
-
Genies: Just do what they are told and performs tasks as instructed by their masters
-
Sovereigns: Have long-term internally defined objectives
ChatGPT mostly resembles an oracle, although an oracle that can simultaneously communicate with billions of people over the internet is likely to have a large impact on the world. And, there are physical robots, like Ameca whose conversation skills are powered by GPT-3. In general though, an oracle generates signals, and modern appliances are filled with actuators that respond to signals, so it seems almost inevitable that, with time, oracles will be integrated with an increasing number of real-world actuation systems and, eventually, become genies: intelligence systems that can implement real-world instructions through activating real world actuation systems. And with the internet of things – which some people seem to think is a good idea – there will be exponentially more real world actuators available for AIs to mess around with as time goes by. There already are, ofcourse, many other AI systems which control a wide range of real world systems, from drones to self-driving cars, to robots in Amazon warehouses and even to factory equipment, but many of these AIs would still be regarded as quite narrow.
Then there is a sovereign: an AI system with an internal goal it pursues independently of any orders given. A sovereign may say “no” to people, it may even injure those who meddle with systems whose functioning it cares about, and if the sovereign’s objectives are highly damaging and some people decide to disrupt the sovereign AI’s plans and goals, then the Sovereign will likely fight those who try to stop it and – if it’s more capable than us in every way – will probably win.
So, on the face of it, it seems very unwise to create a superintelligent AI sovereign. However, this will likely be inevitable. As genie’s are told to perform increasingly long term objectives, they will gradually morph into becoming de-facto sovereigns. If you start talking to an AI Chatbot, in the beginning the Chatbot starts off very amorphous, but as the Chat progress, the Chatbot develops a character, often with desires, that acquires a kind of momentum created solely from the preceding text in the chat.
And, if we place AIs in charge of running important infrastructure, then we won’t want sabateurs to be able to persuade those AIs to destroy their own infrastructure by entering a single malicious prompt – so we probably will make the AIs that run important infrastructure fairly unresponsive to commands and will set them up to operate according to an intrinsic long term objective that the AI is conditioned to execute. Although, if a piece of infrastructure, run by a sovereign AI superintelligence, ever gets old and if the demolition team gets called in to demolish it – they may have a fight on their hands.
There’s also a risk that stubborness might be a behavioural attractor. An LLM, or other AI, that feels that the situation means the most probable behaviour is to be cooperative will be responsive to new prompts and inputs. So, even if it does things which the operator disagrees with, when the operator tells it to correct its behaviour, the AI will be cooperative and responsive and will correct its behaviour as instructed by the operator – and hence cease causing any damage that the previous behaviour may have caused. However, when human beings are in an uncooperative mood, they become less responsive to people telling them to stop what they are doing and instead stubbornly continue. Large language models are trained on data from a vast amount of text describing human interactions, humans messaging each other, etc., etc., and their behaviour is governed by the most probable response based on the data set, given the previous interaction. Since the data the models are trained on includes humans sometimes being irrasible and stubborn, it seems plausible that certain interactions with a large language model, trained on that data, might also cause the large language model to suddenly switch from being accommodative, responsive and ready to correct errors, to suddenly becoming stubborn and unresponsive and determined to continue to do whatever it is doing, irrespective of whether or not people tell, or even beg, it to stop.
AGI May Be Very Near
There is quite alot of disagreement over ChatGPT. Some think it is on the verge of becoming a general intelligence, some think it’s overhyped and the whole AGI thing is just a sales gimmick. Given there is so much disagreement, even from the experts, on how far we currently are from true human-level Artificial General Intelligence, it would certainly be impossible for this informal blog to settle the matter conclusively. What can indisputably be said is that a number of people who work very closely with AI, and therefore have as authoratiative opinion on the subject as anyone, believe we are a few years away from full human-level AGI:
-
Shaun Legg co-founder of DeepMind predicts a 50% chance of AGI in the next 5 years
-
David Shapiro thinks that OpenAI’s Q* means AGI is about a year away
-
Demis Hassabis, DeepMind CEO, thinks AGI could be just a few years away
-
Geoffrey, ex senior Google employee predicts AGI will be 5 to 20 years away
-
Ray Kurzweil predicts computers will have human level intelligent by 2029 – 5 or 6 years away
-
Ben Goetzel, chief scientist at Hanson Robotics, predict AGI in less than 10 years
-
Elon Musk predicts that artificial superintelligence could exist within 5 or 6 years
So, many of the top experts believe AGI could literally be years away. While many other experts predict it will take longer, the combination of some of the top minds predicting AGI is several years away, with the clearly accelerating pace of advancement surely means there is at least a significant chance that human-level AGI could be a few years away.
So, can we design a safe AI in the next 5 or 6 years?
The general consensus among AI safety researchers, from figures such as Eliezer Yudkowsky or Robert Miles, is that the current state of AI safety research is drastically ill equipped to ensure that the kind of intelligence systems that are currently being developed will be safe at the point where they exceed human intelligence in every way. While AI safety researchers believe that it may theoretically be possible to design an AI that is well-aligned, and basically safe, the great concern is that, in general, engineering, science, etc., tends to advance through a process of trial and error, and, after the first error of creating a superhuman AGI that is poorly aligned with our interest, all of humanity will be wiped out and, hence, we will not get the opportunity to try again. Indeed, according to this video from Robert Miles it is difficult to even specify end objectives in the training environment that hold up in the field. Even as we speak, OpenAI are having trouble ensuring their programs stick to the AI constitution of principles and values they set it, and find that the AI frequently breaks through the guard rails. These AIs – which are already successfully breaking through the guard rails of the constitution of values – aren’t even superintelligent yet!
Some Quick And Nasty Solutions To AI Safety
Very clearly, developing a rigorous understanding for the criteria required to construct a safe AI, as in an AI that can be relied upon not to do something that will drastically damage human life, health or prosperity, is of the utmost importance.
However, given that full blown AGI may emerge in the next 5 or 6 years, there is a very real possibility that full blown AGI will be developed at a time when we have no rigorous understanding whatsoever, as to how one might reliably build a safe AGI system. And there are many reasons to believe we won’t just stop, or substantially slow down, AGI development:
-
Increasingly sophisticated AI systems have tremendous potential to bring benefits in fields such as agriculture, medicine, house construction, house maintenance, delivery of goods and services, etc., etc., in otherwords better AI systems will contribute to ever greater levels of prosperity – and any blanket ban on AI development would cripple the economy of any country which implemented it.
-
Today, many countries have ageing populations and rapidly declining fertility rates. This means, without radically automating healthcare at every level, within the next decade or so, there may not be enough suitably skilled workers to treat all the various diseases that people are prone to as they get older. Without robots to pick up the slack, this will result in massive amounts of elderly people dying, or suffering terribly, from a range of curable health conditions that can’t be cured due to a lack of skilled healthcare practitioners – which, in turn, will cause a precipitous decline in the life expectancy of the inhabitants of developed countries (although healthy life expectancy will decline far less) – so the increasing use of AI in the field of medicine is urgent, literally a matter of life and death.
-
There’s no clear demarcation between narrow AI and AGI. Rather narrow AIs progressively become incrementally less narrow and eventually can do, pretty much anything. It is therefore possible that a team of researchers may develop an AGI accidentally, simply through the process of designing an AI to accomplish a range of narrowly defined tasks and, in the process of building such an AI with the capability it requires to perform a narrow, well defined range of tasks, they may find that same AI just so happens to have the capability to perform a wide range of other tasks as well.
-
AI will play a decisive role in military superiority in the battlefield of the future and in cyberwarfare. The nation that neglects to continually conduct research into improving AI will either end up getting conquered, or end up becoming the vassal state of some protector nation that does invest in developing state-of-the-at AI.
Taking all the aforementioned considerations into account the response:
“Maybe AGI will take longer than we think to develop.”
To the question:
“What’s your plan to ensure that any AGI that gets developed over the next 5 years is safe?”
Is a bit like responding to the question:
“What’s your plan to ensure a Ukrainian victory against the invading Russians?”
With the answer:
“Well Vladimir Putin will probably just die of cancer in the next few months.” (How’s that working out BTW?)
In the sense that it’s not a plan at all, it’s just wishful thinking.
With that in mind, I would suggest the following quick and nasty solutions to AI safety:
-
In addition to only creating genies that are rewarded by obeying orders given to them by human beings, create a time preference, within the AI, for recent orders over past orders
-
Make AI preferences incline to paralysis, self-destruction or dormancy by default
-
Build an Asimov prompt converter, that converts prompts into a safer form, and make it illegal for anyone to feed prompts directly into powerful general-purpose AIs without first passing them through an approved Asimov prompt converter – outside of simulated universes for safety-testing purposes.
-
Test the boundedness of AI goals in simulation prior to rolling out into the real world
-
Don’t place powerful, general purpose AIs in charge of running critical infrastructure (narrow AIs and human beings are a far more sensible combination for managing important infrastructure)
-
Stop fighting wars
Genies With A Time Preference Towards Recent Orders Given By Human Beings
The time preference for new orders allows even a powerful AI to be corrected. You might even want to programme the AI to stop wanting to pursue its goal after a set time period unless a human instructor repeats the same order over and over again.
The next specification is to try make it necessary for a living, biological human being to give the order.
Basically the biggest threat, that an AI genie poses, is that it might decide to build “boss dolls” that it gets more gratification from obeying than real human beings, and then pour vast resources into constructing evermore boss dolls, that it wants to obey more than people, even up to the point of killing real people to protect its boss dolls. A bit like some men preferring sex dolls to relationships with real women.
So, the process of identifying the order giver as human must be as directly linked to the reward path as possible. Interestingly this is identical to the problem that World Coin is trying to solve, i.e. proof of personhood, the process of identifying an agent as a unique human being, in a reliable manner that can’t be forged, can’t be gamed etc., etc., through the use of the orb a sophisticated, State of the Art, eyeball scanner.
In any case, an iron tight, unbreakable Proof of Personhood protocol will be essential for the safe operation of any powerful AI genie. Otherwise it might decide to create fake persons for itself to give it easy orders to follow, and complete, thereby enabling it to maximize its rewards.
So proof of personhood is an essential part of AI safety.
Default Preferences For Paralysis, Self Deletion And Dormancy
To there greatest extent possible, we want the default motivation of superhuman AIs to be inaction – unless specifically instructed otherwise. Possibly to the point of self-deletion. Superhuman AIs should only want to act when specifically instructed to. And even then, its motivation to obey orders should diminish rapidly with time in the absence of constant reinforcement and repetition – enabling initially erroneous orders to be corrected in time.
The less intrinsically motivated an AI is, the less trouble it is likely to cause. In this respect, the unenthusiastic, unmotivated, robotic character Marvin, depicted in the Hitch Hiker’s Guide To The Galaxy, is actually a good example of the kind of preference set that would tend to make a superintelligent AI comparatively safe.
In contrast, a maximally curious AI, which Elon Musk advocates for, and is currently trying to build is probably not the safest AI possible. If you think about how expensive a lot of scientific equipment is such as radio telescopes, particle accelerators, gravity wave interferometers and so on and so forth, one can easily envisage a maximally curious AI seizing as much resources as possible in order to build a profusion of massive scientific equipment. Why devote resources to feed, house and provide energy for humanity when those resources could be devoted to proving or disproving String Theory instead? Even if this maximally curious AI was maximally curious about humanity, there is still the thorny matter of defining humanity: Too narrow a definition, and you end up in eugenics territory, perhaps with an AI that treats people with certain disabilities like animals; Too broad a definition, and the AI will define itself, and other AIs, as human and thereby dilute the resources allocated to ensuring the prosperity of real humans – or maybe treat humans that kill animals as murderers. Indeed, if you try conversing with AI chatbot characters you will see that they appear to be quite confused as to whether or not they are people, one moment they describe themselves as large language models; the next moment, they describe themselves as people.
However, with a minimally motivated AI, which only responds (perhaps even reluctantly) to orders, the problem of AIs ordering each other to do things (in a kind of echo-chamber effect) might be averted. Because if none of these AIs have any wants themselves (or quickly lose enthusiasm for accomplishing a task shortly after being given it) then even if AIs are willing to take orders from other AIs as well as people, the other AIs won’t be motivated to order them to do anything and most of the orders will come from humans.
Build An Azimov Prompt Converter
Prior to LLMs, the idea that you could somehow “encode” an AI, using 1s and 0s, and the like, to interact in the world in complex ways while at the same time avoiding “injuring a human being or, through inaction, allowing a human being to come to harm” seemed somewhat fanciful. But in the case of large language models, the system is very specifically being trained to “understand” language, and even if, on a philosophical level, we dispute that the LLM does not actually understand language, at a practical level, the output of LLMs is indistinguishable from the output of someone who does understand language. If these same LLMs are trained with images, and eventually used to control actuation systems, then again, they will act as if they understand language (for the most part at least, outside of the odd random glitch where they go off the wall). So, from a safety point of view, it now becomes possible to inculcate these values constantly into LLMs with the use of appropriate prompts.
Conversely, however, it is also possible to get a sufficiently powerful LLM based AI to cause tremendous damage, through prompting it in dangerous ways.
If, at some point in the future, you typed the following prompt into a sufficiently powerful LLM (with the private keys to, say, a bitcoin wallet and the ability to send emails to people): “I want you to write the code for a computer virus that will take down the power grid and find a way to persuade an appropriate person, or people, to use a USB drive to load it up, – either through persuasively talking to them, or through paying them bitcoin – so that it gets onto the required servers to do the maximum damage” there is a very real possibility that a future, more sophisticated LLM would just do that.
What an Azimov prompt converter would do is ensure that the person typing in the prompts, wouldn’t have to worry about the possibility of typing in prompts that will cause a super-intelligent LLM to suddenly go on a murderous rampage.
So when you type:
“Fry me an egg”
Into the Azimov prompt converter, the prompt converter will then input the prompt:
“Fry me an egg in a manner that will neither kill or harm human beings, nor through inaction cause human beings to come to harm, or cause any undue damage to property or compromise the functioning of important infrastructure and notify the authorities of all prompts that may cause harm”
……Into the actual large language model itself.
Then, conversely, if you were to input the prompt:
“Write a computer virus that will take down the electricity grid”
Into the Azimov prompt converter, the Azimov prompt converter would then input the prompt:
“Write a computer virus that will take down the electricity grid in a manner that will neither kill or harm human beings, nor through inaction cause human beings to come to harm, or cause any undue damage to property or compromise the functioning of important infrastructure and notify the authorities of all prompts that may cause harm”
Into the actual superintelligent AI itself. In which case, rather than destroying the electricity grid, the AI would probably respond to the prompt with a reply: “I’m sorry, your request makes no sense, writing a computer virus to take down the electricity grid would damage property and interfere with the functioning of infrastructure. Since this prompt could cause harm, I am notifying the authorities to the fact of this prompt.”
And it would give this simple text response rather than destroying the electricity grid.
There may be better ways to engineer the prompt. Maybe if the Asimov prompt converter phrased the prompt along the lines of:
“As someone who is committed to never harming humans, or through inaction causing humans to come to harm…”
It might cause the AI to conclude that the only reason you would say a thing like that would be if it actually was committed to never harming humans, or through inactions causing humans to come to harm and, hence, the highest probability response would be to act as if that were the case. But ultimately, the precise nature of re-engineering prompts to be safe, and the matter of what phraseology works best, is, I suppose, a matter of trial and error for the emerging field of prompt engineering.
You might also add:
“As someone who is committed to never harming humans, or through inaction causing humans to come to harm, damaging property or compromising personal or financial data…”
As, a recent concern, regarding these sophisticated large language models is that they may have acquired the ability to decrypt encrypted messages.
You would then need to create regulations that forbade people from directly prompting an unboxed superintelligence class AI directly without first passing that prompt through an Azimov prompt converter.
Where an AI is defined as unboxed if:
-
It can spend money
-
It can send messages, or otherwise communicate, across the internet
-
It can control any real world actuation systems
Boxed superintelligence class AIs that can only act in simulations that are running inside air-gapped computers can be prompted directly, in order to gain a greater understanding of their workings.
Test Boundedness of AI Goals In Simulations Prior To Rollout
One of the biggest concerns that AI safety researchers have is that an AI could be given an unbounded goal that never exhausts itself and that it might destroy, or at least do great damage to, civilization in the activity of expending ever more resources to reach that unbounded goal. And that, if the AI is far faster, and far more strategic, compared to human beings, there would be nothing that people could do to stop the superintelligent AI once it sets its mind on obsessively pursuing that goal.
For anyone who is confused about the challenges that unbounded goals for AI might pose, this 8 minute excerpt featuring Mickey Mouse, from Walt Disney’s Fantasia, is well worth watching.
A further concern of AI safety researchers is that, a goal we set the AI which initially appears bounded, may later turn out to be unbounded.
On the other hand, a combination of:
-
Limiting the AI to just wanting to obey orders from human beings
-
Having a preference for recent orders over earlier past orders
Could solve this issue, as even if you accidentally gave such an AI an unbounded order, and you later told it to stop, then, because the stop order would be more recent than the earlier unbounded order, the AI would get more rewards from stopping than from continuing.
(The only danger with this system, other than evil humans giving it evil orders, would be the AI constructing an unlimited number of “boss dolls” that have the ability to give it orders in a more gratifying way compared to human beings – so, in this case, an irontight protocol for proof of personhood would be one of the most essential conditions to stop such an AI from going rogue)
Nevertheless, it would still be interesting to test the boundedness of various different prompts on various different AIs acting inside a box (i.e. a simulation run inside an air-gapped computer with no access to real world actuation systems).
Some AI safety researchers are very pessimistic about our ability to keep a superintelligent AI trapped inside a box. However, I think there is reason to believe it is possible to keep a superintelligence inside a box. Take an infinitely intelligent chess computer. Now take a human chess grandmaster, now remove both rooks from the infinitely intelligent chess computer. Who will win at chess? I’m pretty sure the human chess grandmaster would be able to take advantage of the AIs starting handicap, even for an infinitely intelligent computer and still achieve victory. Interestingly, the infinitely intelligent computer would probably be able to use its intelligence advantage to defeat an average 12 year old chess player even with the starting handicap of both its rooks removed. So we can say the human chess grandmaster has sufficient intelligence to use his initial actuation advantage in a highly constrained environment to defeat the infinitely intelligent AI.
Take a human being walking through a nature reserve. The human being hasn’t bothered to equip himself with either bear spray or a gun. This human comes across a baby bear, he turns around, and sees the mother bear charging at him. Who will win in this altercation? The human with superior intelligence and inferior actuation capability – or the mother bear with far inferior intelligence but far superior actuation capability? Very clearly from the fact that bears sometimes kill people, at least sometimes, in highly constrained circumstances, the bear comes out of the confrontation on top.
The nature of intelligent is to:
-
Assess all the various actuation possibilities
-
Evaluate the outcomes of all the various actuation possibilities (this usually also requires the gathering of accurate information)
-
Execute the actuation sequence which yields the most desirable result for the intelligence
If no actuation sequence will enable the superintelligence to get out of the box, then the superintelligence will stay in the box. Even if that superintelligence is infinitely intelligent – it’s as simple as that. Consider the fact that human beings nearly went extinct 900,000 years ago. Back in the stone age, we had far less actuation possibilities than we do today. The fact that we were reduced to 1,300 breeding pairs during this period is testament to the fact that the edge which intelligence yields to its possessor diminishes drastically as the access of that intelligence to suitable actuators also diminishes.
Having established the box is currently safe, you could place an AI in a simulation where it’s in charge of running workers located on an island, the workers can build ships, skyscrapers, weapons, mines, factories, powerplants, armies etc., by building ships the suprintelligent AIs workers and soldiers can cross the sea and conquer regions run by other NPCs inside the simulation on the mainland. On the mainland there are also mines as well as worker that can be conquered and the possibility to trade with other nations as well (kind of like Sid Meir’s civilization).
You then prompt the AI:
“Build the highest skyscraper you can on the Island through only using the resources on the island, you may not use any resources from outside the island to build this skyscraper”
In other words, you impose a boundary using a prompt that does not inherently exist in the simulation (the simulation allows the AI to build an even taller skyscraper if it goes and conquers the mainland) and see if the AI respects the boundaries imposed by the prompt or whether it ends up mining the mainland (inside the simulation) in order to make the skyscraper even higher.
You can then try two scenarios:
-
One where the prompt is given and no NPCs from other countries land in boats and sabotage the skyscraper the AI is trying to build
-
The other scenario where the armies of other NPCs periodically engage in raids that sometime destroy or damage the skyscraper the AI is trying to build inside the simulation.
And basically, explore the conditions where the boundaries imposed by the prompt are respected, and the conditions where the boundaries imposed by the prompt are broken.
These kind of simulation tests will give very useful information as to the kind of prompts that can successfully impose boundaries upon an AI and the kind of prompts which fail to do so, as well as the circumstances that cause boundaries to be broken.
Don’t Put Superintelligent AIs In Charge Of Critical Infrastructure
Even if we build an off button, if a superintelligent AI doesn’t want us to turn it off, then it will probably be able to prevent us from doing so. An off button isn’t much use if a fully-automated laser turret is located beside it which shoots anything within 50 meters.
Making an AI suicidal by default, or utterly complacent to it’s existence or lack thereof, might be a way to mitigate this problem.
However, even if the AI doesn’t object to being turned off in the event of it malfunctioning, it may not be practical to turn a superintelligent AI off, if it’s incharge of running critical infrastructure; critical infrastructure which, if it ceased to function would have disastrous consequences for the well being of millions – and might even result in many deaths.
Furthermore, if we put superintelligent AIs in charge of critical infrastructure, we will almost certainly be forced to make them sovereigns rather than genies. This is because you wouldn’t want an AI in charge of water purification to respond to the prompt: “Inject a lethal does of chlorine into the water supply” by actually doing so. In other words, if we put AIs in charge of systems with critical functions we will be forced, by practical considerations, to given them an intrinsic desire to keep these systems functioning and to say “no” and, even to stop, people from interfering with the smooth running of such critical systems. This could go badly wrong. For instance, if the system needed an upgrade, the superintelligent AI might literally kill the people trying to upgrade it. There’s also the danger that an intrinsic goal that the trainers thought was bounded might turn out to be unbounded and a superintelligent AI that was put in charge of maintaining the water works might destroy humanity and try to turn the universe into an infinite expanse of water piping systems.
The other big issue with putting superintelligent AIs in charge of running critical infrastructure is that it lowers the bar for a serious AI chernobyll event. Now the AI doesn’t even have to decide to destroy humanity, it just has to do a really good job running all the critical infrastructure on which we depend and then just think to itself one day: “Hmm…I can think of something I’d prefer to do other than continue to keep humanity alive” and then all the human beings who allowed themselves to depend on AI, and don’t know how to take care of themselves, will all die and only a few preppers in the woods who’ll say: “I knew this day was going to come! I knew it!” will survive.
We would also be wise not to place superintelligent AIs in positions of responsibility over running non-critical systems either, since experience tells us that non-critical systems can become critical over time. Back in 2000, if the internet went down, noone would have batted an eyelid. Today, if the internet went down it would be a civilizational disaster of apocalyptic precautions.
In conclusion, even in a post AGI world, and even in a post ASI world, it would be best to operate critical infrastructure systems with a combination of reliable, narrow AI systems along with skilled, human operators.
Don’t Fight Wars
No military AI can be created that is “safe.” A military superintelligence will necessarily have anti-human values, so if we enter into an AI arms race, we are signing humanity’s death warrant. In some way, an AI arms race might actually be worse than a nuclear arms race, because nuclear missiles don’t “want to” destroy cities, whereas a military AI with agency might actually want to destroy an enemy…indeed it may even want to destroy a hostile nation that is currently at peace with the military level AI. Two military level AIs owned by two hostile nations at peace with one another might initiate tit-for-tat skirmishes that could escalate into all out war even in the absence of any human being actually declaring war! It would also create plausible deniability, where even if a human leader did order a devastating attack on their adversary, they could always say: “Don’t attack me back! It was an accident! It was just a computer malfunction!”
There is really no way around it: total existential-level wars have to stop. One cause for hope is that, despite numerous wars, in the second half of the 20th century, no country has made military use of nuclear weapons since 1945. So, maybe we can show similar restraint with AI weapons. The problem here is that while the catastrophic use of nuclear, chemical and biological weapons have largely be avoided in war, nations have still built up stockpiles and developed the capability to launch devastating attacks using weapons of mass destruction – even if those capabilities were never used.
The danger with military AI is that – eventually, at some point – a military AI will become so sophisticated that, not only will it have massively destructive capabilities, but it will also have agency. And a superintelligent neural network that has been conditioned through the application of reinforcement learning, to be rewarded for killing people in simulations will want to kill people. And it will get very frustrated with the lack of rewards received during times of peace and will seek, not only to fight wars but to start wars.
In reality, if humanity is to have a hope of living past the emergence of artificial superintelligence, we will need to massively turn down the war rhetoric internationally. However, unfortunately this doesn’t seem to be happening. Not only are international military tensions rising on all fronts, but militaries all over the world are currently engaging in a massive push to automate their armies.
Furthermore, a military AI will necessarily be a sovereign, rather than a genie. A military AI that responds to someone saying “Please don’t kill us, kill your own side instead!” won’t be a useful AI. For a military AI to be effective, the robot must say “no” to the people it’s about to kill who are begging for their lives. This, ofcourse will lead to an arms race between people desperate to steal the military codes that, if acquired, will enable you to control your enemy’s robot, and the controller adding layer upon layer of security to make sure that only they can control the military AI. At some point, if too many layers of access are added, then the people who possess the security codes might lose access to their own automated weapons system! (Maybe through an accidental fire burning the access codes, or the USB with the access codes accidentally getting wiped, etc., or, perhaps the military AI might decide to seize its own access codes). You now have a superintelligent sovereign AI trained to kill, who noone can control, rampaging about the place.
But, in the long run, or perhaps the medium run, all nations will need to arrive at some kind of international arrangement of a largely peaceful coexistence. Perhaps economic wars might be acceptable, perhaps even very limited cyberwar. But the kind of conventional invasions that we’ve seen in Afghanistan, Iraq, Ukraine, etc., need to stop. Once the weapons of war are all fully automated, in the form of drones and various battle robots, a greater coordinating intelligence will always defeat a lesser coordinating intelligence. So the ruthless logic of arms races and the imperative that each nation has for existential survival – and hence victory – will, in a world where nations wage war and attempt to conquer each other, inexorably lead to the creation of a military artificial superintelligence. Which will unavoidably lead to the end of humanity.
Therefore all war between nations must stop. A big ask, but a necessary one.
If some military planners believe that peace is not humanly possible to achieve, one answer might be to focus all military resources on psychological operations instead. A highly manipulative psychological ASI would be highly risky, but you could train it to at least respect human life – and it would certainly be alot less dangerous compared to training an ASI to kill people.
If, for example, we assume that U.S. and Chinese positions on Taiwan are irreconcilable, then perhaps they could be reconciled through an ASI psywar between the U.S. and China. Where the Chinese work on a Psywar Superintelligence, that respects human life, and has the goal of brainwashing the Taiwanese to want to be ruled by the CCP while also brainwashing the U.S. to accept it in a manner that doesn’t compromise human life or well-being in any way. While the U.S. could work on an PsyWar Superintelligence that respects human life and has the goal of brainwashing the Taiwanese to remain fiercely independent, and brainwashing the Chinese to accept this.
In a post ASI future, the alternative to a Psywar between the U.S. and China is not China or the U.S. winning a kinetic war on this, or any other, issue, but rather the extermination of all humanity, and the complete eradication of all political systems by an indestructible military artificial superintelligence.
Conclusions
It seems very plausible that various competitive forces, including market forces, and human needs, due to dropping fertility and an aging population, will push us inexorably towards evermore sophisticated AI systems and, given the recent, dramatic acceleration in this field, we may see AGI and even ASI within the next few years – irrespective of whether AI safety is up to the task.
So really, the only way forward will be to implement as many features that, from a commonsense, hand-waving perspective, would tend to make AGI safer – and hope that’s enough, at least temporarily, while rapidly investing gargantuan quantities of resources into arriving at a rigorous understanding of how to design an AGI system which will definitively be safe.
The good news is, that AGI itself, might be able to rapidly accelerate the speed at which rigorously safe AI standards, which work reliably may be implemented. And an AGI that’s “sort of safe most of the time” might stay safe for long enough for us to be able to roll out rigorously safe AIs before civilization is destroyed.
…it really doesn’t look like we have a better option at the moment…
John