No Separation from Hyperexistential Risk

From Arbital:

A principle of AI alignment that does not seem reducible to other principles is “The AGI design should be widely separated in the design space from any design that would constitute a hyperexistential risk”. A hyperexistential risk is a “fate worse than death”, that is, any AGI whose outcome is worse than quickly killing everyone and filling the universe with paperclips.

I agree that this is a desirable quality for any design or approach to creating a design to have. However, I think it’s impossible to do so while creating the possibility for an ‘existential win’, i.e. a good event roughly as good as a hyperexistential risk is bad. In order to create the possibility of a Very Good Outcome, your AGI must understand what humans value in some detail. The author of this page* provides specifics, which they think will move us further away from Very Bad Outcomes, but I don’t agree.

This consideration weighing against general value learning of true human values might not apply to e.g. a Task AGI that was learning inductively from human-labeled examples, if the labeling humans were not trying to identify or distinguish within “dead or worse” and just assigned all such cases the same “bad” label. There are still subtleties to worry about in a case like that[…] But even on the first step of “use the same label for death and worse-than-death as events to be avoided, likewise all varieties of bad fates better than death as a type of consequence to notice and describe to human operators”, it seems like we would have moved substantially further away in the design space from hyperexistential catastrophe.

I find it hard to picture a method of learning what humans value that does not produce information about what they disvalue in equal supply, and this is no exception. Value is for the most part a relative measure rather than an absolute; to determine whether I value eating a cheeseburger it is necessary to compare the state of eating-a-cheeseburger to the state of not-eating-a-cheeseburger, to assess whether I value not-being-in-pain you must compare it to being-in-pain, to determine whether I value existence you must compare it to nonexistence. To the extent we are not labeling the distinction between fates worse than death and death, the learner is failing to understand what we value. And an intelligent sign-flipped learner, if we gave it many fine-grained labels for “things we prefer to death by X much”, would at minimum have the data needed to cause a (weakly-hyper)-existential catastrophe; a world in which we did not die but did not ever have any of the things we rated as better than death. Unless we have some means of preventing the learner from making such inferences or storing the information (so, call the SCP Foundation Antimemetics Division?), this suggestion would not help except against a very stupid agent.

Of course, maybe that’s the point. It seems obvious to me that a very stupid agent does not pose a hyperexistential risk because it can’t build up a model detailed enough to do more than existential harm, but “obvious” is a word to mistrust. Could I make the leap and infer the reversal property? I believe I could. Could one of the senders of That Alien Message, who are unusually stupid for humans but have all the knowledge of their ancestors from birth? I’m fairly confident they could, but not certain. Could one of them cause us hyperexistential harm? Yes, on that I am certain. That adds up to a fairly small, but nonempty, segment of probability space where this would be useful.

But does that add up to the approach being worthwhile?

* Presumably this is Eliezer Yudkowsky , since I don’t believe anyone else wrote anything on Arbital after its “official shutdown”, which was well before this page was created. But I’m not certain.

Benignness is Bottomless

If you are not interested in AI Safety, this may bore you. If you consider your sense of mental self fragile, this may damage it. This is basically a callout post of Paul Christiano for being ‘not paranoid enough’. Warnings end.

I find ALBA and Benign Model-Free AI hopelessly optimistic. My objection has several parts, but the crux starts very early in the description:

Given a benign agent H, reward learning allows us to construct a reward function r that can be used to train a weaker benign agent A. If our training process is robust, the resulting agent A will remain benign off of the training distribution (though it may be incompetent off of the training distribution).

Specifically, I claim that no agent H yet exists, and furthermore that if you had an agent H you would already have solved most of value alignment. This is fairly bold, but at least the first clause I am quite confident in.

Obviously the H is intended to stand for Human, and smuggles in the assumption that an (educated, intelligent, careful) human is benign. I can demonstrate this to be false via thought experiment.

Experiment 1: Take a human (Sam). Make a perfect uploaded copy (Sim). Run Sim very fast for a very long time in isolation, working on some problem.

Sim will undergo value drift. Some kinds of value drift are self-reinforcing, so Sim could drift arbitrarily far within the bounds of what a human mind could in theory value. Given that Sim is run long enough, pseudorandom value drift will eventually hit one of these patches and drift to an arbitrary direction an arbitrarily large distance.
It seems obvious from this example that Sim is eventually malign.

Experiment 2: Make another perfect copy of Sam (Som), and hold it “asleep”, unchanging and ready to be copied further without changes. Then repeat this process indefinitely: Make a copy of Som (Sem) and give him short written instructions, written by Sam or anyone else, and run Sem for one hour. By the end of the hour, have some set of instructions and state written in the same format. Shut off Sem at the end of the hour and take the written instructions to pass to the next instance, which will be copied off the original Som. (If there is a problem and a Sem does not create an instruction set, start from the beginning with the original instructions; deterministic loops are a potential problem but unimportant for purposes of this argument.)

Again, this can result in significant drift. Assume for a moment that this process could produce arbitrary plain text input to be read by a new Sem. Among the space of plain text inputs could exist a tailored, utterly convincing argument why the one true good in the universe is the construction of paperclips; one which exploits human fallibility, the fallibilities of Sam in particular, biases likely to be present in Som because he is a stored copy, and biases likely to be peculiar to a short-lived Sem that knows it will be shut down within one hour subjective. This could cause significant value drift even in short timeboxes, and once it began could be self-reinforcing just as easily as the problems with Sim.
Getting to the “golden master key” argument for any position, starting from a sane and normal starting point, is obviously quite hard. Not impossible, though, and while the difficulty of hitting any one master key argument is high, there is a very large set of potential “locks”, any of which has the same problem. If we ran Sem loops for an arbitrary amount of time, Sem will eventually fall into a lock and become malign.

Experiment 3: Instead of just Sam, use a number of people, put in groups and recombining regularly from different parts of a massively parallel system of simulations. Like Sem, it is using entirely plain-text I/O and is timeboxed to one hour per session. Call the Som-instance in one of these groups Sum, who works with Diffy, Prada, Facton, and so on.

Now rather than drifting to a lock which is a value-distorting plain text input for a Sem, we need one for the entire group, which must be able to propagate to one via reading and enough of the rest via persuasion. This is clearly a harder problem, but there is also more attack surface; only one of the participants in the group, perhaps the most charismatic, needs to propagate the self-reinforcing state. It can also drift faster, once motivated, with more brainpower that can be directed toward it. On balance, it seems likely to be safer for much longer, but how much? Exponentially? Quadratically?

What I am conveying here is that we are patching holes in the basic framework, and the downside risks are playing the game of Nearest Unblocked Strategy. Relying on a human is not benign; humans seem to be benign only because they are, in the environment we intuitively evaluate them in, confined to a very normal set of possible input states and stimuli. An agent which is benign only as long as it is never exposed to an edge case is malign, and examples like these convince me thoroughly that a human subjected to extreme circumstances is malign in the same sense that the universal prior is malign.

This, then, is my point: we have no examples of benign agents, we do not have enough diversity of environments to observe agents in to realistically conclude that an agent is benign, and so we have nowhere a hierarchy of benign-ness can bottom out. The first benign agent will be a Friendly AI – not necessarily particularly capable – and any approach predicated on enhancing a benign agent to higher capability to generate an FAI is in some sense affirming the consequent.