Benignness is Bottomless

If you are not interested in AI Safety, this may bore you. If you consider your sense of mental self fragile, this may damage it. This is basically a callout post of Paul Christiano for being ‘not paranoid enough’. Warnings end.

I find ALBA and Benign Model-Free AI hopelessly optimistic. My objection has several parts, but the crux starts very early in the description:

Given a benign agent H, reward learning allows us to construct a reward function r that can be used to train a weaker benign agent A. If our training process is robust, the resulting agent A will remain benign off of the training distribution (though it may be incompetent off of the training distribution).

Specifically, I claim that no agent H yet exists, and furthermore that if you had an agent H you would already have solved most of value alignment. This is fairly bold, but at least the first clause I am quite confident in.

Obviously the H is intended to stand for Human, and smuggles in the assumption that an (educated, intelligent, careful) human is benign. I can demonstrate this to be false via thought experiment.

Experiment 1: Take a human (Sam). Make a perfect uploaded copy (Sim). Run Sim very fast for a very long time in isolation, working on some problem.

Sim will undergo value drift. Some kinds of value drift are self-reinforcing, so Sim could drift arbitrarily far within the bounds of what a human mind could in theory value. Given that Sim is run long enough, pseudorandom value drift will eventually hit one of these patches and drift to an arbitrary direction an arbitrarily large distance.
It seems obvious from this example that Sim is eventually malign.

Experiment 2: Make another perfect copy of Sam (Som), and hold it “asleep”, unchanging and ready to be copied further without changes. Then repeat this process indefinitely: Make a copy of Som (Sem) and give him short written instructions, written by Sam or anyone else, and run Sem for one hour. By the end of the hour, have some set of instructions and state written in the same format. Shut off Sem at the end of the hour and take the written instructions to pass to the next instance, which will be copied off the original Som. (If there is a problem and a Sem does not create an instruction set, start from the beginning with the original instructions; deterministic loops are a potential problem but unimportant for purposes of this argument.)

Again, this can result in significant drift. Assume for a moment that this process could produce arbitrary plain text input to be read by a new Sem. Among the space of plain text inputs could exist a tailored, utterly convincing argument why the one true good in the universe is the construction of paperclips; one which exploits human fallibility, the fallibilities of Sam in particular, biases likely to be present in Som because he is a stored copy, and biases likely to be peculiar to a short-lived Sem that knows it will be shut down within one hour subjective. This could cause significant value drift even in short timeboxes, and once it began could be self-reinforcing just as easily as the problems with Sim.
Getting to the “golden master key” argument for any position, starting from a sane and normal starting point, is obviously quite hard. Not impossible, though, and while the difficulty of hitting any one master key argument is high, there is a very large set of potential “locks”, any of which has the same problem. If we ran Sem loops for an arbitrary amount of time, Sem will eventually fall into a lock and become malign.

Experiment 3: Instead of just Sam, use a number of people, put in groups and recombining regularly from different parts of a massively parallel system of simulations. Like Sem, it is using entirely plain-text I/O and is timeboxed to one hour per session. Call the Som-instance in one of these groups Sum, who works with Diffy, Prada, Facton, and so on.

Now rather than drifting to a lock which is a value-distorting plain text input for a Sem, we need one for the entire group, which must be able to propagate to one via reading and enough of the rest via persuasion. This is clearly a harder problem, but there is also more attack surface; only one of the participants in the group, perhaps the most charismatic, needs to propagate the self-reinforcing state. It can also drift faster, once motivated, with more brainpower that can be directed toward it. On balance, it seems likely to be safer for much longer, but how much? Exponentially? Quadratically?

What I am conveying here is that we are patching holes in the basic framework, and the downside risks are playing the game of Nearest Unblocked Strategy. Relying on a human is not benign; humans seem to be benign only because they are, in the environment we intuitively evaluate them in, confined to a very normal set of possible input states and stimuli. An agent which is benign only as long as it is never exposed to an edge case is malign, and examples like these convince me thoroughly that a human subjected to extreme circumstances is malign in the same sense that the universal prior is malign.

This, then, is my point: we have no examples of benign agents, we do not have enough diversity of environments to observe agents in to realistically conclude that an agent is benign, and so we have nowhere a hierarchy of benign-ness can bottom out. The first benign agent will be a Friendly AI – not necessarily particularly capable – and any approach predicated on enhancing a benign agent to higher capability to generate an FAI is in some sense affirming the consequent.

4 thoughts on “Benignness is Bottomless

  1. For the black-and-white definitions, I agree that there are no benign agents, and it may be significantly harder to produce a perfectly benign human-level agent than to solve alignment.

    I’m hoping for H to be something like “(epsilon, C)-benign,” meaning: if an adversary in class C samples an input, then H’s output is malign with probability at most epsilon (i.e. is epsilon-close in distribution to a benign policy, over the randomness of the adversary and H). Think of something roughly like C=”algorithms you can run in a trillion steps,” though that won’t literally work (since e.g. you can bake an attack into a weak adversary).

    I discuss this a bit in reliability amplification ( and especially in security amplification ( Security amplification is the problem: can we combine a bunch of processes that can’t be attacked in time T, to build a new process that can’t be attacked in time 2*T?

    In the comments ( Wei Dai points out that my proposed approach ( probably can’t do security amplification, because vulnerabilities can be embedded inside human faculties that are opaque to introspection. I consider it open whether either reliability amplification or security amplification are possible, with security amplification looking much dicier. I’d still give maybe 50-50 on security amplification being possible as-originally-conceived, with another 25% on some other way to work around this difficulty.

    I assume that you expect security amplification to be either very hard or fundamentally confused/impossible.

    My hoped-for first step is either (a) a procedure for security amplification that plausibly works, or (b) a concrete model of an attack that seems hard to eliminate by any plausible security amplification scheme.

    I don’t really feel like this is an issue of not being paranoid enough. I do think the “not paranoid enough” accusation would feel more plausible if I hadn’t written the security amplification post; given that I wrote that post it seems like the remaining charge is mostly that I have unfounded technical optimism. I see myself as trying to build up an algorithm, starting with intuitive sketches and then gradually either refining those sketches or identifying fatal problems. I feel like many people are very quick to give up in a way that would make it impossible to design algorithms even in mainstream theoretical CS. I would not be at all surprised if benign model-free RL were roughly as difficult as the PCP theorem.

    I don’t quite understand the analogy to nearest-unblocked-strategy. There isn’t really an adversary here; there is a long-running process and we are searching for invariants that would allow us to control the behavior of that process.

    I agree the benign model-free RL post brushes this issue under the rug without acknowledging it.


    • I don’t consider any solution to value alignment sufficiently paranoid unless it works if the environment is adversarial. If we’re going to rely on the alignment of a system, it will eventually be subject to the equivalent of a dictionary attack, and simple patches will make this slower to arrive but ultimately produce effects very similar in character to the nearest unblocked strategy.

      Security amplification as you describe it here also seems implausibly optimistic; I don’t think there is a class C big enough to do useful work with with reasonably-small epsilon, for the case of a human H trying to be shown (epsilon, C)-benign. Possibly your definition of malign is a much stronger claim that I’d intuitively use? Simple political slogans and truthful-but-incomplete statistics are totally capable of eliciting behavior from humans I’d consider malign.

      Another possible crux is that, as I mentioned offhand in my conclusion, I don’t believe we have enough diversity of environments to observe agents in to realistically conclude that one is benign. Even were a human to be (epsilon, C)-benign for epsilon < 1/2 and C a class large enough to allow useful computation on the part of H, I don't think we could formally prove this or even construct an unrigorous argument which reasonably implied high confidence. We're specifically reasoning about agents whose internals are not inspectable, and I am skeptical that we can evaluate behavior in edge cases reliably enough to draw conclusions. (And for high-dimensional input spaces, most of the space is edge case.)


      • Yes, solutions should work in adversarial environments, though I’d be happy to make some assumptions if they looked legit.

        I definitely don’t think that truthful-but-incomplete statistics can generate malign behavior, though on that point the ball is probably in my court to give more complete definitions.

        Liked by 1 person

  2. BTW, I don’t know who you are IRL — feel free to email me if you want to chat 🙂


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s