Arguing Machines: Human Supervision of Black Box AI Systems That Make Life-Critical Decisions

We consider the paradigm of a black box AI system that makes life-critical decisions. We propose an “arguing machines” framework that pairs the primary AI system with a secondary one that is independently trained to perform the same task. We show that disagreement between the two systems, without any knowledge of underlying system design or operation, is sufficient to arbitrarily improve the accuracy of the overall decision pipeline given human supervision over disagreements.

We demonstrate this system in two applications: (1) an illustrative example of image classification and (2) on large-scale real-world semi-autonomous driving data. For the first application, we apply this framework to image classification achieving a reduction from 8.0% to 2.8% top-5 error on ImageNet. For the second application, we apply this framework to Tesla Autopilot and demonstrate the ability to predict 90.4% of system disengagements that were labeled by human annotators as challenging and needing human supervision. The following is video on the concept of “arguing machines” applied to Tesla Autopilot “arguing” with an end-to-end neural network on-road in real-time:

See arXiv paper for details. If you find this work useful in your own research, please cite:

  author    = {Lex Fridman and Li Ding and Benedikt Jenik and Bryan Reimer},
  title     = {Arguing Machines: Human Supervision of Black Box AI Systems
               That Make Life-Critical Decisions},
  journal   = {CoRR},
  volume    = {abs/1710.04459},
  year      = {2019},
  url       = {},
  archivePrefix = {arXiv},
  eprint    = {1710.04459}

The basic idea is that the “arguing machines” framework adds a secondary system to a primary “black box” AI system that makes life-critical decisions and uses disagreement between the two as a signal to seek human supervision. We demonstrate that this can be a powerful way to reduce overall system error. The following is a diagram of the concept:

We first illustrate the idea of arguing machines with a toy experiment on ImageNet. The arguing machines framework is proposed as follows. Suppose, there exists a state-of-the-art black-box AI system (primary system) whose accuracy is great but not perfect. In order to safely use or test the system, we propose to have a secondary system that can argue with the primary system. When disagreement arises between two systems, we regard it as a difficult case and mark it as needing human supervision. The purpose of arguing machines is to improve the system performance with minimal human effort, especially when the primary system is a black-box and gives no other information except the final output.

In the case of ImageNet, the task is image classification. We take two popular image recognition models, VGGNet and ResNet. Specifically, we treat a single ResNet-50 model as the black-box and a VGG-16 model as the secondary system. The models are pre-trained and we obtain the prediction results from single center-cropped images in the ImageNet validation set. The arguing machines arbitrator detects the disagreement when the top predictions of two systems differ. In this experiment, ResNet and VGG disagree on 11,645 images, which is 23.3% of the whole validation set. Here are some examples:

Next, we evaluate our application of arguing machines to semi-autonomous driving. The following is an illustration of the concept. When steering disagreement between Tesla Autopilot and end-to-end neural network exceeds a threshold, the case is considered challenging and human supervision is sought:

For this application, we evaluate the ability of an argument arbitrator (termed “disagreement function”) to estimate, based on a short time window, the likelihood that a transfer of control is initiated, whether by the human driver (termed “human-initiated”) or the Autopilot system itself (termed “machine-initiated”). We have 6,500 total disengagements in our dataset (subset of those in the full MIT-AVT study). All disengagements (whether human-initiated or machine-initiated) are considered to be representative of cases where the visual characteristics of the scene (e.g., poor lane markings, complex lane mergers, light variations) were better handled by a human operator. Therefore, we chose to evaluate the disagreement function by its ability to predict these disengagements, which it is able to do with 90.4% accuracy. On the left plot are false reject and false accept rates achieved by varying the threshold. On the right is an example scenario of a driver-initiated Autopilot disengagement with a time series of steering decisions and disagreement magnitude compared to threshold:

That’s a surprising result, showing that there is a lot of signal in this disagreement even when the disagreement is based on a simple threshold. As the video at the top of this post explains and shows, we’ve instrumented a Tesla Model S to demonstrate real-time operation of this framework. Here are the various components involved:

To get in touch about this research, contact Lex Fridman via email ( or connect on TwitterLinkedInInstagramFacebook, YouTube.