In my undergraduate programming languages class, our professor used an auto grader and let us use it to test our programs before we turned them in. We could upload our assignment to the auto grader, where it would run our program through a number of tests and tell us our many out of the total passed. Importantly, the auto grader did not tell us what the test cases were.

Our professor offered extra credit to any of us who could extract the test cases from the auto grader. The course had been running for years, and each time a student successfully did this they closed that particular hole. By the time I took the class, the auto grader was pretty robust. No one in my class got the extra credit.

At the time, most of us thought of approaches like directly hacking the grading server, or trying to make our programs write the test cases out so some other place on the server. The auto grader ran on a shared server, and we could access some files stored on that server. It turns out, these approaches are pretty easy to protect against. Presumably you already have an interest in keeping your server secured, so let’s assume directly hacking it is out. Preventing your program from writing rogue files all over the system is easy enough too. Just make sure your programming environment doesn’t have any access to the outside world.

It turns out there’s a more subtle approach: side channels. Side channels, generally, are a way of using something you control to send a message with information your aren’t supposed to be able to reveal. One common side channel is time. Let’s say you were trying to guess someone’s 4 digit PIN and you had a system that let you try as many times as you wanted. Furthermore, let’s assume that it the password checker were really slow. It would check your PIN digit by digit and each digit takes one second to check. Let’s say you try 4567 and after two seconds the system says your PIN was incorrect. You now know the first two digits are 4 and 5. So next try 4577. This time it still takes two seconds. So try 4587. This time it takes three seconds! This means we’ve figured out the first three digits. With at most 9 more tries, you’ll have figured out the password.

There’s a reason login systems aren’t designed like that.

So let’s go back to the auto grader. Can we use side channels to recover the tests? Ideally, we’d want to build a channel using functionality that is critical to the overall system. If we succeed at this, there will be no way for the instructor to patch the hole without turning off the auto grader entirely.

For the auto grader, the key functionality is:

  • Run code that you upload. This code is heavily sandboxed, but it must allow you to answer the questions.
  • Tell you how many tests passed.
  • Allow multiple attempts.

Using just this functionality, how could we recover the tests that were run?

Twenty Questions

It turns out there’s at least one thing we have control over: whether we get a question right or not. Once you know how to correctly answer the question on all inputs, you can basically use this to play twenty questions with the test cases. For example, let’s say you were supposed to write a function that takes a number n and computes n^2. This is easy enough to get:

function square(n) {
  return n * n;
}

Let’s say I upload this to the auto grader and it tells me:

square: 1 out of 1 test passed

That hasn’t told me what number they tested on, but now I can start asking questions. The general approach is to ask a question about the input and then return the right answer if the question is yes and a wrong answer if the question is no. Let’s upload the following program to our imaginary auto grader:

function square(n) {
  if (n == 0) {
    return n * n;
  } else {
    return n + 1;
  }
}

The auto grader responds:

square: 0 out of 1 test passed

So now we’ve learned that the test case is not 0. We could keep trying, trying testing if n == 1, then n == 2 and so on. Eventually this will work, but if the test case were 10,000,000 then it will take awhile. Fortunately, we can ask other questions of our input, such as whether it is less than some given number. If we do this, we can use a binary search to find the test case in almost no time at all.

Things are looking good so far, but in reality an auto grader is not going to give you just one test case.

Finding Multiple Test Cases

What if we had five test cases instead of just one? We could continue asking number by number and we’d eventually find all five. Chances are we’d even find all five in about as much time as it took us to find the first one. Could we find all five faster? Let’s try the binary search approach. We start with this program:

function square(n) {
  if (n > 1) {
    return n * n;
  } else {
    return n + 1;
  }
}

The auto grader tells us:

square: 2 out of 5 tests passed

We’ll need to do this a few more times, so let’s make a table.

n > 1 square: 2 out of 5 tests passed
n > 2 square: 1 out of 5 tests passed

Already we can figure out that one of the tests is 2, since that is the only one number that is greater than 1 but not greater than 2. Let’s continue and see if we can figure out the rest.

n > 4 square: 1 out of 5 tests passed
n > 8 square: 1 out of 5 tests passed
n > 16 square: 0 out of 5 tests passed

So now we know that all of the test cases are less than or equal to 16, and that one of these lies between 9 and 16 inclusive. We can do a binary search in this range to find the exact number and then continue the process until we’ve found all of the test cases.

That’s the gist of the idea, and a good place to wrap up for now. Of course, programs that do simple operations on numbers aren’t too exciting. What about more advanced problems, and more intricate test cases? How can you learn about linked lists or trees using similar techniques? I’ll leave these as an exercise for the reader, or until a later post.