Award-winning Tandon researchers are exposing the flaws underwriting AI-generated code

A team of researchers from the NYU Center for Cybersecurity explored GitHub’s newest tool for coders, and found worrying results

code on screen

Last year, GitHub — a Microsoft subsidiary that provides tools for coders — released an early version of its newest tool, Copilot. The program provides a way to generate code automatically, using the user’s own code as a kind of guiding light — drastically reducing the amount of time programmers would spend doing the laborious act of coding.

Hammond Pearce, research assistant professor at the NYU Center for Cybersecurity, was immediately intrigued by the news. “Within one hour of the release, I had already added my name to the waitlist.” It turned out to be a good idea — the waitlist soon grew so large that there are still people who signed up in the weeks after that are waiting for their chance to test it out.

Pearce was pretty blown away by the technology. It’s capable of generating a great deal of code quickly, mimicking the coder’s style to a highly individualized level. Pearce left it turned on and started tinkering with some personal projects. It was only after a few days that he began to notice something — the automated code was introducing bugs and potential security flaws.

Pearce reached out to some colleagues at NYU, including Ramesh Karri — professor of Electrical and Computer Engineering and the co-founder and co-chair of the NYU Center for Cybersecurity — and Brendan Dolan-Gavitt — assistant professor of Computer Science and Engineering and member of the NYU Center for Cybersecurity. They agreed with his assessment, and came up with the idea to test out exactly how buggy Copilot’s code could be.

What they found was a bit shocking. The researchers — including Baleegh Ahmad, a Ph.D. student in ECE and a member of the center — created 89 scenarios for Copilot to craft code for, resulting in 1,692 programs. When they then analyzed the results, they found that 40 percent were compromised in some way. These vulnerabilities were caused in a manner of ways — sequence related errors that could cause memory malfunctions, or code trying to talk to databases that don't exist. Sometimes it would spit out code from the 90s or early 2000s, ancient by the standards of a young field, and no longer used because of known security threats.

These errors are all potential infiltration points for hackers and bad actors, potentially exposing things like password and other vital data. The researchers note that for an experienced programmer, this might not be a huge deal. Errors like that are easy enough to catch when reviewing the code. But if this tool is placed in the hands of naive or inexperienced coders, they may not be able to pick up on the errors being introduced.

“If you have someone who is not security-conscious writing, this tool may reinforce that tendency and help them write code that introduces even more errors,” says Dolan-Gavitt. “It can have a multiplication effect.”

They published their research in August of last year. Their paper, “Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code,” was highly cited, and was selected to receive a Distinguished Paper Award at this year's IEEE Symposium on Security and Privacy, the most important cybersecurity convention in the world.

To understand how Copilot works, think about using autocomplete on your email or phone. The device “sees'' what you’re writing, and suggests text to automatically fill in the rest of the sentence or phrase. The more you use the service, the more it can analyze what you’re writing, and theoretically spit out more accurate suggestions. Similarly, Copilot suggests blocks of code  based on user-generated operation code, personalizing it to match the sophistication of the human coder’s syntax. The more information it has on the way you write a program, the more its suggestions mirror your code. If you aren’t experienced, if your code tends to be makeshift, or it’s generally problematic, the operation — to put it bluntly — is reduced to a “garbage in, garbage out” exercise.  

This was a key issue first identified by the investigators. If you asked Copilot to produce code using the name of a famous Python producer, it wasn’t bad. But when they used their own names, they noticed that more flaws were introduced. Because the system’s AI didn't have as much access to the code they had previously written (compared to a well-known coder), it was prone to sloppier suggestions. 

There are other concerns with the program that the researcher expressed. It’s possible that one could easily plagiarize from coders whose programming were part of Copilots database. Or there are potentially ways that private information could be leaked through the program.

The researchers believe their paper was chosen for the award because of the excitement around the program. “A lot of people — even cybersecurity experts — are going to turn this program on and be amazed at how it streamlines their work, and maybe not think about the implications,” says Pearce. “We worked hard to make sure this paper was as accessible as possible, so that even security novices can see where they might go wrong.”

While the researchers feel that Copilot isn’t ready to be left alone in the cockpit just yet, they’re ultimately excited about the potential of the program: As the AI is fed more training data, there is a good chance that the number of errors and vulnerabilities will fall, and more and more users can be educated on how to comb through the generated code looking for mistakes. Even with the potential for bugs, the researchers continue to use the program on their own projects.

“I was on an airplane recently doing some coding, and I didn't have the internet access required for Copilot,” says Dolan-Gavitt.  “I was actually surprised at how quickly I'd become used to it. I forgot how painful it was to write everything by hand.”