Stylistic analysis can de-anonymize code, even compiled code


#1

Originally published at: https://boingboing.net/2018/08/10/greenstadt-and-caliskan.html


#2

Any relation to Niko Matsakis? That guy’s on a whole other level when it comes to language design and implementation.


#3

Nope, not buying it. This analysis might be able to distinguish between two coders with high confidence, but there is no international sample-bank to compare against. There isn’t even an international fingerprint registry yet.


#4

I hope this also has has worrying implications for the writers of ransomware and other nasties.


#5

That’s why I use Google translate to convert all of my code to Visual Basic and then to Python and finally Fortran.


#6

image


#7

From their 2017 paper, but order-of-magnitude typical of the restricted space of programmers they consider:

Assuming a known set of suspect programmers, such as the employees of a company, and some form of segmentation and grouping by authorship, such as accounts on a version control system, we present a technique which performs stylistic authorship attribution of a collection of partial source code samples written by the same programmer with up to 99% accuracy for a set of 106 suspect programmers.

Also, they seem to focus on code written in code-jams. Personally, I write code entirely differently when I’m in a rush versus when I have time to really plan.


#8

or, do you? :slight_smile:

personally, this research makes a bit of sense to me.

i feel many people’s code have tell tale signals. on c++, the amount of template usage, references vs pointers, return vars vs out values, heavy inheritance vs composition, ifs switches flags bools.

it’d make sense if some of that leaked out into the binary. there really is very little one right way in programming. or, we’d just have the programs program the programs.

( except my way. my way is the right way. )


#9

I think it also depends heavily on the culture of the company you work for. My current and previous work places do heavy code reviews, and over time everyone’s code starts to resemble each other’s. I’ve seen it happen when newbies code in an unacceptable way at first, and slowly learn the proper style, patterns, naming conventions, tests, etc.

It’s definitely interesting research, but it seems to me that hacking things out overnight produces a vastly different style than production code.


#10

This topic was automatically closed after 5 days. New replies are no longer allowed.