A bot has been finding bugs and submitting patches for them, successfully masquerading as a human


#1

Originally published at: https://boingboing.net/2018/10/19/repairnator-is-everywhere.html


#2

Bot generated code… all the way down.


#3

This is genuinely epic, thanks for sharing it


#4

Good thing it’s not finding bugs and exploiting them.

Oops.


#5

Utterly bogus. Wilfully fraudulent.

The fix which their bot applies is to trivally wrap any statements that generate null pointer exceptions with if statement guards.

They provide willfully misleading data to suggest that their results are significant (they managed to commit only five fixes to (probably, but willfully not stated) thousands of build failures They don’t report how many fixes were rejected.

And of the four fixes they did manage to get accepted, two are definitely incorrect, one is probably incorrect, and the fifth would require actual debugging and analysis of related commits to determine whether it is correct, which was not performed by the bot.

In all five cases, the five accepted patches should have been rejected. The code is not fixing exceptions; it’s ignoring them, so in every case, the fixes are doing significant damage in that they may be masking significant underlying problems. It would have been simpler (and more obvious) to just write a bot to disable the tests in which errors occurred.

The researcher reports two flights of tests. “Expedition #1” which analyzed 3,551 replicable build failures, and “failed” (failed presumably to get a single patch accepted, although that’s not explicitly stated); and “Expedition #2” which managed to get five automatically generated fixes accepted. That they do not report how many builds were analyzed for Expedition #2" seems like a wilful omission of data. That they don’t report how many patches were submitted in either expedition is more than suspicious. If, for example, they submitted 200 patches in Expedition #2, and only managed to get five accepted, then the problem with this paper becomes much more obvious.

The researchers claim that there is a bias against machine generated fixes. They don’t provide supporting data, but one assumes that they make this claim based on the fact that all of their “Expedition #1” fixes were identified as bot fixes, and were rejected. When a developer makes a source commit, there is an implicit understanding that they have done due diligence to analyze the root cause. When one knows that the fixes in question have been applied by a bot, it becomes blindingly obvious that this due diligence has not been performed. The problem isn’t that fixes in Expedition #1 were rejected; the problem is that none of the fixes in Expedition #2 should have been accepted.

(I am a professional software developer).


#6

Good analysis, though you might want to tone down this kind of stuff:

The tone there kind of undercuts your otherwise useful, constructive comments.

Would be helpful to know how you determined this. All five patches were accepted by human beings who are presumably also professional software developers and more familiar with the projects in question than you are, so it looks like there’s some difference of opinion here. Given this conflict of opinions, how could I verify your determinations for myself?


#7

I’ll be impressed when it can make light banter with the receptionist.


#8

Absolutely spot on. This comment is the reason I came to the comments section.

I became suspicious as soon I realized that they were being coy about linking to the actual pull requests. Speaking of which, where are they?


#9

Wuh? The pull requests are in a big html list in the middle of the Medium article that the OP is citing. Nothing “coy” about that in the least.

All the fixes were:

  1. verified by humans working as part of the Repairnator project
  2. accepted by humans working for the project to which it was submitted

So “Utterly bogus. Willfully fraudulent.” seems a little strong to me. It seems like an honest attempt at the first versions of what could some day be a useful tool (though I agree these fixes are probably masking problems instead of really fixing them).

I’d like to see how many patches got rejected though.


#10

Thank you, I see the patches now, and they are HORRIBLE.

This is nothing but vandalism. Defending this is equivalent to suggesting that 419 scammers, maybe they think that they can get you some money?

Tests exist to catch bugs. All these patches do is defeat tests so that bugs can make their way into software.


#11

Did you miss the part where the project maintainers actually accepted them?

I mean, yes, they’re not great patches and they’re probably masking other issues, but it’s not “vandalism” if I say “Hey, can I paint a mural here?” and the building’s owner says “yeah, sure!”


#12

Absolutely vile. Pull request spam. They sent thousands of these pull requests, and they learned that if you pretend to be a human a few maintainers will be tricked into accepting terrible patches.

So, now I know, if a pull request comes in it may be a bot attempting to defeat my testing. Good to know.

Since you’re probably not a software developer, let’s explain this a bit. Coal miners used to take canaries into the mines with them, because toxic gasses would affect the canary before the humans, and a dead canary would serve as a warning. This bot “solves” the problem of your canary dying by giving it a gas mask.


#13

I am, actually, but thanks for the condescension.

I understand the problem with these fixes and have already mentioned it twice. I hope you’re better at parsing and analyzing code than statements in the English language.

I agree that these maintainers should probably have rejected the patches. What I disagree with is accusations like “bogus”, “fraudulent”, and “vandalism”.

If you bother to read the material, you will see that:

  1. there were not thousands of pull requests sent (they don’t give a specific number, but the first pass only submitted 15 pull requests, so I’m betting the second pass was in the double digits as well) so not really “spam”
  2. each pull request was analyzed and submitted by a human being working as part of the Repairnator project
  3. the Repairnator guys informed each maintainer that the patch was produced algorithmically

If you bother to think about it for a second, you’ll realize:

  1. human beings are every bit as capable of writing terrible code as this bot
  2. therefore, the onus is always on the project maintainer to decide whether or not a patch is a worthwhile contribution and should be merged

You seem really hung up on what is clearly an early version of what could – with a lot of work! – become a very interesting and useful tool. You’re so focused on what’s wrong that you can’t see the possibilities. This is really a failure of imagination on your part.


#14

To overcome this problem, we have decided early in the project that all Repairnator patches would be proposed under a fake human identity.

They only got pull requests accepted when they started lying.

And these are terrible patches. These are a young developer who does not understand TDD automating his terrible practices, mixing in a dash of fraud.


#15

@Scott_Lindsey’s false claims so far:

  1. “coy about linking to the actual pull requests” – the authors very clearly linked the pull requests
  2. “they sent thousands of these pull requests” – no evidence that there were thousands sent, evidence suggests there were probably fewer than 50 sent
  3. “they only got pull requests accepted when they started lying” – what exactly was the lie in question? patches were submitted with a comment such as “this patch is a proposal fix for a NullPointerException.” Not “this patch is a proposal fix for a NullPointerException and it was definitely written by a human”; if the maintainers want to accept and merge terrible patches, that seems like their prerogative to me
  4. “dash of fraud” – each merge was followed by the following statement: "Full-disclosure, this pull-request was part of a scientific experiment, see https://medium.com/@martin.monperrus/human-competitive-patches-in-automatic-program-repair-with-repairnator-359042e00f6a "
  5. “young developer who does not understand TDD” – https://scholar.google.com/citations?user=dJQf4SYAAAAJ&hl=en may we please see your CV @Scott_Lindsey?

There’s plenty to criticize in this research , but sputtering outrage based on incorrect assumptions and utter falsehoods doesn’t help anything (unless you’re into self-righteous indignation I guess).


#16

Let’s not forget to mention the relevance of these five software projects.

The first patch looks okayish to me.

The second one is for a repo that looks like basic exercises for interview code. The implementation is horribly messy, and the patch is obviously wrong in that guards against a condition in one branch of the block while happily ignoring it in the other. Furthermore, this is something that can never happen if the tree is sound, so there must be another bug somewhere creating the inconsistent situation. The patch is clearly incorrect.

The third one has apparently been pulled from github.

The fourth one is in Eclipse Ditto, and I don’t have enough time to assess it.

The fifth patch looks bogus as it doesn’t store the output of the protected call in the same scope, until you realize that said output is never used, and the method in question is side-effect-free anyway. Again, this project is exercise code.

I glanced at a few other patches that the same account has submitted, and I realized something: If this bot is ever allowed to run on code that handles authentication or authorization, it could perfectly well identify a crash location and replace it with an authentication bypass vulnerability. Not good.

EDIT: also, this only seems to fix NullPointerExceptions, which arguably are a consequence of a design flaw in the language itself. Modern languages try to stay away from nulls for that reason. That kind of things should thus be fixed in the language itself and/or the compiler, not by hammering random projects with pull requests.


#17

I urge you to read the original report carefully, paying attention to the data that was NOT included. They excluded exactly the data you’d want to see for “Expedition #2”, but for some reason found it important to include exactly that data for the failed “Expedition #1”. I can’t help but think that was wilful.

You assume that the undisclosed number of fixes that were submitted in “Expedition 2” was the same as for “Expedition 1” based on an assumption that the number of candidate build breaks were the same between “Expedition 1” and "Expedition. But that number was not disclosed either. How can anyone write that report and not think that both of those figures were important pieces of data that needed to be disclosed? Seriously.

Huge portions of the report deal with the “framework” used to detect candidate defects, and submit automatically generated fixes. Build automation is pretty routine stuff. What doesn’t get discussed at all is the nature of the fixes they try to apply. It’s only when you review the commits that you realize the full extent of the awfulness of the automatic fixes they apply. I can’t help thinking that wasn’t an accident…

The magnitude of the awfulness of what they actually did should not be understated.

No. Really. This is the result of having the imagination to be interested enough in their claims to find out what they were actually doing, and being disappointed by what I discovered.

The problem is that it’s a dreadful hack masquerading as an interesting and useful tool. And that there is absolutely no indication that anything interesting and useful actually did happen. And it’s not like there aren’t great tools out there that already to perform automated analysis of null pointer exceptions (e,.g @Nullable/@NotNull annotations in Java and Code Contracts in C#). Or that writing build automation tools is rocket science so difficult that the framework needs more discussion than the discussion of what types of fixes were automatically applied.

The four fixes that were disclosed are applications of the following production rule:

    int value = someVariable.member; <-- throws a null pointer exception in testing

is replaced with:

if (someVariable !=  null)
 {
      int value = someVaraible.member;
 }

You might as well write:

   public void testsomething() {
        try {
            ....
        }
        catch (Throwable ignored)
        {
        }
    }

which would certainly be a fireable offense if it happened again after having had the “very very serious chat” (although I don’t truly believe that any junior programmer would be capable of making that kind of commit twice).

The fixes they applied were not promising prototypes of a real fix in some future version of the software. They were staggeringly awful fixes.

How did the get human beings to accept these patches? When they told reviewers that the fixes were machine generated, not one of the fixes was accepted (Expedition #1). When they didn’t tell reviewers that the fixes were machine generated (Expedition #2), some percentage of fixes (said percentage actually being an interesting data point, but sadly not disclosed) were accepted. Why? One has to assume that the reviewers accepted the fix granting some measure of good faith belief that the human who was assumed to have generated the fix had actually performed the supporting analysis.

Is that a hypothesis? Absolutely. But it’s an awfully reasonable hypothesis. And surely deserved some discussion, along side the hypothesis that humans show bias when reviewing commits that were generated by actually stunningly awful bots when you disclose that they were written by potentially stunningly-awful bots.


#18

#19

…which sounds to me like merely a different way of saying that machine generated fixes are treated more harshly than ‘human’ generated ones.

Except they do - to the standard usually expected of any academic work. They reference:

A. Murgia, D. Janssens, S. Demeyer and B. Vasilescu (2016) Among the machines: Human-bot interaction on social q&a websites. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 1272–1279. Cited by: Human-competitiveness.

If you want the data, you have to go read that. I have no idea whether it backs up the claim because I haven’t read it either.

We all know the stupid old saw about what happens when you assume… :wink:

In this case reading on to the next paragraph tells us that both Expeditions were ‘disguised’.

To overcome this problem, we have decided early in the project that all Repairnator patches would be proposed under a fake human identity. We have created a GitHub user, called Luc Esape, who is presented as software engineer in our research lab. Luc has a profile picture and looks like a junior developer, eager to make open-source contributions on GitHub. Now imagine Repairnator, disguised as Luc Esape proposing a patch: the developer reviewing it thinks that she is reviewing a human contribution. This camouflage is required to test our scientific hypothesis of human-competitiveness. Now, for sake of ethics, the real identity of Luc has been disclosed on each of his pull-requests.

Not that I disagree with you in general but it does seem as though the project was stating that the patches were written by a human from the outset.

That statement was added later, not at the time the patches were submitted. see the last sentence in the quoted section above: “Now, for sake of ethics, the real identity of Luc has been disclosed on each of his pull-requests.

If you look at the github commit conversation list, the disclosure was made 5 days ago.


#20

I’ve decided to read this as a story about a robot sewing tiny patches for insect clothing.