I wonder. Right now most seem unbelievably poor at even recognizing what they’re talking about, which you hardly need an in-depth conversation to notice.
I guess this comes down to what you say next, about different juries have different standards:
When nobody is interested in probing whether their partner has memory let alone understanding, no conversation is going to test for them, and so the heuristics we have now might be enough. Was the idea really to fool one non-expert jury, though? I sort of assumed it was being able to fool juries, so on a more or less consistent basis.
As for me, it’s not like I would use the same questions every time; for instance, I would ask about completely different stories. And if your program can listen to any random story I tell it about any random subject, give a plausible answer about the meaning or motivations, and support that with evidence from the story, I’m ok with saying it must have some form of understanding in it.
Now it’s easy to say that’s just because nobody has made such a program yet, and when we do I’d want something else, but I’m skeptical. Chess was picked not because it was reflective of how people think, just because it was hard, and nobody considered the possibility of tuning up something to be good at chess and incapable of anything else. Go turns out to be harder still, but I would hope by now everyone understands that a Go program would end up the same.
But being able to interpret real or hypothetical situations, so far as coming up with meanings and motivations, and support those answers? What exactly would anyone expect from an intelligent program that doesn’t actually fall under that umbrella? Maybe someone could wave such a thing off as merely a world-interpreting-program, but as far as I can tell that’s essentially what human understanding is, too. I’m genuinely at a loss to imagine how I would draw a distinction between the two.
I guess maybe the test is silly, but I still don’t think the choice of task is.