Adversarial Examples are Not Bugs, They are Features
by Andrew Ilyas, Shibani Santurkar et al.
Main Idea
We cast adversarial vulnerability as a fundamental consequence of the dominant supervised learning paradigm.
Central Premise
There exist both robust and non-robust features that constitute useful signals for standard classification.
It is possible to disentangle robust from non-robust features (in the standard image classification datasets).
Experimental Setup for Demonstration
Disentangling robust and non-robust features
Non-robust features suffice for standard classification
The phenomenon that non-robust features contribute classification is not identical or due to finite-sample overfitting.
Transferability can arise from non-robust features
They transfer across models with different architectures.