Past Recording
ShareSave
Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
Tuesday Jan 26 2021 17:00 GMT
Please to join the live chat.
Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
Why This Is Interesting

As a result of the complexity in machine learning models, researchers have proposed a number of techniques to explain model predictions. Often, the motivation for using such techniques is to increase trust in ML models. However, to what extent are explanation methods vulnerable to manipulation? In this talk, we introduce an attack that fools two popular explainability methods called LIME and SHAP through exploiting a common assumption in both techniques. This allows us to create models which have arbitrary explanations according to LIME and SHAP. We demonstrate the potential significance of our attack through building classifiers that solely rely on protected attributes (e.g. Race or Gender) but explanation methods do not indicate that these features are important.

Discussion Points
  • SHAP makes theoretical guarantees on the fair allocation of feature importance. Why do these guarantee not protect SHAP from this vulnerability?
  • Would global explanation method (like a surrogate decision tree) be less likely to be vulnerable to the attack technique you presented?
  • Would it be a fair conclusion that explanation methods can identify bias but cannot guarantee the absence of bias?
Takeaways
  • LIME and SHAP be intentionally fooled by a biased classifier so that the bias is not detected
  • Practitioners should diversity the explainability tools they use to better detect bias
Time of Recording: Tuesday Jan 26 2021 17:00 GMT