AI in a Minefield: Learning from Poisoned Data
2021-06-12, 10:00–10:45, Track 1

Many security technologies use anomaly detection mechanisms on top of a normality model constructed from previously seen traffic data. However, when the traffic originates from unreliable sources the learning process needs to mitigate potential reliability issues in order to avoid inclusion of malicious traffic patterns in this normality model. In this talk, we will present the challenges of learning from dirty data with focus on web traffic - probably the dirtiest data in the world, and explain


Data poisoning is one of the main threats on AI systems. When malicious actors have even limited control over the data used for training a model, they can try to fail the training process, prevent it from convergence, skewing the model or install so-called ML backdoors – areas where this model makes incorrect decisions, usually areas of interest for the attacker. This threat is especially applicable when security technologies use anomaly detection mechanisms on top of a normality model constructed from previously seen traffic data. When the traffic originates from unreliable sources, which may be partially controlled by malicious actors, the learning process needs to be designed under the assumption that data poisoning attempts are very likely to occur.
In this talk, we will present the challenges of learning from dirty data, overview data poisoning attacks on different systems like Spam detection, image classification and rating systems, discuss the problem of learning from web traffic - probably the dirtiest data in the world, and explain different approaches for learning from dirty data and poisoned data. We will focus on threshold-learning mitigation for data poisoning, aiming to reduce the impact of any single data source, and discuss a mundane but crucial aspect of threshold learning – memory complexity. We will present a robust learning scheme optimized to work efficiently on streamed data with bounded memory consumption. We will give examples from the web security arena with robust learning of URLs, parameters, character sets, cookies and more.

In the last 20 years I had researched and innovated in variety of security domains, including web application security, advanced persistent threats, DRM systems, automotive systems, data security and more. While thinking as an attacker is my second nature, my first nature is problem solving and algorithm development - in the past in cryptography and watermarking, and today mostly around harnessing ML/AI technology to solve security-related problems. While I am fascinated with bleeding edge technologies like AI and federated learning and the opportunities these technologies unlock, as a security veteran I am also continuously asking what can go wrong and the answer is never NULL.
I am the inventor of 20 patents in security, cryptography, data science and privacy-preserving computation arenas. I hold an M. Sc. in Applied Math and Computer Science from the Weizmann Institute.