This Dagstuhl Seminar on "Malware analysis: from large-scale data triage to targeted attack recognition" addresses the specificities of the analysis of malware samples. It unites people from multiple backgrounds such as code verification, forensics, and machine learning with both industrial and academic point of view. The seminar is motivated by the earlier Dagstuhl Seminar 14241 where three major challenges specific to malware analysis (compared to other executable analysis) were highlighted: the ability to handle obfuscated code, the scalability of analyses both in size of executables and volume of data, and the capacity of analyses to retrieve information both on the behavior of the malware sample and its origin. For the second challenge machine learning was suggested as a promising technique.
Malware authors are becoming more and more sophisticated in making analysis difficult by implementing layered static/dynamic anti-analysis techniques. At the same time, the amount of daily received malware precludes them from being deeply analysed.
Hence, the challenges posed by code obfuscation and anti-analysis defenses come from the combination of the desire for quick and cheap mechanisms for triage (see challenge machine learning and scale) in order to handle a large number and variety of malware samples, on the one hand, and the high cost of penetrating multiple layers of obfuscations, on the other hand. Techniques for detecting and reasoning about obfuscated code, including symbolic and concolic analysis techniques, detection and neutralization of obfuscation and other anti-analysis measures, and efficiency and scalability considerations, are therefore especially relevant to this seminar.
Malware continues to grow exponentially, with new malware numbers reaching 100s of thousands per month. It is no longer feasible to analyze new malware manually or perform point-analysis of individual malware. More scalable methods are needed to individually or collectively analyze malware. Machine Learning (ML) offers a viable, and as-yet under explored, method to manage the scale. ML may be used across the entire spectrum of tasks involved in handling malware: triage, signature generation, detection, mitigation, and incidence response. ML may also introduce new methods of countering malware threat, such as, creation of automated, self-learning classifiers that can predictively detect unseen malware.
Machine learning techniques need a corpus of well labelled data to estimate the model they implement. As such a corpus is not available for malware samples advances are needed in the interplay of feature space, model construction, performance estimation, and such, taking into account the specific adversarial context created by malware.
Compared to other executables, the expected behavior of malware samples is not known a priori. Hence, the first question one has to deal with is "What does this malware do?". Secondly, one asks "Who is the enemy?" - a question about the origin of the malware. A part of the answer may be found in the executable itself. The other part may be deduced from previous malware attacks and hence has a tight link with the Machine learning challenge: (i) Triage would enable to answer how (and if) a malware campaign is related to a previous one; (ii) industrial participants of the seminar should be asked what kind of features are relevant for malware recognition and which could serve as malware footprint for triage. This feature extraction has to be done in the context of obfuscation and hence can serve as guideline for de-obfuscation, see challenge Obfuscation.
As a follow-up on the previous Dagstuhl Seminar 14241 on the analysis of binaries, the interest in attending this new seminar was very high. The attendance was very diverse, almost half academics and half practitioners.
Talks were arranged by topics and each day ended with an open discussion on one of the three topics: machine learning, obfuscation and practitioners' needs.
Considering the given talks, it appears that the challenges in the realm of general binary analysis have not changed considerably since the last gathering. However, the balance between the topics shows that the academic interest is now more focused on machine learning than on obfuscation. On the contrary practitioners exhibited examples showing that the sophistication level of obfuscations has tremendously increased during this last years.
The open discussions were the most fruitful part of the seminar. The discussions enabled the academics to ask practitioners about the hypotheses that are relevant to build models for their analyses and the problems they face in their daily work. The practitioners gained awareness of the automated tools and techniques that they can expect to see emerge from research labs.
These informal exchanges will be gathered into a separate document and spread to the academic community.
Finally please note that not all people who presented have submitted their abstracts due to the sensitive nature of the content and/or the organization that the participants work for.
- Radoniaina Andriatsimandefitra (Rennes, FR) [dblp]
- Sebastian Banescu (BMW Group ITZ, DE) [dblp]
- Thomas Barabosch (Fraunhofer FKIE - Bonn, DE) [dblp]
- Sébastien Bardin (CEA LIST, FR) [dblp]
- Konstantin Berlin (SOPHOS - Fairfax, US) [dblp]
- Paul Black (Federation University Australia - Mount Helen, AU) [dblp]
- Cory Cohen (Carnegie Mellon University - Pittsburgh, US) [dblp]
- Christian Collberg (University of Arizona - Tucson, US) [dblp]
- Sophia D'Antoine (Trail of Bits Inc. - New York, US)
- Mila Dalla Preda (University of Verona, IT) [dblp]
- Robin David (Quarkslab, FR) [dblp]
- Bjorn De Sutter (Ghent University, BE) [dblp]
- Saumya K. Debray (University of Arizona - Tucson, US) [dblp]
- Thomas Dullien (Google Switzerland - Zürich, CH) [dblp]
- Roberto Giacobazzi (University of Verona, IT) [dblp]
- Yuan Xiang Gu (Irdeto - Ottawa, CA) [dblp]
- Tim Kornau-von Bock und Polach (Google Switzerland - Zürich, CH) [dblp]
- Arun Lakhotia (University of Louisiana - Lafayette, US) [dblp]
- Colas Le Guernic (Direction Generale de l'Armement - Rennes, FR) [dblp]
- Jean-Yves Marion (LORIA & INRIA - Nancy, FR) [dblp]
- Marion Marschalek (G DATA Advanced Analytics GmbH - Bochum, DE) [dblp]
- J. Todd McDonald (University of South Alabama - Mobile, US) [dblp]
- Xavier Mehrenberger (Airbus Group - Suresnes, FR)
- Michael Meier (Universität Bonn, DE) [dblp]
- Bogdan Mihaila (Synopsys Finland OY - Helsinki, FI) [dblp]
- Craig Miles (Assured Information Security - Portland, US) [dblp]
- Asuka Nakajima (NTT - Tokyo, JP)
- Daniel Plohmann (Fraunhofer FKIE - Bonn, DE) [dblp]
- Mario Polino (Polytechnic University of Milan, IT) [dblp]
- Pablo Rauzy (University of Paris VIII, FR) [dblp]
- Raphael Rigo (Airbus Group - Suresnes, FR)
- Radwan Shushane (Columbus State University, US) [dblp]
- Natalia Stakhanova (University of New Brunswick at Fredericton, CA) [dblp]
- Ryan Stortz (Trail of Bits Inc. - New York, US)
- Dinghao Wu (Pennsylvania State University - State College, US) [dblp]
- Yves Younan (Cisco Systems Canada Co. - Toronto, CA) [dblp]
- Stefano Zanero (Polytechnic University of Milan, IT) [dblp]
- Sarah Zennou (Airbus Group - Suresnes, FR) [dblp]
- Ed Zulkoski (University of Waterloo, CA) [dblp]
- security / cryptology
- semantics / formal methods
- verification / logic
- reverse engineering
- executable analysis
- machine learning
- big data