Hunting Fileless Malware with Treesitter
2024-06-08, 15:30–15:55, Track 2 (Moody Rm 101)

Obfuscated, fileless malware poses a significant challenge to automated detection systems and wastes valuable time during manual analysis. This challenge occurs as the many layers of obfuscation must be unraveled before the true malicious payload is revealed. In this talk, research will be presented that demonstrates how the tree-sitter parser generator library can be used to write scalable, accurate, and attributable detections for malicious Powershell and Bash payloads.


Existing work on detection of obfuscated PowerShell scripts falls into two categories: dynamic and static analysis. Dynamic analysis involves executing portions of the script, while neutering some of the harmful APIs that would be invoked by a payload, to reconstruct the portions of the script that have been obfuscated. Static analysis involves rule-based scanning (i.e. Yara) or syntax tree reconstruction, such as through PowerShell’s AST module. Yara, although powerful, is inherently limited since you cannot reliably predict all variations on text passed through token-based and string-based obfuscation layers. PowerShell’s AST module is limited in that it requires a syntactically correct script, which typically wouldn’t be a problem when dealing with live systems or even disk collections.

Memory forensics introduces additional challenges when dealing with this type of data – event log record data, such as a single record from a series of 4104 script block logging records, may be missing, or pages of data may be missing or smeared, leading to an incomplete or corrupt representation of the original script. Tools that outright fail to handle these will be unable to detect and alert on maliciously obfuscated payloads.

Originally written for the Atom text editor and now an integral component of Neovim, Tree-sitter is a library that provides a single easy interface for writing language parsers. The grammar is defined in a JavaScript file, and a C parser is generated from that grammar. It excels in its ability to recover from errors – testing has shown that even after wiping out an entire page (4096-byte block) of data from large PowerShell scripts, the tree-sitter parser correctly renders a partial syntax tree for the sections before and after the null page, allowing post processing via tree-sitter query analysis to continue. The simplicity of the library, and its integration with languages like Rust and Go through bindings, allows for the creation of static binaries that can be built and deployed across all major OSes.

Tree-sitter also provides a query language with a clean and easy-to-use interface for interacting with the syntax tree of a document. This allows for the creation of rules that precisely target the specific characteristics of a document that are of interest to us. Our research using a relatively small ruleset has proven effective at correctly identifying token and string-based obfuscation techniques, with no false positives across a significant set of vanilla PowerShell scripts. We have also been able to develop techniques for detecting syntax tree characteristics of PE droppers, such as large array initializations and calls to 'Reverse' on large strings. The ability to construct and query the syntax tree is essential to these heuristics.

David McDonald is a researcher and software engineer with 3 years of digital forensics R&D experience. His passion for this field began with his involvement in the University of New Orleans CTF team, as well as through his time as a Systems Programming teaching assistant. After over two years of digital forensics research and development on Cellebrite's computer forensics team, he joined Volexity's Volcano team, where he now works to develop next-generation memory analysis solutions.

He believes deeply in sharing knowledge and helping others discover their abilities and interests through their own journeys in cybersecurity, and strives to pay forward the benefits of the mentorship that has opened so many doors for him.