Thanks to package managers like Maven, pip or npm, the consumption of open source components is a very natural part of software development. By now, the majority of modern software applications – commercial or not – depends on at least one open source component. These dependencies may themselves require other dependencies, forming the so called dependency tree.
Numerous studies confirm the omnipresence of open source: According to a blog post from December 2018, written by npm Inc., “the average modern web application has over 1000 modules, and (dependency) trees of over 2000 modules are not uncommon” and “an individual developer is responsible only for the final 3% that makes their application unique and useful“.
However, the flip side of open source consumption is the exposure to potentially severe security risks.
One such risk stems from the use of components with known vulnerabilities. This problem got a lot of attention in the past few years, especially after the infamous Equifax data breach. By now, this problem is tackled by commercial and open source tools, e.g., OWASP Dependency Check or Eclipse Steady, and it has been studied from different angles and for different ecosystems.
Another risk stems from malicious components, i.e., components into which attackers injected malicious code. Once infected components become part of the dependency tree of – potentially many – downstream components, its payload can be executed in their respective development and/or production environments. One dependency may be included into several thousand downstream components.
A prominent example for such a supply chain attack is the infection of event-stream. Here, the original maintainer handed over the repository ownership to the attacker, which allowed the latter to place malware in one of its dependencies. Alarmingly, at the time of the attack, event-stream was used by another 1,600 packages, and was in average downloaded 1.5 million times a week.
Backstabber’s Knife Collection
To support researchers and the open source community in problem understanding and, ultimately, the development of countermeasures, SAP Security Research, in collaboration with the University of Bonn, analyzed the code of 174 malicious components that were used in past supply chain attacks, and described their characteristics using a number of dimensions, e.g., the primary objective or the use of obfuscation techniques.
The results of this study are presented in the paper Backstabber’s Knife Collection, which has been accepted at the DIMVA 2020, the 17th Conference on Detection of Intrusions and Malware & Vulnerability Assessment.
The Attack Tree
To enumerate the potential attack vectors in a more structured manner, an attack tree was developed and used to reference actual attacks and related works.
The attack tree has as top-level goal to inject malicious code into the dependency tree of downstream components. In other words, the goal is satisfied as soon as a package with malicious code can be downloaded from a public distribution platform, e.g., PyPI or Maven Central, and it became part of other components’ or applications’ dependency trees.
To reach that high-level goal, an attacker may follow two possible strategies: either infect an existing package or submit a new package.
Obviously, developing and publishing a new rogue package using a name that is not used by anybody else avoids interference with other legitimate project maintainers. However, such a package has to be discovered and referenced by downstream consumers in order to end up in the dependency trees of victim packages. This may be achieved using a name similar to existing package names (“typosquatting” or “combosquatting”), or by developing and promoting a “trojan horse“.
An attacker might also use the opportunity to reuse the identifier of an existing project, package, or user account withdrawn by its original and legitimate maintainer (use after free).
The second strategy is to infect an existing package that already has consumers, contributors and maintainers. The attacker might choose packages for different reasons, e.g., a significant number or specific group of downstream consumers. Once the attacker chooses a package to infect, the malicious code may be injected into the sources, during the build, or into the package repository. To that end, weak or leaked account passwords – a research topic by itself – of legitimate project maintainers are frequently used.
The paper Backstabber’s Knife Collection contains a more comprehensive description, and references to real-world attacks or related research works.
The Malicious Packages Dataset
We have created the first manually curated dataset of malicious open source packages that have been used in real-world attacks.
The compilation of the dataset took place between July 2nd and August 2nd, 2019 and was updated on 27th of January 2020. During that time, the vulnerability database Snyk, security advisories and research blogs were reviewed to identify malicious packages and possible attack vectors. Only packages that are explicitly labeled as malicious are considered, i.e., vulnerable ones were left out.
A total number of 469 malicious packages could be identified. Additionally, 59 packages were found that could be identified as proof of concept (published by researchers) and hence are excluded from further examination. Eventually, we were able to obtain at least one affected version for 174 packages, which were manually analyzed regarding the following characteristics:
– Temporal aspects: According to the dates of publication and disclosure, we found that a malicious package is available (downloadable) for 209 days on average.
– Trigger of malicious behavior: The majority of packages (56%) trigger the malicious code during the installation process, e.g., during “npm install”.
– Conditional execution: 41% of malicious packages execute malicious code only conditionally, e.g., depending on whether a domain name can be resolved, or depending on the balance of a crypto wallet.
– Injection (technique): The two most prominent techniques were typosquatting (or combosquatting) (61%), followed by the infection of existing packages using compromised credentials (22%).
– Primary objective: The majority of malicious packages aim at the exfiltration of data, e.g., SSH keys (“∼/.ssh/*”) or npm configuration settings (“∼/.npmrc”).
– Target operating system: Not considering second stage malware (downloaded by droppers), we found that most packages are OS-agnostic.
– Use of obfuscation techniques: Roughly half of the malicious packages (49%) use obfuscation techniques such as encodings or string sampling.
– Clusters: Based on a manual comparison of the similarity of malicious code, we were able to identify 21 clusters, whereby one even spans across multiple ecosystems (and therefore languages).
You can have a look (and contribute!): the Dataset is open
We believe that an open dataset accelerates the development of detective and preventive countermeasures. Thus the complete dataset is available for free on GitHub.
However, access will be granted on justified request only, due to ethical reasons. Just drop an email to the authors of the paper, briefly explaining your motivation and background.
Of course, we also welcome contributions to this dataset, especially for programming languages currently underrepresented, e.g., Java.
What Can We Do about It?
Our analysis shows that it is important to make use of already available security means. Plenty of countermeasures are readily available, for open source project maintainers, repository owners as well as downstream consumers.
Example recommendations are as follows: For open source project maintainers, multi-factor authentication and strong passwords should be mandatory. Developers should use version pinning. However, the version needs to be chosen absolute, i.e. no automated security patches or bug fixes (minor updates), which – however – may be counterproductive when it comes to vulnerabilities. Typosquatting packages are already being frequently purged by common package repositories but nonetheless make it through often. General awareness of developers and more stringent rules from the package repositories may help against that type of attack.
The compilation of a more comprehensive overview of countermeasures, mapped to the nodes of the above-described attack tree, will be one of our future activities.
Furthermore, now that a dataset exists, it is possible to use proven malicious packages as seeds in order to find more related cases. Our manually curated and labeled dataset allows supervised learning approaches that can support automated and repository-wide search for malicious packages.
Acknowledging the importance of a comprehensive and up-to-date dataset, it will be necessary to continue its curation – contributions are welcome!
Contact : Henrik Plate
Reference of the scientific publication:
Marc Ohm, Henrik Plate, Arnold Sykosch and Michael Meier Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks Proceedings of 17th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA), 24-26 June 2020
The preprint on arXiv
The presentation at DIMVA 2020.
Discover how SAP Security Research serves as a security thought leader at SAP, continuously transforming SAP by improving security.