Credential Digger: Using Machine Learning to Identify Hardcoded Credentials in Github
Github.com is a hosting platform for software development version control and management. With more than 100 million repositories (with at least 28 million public ones), it is the largest host of source code in the world. Users can use Github to publish their code, to collaborate on open-source projects, or simply to use publicly available projects. In such an environment, one of the most critical
threats is represented by hardcoded (or plaintext) credentials in open-source projects.
When developers integrate an authentication process in their source code (e.g., a database access or an email server), a common practice is the use of credentials or authentication tokens (passwords, API Keys, private keys, etc.). Even if all the secure development standards already defined secure methods to use these credentials, some developers may unintentionally publish them in clear text in their Open Source projects. For example, Uber sustained in 2016 a massive data leak, affecting 57 million customers by revealing personal sensitive information such as names, email addresses, and phone numbers. This attack was originating from a password found in a Github repository.
While several scanning tools are already existing (including Github’s official scanner) they all suffers from a very high false positive rate especially when the scans are fetching for passwords. Most of these tools, on their default configurations are only fetching for well structured API keys. The reason of these bad pattern recognition precision is the diversity of credentials, depending on multiple factors such as the programming language, code development conventions, or developers’ personal habits. The second reason is related to a valid development convention that tends to include to the open source project a portion of code dedicated to tests and examples. This part of the code, used to help the developers and users to test their installation and learn how to use the software, contains systematically fake credentials that are identified as hits by all the scanning tools.
Credential Digger Unique Approach
The main problematic related to the traditional scanning tools is mainly related to the detection technology used. Most of these scanners are relying on regular expressions (static text identification queries) to identify hardcoded credentials. This approach is not efficient when the structure of the credential is not predictable, and the type of credential (fake or real) cannot be guessed. For this reason with Credential Digger, we decided to address directly the false positive source by training two Machine Learning models:
File Path Model
The “File Path Model” is in charge of identifying all the test and example portions of code in a project. This machine learning model is analyzing the path and the file naming in the projects and filters out all the dummy credentials that can be identified as hits by traditional scanners. During our tests in more than 1000 Github repos, just applying the Path Model reduced the False Positive hits by 50%.
Code Snippet Model
The “Code Snippet Model” is in charge of identifying the portion of code where a developer is authenticating, and distinguishing between real and fake passwords and credentials. This Machine Learning Model is able to recognize when a clear text credential is used in a source code to perform a real authentication action. During our tests in more than 1000 Github repos, by applying the Path Model, we are able to reduce the False Positive hits by 40 to 50%
Credential Digger is now Open Source on Github
And you can enjoy scanning your Github repos by installing the tool here credential-digger
Pypi project here for the installation https://pypi.org/project/credentialdigger/
If you are interested to join the project community please drop me an e-mail or leave a comment on the blog.