Code-based Vulnerability Detection in Open-source ...

serena_ponta · ‎02-20-2019

The use of open-source software (OSS) is ever-increasing, and so is the number of open-source vulnerabilities being discovered and publicly disclosed. The risks that come from the reuse of community-developed libraries were mercilessly demonstrated by the (in)famous Equifax data breach where personal and financial data of millions of US citizens were stolen. The root cause of the Equifax data breach lies on a web server application which was depending on an old, vulnerable OSS library.

Nowadays several tools exist to detect whether vulnerable libraries are among application dependencies, however most of them rely on meta-data for mapping libraries to vulnerabilities.

A different approach based on the detection of vulnerable code rather than vulnerable libraries stands out of the crowd.

Why code-based vulnerability detection?

Vulnerabilities are not attributes or labels that get attached to a software library. This is just the way metadata-based approaches treat them! Whatever the software does--good or bad--comes from the code and thus it seems just natural to shift the focus from the abstract idea of vulnerability to the concrete concept of vulnerable code, that is the set of instructions whose execution results in a vulnerability.

What do we gain?

The benefit is straight-forward: It does not matter where the vulnerable code is, it will always be detected. There is no need any longer to rely on existing meta-data that somebody must have provided (or may have hidden!).
Let's remember that the vulnerable code is just a small fraction of an OSS library, and thus the code-based approach allows for much higher precision in the detection as it does not simply associate the vulnerability to an entire software library.
On top of this, think about every time a library (or a part of it) gets re-bundled into a new one, or some code (e.g., classes) gets copied to obtain a self-contained component. There are little to no chances that meta-data from the original library are still available in the re-bundled one and thus no detection based on meta-data is possible. On the contrary, with a code-based approach the vulnerable code can still be detected.

At which cost?

"Nothing in life comes for free" and that's the case here as well. To go from vulnerabilities to their vulnerable code requires the knowledge about how the vulnerability has been fix in the library code repository. Such information may be available within the vulnerability disclosure or may be well hidden in the repository history. At the state-of-the-art the human is still involved in the (semi-automated) process to identify the vulnerable code but machine learning approaches are catching up!

Code-based vulnerability detection is the approach underlying the vulnerability assessment tool open sourced by SAP (documentation available here). The tool also comes with an existing knowledge base containing the vulnerable code for hundreds Java vulnerabilities.

Stay tuned for more blog posts!