Usage Analytics in the SAP Cloud SDK

sander_wozniak · ‎10-23-2018

Usage Analytics in the SAP Cloud SDK

In order to improve your experience with the SAP Cloud SDK, we recently started to include anonymized usage analytics into the SDK’s Maven archetypes and its Continuous Delivery Toolkit to better understand how to further develop our offering. For example, questions that we often ask ourselves are:

How many projects are using the Java libraries and the Continuous Delivery Toolkit?

Which versions of the SDK are used?

Which modules of the SDK are used? Which are not?

Which operating systems, Java, and Maven version are used?

However, given that we do not know all of our users, we are not able to answer these questions. By integrating usage analytics into our offering, we hope to gain more insight here.

Since we know that collecting usage data is a very sensitive topic, this blog post aims to provide full transparency on how we collect which kind of data.

At SAP, we respect your privacy and intellectual property. Therefore, we only collect non-sensitive data about the use of the SAP Cloud SDK. We do not collect any personal information or data about the inner workings of your project. Any remotely sensitive values like your project’s Maven group or artifact IDs are obfuscated so that no one except you and your team knows or can infer details about your project.

By default, usage data is collected by the SDK. Of course, you can always decide to opt out as described below.

In the following, let us first have a look at some basic fundamentals of privacy protection and how we apply these mechanisms to ensure the privacy of you as an individual as well as sensitive details about the business case or technical realization of your project.

Fundamentals of Privacy Protection

When speaking of privacy, it is often related to the notion of anonymity. To be a bit more formal, anonymity is the inability of a malicious party (an attacker) to identify an entity (e.g., a developer or project) within a set of entities, the so-called anonymity set.

In order to protect the anonymity of entities, several established techniques can be applied:

Pseudonymization: The identity of an entity can be obfuscated by using a pseudonym, which is an identifier that is different from the actual identity of an entity. For example, given a person named Alice, a pseudonym could be simply another name like Bob. Using a pseudonym to achieve anonymity is also referred to as pseudonymity. Note that while the term anonymity corresponds to the inability of an attacker to identify an entity within the anonymity set, pseudonymity only refers to the use of a pseudonym. Therefore, pseudonymity is no guarantee for real anonymity. For example, if someone learns that Alice is in fact Alice and not Bob, the pseudonym is no longer of use. Inferring pseudonyms is usually achieved by consulting or combining several sources of information. In our example, an attacker might simply observe that “Bob” lives in a flat with the nameplate “Alice” to render the pseudonym obsolete. A way to mitigate this is to use the same pseudonym for multiple entities, thereby forming a group pseudonym which in itself again forms an anonymity set.

Data perturbation: This brings us to the second option for achieving anonymity: the concept of k-anonymity. Here, the idea is to make each entity indistinguishable from k - 1 other entities. Furthermore, it is possible to introduce random noise into the data to protect the privacy of entities. While this keeps the granularity of the data, such noise comes at the cost of decreased data accuracy.

Data generalization: If the granularity of data is not essential, it is also possible to generalize data, for example by aggregating individual data points into an average.

Data suppression: In order to protect the privacy of users, it is possible to suppress such data, for example, by allowing users to opt out of the data collection.

Privacy-Aware Usage Data Collection

In order to allow us to correlate usage data by projects across Java libraries and the build pipeline of the Continuous Delivery Toolkit, we need a unique project identifier. Naturally, the Maven group and artifact IDs represent an appropriate identifier. However, developers may not wish that the Maven group and artifact ID are disclosed – while this may not be highly sensitive information, it is not desirable nonetheless.

Therefore, we use pseudonyms that are based on a project’s group and artifact IDs to represent a project. We only include the group and artifact ID, not the version, so that there is no way to learn about the development progress or release cycle of a project. Furthermore, assuming that there is usually a team of developers working on a project, this offers an additional layer of k-anonymity.

Finally, we include a project-specific, securely generated random secret into the pseudonym. This secret is not transmitted when collecting usage data and is kept private within your project. The generated secret has the same 256-bit length as the pseudonym, making it impossible to guess the underlying Maven group or artifact identifiers from a given pseudonym.

The pseudonym for a project is generated as follows:

projectId = h(groupId + artifactId + salt)

where h(x) is a cryptographic hash function (at the time of this writing, SHA-256), + the concatenation of Strings, and salt a random value with the same bit length as h(x). The salt value is generated with a cryptographic random number generator (Java's SecureRandom).

The random salt value is generated per project by the s4sdk-maven-pugin. This salt is stored within the project’s POM file and only used for computing the hash value. It should never be shared with anyone that is not considered trustworthy since knowing the salt reduces the effort of brute-force guessing a project’s group and artifact ID.

For example, given the group ID com.company, the artifact ID app, and a salt value e9a94b0ee5c8b75a3834ed6264dfda51bff4642f94e53e22d1cad8b340d1584c, the resulting hash value for the project is 0335b62e0bb82b11888302be2cb8160e0c0976cac7014b7866c3526ae3f7b0ab.

If you do not want to use a salt value, you can disable its automatic generation by setting the configuration flag generateSalt to false. However, please be aware, that this will make brute-force attacks aiming to infer the Maven group or artifact ID easier.

In addition to the project identifier above, we collect generic information such as the current type of operating system, the current Java and Maven versions, as well as which modules of the SDK are being used. For a detailed, up-to-date overview of the data that we collect, you can either refer to this page, or look at the logger output that is written by both the s4sdk-maven-plugin, as well as the build pipeline of the Continuous Delivery Toolkit.

Opt-Out

Collection of usage data is enabled by default.

If you wish to disable it, please perform the following steps:

For both the SAP Cloud SDK Pipeline and SAP/jenkins-library, set collectTelemetryData to false in your pipeline_config.yml in the general section as in this example:
```
general:

  collectTelemetryData: false
```

Set the skip (deprepcated) or skipUsageAnalytics flag in the configuration of the s4sdk-maven-plugin to true:

<plugin>

    <groupId>com.sap.cloud.s4hana.plugins</groupId>

    <artifactId>s4sdk-maven-plugin</artifactId>

    ...

    <configuration>

        <skipUsageAnalytics>true</skipUsageAnalytics>

    </configuration>

    ...

</plugin>