What is this all about

mariusobert · ‎08-24-2021

Over the past few years, there has been a clear trend that SAP embraces open-source technologies more and more. This means that SAP leverages open-source technologies like Cloud Foundry and Kubernetes in its offerings and that SAP technologies become embeddable on open-source platforms. Node.js applications that store data in an SAP HANA database or Jupyter Notebooks that process data from SAP Data Intelligence are just two of these examples. This development brings a need for a growing number of utility packages that support developers, such as database clients, authentication libraries, scaffoldings tools, and many more. All these packages make the life of developers easier but staying ahead of the curve and being aware of all available packages can also be an additional burden. Therefore, I want to introduce you to a new open-source project that ranks the popularity of all kinds of development artifacts monthly: The Artifact of the Month.

Update December 2021: Change the described behavior as the web app evolved (thanks to community contributions)

A new website that gives you an overview of all tracked development artifacts.

What is this all about

The idea for this project was born a few months ago when I accidentally noticed that one Docker image that I used for a blog post got over 10k pull in a few days. It felt strange as the other images used in the post didn’t get nearly as many downloads. In this particular case, the image has probably just been pulled that often by a Kubernetes cluster that didn’t work properly. Anyhow, this could as well have been an image that is super useful to the SAP community. And it made me wonder, how many other images might be trendy and are totally off my radar, and how could I bring all relevant Docker images on my radar? And should this only be limited to Docker images, or should I keep an eye on other things as well? How about NPM packages, GitHub Repositories, PyPI packages, etc.

Knowing which artifacts are currently trending is probably not just important to me. I guess many community members would be interested in this as well. When I saw holge.schfer‘s Tweet a few months back, this reassured me that this is helpful to many developers:

https://twitter.com/hschaefer123/status/1412648384842944514

I hope the Artifact of the Month can help many of you to find all sorts of hidden gems more easily.

Where is the data coming from

I already mentioned a few data sources (aka providers) above. What was important in their selection was that each data point has exactly one key indicator, such as monthly downloads or gained forks. It was also an essential requirement that this data is publicly verifiable. Which rules out some indicators that the package author can only see, such as monthly views on a GitHub repository or Maven packages. Another requirement was that it shouldn’t matter if the artifact is provided by SAP, a partner, or any other third party. The only technological condition is that the artifact needs to meet is that it is related to SAP technology.

With all this in mind, I chose the following providers for the first version of this project:

DockerHub Images
- As mentioned before, my initial idea was to collect the “pull count” of Docker images to find out which ones are the most popular. Luckily, DockerHub API provides all the data we need. At the time of this writing, it was sufficient to retrieve all images of a DockerHub user. Additional support for individual images shouldn’t be too hard to implement when needed.

NPM Packages
- Developers download NPM packages that they need for their projects. Higher usage of a package indicates a higher value to the overall community. And the npm registry kindly provides this usage data so that the provider can pull information about individual packages and all packages that belong to a particular scope.

PyPI Packages
- It’s the same thing as npm, just for the Python ecosystem. The only difference here is that the data needs to be pulled from a public Google BigQuery dataset. (Btw: thank you, vitaliy.rudnytskiy, but this could be extended to individual projects.

GitHub Repositories
- GitHub Repositories serve multiple purposes. One is that developers use them to share source code, fork, improve, and develop projects. Another purpose is simply the sharing of knowledge by providing easy-to-understand samples. To consider both scenarios, we calculate the sum of the repository forks and their star count. GitHub’s Octokit API delivers the data, whereas it makes sense to be authenticated to get a higher quota. This provider collects data from all individual repos or entire organizations and users.

I would have also liked to include GitHub Packages as a separate provider, but the Octokit API, unfortunately, doesn't provide a download count (yet).

How is it implemented

This project itself should, naturally, be an open-source project. And it should make use of one of SAP's very own open-source frameworks for the web UIs. In the last few months, I did a lot with UI5 Web Component, so I felt that it's time to go back to my roots and use OpenUI5. For the "fetch logic," I used TypeScript as it's (currently) my favorite programming language. For the persistency- and platform question, it was a little bit harder. I was considering running the entire app with a DB on the free tier model for SAP BTP, but I thought that would be overkill – at least for now. Instead, I decided to store the data directly in the JSON model of the OpenUI5 web app and host everything on GitHub Pages. Now that the data persistence and the hosting problems are solved, there is only one backend task left: data processing. As the code is already hosted on GitHub and as GitHub Pages will expose the web app, it makes sense to use GitHub Actions for this task. The first job builds the web app on every push based on the latest data. This ensures that changes to the web app are deployed immediately. The second job will automatically be invoked by the platform on the first of every month (or manually) and fetch the latest download stats from all the artifacts, calculate the rankings, and re-build the web app. With all these pieces in place, this is essentially a serverless web app that updates the data on its own.

If you are interested, you could also run the application locally. For this, you need to clone the repo and install its dependencies.

git clone https://github.com/SAP-samples/artifact-of-the-month

cd artifact-of-the-month

npm install

Some usage data can be downloaded anonymously, while others can only be consumed with an API key. The GitHub API and Google BigQuery table are examples of the later ones. Once you have provisioned the keys for these providers, create a analyzer/.env file with the following content:

GITHUB_TOKEN=ghp_<token>

GOOGLE_CLOUD_PROJECT=<project id>

GOOGLE_APPLICATION_CREDENTIALS=<path to the key file>

And make sure that the service key is stored in the referenced file. Now, you can run the following commands to fetch the latest data, calculate the ranking, and start the development server locally:

npm run fetch

npm run analyze

npm start

The rankings of the application

I deliberately kept the application as simple as possible (or at least, I think I did). It's a typical Single-Page-Application (SPA) with a collapsible menu on the left that can be used to navigate between the four main pages. I was also inspired to add a special widget to the ShellBar that allows users to switch between dark- and light-mode. To be honest, this was mainly for fun, and it's less functional, but I still like it a lot.

The first three pages display the various trends, and each artifact/item uses a number indicator to show the rank (and a comparison to the previous month's rank). Besides, the list items also provide some basic information about the artifact and a link that will redirect the user to the respective registry to find more information.

Overall Trends
- This ranking considers all artifacts, and each artifact's score is calculated based on the Z-Score. For the calculation, we use the current indicator and the list of the indicators of the past few months. Artifacts that don't have data points for the past two months won't be considered. For npm packages, for example, the parameters are the current monthly downloads and the download counts of the previous months. The advantage of this is that utility packages with a high count, such as xsenv, will only be in the ranking if they have increasing downloads but not constant or downward.

New Artifacts
- The "New Artifacts" ranking only considers items added to the catalog in the past two months so that you can see which artifacts were just released and might be of interest for your next project. New items naturally don't have previous data points. Therefore, we calculate the Z-Score here against their peer artifacts. E.g., DockerHub images with an above-average pull count will receive a high rank here in the first few months.

Recently Updated Artifacts
- The (for now) last ranking only considers artifacts updated in the previous month. Some of the tags (stars, forks, last updated) will be freshed on a daily basis which might impact the ranking here. Therefore, we won't display a rank here any longer.

And the last page lists all items that were considered for the current month. These list items look slightly different than the previous ones as they don't have a rank indicator. The list includes a few hundred items, so it doesn't make sense just to filter them based on their type. This is why there is an additional search box to filter list items based on their name, description, or type:

The all-items view

Improvements are very much appreciated

In the current version, the catalog contains more than 1750 artifacts. I'm sure that the actual number of relevant artifacts is way larger. So I want to invite you to open a pull request if you can think of one or multiple of the following improvements:

Adding more artifacts to the existing providers (e.g., more GitHub Repos or NPM packages you use frequently).

Improving the implementation logic of existing providers

To add additional providers that I didn't think of

General improvements to the application

Conclusion

I think this project is a nice open-source showcase of open-source packages built with open-source technologies, running on an open-source platform. But I guess I'm biased. So I'd be interested what you think of it, let me know in the comments. 🙂

Introducing the Artifact of the Month

What is this all about

Where is the data coming from

How is it implemented

The rankings of the application

Improvements are very much appreciated

Conclusion

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win