Introducing the Artifact of the Month
What is this all about
The idea for this project was born a few months ago when I accidentally noticed that one Docker image that I used for a blog post got over 10k pull in a few days. It felt strange as the other images used in the post didn’t get nearly as many downloads. In this particular case, the image has probably just been pulled that often by a Kubernetes cluster that didn’t work properly. Anyhow, this could as well have been an image that is super useful to the SAP community. And it made me wonder, how many other images might be trendy and are totally off my radar, and how could I bring all relevant Docker images on my radar? And should this only be limited to Docker images, or should I keep an eye on other things as well? How about NPM packages, GitHub Repositories, PyPI packages, etc.
Knowing which artifacts are currently trending is probably not just important to me. I guess many community members would be interested in this as well. When I saw Holger Schäfer‘s Tweet a few months back, this reassured me that this is helpful to many developers:
😀 there are so many hidden gems, you just have to find it. It feels like they are using gamification for each monthly update, because also docs are overhauled and i am somehow forced to review docs 😂. This repeatment burns everything into my brain.
— Holger Schäfer (@hschaefer123) July 7, 2021
I hope the Artifact of the Month can help many of you to find all sorts of hidden gems more easily.
Where is the data coming from
I already mentioned a few data sources (aka providers) above. What was important in their selection was that each data point has exactly one key indicator, such as monthly downloads or gained forks. It was also an essential requirement that this data is publicly verifiable. Which rules out some indicators that the package author can only see, such as monthly views on a GitHub repository or Maven packages. Another requirement was that it shouldn’t matter if the artifact is provided by SAP, a partner, or any other third party. The only technological condition is that the artifact needs to meet is that it is related to SAP technology.
With all this in mind, I chose the following providers for the first version of this project:
- DockerHub Images
- As mentioned before, my initial idea was to collect the “pull count” of Docker images to find out which ones are the most popular. Luckily, DockerHub API provides all the data we need. At the time of this writing, it was sufficient to retrieve all images of a DockerHub user. Additional support for individual images shouldn’t be too hard to implement when needed.
- NPM Packages
- Developers download NPM packages that they need for their projects. Higher usage of a package indicates a higher value to the overall community. And the npm registry kindly provides this usage data so that the provider can pull information about individual packages and all packages that belong to a particular scope.
- PyPI Packages
- It’s the same thing as npm, just for the Python ecosystem. The only difference here is that the data needs to be pulled from a public Google BigQuery dataset. (Btw: thank you, Witalij Rudnicki, for helping me out with the query to fetch the data). We currently only pull data from selected users, but this could be extended to individual projects.
- GitHub Repositories
- GitHub Repositories serve multiple purposes. One is that developers use them to share source code, fork, improve, and develop projects. Another purpose is simply the sharing of knowledge by providing easy-to-understand samples. To consider both scenarios, we calculate the sum of the repository forks and their star count. GitHub’s Octokit API delivers the data, whereas it makes sense to be authenticated to get a higher quota. This provider collects data from all individual repos or entire organizations and users.
I would have also liked to include GitHub Packages as a separate provider, but the Octokit API, unfortunately, doesn’t provide a download count (yet).
How is it implemented
This project itself should, naturally, be an open-source project. And it should make use of one of SAP’s very own open-source frameworks for the web UIs. In the last few months, I did a lot with UI5 Web Component, so I felt that it’s time to go back to my roots and use OpenUI5. For the “fetch logic,” I used TypeScript as it’s (currently) my favorite programming language. For the persistency- and platform question, it was a little bit harder. I was considering running the entire app with a DB on the free tier model for SAP BTP, but I thought that would be overkill – at least for now. Instead, I decided to store the data directly in the JSON model of the OpenUI5 web app and host everything on GitHub Pages. Now that the data persistence and the hosting problems are solved, there is only one backend task left: data processing. As the code is already hosted on GitHub and as GitHub Pages will expose the web app, it makes sense to use GitHub Actions for this task. The first job builds the web app on every push based on the latest data. This ensures that changes to the web app are deployed immediately. The second job will automatically be invoked by the platform on the first of every month (or manually) and fetch the latest download stats from all the artifacts, calculate the rankings, and re-build the web app. With all these pieces in place, this is essentially a serverless web app that updates the data on its own.
If you are interested, you could also run the application locally. For this, you need to clone the repo and install its dependencies.
git clone https://github.com/SAP-samples/artifact-of-the-month cd artifact-of-the-month npm install
Some usage data can be downloaded anonymously, while others can only be consumed with an API key. The GitHub API and Google BigQuery table are examples of the later ones. Once you have provisioned the keys for these providers, create a
analyzer/.env file with the following content:
GITHUB_TOKEN=ghp_<token> GOOGLE_CLOUD_PROJECT=<project id> GOOGLE_APPLICATION_CREDENTIALS=<path to the key file>
And make sure that the service key is stored in the referenced file. Now, you can run the following commands to fetch the latest data, calculate the ranking, and start the development server locally:
npm run fetch npm run analyze npm start
The rankings of the application
I deliberately kept the application as simple as possible (or at least, I think I did). It’s a typical Single-Page-Application (SPA) with a collapsible menu on the left that can be used to navigate between the four main pages. I was also inspired to add a special widget to the ShellBar that allows users to switch between dark- and light-mode. To be honest, this was mainly for fun, and it’s less functional, but I still like it a lot.
The first three pages display the various trends, and each artifact/item uses a number indicator to show the rank (and a comparison to the previous month’s rank). Besides, the list items also provide some basic information about the artifact and a link that will redirect the user to the respective registry to find more information.
- Overall Trends
- This ranking considers all artifacts, and each artifact’s score is calculated based on the Z-Score. For the calculation, we use the current indicator and the list of the indicators of the past few months. Artifacts that don’t have data points for the past two months won’t be considered. For npm packages, for example, the parameters are the current monthly downloads and the download counts of the previous months. The advantage of this is that utility packages with a high count, such as xsenv, will only be in the ranking if they have increasing downloads but not constant or downward.
- New Artifacts
- The “New Artifacts” ranking only considers items added to the catalog in the past two months so that you can see which artifacts were just released and might be of interest for your next project. New items naturally don’t have previous data points. Therefore, we calculate the Z-Score here against their peer artifacts. E.g., DockerHub images with an above-average pull count will receive a high rank here in the first few months.
- Recently Updated Artifacts
- The (for now) last ranking only considers artifacts updated in the previous month. Calculation-wise, it applies the same principles as the “Overall Trends”.
And the last page lists all items that were considered for the current month. These list items look slightly different than the previous ones as they don’t have a rank indicator. The list includes a few hundred items, so it doesn’t make sense to filter them based on their type. Instead, there’s a search box to filter list items based on their name, description, or type:
Improvements are very much appreciated
In the current version, the catalog contains more than 1500 artifacts. I’m sure that the actual number of relevant artifacts is way larger. So I want to invite you to open a pull request if you can think of one or multiple of the following improvements:
- Adding more artifacts to the existing providers (e.g., more GitHub Repos or NPM packages you use frequently).
- Improving the implementation logic of existing providers
- To add additional providers that I didn’t think of
- General improvements to the application
I think this project is a nice open-source showcase of open-source packages built with open-source technologies, running on an open-source platform. But I guess I’m biased. So I’d be interested what you think of it, let me know in the comments. 🙂