Data pipelines and APIs – Consider this when building your next data pipeline
In this blog post I will cover some of the challenges that we can face when building a data pipeline that needs to interact with an API provided by a cloud application/service. Also, I include some examples of how these challenges can be addressed and some common practices when building data pipelines.
Long gone are the days where we had access to all the databases of all the software we used within our company. With companies embracing the cloud we’ve seen a move from purchasing software licences, procuring infrastructure and running the software on-prem, which generally meant also deploying a server to host the database that the purchased software required. Nowadays instead of installing everything ourselves it is possible that we end up adopting a cloud service, where we might not have access to the underlying database, as it can be a software-as-a-service (SaaS) solution, but we can access the data through an application programming interface (API).
Now we face different challenges to extract the data. Instead of having direct access to a database we retrieve the data via an API. That said, what hasn’t changed is the need to have access to the underlying data for further processing/analysis in our data warehouse(s), to build report(s) that are required by the business to operate and we still need to build our extract transform load (ETL) pipelines or maybe we follow an extract load transform (ELT) approach, to get the data from the source system to our data warehouse/data lake.
Fortunately, there is no need to build these data pipelines from scratch, SAP already provides a number of products, e.g. SAP Cloud Integration, SAP Data Intelligence, that can simplify building data pipelines and can help you manage the challenges discussed in this blog post. Now, lets find out what are some of the challenges that we can face.
- Authentication: Different services provide different forms of authentication for their APIs. Some examples would be: basic authentication, certificate based authentication, JWT or OAuth 2.0. In the case of a service that uses JWT or OAuth 2.0 authentication we normally retrieve a token which will be used as part of the authentication of our requests to the API server. This token is valid for a period of time and this will vary across services, e.g. 20 minutes. Depending on how long an access token is valid for, it is possible that you will need to deal with a refresh token mechanism as part of your data pipeline, e.g. Authentication & refresh tokens with SAP Ariba APIs. This will be required if the running time for a data pipeline is greater than the time a token is valid for. Depending on the tool used for the data pipeline, this can be managed by the connector/adapter configured to retrieve data from the service.
- Network: Although a database can now be hosted in the cloud, previously we were dealing with a database within our network (locally). Local network speeds are faster and generally more reliable when compared to communications going through the internet. We need to take this factor in consideration when building our data pipelines. The internet sometimes can be not as reliable as we would like and lots of things can go wrong when our data needs to go through a network of networks. See recovery mechanism below.
- API Contracts: Although some APIs may provide an API specification. Normally, these aren’t as simple to navigate/interpret as a traditional database schemas or as simple to generate an entity relationship diagram. It is important to get very familiar with the structure of the API response and also with the functionality that the API offers. The structure of the API response tell us what data is available to us. The functionality will help us understand how we can interact with it, e.g. can we sort the elements in the response, can we filter data by providing parameters in the request, do I need to implement pagination. In essence, studying the API contract will best tell us how we can interact with the API and being deeply familiar with it will save us time and effort down the line.
With some scripting it is possible to generate an Entity-Relationship diagram from data available in an API, which will ease understanding the structure of the data. Check out this blog post – Generating Entity-Relationship diagrams from the SAP Ariba Analytical Reporting API metadata.
- Pagination: We might need large amount of data from a particular API and sometimes these data might exceed what the API response will return per call. In this case, normally APIs will implement some form of pagination mechanism.
The API response is batched and paginated, the results are returned one page at a time. Unfortunately, there is no standard way APIs implement pagination but the process followed by APIs is very similar. The first request to the API, will only specify the data that the client is interested in. The API returns a subset of the data available and it informs the client that it has only returned part of the data. It does this by means of a “pagination token”. This pagination token is returned in the response of the request, some APIs include it in the HTTP response headers, others in the body of the response. If the client is interested in retrieving more data, that matches its filtering criteria, it needs to include the pagination token in the request, to indicate to the API which page/batch of data it wants to retrieve. Source: Github SAP Samples – Ariba Extensibility Samples repository
- If dealing with large volumes of data make sure to filter the data when possible and to implement a process so that it can make additional requests to the source system API if there is more data available for our request.
- Rate limits: Given that a SaaS solution is shared by multiple companies/customers, the APIs exposed by the SaaS generally implement rate limit. The idea behind a rate limit revolves around the amount of data we can extract on an API call or the times that we can call the API. Not something that we strictly faced when dealing with a database. Also, this can impact how many times we can call an API and we might need to plan our extraction process accordingly.Get familiar with the API rate limits as these will help plan how to schedule your automated pipelines. Also, APIs normally provide headers in the response that indicate the API usage, e.g. X-RateLimit-Limit-hour, X-RateLimit-Remaining-hour. You can track the current API usage by keeping an eye on these headers, e.g. How to track your API usage to avoid exceeding rate limits.
In computer networks, rate limiting is used to control the rate of requests sent or received by a network interface controller. It can be used to prevent DoS attacks and limit web scraping. Source: Wikipedia.
- Source system: Generally, we have no control on the functionality or data exposed by the API providers. APIs exposed by services normally follow a deprecation cycle/policy, e.g. SAP API Deprecation Policy, and it is important to be aware of the release cycles the product you are interacting with so that you know if there are any changes in the APIs/the data exposed. It is possible that some functionality we rely on is being deprecated and some work might be needed on our end. That said, not everything is bad news, it can happen that in a new release a new API might be available which can provide us with additional functionality/data and it might make to rework our integration to take advantage of it.
- Data processing in the source system: If some data processing is possible in the source system, do it there, e.g. filtering, sorting. Less data will be going down the wire, less requests will be need to gather the data you need. Some services might expose asynchronous data processing, e.g. SAP Ariba Analytical Reporting API – asynchronous jobs, which will can end up doing some of the heavy lifting on your behalf. In the case of these SAP Ariba API, you submit a job and can retrieves large amounts of data on your behalf and package it in a simple file to download once the job is complete.
Below, some common practices followed when building data pipelines.
- Monitoring: We want to know how our pipeline executed. It is important to have visibility on the execution and have any data related to the execution available to us, e.g. execution time, parameters.
- Logging: If things go wrong, logging will help us troubleshoot our pipeline and it should be possible to know exactly where it failed and why. Good logging of our pipeline will help us recover faster from an error.
- Recovery mechanism: As explained in challenge #1, lots of things can go wrong when we are dealing with pulling data from services through a network of network. It is important that we are able to handle scenarios where things go wrong, e.g. time outs, HTTP errors – (500 – Internal Server error), 429 – API rate limit exceeded, exceptions raised when processing data. In this case, we should be able to either recover automatically, e.g. in case of a network failure, and schedule a future execution that can continue from where the process failed or raise an alert if an automated recovery is not possible.
- Alerting: When building a data pipeline is possible that we didn’t take in consideration a scenario, e.g. missing data/mapping or an unhandled exception, and we need to go and fix/enhance our pipeline. Hence why alerting is important in case something goes wrong and we need to fix the error as soon as possible.
- Data checks: We live in a world of imperfect data and very likely we will have to deal with it in our transformation. Make sure to add the relevant checks in the transformation/mapping for critical data that is required by the pipeline. If there is a problem, raise an alert.
Thanks for reading this far… we’ve covered a lot. I hope this blog post helps you the next time that you are looking to build an extraction mechanism that involves interacting with the APIs provided by a cloud application/service.