Extract System Logs from your Cloud Integration into Splunk/Loggly
Why/What – To accelerate the learning cycle through Build, Observation and Learn – you need robust and reliable observation capabilities.
Below technique describe how to extract your System logs from your Cloud Integration (CI) Tenants into Splunk and how you can turn mass data as first step of observations
- Retention; CI system log data retention time in the database is 7 days (Neo); 30 days (CF) w. storage limit*; If you need to look back history/triage issue pass xx days. 2582913 – [SAP Cloud Integration] How long are logs persisted.
- HTTP – all inbound request regardless the iFlow/endpoints availability; Imagine you want to see why sender issue 404; 401
- Trace – ALL events including polling adapter (timer/sFTP/Kafka) actives which will NOT be captured by MPL – ie. missing credentials; If iFlow were to failed to deploy due to missing dependencies are also captured; If a msg failed you will get the error along Elapsed (ms) of the steps (below).
- Power of search tools (Splunk/Solarwind) which will showcase at end to help find anomalies that can be shared btw teams – Learning and self discovery
The building blocks are accessing the LogFiles API and push to Splunk, which will go into detail in below section.
Important to note is SystemLogs are just one extra dataset, and from practices we have learned MPL is structured contain richer set of dimensions of how iFlow actually performs and if thought thoroughly w. incorporation of sender/receiver context you can delivery end to end telemetry.
Just two words on above, Santos Kumarv have nice blog on how to built iFlow to extract MPL and send to Splunk and I have written one for Solarwind. SAP have roadmap to have native connector to Splunk/others in upcoming quarter to push MPL. Please look forward for that as its much robust/higher throughput.
According to my sources extracting system logs to Splunk is not in current roadmap, hence I want to share this here.
Let’s quickly review how you can access system logs (CF) today through WebUI below:
Under the LogFile attributes:
A. LogFileType – HTTP vs. TRACE
B. Size – Number of entries w. max 150K, before roll to next file
C. Application – Tenant short name;
LastModified – field we will later used to track what file we are downloading and upload.
Name – File http/trace timestamp of first event entry
To access above oData API, you will need services instance created and necessary permission like what is described in MPL documentation. This will required BTP cockpit access by an administrator.
Now you have some basic understanding of where the logs and what the payloads looks like, we will go into how to built this iFlows to extra > forward logs into Splunk.
How – Deploy iFlow as describe below and working with your Splunk in-house experts to incorporate the HTTP/Trace unique formatting as logs are in compress zip and ideally Splunk can index these files for faster search and accelerate interpretation.
The design of the extraction is decouple one iFlow first tracking the list of LogFiles and second iFlow to actually download the LogFile (zip) and push to target (Splunk). Two reasons behind splitting the iFlow is because as of current writing Sept 7, 22, the API to access LogFiles download can take upward of 80+ seconds in (CF) and this may cause timeout.
SAP is aware and performance improvements coming releases to drastically reduced the access time but incase your tenant generate large number of logs the second reason to decouple is to allow concurrent and robust download and pushing of these files to Splunk. As the file contains up-to 150K lines and ~3-5MB zipped and 40mb+ uncompressed.
To facilitate decoupling I have chosen to use JMS Async so you can monitor on top of the system files. If you have other alternatives please share your experience.
|Timer: Set for recurring; ideally run once per day just after 00:00 GMT so all files from previous day is extracted; Too frequent will be taxing as files and chance of duplicates of logs;||
ServerURL of the tenant and Timestamps to track previous runs; 10 mins – optional not to pick active logs
Credentials of the oData API as described in earlier (MPL)
|C Section of filter is used to remove the most current logFile so partial log is ignored||
/LogFiles/LogFile[ LogFileType = ‘TRACE’ ]
Filter out current file:
/LogFiles/LogFile[ ( (preceding-sibling::LogFile | following-sibling::LogFile)/LastModified > LastModified ) ]
Step 2 Forward individual LogFile to Splunk LogFile_Forwarder
|Config A: Define Splunk HTTP forwarding host and extending timeout|
|Splunk Headers for Token and additional attributes; The Splunk Source(s) will be determined dynamically if keeping cpi_sys_http and cpi_sys_trace where name comes from part of LogFile XML node Application.|
JMS Adapter – as default defined 1 concurrent extraction to API, depend number of workernode in your assigned tenant, this will likely run in multiple threads.
Prep Steps for Splunk
Below are props.conf files that is needed each for HTTP and TRACE as they format slightly different where trace/stacks should represent in single block of event while HTTP events are represented in each line. Splunk expert Jeff Edquid contribution below.
[cpi_sys_http] LINE_BREAKER = ([\r\n]+) MAX_TIMESTAMP_LOOKAHEAD = 31 TIME_FORMAT = %d/%m/%Y:%H:%M:%S %z TIME_PREFIX = - - \[ SHOULD_LINEMERGE = false [cpi_sys_trace] LINE_BREAKER = ([\r\n]+) MAX_TIMESTAMP_LOOKAHEAD = 25 TIME_FORMAT = %Y-%m-%d %H:%M:%S#%z TIME_PREFIX = ^ SHOULD_LINEMERGE = true
Below example of TRACE file that you can search for where a msg failed and elapsed time of each steps. The WebUI debug mode is more user friendly.
Information > Insights
Now you have logs pull into Splunk. Using simple wild card search in Splunk to find exceptions and patterns from which URL Context path:
index=xyz source=cpi_sys_http "1.1\" 40*"
Finding msg runtime errors by looking for camel exception as example..
index=xyz source=cpi_sys_trace "#ERROR#org.apache.camel.processor.DefaultErrorHandler#" NOT "Kafka"
You can create Splunk Alerts to eliminate errors like above and improve MTTR.
to below individual for their asssistance!
- Felix K. from SAP Product Engineering for LogFiles API guidance and performance analysis
- Jeff Edquid Splunk expert tips on handling of the TRACE files/validation
- Sriprasad Bhat on XSL/Groovy tips and feedbacks
As i understand it, there is a limit that you can only have two concurrent API requests to the public API concurrently. That could be the reason you are seeing 85 seconds deplay.
Hi Daniel, Thanks for feedback. The performance is really depend on the number LogFiles your tenant have, I had over 1800 in my "prod" system. But if deploy on lighter tenant, your performance may be half that time as in my "dev".
Its not encourage to have large concurrency as you hinted, thus want to throttle it via your own deployment via JMS. I have seen /api/v1/ call very higher concurrency rate and its fine. If this is schedule 3-4 times a day each log file extractions should process w. ease.