Skip to Content
Technical Articles
Author's profile photo Jan Fetzer

Proudly launched: Intelligent Lookup for SAP Data Warehouse Cloud

What it is

SAP Data Warehouse Cloud (DWC) is a data warehouse that specifically targets business users. DWC does so by focusing on superior ease of use, super-simple data collaboration in DWC spaces, seamless integration of data from both SAP and non-SAP sources and, of course, tight integration with SAP Analytics Cloud. With upcoming SAP Data Marketplace, it will be even simpler to find relevant, accurate and detailed data that can be blended with the inhouse data from customers’ SAP systems.

Intelligent Lookup – Harmonizing Data with Ease where Standard Techniques Fail

DWC thus easily enables entirely new use cases like comparing a company’s own salaries with those typical in the job market for comparable roles, comparing own business performance with external economic indicators, enriching one’s business partner data with external ratings or economic performance and many others.

It is fully clear though, that the challenge of data integration does not end once the data is brought from SAP Data Marketplace or custom source systems into a user’s DWC space. The act of physically joining one’s own and these newly acquired data sources is a challenge in its own right: unless all involved data is superbly clean, any straightforward join between two datasets that belong together “in principle” regularly suffers of some or even many exceptions – be it special characters, typos, duplicates, fuzziness in the data, leading zeros, spelling differences or many more. For these exceptions, decisions need to made by subject matter experts in the business (since they know the data) without the overhead of complex SQL code or long transformation in ETL logic. In our experience, dealing with these exceptions often causes 90% of the effort and data integration costs.

This is exactly where the Intelligent Lookup operator in DWC’s Data Builder comes in and provides users with a dedicated, own data matching environment

  • Interactive Environment for User Mappings
    Users efficiently create mappings & mapping rules in an intuitive no-code UI.
    Results are shown immediately. Data is taken as it is.
  • Mapping Automation through Rules
    Rules create mappings automatically.
    Users can add or rework mappings where required – down to record level
  • Mappings need to be created only once
    All automated & manual mappings are persisted in a “pairing table”
    No further user intervention is required ever again on rows where these mappings are applicable
  • Full integration with rest of SAP Data Warehouse Cloud
    Harmonization result is a standard entity within DWC’s Data Builder that all modelling entities and consuming tools can leverage

Why Joins Fail to Solve the Data Integration Challenge

Let’s walk through a simple example end-to-end. Imagine you are a marketeer analyzing sales coming from purchases in the company’s webshop. Your sales department shared the raw data that might look like this

Raw%20data%20on%20webshop%20sales

Raw data on webshop sales

As a first goal, you’d like to analyze the data by continent & also plot it on a map. To that end, you surfed the web and found city information for 100k cities in a Google BigQuery dataset that contains all those columns that you need, esp. continent information and latitude/longitude

Details%20on%20%7E100.000%20global%20cities

Details on ~100.000 global cities

So far, so good. Now what is your next step in bringing the two datasets together? You put on your data modelling hat, build a simple view that joins the two datasets on city name, region name (=subdivision in city dataset) and country name, mark the view as Analytical Dataset and expose it for consumption to SAP Analytics Cloud (SAC).

Graphical%20View%20w%20join%20over%20city%2C%20region%20and%20country

Graphical View w join over city, region and country

You then proceed to build you story in SAC. But, gosh, when looking at your data, you realize you are in unforseen trouble! The join essentially fails in 80% of your cases to identify the correct record in the cities table: continent = NULL for almost all records!!

SAC%20Story%20on%20Webshop%20Sales%20w%20continent%20equals%20NULL%20for%2080%25%20of%20records

SAC Story on Webshop Sales w continent equals NULL for 80% of records

 

Solving the problem with Intelligent Lookup

Intelligent Lookup solves the problem with an entirely new approach by iteratively building up a match table in the background that keeps track of what entry in the one table matches to what record in the other table.

Adding input & lookup entities

As you see in the video, Intelligent Lookup is a standard operator in DWC that creates an artefact that can subsequently be used anywhere else in DWC, just like a normal Graphical or SQL View. After creating a new Intelligent Lookup, users see a template canvas that they need to fill with life.

Firstly, they identify which is their input dataset. One decision they need to take is the granularity at which matches are made or the foreign key in the input dataset that uniquely identifies a record in the lookup dataset. We also call this column the “pairing column”. I’ll be writing an own blog about this soon (but this FAQ question already addresses it). Don’t worry if this feels opaque for now – it’s a rather straightforward concept when explained well.

Next you define the lookup dataset, i.e. the dataset from which you want to blend in data to each and every record of your input dataset. Here, we use the above-mentioned Google BigQuery dataset. We subsequently choose which of its columns shall be added to the input dataset. In the case at hand, we are interested in looking up the continent, latitude and longitude of each and every city and thus pick those as “return columns”

Creating%20a%20new%20Intelligent%20Lookup%20and%20defining%20Input%20and%20Lookup%20Entities

Creating a new Intelligent Lookup and defining Input and Lookup Entities

Define your first rule – then deploy & run

Next step of the process is now to define the first matching rule. This looks like a join definition, but you’ll see that it actually is a lot more than this and quite a bit more flexible: Once the two entities are defined, you jump to the rule properties, choose the match strategy (exact or fuzzy) and choose the participating columns.

In our case, we want to exactly compare city to city_name, region to subdivision and finally country to country_name. This is actually exactly what our join operator did above, but you’ll soon see where the difference is.

We now need to deploy & run the operator. In the background, Intelligent Lookup deploys the match table (aka “pairing table”), the respective code snippets that do the match logic as well as the final output artefact that effectively joins over input table, pairing table and lookup table.

After deployment, we run the operator and inspect its results. Not fully unexpectedly, the result is exactly what the join did: ~80% “unmatched” and ~20% “matched”. With matched, this means that the rule could identify which LocationId of the input dataset belongs to which geo_id of the lookup dataset by exactly comparing those 3 fields. As with the join, that comparison proved to be unsuccessful for ~80% of the records

Deploy%20and%20run%20the%20first%20rule%20and%20inspect%20its%20results

Deploy and run the first rule and inspect its results

Adding rule #2 to work through all records that did not pass rule #1

The trick with Intelligent Lookup is though that you can take those 80% unmatched records and pipe them into yet another rule of type exact match. This time we only compare city to city_name & country to country_name, thus leaving aside region / subdivision.

This rule is less strict, but is able to find a matching city record for 46% of the input records!! This means that for all those 7646 records, comparing just city & country names was good enough to uniquely make out the city (and thus its continent, latitude & longitude) in the lookup dataset.

Deploy%20and%20run%20rule%202%20and%20inspect%20results

Deploy and run rule 2 and inspect results

The price for a less strict rule rule though is that it sometimes cannot finally make a decision because there is indeed more than one match. For example, those 102 sales transactions that belong to Walldorf/Germany cannot uniquely be assigned by this rule to any given record of the lookup datasets, because there are actually two cities of name Walldorf in Germany: one in Baden-Württemberg and one in Thüringen.

Inspecting%20and%20matching%20multiple%20matches%20of%20Rule%202

Inspecting and matching multiple matches of Rule 2

When you check the details in the screenshot, you will realize why the join (and the first rule) failed on this record: input region is Baden-Württemberg while lookup subdivision is Baden-Wurttemberg Region. These are obviously semantically the same, but dumb computers cannot see this – only smart humans can. This is why Intelligent Lookup puts back the control over match decisions back into the hands of the subject matter experts in the LoB by giving them the possibility to resolve these simple albeit crucial situations by themselves without the need for IT involvement.

As you see in the GIF, with just 2 decisions & clicks, the user matches 102 sales transactions for Walldorf/Germany as well as 1614 sales transactions for Berlin/Germany!

Taking%20a%20Match%20Decision%20on%20Multiple%20Matches%20of%20Rule%202

Taking a Match Decision on Multiple Matches of Rule 2

Add a third rule that does fuzzy matching on everything still unmatched

But the story doesn’t end there. Intelligent Lookup plays out its core strength by also supporting fuzzy comparisons in match rules. We now pipe those remaining 17% of unmatched cases straight into a fuzzy rule that jus checks on the city name. What this rule does is that it fuzzily compares the city name of the input records with each every city of the lookup records.

Due to HANA, even with 109.000 cities and ~3000 open input records, this is done in a couple of seconds and the result is actually pretty stunning: For 1222 records, we simply need to review and confirm the results – but note that we do this in groups of transactions that share the same LocationId (= your pairing column value).

Examples you see below are e.g. cases like Zurich ~ Zürich (damned be those German umlauts!), Furtwangen ~ Furtwangen im Schwarzwald or Leibnitz ~ Leibnitz, Styria. Users might need to reject certain records where fuzzy yielded unexpected results, but for most cases, this is just what we were looking for

Designing%2C%20deploying%20and%20running%20a%20fuzzy%20rule

Designing, deploying and running a fuzzy rule

As with rule #3, we again have multiple matches, i.e. match candidates that the rule proposes to us. For the example in the picture, Frankfurt/Germany, the fuzzy rule found 11 match candidates to work through. The same is true for many other places (e.g. 55 sales transactions for Saint Moritz/Switzerland fuzzily matching to Mori/Japan, Mori/Italy, Mor/Hungary as well as St. Moritz/Switzerland).

Multiple%20Matches%20of%20Rule%203

Multiple Matches of Rule 3

Reducing match candidates by adding rule #4

Users could now work through those match candidates manually as shown above, or they could let the system do this by letting a rule sieve through these lists of candidates. By chaining a rule to the multiple matches, users can effectively limit those match candidates even more and thus only work through the most relevant ones. In our case, for example by enforcing that the country on both sides actually need to be the same, we get rid of all those Frankforts in various US states.

Rule%204%20helped%20massively%20reduced%20match%20candidates

Rule #4 massively reduced match candidates

As you see in the screen shot, rule #4 helped reduce the number of match candidates for Frankfurt/Germany down from 11 to 2, thus making it simple & efficient for business users to take the final call between those proposals.

Inspecting the output

In less than 15′ we now went from 13.000 unmatched records down to just a few dozen. We still need to work through some match candidates if we so choose, or we leave it at that. In the output view configuration, we can now decide how we continue from here:

  • We can either include or exclude unmatched elements, i.e. any record that is neither matched or in review (you will be able to exclude items in review very soon). If you choose to include unmatched records, obviously the value of their lookup columns (that we chose to be continent, latitude & longitude) is set to NULL.
  • If you so please, you can assign a default value to them, allowing for a better reporting. In the example in the screenshot we set continent to Unknown and left latitude/longitude at their default NULL,
  • You can optionally add own boolean indicator column that contains True for all unmatched records and False otherwise. This then allows users in upper views or in reporting to distinguish between default values and “real” values – something that e.g. could otherwise be hard for default dates or default numbers. For strings this should be less of an issue normally.

Output%20View%20of%20the%20Intelligent%20Lookup

Output View of the Intelligent Lookup

Building a view based on an Intelligent Lookup

The Intelligent Lookup is a first class citizen among DWC’s Graphical Views, SQL Views, Business Entities, Data Flows etc. This means that it can be used as data source for any of those just as if it was a normal view.

This is really neat, because it thus replaces the join that yielded insufficient results by a cleaner, fully reusable entity.

Using%20an%20Intelligent%20Lookup%20in%20another%20View

Using an Intelligent Lookup in another View

Consuming results from Intelligent Lookup in SAP Analytics Cloud

Now let’s rebuild the story from above in SAP Analytics Cloud. As you see in the screenshot, we see that the picture has entirely changed.We went from ~14000 unmatched records in category NULL down to 111 records in category Unknown within not much more than 15′ of work!

I let you make up your own minds on this, but we feel that with Intelligent Lookup, DWC is offering a transparent, iterative, efficient process which is a MASSIVE step forward for giving LoB users the tools in hand that they need to solve the data interaction challenges that cross their ways.

Re-building%20the%20Story%20in%20SAP%20Analytics%20Cloud

Re-building the Story in SAP Analytics Cloud

Summary

Intelligent Lookup is an entirely new and ground-breaking way of solving one of the most painful problems in data integration: bringing data together that semantically fits, but doesn’t quite fit on a technical (i.e. join key) level.

By allowing users to design their data knowledge into data matching rules that either automatically match the records or allow users to take a final match decision, Intelligent Lookup strikes a very clever balance between automation and end-user-centric decision making.

And don’t forget: if new data is loaded into the transaction table, all these matches will automatically be reused without any further user intervention!! For any record that has a LocationId for which a match to the lookup table already has been made (e.g. like we did above for the case of Walldorf/Germany above, essentially matching its foreign key LocationId 74 in the input entity to the table key geo_id GEO00072463 of the lookup entity), this match will now be used automatically. No further running of the rule required!

This means that we just created a re-usable entity whose value grows over time, that benefits from each and every decision by subject matter experts and don’t ever asks them again for the same match making.

We’ll be publishing a lot more information about Intelligent Lookup, because to be honest, we are really proud about it. Stay tuned!

PS: Check & bookmark the one-stop shop on Intelligent Lookup. It contains links to all available blogs, videos and other resources like SAP Help Documentation and its tutorials

Assigned Tags

      8 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Gerrit Posthumus
      Gerrit Posthumus

      Can the input paring column also be two columns. One column to do the match on and the other to separate the source data. We have stores that send there own product and location codes with additional information that we want to auto / manually map to our own product / location codes. All our customers use there own coding so we give all our customers a data provider code to separate the data and in some customer can use the same codes for different products / locations. So we need to do the mapping per data provider that why we need two columns for pairing.

      Will also time dependent mapping be support in a future release ?

      Will there be a option in the future to assign data stewards to be responsabele for reviewing and mapping a subset of the data in our case separated based on the data provider and rout them to a personal / group queue to be processed ?

      Author's profile photo Jan Fetzer
      Jan Fetzer
      Blog Post Author

      Hi Gerrit, all very valid and interesting questions. So let's dig right in:

      • Today we are limited to a single pairing column only, but we plan to release this short-coming in the near future. In the meantime, you'd need to concatenate the two into a single column in a predecessor view.
      • We also have time-dependent mappings on the roadmap, but this will certainly take a bit more time since it involves some changes in persistency.
      • With regard to data stewards: we have no own role for match decisions today, but plan to introduce this also mid-term since, of course, designing match rules & acting on their results are often separated into different roles. Part of that plan is also to be able to schedule Intelligent Lookups so that the rules run on newly loaded datasets, automatically find matches where rules are good enough and raising notifications to data stewards for those where manual match decisions need to be taken. Due to the inherent re-use of matches, this is hoped to happen less and less over time though.

      Pls reach out to me directly, if you have interesting datasets or cases to work through: jan.fetzer@sap.com. Cheers, Jan

      Author's profile photo Gerrit Posthumus
      Gerrit Posthumus

      Hi Jan,

      Thanks for your reply.

      Intelligente lookup are a real game changer keep up the good work.

      Author's profile photo Sebastian Gesiarz
      Sebastian Gesiarz

      Cool feature; I remember doing this with countless and HANA decision tables.

      Author's profile photo Evgenii Shark
      Evgenii Shark

      Seems to be good!

      Author's profile photo Murali Balreddy
      Murali Balreddy

      Cool feature!!!

      When will this be available on DWC trial landscape?

      Author's profile photo Jan Fetzer
      Jan Fetzer
      Blog Post Author

      Hi Murali, I was under the impression that it should of course be part of that landscape. It definitely is part of the standard delivery regardless of license. Is it really not part of the trial landscape?

      Author's profile photo Murali Balreddy
      Murali Balreddy

      Yes, trial landscape only has the following

      Data%20builder

      Data builder

      Business%20builder

      Business builder