Skip to Content
I was at the IEEE Database Conference at the Houston Doubletree in 1994 when Gio Wiederhold (who was then still deciding what IT projects (D)ARPA should/not fund) made the following request of the audience at a plenary session.  He asked them to stop submitting proposals to “extend SQL” – he said that although SQL was the worst of all possible outcomes – not machine-friendly and not user-friendly – the community had decided to make it THE standard for data retrieval, so what was really needed was bright proposals for new data-retrieval paradigms, not extensions to a bad paradigm.  Well of course no one listened – for a very simple reason.  As someone once said – if you tell folks they’ve made a mistake, they will not be angry with you; but if you tell them they’re fundamentally misconceived, they will never forgive you.  Nonetheless, I’m here to say that:  1) SQL is fundamentally misconceived;  2) it is fundamentally misconceived because the relational paradigm on which it is based is fundamentally misconceived;  3) the relational paragidm is itself fundamentally misconceived because it is data-centric rather than index-centric;  4) SAP can and SHOULD use ADABAS to develop a new index-centric data-retrieval paradigm that would better support its Enterprise SOA aspirations.  To understand what I mean by index-centric, one need only recognize the obvious fact that you can’t have an index on a field in a relational table unless the field is physically filled in the rows of the table.  But why should this be the case?  (Please! – in this day of RAID arrays and mirroring and parallelism – don’t talk to me about back-up and recovery issues as if this was 1986 instead of 2006.)  And in fact, one key developer of a highly-esteemed pre-relational database asked this very question and decided that there was no good reason to place this limitation on indices.  So he (Bill Mann) decided to make the Model 204 database really scream by permitting the addition of “invisible fields” to records in any Model 204 file.  (Bill’s resume currently reads like a history of IT from 1960 to 2010 – he was recruited out of MIT to work for Merrill at Lincoln Labs, then worked at BBN on projects for which he still holds patents, then for CCA on Model 204, then for Lotus … and is a now one of Michael Stonebraker’s key staff-members working on the new Vertica database.)  As Bill now recalls somewhat ruefully, “invisible field” was exactly the wrong name to call his construct – which was nothing more than an index without a physical base.  It scared customers away from a very simple and powerful idea which can be used to develop an entirely new set of tricks in the repertoire of data magicians.  Suppose you need to know all of the records in a file (rows in a table) that have a certain field-name:value pair.  Basically – all you need (for any given field-name:value pair) is a set containing:  a) record numbers   or   b) 1’s in a bit-map (at least in a DB like ADABAS, which, like Model 204, is all the better for the fact that it still implements IFAM indices)  What you DON’t need (unless you hold stock in Oracle or IBM and are therefore a true-believer in Codd’s legacy) is for the value of the field to be physically present in a field of a record of a file (“column” of a “row” of a “table”) And in fact, it is a distinct (though cynical) possibility that the relational paradigm originally forbid this notion for the same reason that it originally forbid multiply occurring fields in the same row – it was developed by a company that was very very interested in selling strings of 33×0’s and their controllers, so the more space required, the merrier.  Now back in the days before David Patterson’s predictions about cheap disk space came true beyond his wildest imagination, one could argue that the last thing a CIO wanted was a construct which encouraged folks to add even MORE indices to a file table.)  Particularly CIO’s who were already ticked at the fact that a non-relational database engineered like Model 204 made it possible to add as many regular (physically-based) b-tree or inverted indices to a file with no degradation in performance.  But David Patterson’s wildest dreams about cheap disk-space have come true, and as noted earlier, RAID arrays, mirroring, and parallelism now provide the infrastructure needed to:  a) update as many indices as you want in real-time or damn close to it;  b) successfully recover non-physically realized indices in case of failure (an admittedly real issue back in 1986)   So, to sum up, what SAP should seriously consider doing is re-implementing Bill Mann’s old notion of “invisible field” inside ADABAS.   If anyone can’t immediatley see how the availability of such “invisible” indices would completely change the current OLAP paradigm for the better – just let me know and I’ll be happy to devote more than one blogpost here to the topic – particularly to how and why an index-centric data retrieval paradigm could help support SAP’s Enterrpise SOA aspirations in a number of ways.  And BTW – if you think you see a common-thread running through my blogposts here (except for the first two on metadata repositories), you’re entirely correct.  It’s simply embarassing to see a company as intelligent and intelligently-run as SAP having to rely on fundamentally misconceived relational products such as DB2 and Oracle, and now that SAP has its own database, it no longer really has to. 
To report this post you need to login first.

6 Comments

You must be Logged on to comment or reply to a post.

  1. Stephan Heinberg
    Hello David

    I am personally interested to read from you about that topic particularly some use cases where the “Invisible” Fields approach would do much better than Scott’s relational approach.

    Thanks,
    Stephan

    (0) 
    1. David Halitsky
      I will prepare a blog which summarizes what I actually did with invisible fields for the US Army in 1994.  But while I’m doing so, I’d like you to think about:

      a) why BSEG is really the cluster table RFBLG at most SAP customers running SAP financials, i.e. why SAP has to violate the relational paradigm in staging BSEG (violate it in spirit if not in letter.)

      b) why SAP’s need to make BSEG a cluster table reveals the relational paradigm to be the misconceived paradigm it really is;

      c) what kind of DB, OS, and hardware architecture would have to exist to permit BSEG to be indexed on as many alternative singleton and composite indices as a customer wanted.

      Reason I’m suggesting (a-c) as a background for what I’ll post on the actual “invisible fields” implementation I did in 1994 is because this implementation was designed to solve a problem very very similar to that posed by BSEG.

      (0) 
    2. David Halitsky
      Stephan –

      You asked of how the relational paradigm could be outperformed by a database paradigm which permitted invisible fields.

      Here is one way in which I actually used Model 204’s “invisible fields” back in early 1990’s.  In the exposition below, please note that I use the generic terms “file”, “record”, and “field” the way the relational terms ‘table”, “row”, and “column” are used today.

      Suppose you have a large file (“SourceFile”) and the customer wants to query it in real-time in order to obtain output sorted in a number of different ways.  For the sake of discussion, suppose this file “Sourcefile” has four key fields F1, F2, F3, F4 and the customer wants to receive sorted output with control breaks on any of the following sorted “major-to-minor”orders:

      (01) F1 F2
      (02) F1 F3
      (03) F1 F4

      (04) F2 F1
      (05) F2 F3
      (06) F2 F4

      (07) F3 F1
      (08) F3 F2
      (09) F3 F4

      (10) F4 F1
      (11) F4 F2
      (12) F4 F3

      (13) F1 F2 F3
      (14) F1 F2 F4
      (15) F1 F3 F2
      (16) F1 F3 F4
      (17) F1 F4 F2
      (18) F1 F4 F3

      (19) F2 F1 F3
      (20) F2 F1 F4
      (21) F2 F3 F1
      (22) F2 F3 F4
      (23) F2 F4 F1
      (24) F2 F4 F3

      (25) F3 F1 F2
      (26) F3 F1 F4
      (27) F3 F2 F1
      (28) F3 F2 F4
      (29) F3 F4 F1
      (30) F3 F4 F2

      (31) F4 F1 F2
      (32) F4 F1 F3
      (33) F4 F2 F1
      (34) F4 F2 F3
      (35) F4 F3 F1
      (36) F4 F3 F2

      (37) F1 F2 F3 F4
      (38) F1 F2 F4 F3
      (39) F1 F3 F2 F4
      (40) F1 F3 F4 F2
      (41) F1 F4 F2 F3
      (42) F1 F4 F3 F2

      (43) F2 F1 F3 F4
      (44) F2 F1 F4 F3
      (45) F2 F3 F1 F4
      (46) F2 F3 F4 F1
      (47) F2 F4 F1 F3
      (48) F2 F4 F3 F1

      (49) F3 F1 F2 F4
      (50) F3 F1 F4 F2
      (51) F3 F2 F1 F4
      (52) F3 F2 F4 F1
      (53) F3 F4 F1 F2
      (54) F3 F4 F2 F1

      (55) F4 F1 F2 F3
      (56) F4 F1 F3 F2
      (57) F4 F2 F1 F3
      (58) F4 F2 F3 F1
      (59) F4 F3 F1 F2
      (60) F4 F3 F2 F1

      Suppose next that sorting the retrieved data is out of the question – the customer wants to pull back a lot of data, sort it in any one of the ways above, summarize it up to the relevant “control-breaks”, and then just bring the “control breaks” back to the scren (with subsequent drill-down possible.)

      Finally, suppose that you have to do what the client wants – i.e. possible technical responses to the client do NOT include: a) “That’s an unreasonable spec”; or b) “Sure, buy me a Cognos or SAP BW and I’ll be happy to pre-load it for you.

      Well, here’s what you can do if you’re working in a non-relational database that permits:

      a) creation of “invisible fields”, i.e. indices on fields not physically present in a file;

      b) multiply occurring fields within one record of a file;

      c) multiple record types in a file with no degradation in file performance

      d) direct access to the b-tree indices you’ve created as “invisible fields”, i.e.

      – a syntactic construct “FDV” (Find All Values)  which gets all the values of  b-tree index and places them in a list, and which permits retrieval of values on a pattern, e.g. “FDV OF index-1
      in ControlFile WHERE index-1 LIKE PATTERN;

      – a syntactic loop construct “(FRV” (For Each Value) which lets you step through a list of index values you’ve obtained with an FDV statement;

      e) multiple record types in a single file with no overhead penalties for carrying empty fields, i.e. records do not have to have the same field-structure;

      I. First, you set up a control file “ControlFile” with the following INVISIBLE and multiply-occurring fields:

      (01) F1F2
      (02) F1F3
      (03) F1F4

      (04) F2F1
      (05) F2F3
      (06) F2F4

      (07) F3F1
      (08) F3F2
      (09) F3F4

      (10) F4F1
      (11) F4F2
      (12) F4F3

      (13) F1F2F3
      (14) F1F2F4
      (15) F1F3F2
      (16) F1F3F4
      (17) F1F4F2
      (18) F1F4F3

      (19) F2F1F3
      (20) F2F1F4
      (21) F2F3F1
      (22) F2F3F4
      (23) F2F4F1
      (24) F2F4F3

      (25) F3F1F2
      (26) F3F1F4
      (27) F3F2F1
      (28) F3F2F4
      (29) F3F4F1
      (30) F3F4F2

      (31) F4F1F2
      (32) F4F1F3
      (33) F4F2F1
      (34) F4F2F3
      (35) F4F3F1
      (36) F4F3F2

      (37) F1F2F3F4
      (38) F1F2F4F3
      (39) F1F3F2F4
      (40) F1F3F4F2
      (41) F1F4F2F3
      (42) F1F4F3F2

      (43) F2F1F3F4
      (44) F2F1F4F3
      (45) F2F3F1F4
      (46) F2F3F4F1
      (47) F2F4F1F3
      (48) F2F4F3F1

      (49) F3F1F2F4
      (50) F3F1F4F2
      (51) F3F2F1F4
      (52) F3F2F4F1
      (53) F3F4F1F2
      (54) F3F4F2F1

      (55) F4F1F2F3
      (56) F4F1F3F2
      (57) F4F2F1F3
      (58) F4F2F3F1
      (59) F4F3F1F2
      (60) F4F3F2F1

      (Each of these fields is a concatenation of two or more of the values of the fields F1, F2, F3, and F4 that actually occur in the file “TargetFile” – obviously, these concatenations must honor spaces if F1, F2, F3, and F4 can have values that do not have uniform lengths.)

      II. Second, you run a program on TargetFile that looks for all values of the fields (1-60) in this file and loads the values of these 60 fields into  ControlFile. 

      Note:

      – at the end of this load, the file ControlFile will have only 60 records in it, one for each of the 60 fields defined in Step I;

      – everyone of these 60 records will be physically empty.

      If fields (1-60) were defined as multiply-occurring but NOT invisible, then each of the records in ControlFile would have some number of occurrences of one of the fields (1-60), e.g. one of the records in ControlFile would have 24,312 occurrences of the field F4F3F2 if there were 24,312 different combinations of the values F4, F3, and F2 in the records of TargetFile.

      But since fields (1-60) have been defined as invisible as well as multiply-occurring, what you’ve done is to build 60 indices in ControlFile, one for each of the sort orders the customer can possibly request.  There is no “data” in Controlfile, in the relational sense of the term “data”. 

      III  The rest of the story is pretty obvious.  Suppose the customer wants the sort-order F3, F4, F1, F2. Then the structure of your retrieval program is:

      FDV F3 in Control File.
      FRV F3
        
      FDV F3F4 in ControlFile WHERE F3F4 LIKE F3
      FRV F3F4

      FDV F3F4F1 in Control File WHERE F3F4F1 like F3F4
      FRV F3F41

      FDV F3F4F1F2 in Control File WHERE F3F4F1F2 like F3F4F1
      FRV F3F4F1F2

      Find all records in SourceFile WHERE f1 = F1, f2 = F2, f3 = F3, f4 = F4.

      ENDLOOP.

      ENDLOOP.

      ENDLOOP.

      ENDLOOP.

      Note: if you like, you can also make F1, F2, F3, and F4 invisible in SourceFile (as you might want to do in SAP’s BSEG for fields such as account, cost center, profit center, etc. … more on this later in Section V below…)

      IV.  So of course the first question comes down to: can you update the indices in ControlFile sufficiently fast enough?  And the answer is yes – for two reasons.  First, if you buffer ControlFile, all you’re doing is reading b-tree indices in core.  Second, if I could update 60 indices sufficintly fast in the early 1990’s without parallelism, there is no question I can do so now.  In the early 1990’s “sufficiently fast” was defined as a six-hour period.  Updates to SourceFile came in four times a day and the requirement was therefore to get SourceFile and ControlFile updates before the next update cycle began.  Usually, we completed updates to SourceFile and ControlFile within a hour or less, but just to let users know whether an index update was still in progress, we checked to see if the index update was still running or not.  But as I said earlier, in these days of parallelism, mirroring, etc, there is a good chance that updating of the 60 indices in ControlFile could be immediate with no degradation in performanc noticeable to the user.

      V.  The second intersting question is the one raised at the end of Section III – should F1, F2, F3, and F4 be invisible fields in SourceFile (This is unrelated to the fact that the 60 indices in ControlFile are invisible.)

      Well, suppose you had four additional control-indices as follows: a) ControlFile1 has all the fieldname:value pairs for F1 in Sourcefile and for each such pair P1, the record numbers for the records in SourceFile that have P1 in it ; b) ControlFile2 has all the fieldname:value pairs for F2 in SourceFile and for each such pair P2, the record numbers for the records in SourceFile that have P2 in it ;; c) ControlFile3 has all the fieldname:value pairs for F3 in SourceFile and for each such pair P3, the record numbers for the records in SourceFile that have P3 in it ; d) ControlFile4 has all the fieldname:value pairs for F4 in SourceFile and for each such pair P, the record numbers for the records in SourceFile that have P4 in it . 

      Then if ControlFile1 – ControlFile 4 were each organized by SourceFile record number (implicitly or explicitly), you could index into these files from a record number in SourceFile and find out the values of F1-F4 for that record.

      ******
      Am I serious about V above?  Serious enough to want to see a trial run on a highly-parallelized system with a reasonable amount of memory/memories.  I’d be very surprised if the approach weren’t entirely feasible.   Remember, all we’re doing is reversing the pre-fetch process used now. Instead of driving off one or more control tables (like COEP and COBK to find belnrs for a given kostl in BSEG), we drive off a control table to find the kostl for a given belnr in BSEG.

      But regardless of whether V is a feasible approach to solving SAP’s problem with BSEG (and many other customers’ problems with similar detail files that need to be sliced and diced), I-IV is certainly a feasible approach to multiple arbitrary rotations, and one which can’t be implemented within the relational paradigm.

      (0) 
  2. Dirk Herzog
    Let me see if I got it correct.

    Let’s say you have a table that has
    Company
    Product
    Sales

    and you have a master data table
    Company
    Country

    Now you need a query
    Product  Sales
    for all companies in the USA. You need to create a view and select first all companies from the USA, then all records for the selected companies. With the invisible fields you could create an index on the first table that gives you all records from companies from the USA. Is it what you mean? This could be a revolution in BI.

    Best regards
    Dirk

    (0) 
    1. David Halitsky
      Dirk – Thanks for taking the time to reply.

      I myself have grown so suspicious and weary of IT “revolutions” that I am not sure whether to take your comment about a “revolution in BI” as straightforward or sarcastic (in a friendly way, of course.)

      But assuming you were not being sarcastic, I feel compelled  to say that although I don’t know the internals of COGNOS or the SAP BW, I would be surprised if both these engines (and any other OLAP engine) WEREN’T using “index-centric” techniques of the type I describe in this thread.

      Why do I think that OLAP engines are already using “index-centric” techniques “under the covers”?

      Well first, OLAP engines have the luxury of operating on precomputed data thay is reloaded periodically and never updated in any way while the OLAP engine is operative.  So right off the bat, OLAP designers can use “index-centric” techniques such as “invisible fields” (or their equivalent) without worrying about how to update the damn things in real time.

      Second, OLAP designers don’t have to fight the relational paradigm and those who are vested in ensuring that it remains the dominant database paradigm.  In fact, it could be argued that the rise of OLAP engines was a natural response to the plain and simple fact that “data-centric” relational databases simply can’t handle a certain class of real-world problems that are elegantly handled by “index-centric” engines.  That is, the relational gurus were more than happy to let OLAP engines flourish, because they could say that the success of OLAP engines is no reflection on the inadequacies of the relational paradigm, i.e. that the relational paradigm was NEVER MEANT to handle the class of problems addressed by OLAP engines.  (Of course, by saying this, relational gurus implicitly confess that their paradigm is essentially trivial, but this is something that seems to escape them.)

      Suppose I’m correct in the assumption that OLAP engines ARE today using “index-centric” techniques such as “invisible fields” or their equivalent.  Then the real question is whether “index-centric” techniques should stay solely within the OLAP domain or whether they can be safely imported into everyday database engines.  (As I said before, this question can only seriously arise in a time such as ours – when RAID/mirrors, parallelism, cheap disk space, and fairly cheap large memories all permit “index-centric” engines to solve the real-time update problem in ways that could not be dreamt of when Bill Mann first made “invisible fields” a part of Model 204’s retrieval repertoire.)

      I personally believe that “index-centric” domains should move out of the OLAP domain and into “everyday” databases for the simple reason that “everyday” databases are ALREADY using versions of “index-centric” techniques without realizing it.

      For example, the first thing every SAP FI(/CO)programmer learns is never to go into BSEG unless you have a company, year, and one or more specific belnrs (doc nrs).  But how do you get a set of specific belnrs?  Well, as I have noted before, if you want belnrs per kostl (cost element), you go from CSKS to COEP to COBK, and voila, you have belnrs per kostl to your heart’s content, and you’re all set to take these belnr’s into BSEG.

      But let’s face it – although CSKS, COEP, and COBK carry useful data, these table are in fact FUNCTIONING as “invisible indices” WHEN THEY ARE BEING used solely for the purpose of  pre-fetching a set of belnrs per kostl to take into BSEG.

      So,  since one is forced to use “implicit” index-centric techniques inside the SAP engine to solve a certain class of retrieval problems, why not see whether “explicit” use of index-centric techniques might not be a better way of doing things?

      As far as your particular example goes – yes – I think you have it right, although your example is kind of a “mixed” case where certain data interrelationships are obtained by standard “data-centric” techniques (views created by relational joins) and others by “index-centric” techniques involving the creation of indices that are not rooted in fields that are actually present in tables.

      Anyway, I hope you find this an appropriate response to your reply – again, I’m not sure whether your reply was intended sarcastically or not, so I’ve tried to cover all bases without making a complete fool of myself in the process.

      (0) 
  3. David Halitsky
    I sent Bill the link to this SDN thread and he was kind enough to respond in his usual gracious and patient manner.  He observed first that when he put “invisible fields” in M204, he anticipated that they would be used only when their values could be algorithmically computed via a user-supplied function from “visible fields”.  (This would actually be the case in my example of using invisble indices that are concatenations of visible fields to replace sortkeys by nested loops on “nested” indices.)

    He also observed that little new can be said these days re the question of how to goose database performance with new kinds of index tricks, and that the new database he’s now helping to develop (Vertica) may well render the issue of index performance irrelevant.  He wouldn’t/couldn’t say more on this topic, so naturally I’m very curious to see what he and his co-developers have up their sleeves this time.

    (0) 

Leave a Reply