Extended Set Processing Technology

Machine-Independent Data Management Systems
Managing Data by Mathematical Identity not by Physical Location
RESEARCH OBJECTIVE:   Give data a mathematical identity.

Machine-independent data management systems access data by its identity. Machine-dependent data management systems access data by its location. Today all commercially available data management systems are machine-dependent. Under a 1965 ARPA research grant Dr. F. H. Westervelt initiated research to develop a mathematical foundation for implementing machine-independent data management systems. Dr. Westervelt's vision was to develop data access operations to replace data access structures. In 1968 a set-theoretic data structure, STDS, demonstrated accessing data using set operations at the I/O level. The original STDS was limited to record structures. It took another 20 years to develop and validate the extensions to ZFC required to support all physical data representations as extended sets.

The set operations necessary for supporting multi-platform, real time data access have not been published. The extended set operations, essential for implementing machine-independent data access systems, are presented following the 1968 research paper on the development of STDS.

Development of a Set-Theoretic Data Structure (1968)
Basis for Machine-Independent Information Management Systems

A search for commonality of purpose in existing data structures showed the basic concern of all data structures to be: information versus machine representation. Since all information is machine independent before being forced into a machine representation, an effort was made to separate those properties of data structures which were machine independent, the information environment, from those properties which were machine dependent, the machine environment, and then develop a data structure which allows isolation and separate control of these properties. This approach yielded the following definition: A data structure is an isomorphism between a machine environment and an information environment preserving the properties of each. The particular type of data structure depends on the structure of the information environment. When the information environment is represented by set theory, the isomorphism is a set-theoretic data structure.
KEY WORDS AND PHRASES: information environment, machine environment, data structure, set theory, Set-Theoretic Data Structure, transformation operations, set operations, machine-independence.

Introduction At the beginning of the CONCOMP project, Dr. Franklin H. Westervelt initiated an investigation into data structures. It was his view that present efforts in data structure development were too machine-oriented and would, inherently, be unable to handle future needs. It was his contention that users of huge data bases of the future should not be burdened with the intricacies of complicated data representation -- rather, data representations should be transparent to the user. In fact, the user should not even be aware of the type of storage media or even the type of machine being used. (Machine-independence means data representation independence. No machine-independent systems currently exist. Even today, accessing data requires knowledge of how data is physically represented.)

Dr. Westervelt had often lectured on the need for a general data structure that was not machine oriented but one that was information oriented, where design emphasis was placed on information content instead of machine representation. For example, in a graphics problem concerning diagrams constructed from lines and points, the positions of lines and points relative to one another are invariant no matter what machine is used or how the information is stored .

The idea behind this investigation was to try to separate those properties of data structures which were machine independent from those properties which were machine dependent, and then develop a data structure which allowed isolation and separate control of these respective properties. Historically (as viewed in 1970) it seems that most data structures have been developed for a particular problem on a particular machine and then generalized. It seemed that the best approach would be to start with a general means for expressing any problem and then worry about expressing the general representation on a particular machine. This was the approach taken which eventually resulted in the development of the concept of a Set-Theoretic Data Structure.

Purpose: A Set-Theoretic Data Structure -- STDS
            To provide a storage structure representation of arbitrarily related data allowing
            the data to be stored as sets and then retrieved using set operations.

            a. Flexibility, generality and reliability provided by set theory.
            b. Machine-independent representation of data.
            c. Minimal storage requirements.
            d. Real-time access of data.
            e. Easy modification of data.
            f. Ideal for distributed data environments.
            g. No restrictions on questions as long as they are formulated using sets and set operations.

Information vs. Representation
A search for commonality of purpose in existing data structures showed the basic concern of all data structures was: information versus representation. Other complexities involving the amount of data, kinds of questions to be answered, speed with which they were answered, updating requirements, all depend on the information contained in the data and the representation of this information in the machine . These two areas can be characterized separately and independently, and therefore should be approached separately and independently instead of being blended together. However, this does not seem to be a widely accepted practice in developing data structures.

Machine Environment & Information Environment
In order to pursue this approach a distinction must be drawn between a machine environment and an information environment. The first may partially be characterized by: addresses, codes, memories, machine hardware, computer software, programming costs, data locations, sorts, storage medium, storage size, retrieval speed, cpu-seconds, pointers . . . in short: the empirical world of the computer. This seems to be the most comfortable area to work in for those involved with the development of data structures. However, it only covers half the problem. The second area is not so easily characterized, which may partially account for its not being isolated. This is the area of pure information, data relationships, questions, answers, information extraction . . . in short: the abstract world of information.

Here rests the heart of the data structure problem, for in order to treat the information environment and the machine environment as functionally separate entities, their respective structures must be precisely stated. The structure of the machine environment is generally accepted, but a separate structure for the information environment is not. Therefore, the information environment is usually couched in terms of the machine environment, thus abolishing any hope of functional separability.

The problem then is to devise or find a suitable structure for the information environment, such that the resulting information environment operations will insure the same result, independent of any particular machine-representation of such information. If the information environment operations themselves are defined in terms of a particular machine-representation, then they are intrinsically dependent on that representation. (This, unfortunately, characterizes many current data structures.) The means for isolating information environment operations from the machine-representations of data is not necessarily obvious. The difficulty arises in trying to insure generality and consistency. Generality in that the operations must not be restricted to a particular information configuration, but must be applicable for any possible information configuration; and consistency in that, independent of the information configuration, the operations must yield unique and well-defined results.

This amounts to the adoption or development of an abstract mathematical model for the information environment. (Set theory was eventually chosen as such a model.) Assuming that such a model can be found and that the machine environment and the information environment are functionally separate and completely independent structures, the function of a general data structure would be to connect them in such a way as to preserve their integrity. Therefore, the view is proposed that: any data structure is actually an isomorphism between a machine environment and an information environment preserving the functional aspects of each, The particular type of data structure would depend on the structure of the information environment.


FIGURE 1           

The schema in Figure 1 represents the relationship of the user, the information environment, the data structure, and the machine environment . It will be argued subsequently that this definition allows a data structure to enjoy a degree of machine independence. However, it may not be immediately obvious that such a definition implies any operational improvement in the use of the resulting data structure; but since any particular machine representation has inherent properties associated with it (initial programming cost, storage allocation, updating costs, retrieval speed, etc.), some machine representations are better for some purposes than others.

Given a selection of representations, which one, or ones, should be used with a particular data base; since, in general, no one representation is best or even adequate for all retrieval needs? The answer depends on which storage and retrieval requirements best meet the information environment conditions (which may even change during interrogation of the data base). If the different machine representations could be transformed from one to another (at a 'minimal' cost) without affecting the form of the retrieval program, then any retrieval requirement could be facilitated by the selection from machine representations available. The control of these operations should be accessible to the user, but since these operations do not affect the information environment, only the machine environment, they should be distinct and separate from the operations of the information environment. Figure 2 represents this schematically.


FIGURE 2           

The box labeled ME-Controls allows for any operation which affects only the machine environment, which in turn can not affect the information environment since they are functionally separate. Therefore, these operations do not belong as part of a data structure, (though the previous definition of a data structure does not preclude the use of these operations in some other context). The transformation operations, or mode operations, are in this category and every transformation must have some cost a associated with it in the form of storage and cpu-time used. From the user's view the mode operations are independent of the information environment being used, they are to allow transformations with a cost savings compared to the resulting retrieval and storage costs. The most expensive transformation cost would result if each transformation had to be programmed when the need arose, which is the current situation with many data structures. With MODE operations, however, any transformation can be made between existing machine representations subject only to the gained or lost retrieval speed versus the increased or decreased storage requirements. It is important to remember that the information is an invariant under these transformations, only the machine-representation of the information is changed. With MODE operations any user has the following economic flexibility: During extended periods of non-use the data can be compacted and stored at the lowest possible storage rate. When the need arose for interrogation, modification, updating, or analysis the data could be put into an expanded storage form, thus acquiring the most economical retrieval costs and fastest response time.

Set-Theoretic Data Structure (STDS)
The concept of a Set-Theoretic Data Structure (STDS) evolved over a period of three years. The initial direction leading to its development was inspired by the efforts of Timothy E. Johnson and his work, and by the work of Jerome A. Feldman, on Associative Data Structures. A most significant feature of the ADS was the introduction of the functional notation A(o) = v. This notation is strictly a part of the information environment and can have an arbitrary implementation in the machine environment. The principle drawback, however, with any functional notation is the prerequisite for the arguments to be single valued. This lack of generality at the outset may be an undue restriction. Since a more comprehensive notation does exist, allowing a collection of values for the arguments, it seemed worth exploring .

Image Operation
The notation A[O] = V may not seem to connote any more information or generality than did A(o) = v. However, this notation represents the Image operation in set theory, where O and V are sets of values instead of being restricted to single values. Ambiguities possible with a functional notation are no longer a problem with the image operation: if S represents 'square root' then what is the value of v in S(4) = v? Here two values for V are correct: +2 and -2. The ambiguity disappears using the image notation S[{4}] = {+ 2, -2}. This may further be demonstrated: by the 719-th root of 13, by the children of x, or by the points reachable from y. Besides increasing the generality of A(o) = v to A[O] = V, the introduction of sets allows the potential of utilizing the powerful operations of set theory (which are not defined algorithmically, thus allowing easy adaption to machine independence).

Set Theory?
Crispin Gray and Charles Lang considered set theory in their 1967 paper on ASP (Associative Set Processors). They were also quick to point out the inherent difficulty of attempting to implement set theory on a digital computer: a set by definition is an unordered collection of objects while any machine representation forces an order on the elements. For every set with n elements there exist n! different orderings. Which of these should be used for a storage representation, can a canonical ordering always be found, or does it make any difference? Much of the investigation into data structures was directed toward resolving these questions. It does make a difference! One particular ordering is preferred and it can always be found, (thus allowing a canonical machine representation for any arbitrary set).
During the investigation it became ever increasingly evident that set theory -- if it could be applied to a computer -- would be an ideal structure for the information environment. The feasibility of a set-theoretic approach was described in CONCOMP Technical Report 6 and was also presented in Edinburgh, Scotland, at the IFIP Congress 68, under the rather cumbersome title: Feasibility of a Set-Theoretic Data Structure: A general structure based on a reconstituted definition of relation.[IFIP] The result allows the separation of the information environment from the machine environment, by letting set theory represent the information environment. It is now the function of the data structure to tie the two back together again, using the earlier definition of data structure as an isomorphism.

DEFINITION: A Set-Theoretic Data Structure -- STDS
            A STDS is an isomorphism between a machine environment and set theory preserving
            a given universe U under a collection of set operations S.

This conforms to the schema of Figure 1 with the information environment represented by set theory. One immediate consequence of this definition is that from the user's view (the set theory side) it appears to be a machine independent data structure. Some people may argue that a machine independent data structure is a contradiction in terms since any data structure must be concerned with addresses, searches, pointers, word length, and other machine dependent characteristics. By such a definition, of course, a machine independent data structure is a logical, and hence technical, impossibility. Whether or not there is any agreement to calling an isomorphism a data structure in some classical sense is irrelevant. The relevant issue is the separation and definition of those aspects of "data structure usage" that do or do not depend on the particulars of a given machine.

It would seem that any data structure could properly be called machine independent if a user could access any type of data represented in any way desired without needing to know what machine was being used. The only clue to the machine might be the storage sizes available and the retrieval times experienced. To justify that STDS is a machine independent data structure in this sense, it may help to examine why other data structures are not machine independent. Most fail immediately since they require a fixed data representation and seldom allow spontaneous construction of questions. The data is forced into these fixed structures which are already geared to specific types of questions. All data is machine independent before it is forced into a particular machine . What happens to this information independence, why is it lost? It is only lost in the sense that it can no longer be accessed directly, it has to be embraced with "handles" or "hooks" or pointers or labels or whatever, which in turn are able to be accessed. The result is that all data is accessed indirectly through arbitrary appendages. Unfortunately these appendages always seem to possess machine properties which now render the data accessibility machine dependent. The reason for all this indirection seems to stem from the view that all information processing must be algorithmic in nature. A command like "FIND " must invariably reduce to an algorithmic language or procedural representation, which on the surface seems quite reasonable since any machine implementation of anything results in an algorithm. However, there is the flaw! What is necessary for the information environment is a structure embodying operations that are not defined procedurally, but whose implementation, of course, will be. Since the operations do not dictate a procedure, any procedure giving the correct result is legitimate. Set theory is such a structure.
Given two sets A and B, and any set operation *, then A*B = C is completely defined for all x if it can be determined if x is an element of C just by determining the truth or falsity of a statement concerning the membership of x in A and in B. In other words, in set theory only the result is defined, not how to obtain the result. Therefore, if this aspect of set theory can be preserved and implemented on a computer, then all operations would be independent of how the sets were represented in the machine and only dependent on the information content of data represented. Any procedure for executing the set operation would work as long as the information content of C was correct. C could even be represented in the machine a different way every time. Clearly there would be representations that would be more desirable than others for certain operations. Since the information content of any set is the same no matter what the machine representation is, a set can have its representation changed without effecting the result of any set operation, (the only effect would be on execution time and storage allocation).

A STDS, therefore, allows a user to see only set theory while operating on a machine environment. However, a user may wish to exercise control over the machine representation. The MODE operations, mentioned earlier, allow changing from one machine representation to some other. Only 'time' and 'storage' characteristics need be known about the different MODES or machine representations, since those are the only affects that can be detected by the user. Figure 3 represents the schema for a STDS.


FIGURE 3           

Any structure that allows referencing the data directly without dependence on artificial devices would give this kind of machine independence. The viability of such a structure depends on the expressive power allowed by the interaction of the operations. Here is the real strength of set theory. No argument can do it justice. Only first hand experience can demonstrate the inherent power of the set operations . For this reason, an interactive demonstration calling program for set operations was written for use in MTS. People who were introduced to it very quickly were doing retrieval queries they had previously thought to be impossible. (Examples appear in the appendices).[*] With set theory as a model the information environment acquires all the precision, generality, and formalism inherent in set theory. The general concepts proposed earlier may now be particularized starting with the separation and delineation of the information environment and the machine environment. The result is two separate and functionally independent spheres of activity, characterized in part by the following table:


Any collection of information can be represented by a set in the information environment, while the mode of that set is the particular machine-representation of that set in the machine environment. Changing the mode of a set clearly does not change the information content of the set but only how that set is represented in the machine environment. A legitimate mode can be modeled after any of the currently popular data organization techniques: trees, lists, inverted lists, ring structures, cellular multilist files, index sequential files, hash codes or content addressable organizations. In fact, any future machine-organization or hardware memory device can be utilized for a mode, or machine-representation of data. In all cases the information retrieval speed characteristics of the data organization will be preserved. However, due to the intrinsic advantages of set theory, the storage requirements will generally be far less when the data is organized with a STDS and the representation will always be machine relocatable. The latter property is possible by the consistent use of relative pointers and the complete avoidance of absolute pointers for representing information. For an example:

Let E be a set of employed persons and let F be a set of fathers. Then, to find the set A of all fathers who are also employed, A would equal the INTERSECTION of E and F. This operation takes place in the information environment. To actually accomplish this retrieval requires assigning modes to both E and F in a machine environment. Let E have, say, MODE(6) which may be assumed to have slow retrieval and small storage properties. Assign, say, MODE(3) to F and assume that it has fast retrieval characteristics but requires large storage. Let IN be a FORTRAN callable subroutine which performs the operation of INTERSECTION. Then CALL IN(E, F, A) gives A as the set of fathers who are also employed, and the mode of A will be the default mode.

The point to be made here is that if the mode of E were changed from 6 to 3 or if the mode of F were changed from 3 to 6, or if both modes were changed, the resulting set A of IN (E, F, A) would be exactly the same, only the time to execute IN(E, F, A) and the resulting default mode for A might vary .
An important concept in set theory is that of partitioning. This concept allows STDS to provide the facility to take maximum advantage of any storage medium when handling large data bases. Partitioning subdivides a data base into several distinct and independent parts as a function of the information content of the parts.

For example, suppose that one must deal with a data base of, say, two million automobiles where each automobile has, say, 150 recorded data items. This data base might be contained on 20 magnetic tapes and require one and one-half hours per day to update. Suppose further, however, that it is the case that only about 30 of the 150 recorded items account for nearly all of the updating required.
If the data base were organized on tape as an STDS, the first pass over the set of twenty tapes could partition the data base into two parts:
            4 tapes containing the partition subject to frequent change,
            16 tapes of relatively static data.
With the data so partitioned, the daily update would now deal with only 1/5 of the tapes previously required. Furthermore, no loss of interrogation generality is incurred by this process . If wild card changes occur (i.e. changes that are not among the most frequent 30 items), these changes may be used to obtain dynamic partitioning, if desired. Or they may be maintained as a small set of "changed " individuals until it becomes economically or procedurally attractive to merge (in terms of Set Theory, Union) them with the master data base.
[Note: This paper was written at a time when high speed, flexible data access was a baggage cart hustled through Pentagon corridors, loaded with magnetic tapes. After 1967, STDS focused on random access disk drives.]

Other facilities also follow from the basic nature of a STDS. Among the more important features are:
          1) Any data base that is in machine readable form requires no redesign for use with STDS.
         2) Most data bases in their raw input form (i.e. no structure associated) have enough redundancy and unused portions of data fields to permit a reduction in storage requirements by a factor of about four (based on current experience) when placed into the most compact STDS. Importantly, the data base in maximally compacted form in STDS would still permit total interrogation. In one specific case, a data base of 300,000 records of 120 characters/record was reduced to most compact form. Later when expanded for fastest retrieval, the expanded sets occupied only 1/4 of the space originally required by just the raw data (i.e. 9 x 106 characters for fastest retrieval and total interrogation versus 3 x 105 x 120 = 36 x 106 characters in raw form).
         3) Test runs of huge data bases may be expedited by using STDS to extract random samplings (i.e. subsets). These subsets may be used for experiments with partitioning and other schemes to obtain the fastest interrogation of the complete data base. Importantly, all STDS operations are completely compatible in both the subset and total data base cases.
         4) Set Theory lends itself naturally to an English language superstructure. Many set operations already have English equivalents. For example:

"or"→"union" "The set of blue or red automobiles " is the same as
"The union of the set of blue automobiles with the set of red automobiles."

"and"→"intersection" "The set of people who are married and unemployed " is the
same as "The intersection of the set of married people with the set of unemployed people.

"but not"→"Relative Complement" "The set of employed but not collage graduates" is the same
as "The relative complement of the set of employed with respect to the set of collage graduates."

"not both"→"Symmetric Difference" "The set of people employed or in school, but not both"
is the same as "The Symmetric Difference of the set of people employed and in school.

          5) The modularity of set theory may be carried into the design of a STDS. All of the set-theoretic operations may be implemented as subroutines. Each subroutine for either operations or modes (storage-representations) and independent of one another. Further, new operations ( or modes) may be added at any time with no disturbance of previously implemented operations, modes or data already in existence.

Finally, the universality of set theory allows for the complete range of inquiry and complete management over any type of data. This fundamental fact together with the constructive demonstration of the existence of a canonical machine-representation for an arbitrary set provides the foundation for a Set-Theoretic Data Structure and its properties. However, the viability of any such implementation will depend on the transformation algorithms, the set operation algorithms, the storage media available, and most importantly the cleverness in the design of the different machine-representations.

          Appendix D.1    Expanded Definition of STDS Facilities
          Appendix D.2    Some Familiar Examples Presented in Set-Theoretic Form
          Appendix D.3    Description of the Demonstration Files for STDS Used in D.4
          Appendix D.4    Example Use of STDS

The ARPA report concluded that machine-independent data access operations, STDS, had been mathematically well-defined and implemented successfully.[ARPA] Though government funding ended in 1970, commercial interests in processing data as extended sets continued for another fifty years.

---------------------- ADDENDUM ----------------------

Machine-Independent Data Access Operations
Accessing Data by Identity, not by Location

In 1965 the term relation was universally understood to be a well-defined set of homogeneous n-tuples. Not a user friendly labeled-table obfuscating a Gordian knot of indexed data. STDS operations are on relations in the strictest set-theoretic definition of the term, sets of homogeneous n-tuples. The only problem was that Classical set theory could not support operations on n-tuples. Since the objective was to model operations on records with set operations on n-tuples and since we had no idea what set operations were needed, we assumed that records were sets, wrote the programs that were needed, and made up set operations to match the programs, The following provides the set-theoretic justification of the these operations. (The axioms for supporting these operations were not available until the mid 80's.)

Data, Information & Machine-Independence
When developing any formal model it is generally a good idea to know what it is that is intended to be modeled. Though Data is a familiar term, it does not necessarily have a shared understanding. The terms information and machine-independence are also used as if everyone had the same understanding of their meaning. This usually results with seeming agreement on basic misunderstandings. To circumvent this precedent explicit use of the terms are presented.

                    DATA:   Relationships of interest.
                    INFORMATION:   Data derived from data.
                    MACHINE-INDEPENDENCE:   Data accessed by identity.

These simple definitions are rich and precise enough to support requirements for implementing machine-independent data access systems. Information is defined as derived data. Data defines relationships. Relationships are abstract associations, mathematical objects. While relationship representations on a computer are byte sequences, physical objects. The 1965 research challenge of implementing machine-independent information access systems was to define data relationships mathematically, store them physically, and define operations on them to access information.

Set operations provide applications a portal to information embedded in data.

Even prior to the ARPA project, Dr. Westervelt had been advocating the need for a mathematical foundation for representing and accessing data. As a mathematician and engineer Dr. Westervelt recognized that all engineering disciplines, sciences and technologies had formal foundations. Without a formal architectural foundation, data processing is just an art of computer programming.

Dr. Westervelt proposed a data processing architecture consisting of an information environment and a machine environment each representing data as sets. The information environment representing abstract objects. While the machine environment represented concrete objects. The key feature of this architecture was that the environments shared set definitions, but not set representations. Operations defined abstractly in the information environment were executed faithfully in the machine environment.

Attempting to formally model data processing systems with set theory was not a novel idea. The first step of our research was to analyze previous attempts with using set theory to model data relationships abstractly and concretely with content preserving mappings between them. None were successful nor even promising to pursue. Since relying on an existing set theory was not a prerequisite, the decision was made to determine what operations needed to be formally modeled by a set theory, then formulate what an acceptable set theory might look like.

The term machine-independence is not well undestood and is most important for future systems development. It is essential for real-time integarted derivation of information from globally accessable pools of arbitrarily preserved data. The single most impoetant property of a machine-independent system is the ability to determine if two data representations on seperate platforms are informationally equivalent. No system today can support that requirement without knowing the physical properties of both data representations. That is the essence of machine-dependence

Tuples as Records
In set theory all relations are defined as sets of ordered pairs.[Suppes p.57] Thus, all relations are binary relationships between just two items. Most people (excluding mathematicians) believe that relations can be sets of n-tuples. Which they can, but all n-tuples are just ordered-pairs of nested ordered-pairs. Though modeling records as legitimate n-tuples was the research objective, it was not clear how to model operations on n-tuples, defined by convention as nested sets of ordered pairs.

Since it was not clear what operations needed to be defined even if tuples were well-defined, the objective of research was reversed. Assume n-tuples are defined as records. Develop the necessary operations on records and worry about a mathematical justification later. STDS was the result.

Extended Set Operations
A unique property of STDS is that all operations on data are required to be set-theoretically defined. This not only ensures mathematical consistency and operational reliability, but leverages machine-independence by allowing better performing products to be substituted for inferior performing products.

Since all operations need to be set-theoretically well-defined, developers need to be able to express new operations in terms of extended set notation, XSN. All this means is how to incorporate element superscripts in set-theoretic definitions. This is accomplished quite simply by incorporating the element superscript as a membership subscript.
                   CST SET:   A is a CST set: x∈A ↔ {x} ⊆ A.
                   XST SET:   A is a XST set: x∈iA ↔ {xi} ⊆ A.
The above are read, A is a CST set asserts that x is an element of A iff the set containing x is a singleton subset of A. While A is a XST set asserts that x is an i-th element of A iff the set containing x, as an i-th element, is a singleton subset of A. This simple extended notation can be used to convert any CST set into an XST set. All familiar CST operations can be extended to XST operations.
                                                        UNION:   A ⋃ B = { zi: x∈iA or x∈iB }
                                       INTERSECTION:   A ⋂ B = { zi: x∈iA & x∈iB }
                   RELATIVE COMPLEMENT:   A ~ B = { zi: x∈iA & x∉iB }
                      SYMMETRIC DIFFERENCE:   A Δ B = (A ⋃ B) ~ (A ⋂ B)

All of the above apply to all XST sets, but like with CST some additional operations are specific to sets of n-tuples. Additional CST definitions, however, only apply to sets of ordered pairs. XST definitions are extended to sets of n-tuples.

Extended Relation Operations
To appreciate what the term extended means when applied to extended relations it might help to compare XST definitions with CST definitions. Two points to notice. CST treats all relations as sets of ordered pairs and none of the definitions involving relations require knowledge of how ordered pairs are defined as sets. The major difficulty in moving from binary relations to n-ary relations was the lack of an ability to compare different tuple-values from different tuples. Which is a programming capability quite common for comparing different field values from different records.

n-ary Relations
Applications process records. There are no set operations that process n-tuples. Unfortunately, n-tuples are defined, by conventions, as nested ordered pairs. In order to model programming procedures on records with set-theoretic operations on tuples, tuples need to be set-theoretically defined.
                                Definition: Tuple:  x = < x1, x2, .. , xn > = { x11, x22, .. , xnn }
The definition of tuple introduces the notation of extended sets. The extension applies to the addition of a superscript on each element of a set. This superscript indicates a formal extension to the definition of set membership. For defining tuples the extension imposes a ordering of 1 to n on the elements of the set representing a specific n-tuple. For example:
    < a, b > = { a1, b2 }    < a, b, c, d > = { a1, b2, c3, d4 }    < < a, b >, < c, d > > = { { a1, b2 }1, { c1, d2 }2 }    < a > = { a1 }
Without the operations of tuple item and tuple extraction STDS could not have been implemented. Both operations are absent in set theory. Both are required for a machine-independent declaration of a machine-dependent executable. Procedurally the contents of two given records can be compared. This is not the case with two given tuples. Without a formal set-theoretic definition membership can not be determined. For example given r = < a, b, x, y > and s = < a, x, b, y >, CST provides no operations for comparing r and s. Nor does CST provide any mechanism for mapping one to the other. Extended set tuple operations provide the following:   ρ1 (r) = ρ1 (s)   ρ2 (r) = ρ3 (s)   ρ3 (r) = ρ2 (s)   ρ4 (r) = ρ4 (s) and for   σ = < 1, 3, 2, 4 >,   ρσ (s) = r .

Relations as Processes
Dr. Westervelt distinguished between Data Management as a record retrieval system and Information Management as an information derivation facility. A distinguishing feature between the two is the existence of an Image operation. The Image operation turns relations into processes. For example given the relation F, a set of ordered pairs < x, y >, where "y" is the father of "x" the function F(x) = y is formally defined in set theory by the Image operation, F[{x}] = {y}. The Image operation was the workhorse of STDS.

The Image operation was programmed using FORTRAN. Ordered pairs were defined as binary-records. In STDS the Image operation was represented by the FORTRAN subroutine IM( A, F, C ) for C = F[A] . Defining n-tuples as records and only allowing unary and binary records allows modeling sets with 1-tuples and 2-tuples. For example let F = { < a, b >, < c, d >, < r, s > } and A = { < a > } then F[A] = { < b > } .

The STDS Image operation has a simple set definition, F[A] = { < y >: (∃ x)( < x > ∈ A & < x, y > ∈ F )}. Though set theory does not support a useful definition of a 1-tuple, < x >, the notation provided an unambiguous description of what had to be programmed.

Derived Relationships (Information)
For an information management system the single most important feature is the ability to extract meaningful relationships (information) from virgin source data. Set operations coupled with pools of related binary relationships support such a capability.

Consider a source data to be just the two sets F and M such that given a set of people, P, F[P] is the set of all fathers of people in P and M[P] is the set of all mothers of people in P. Then the following represent some derived relationships obtainable from the source data:
          F[F[P]] is the set of Grand-Fathers on the father's side of people in P,
          F[M[P]] is the set of Grand-Fathers on the mother's side of people in P,
          M[F[P]] is the set of Grand-Mothers on the father's side of people in P,
          M[M[P]] is the set of Grand-Mothers on the mother's side of people in P,
          (F ⋃ M)[P] is the set of Parents of people in P,
          (F ⋃ M)[(F ⋃ M)[P]] is the set of Grand-Parents of people in P.

Using the Image operation in conjunction with standard set operations allowed STDS to derive new relationships (information) from previously stored or derived relations.

A most important property of machine-independence is the ability to determine if two data representations, on separate platforms, are information equivalent. The simplest case is record equality. Using the n-tuple notation, < x1, x2, .. , xn >, to represent a record of n items, and given an i-th item value operation defied by ρi ( < x1, x2, .. , xn > ) =  xi, let r = < a,b,c,d> > and s = < a,b,c,d> >, each on a different platform. It should now be simple to determine that r equals s. Does ρi ( r) = ρi ( s) for all i? It should be if the records were well-defined sets, but with r = <<< a, b >, c>, d>>> and s = < a,< b,< c, d>>> the records are well-defined (using different conventions) and membership comparison determines that they are not equal. Without a convention-independent definition of n-tuple membership classical set theory is not suitable for modeling machine-independent information management systems.

Since any n-tuple has n! information equivalent representations, in a truly machine-independent implementation any two information equivalent records could be verified. For example, the following   < a, b, c, d >, < c, b, d, a >, and < d, a, b, c > could be verified as information equivalent. Only the notation orderings are different. To support machine-independence, operations are required that can verify information-equality between any two qualifying records, no matter what platform they might be on.

If the notation had been order independent, more set-theoretic in nature, like  { a, b, c, d },  { c, b, d, a }, and { d, a, b, c }, then simple set membership could establish information equivalent. Though order is irrelevant for set-theoretic equivalence classes of n-tuples, it is information bearing for records. They specify which field or domain the respective items belong to. A simple solution is to incorporate the record field information as part of the set membership:  { a1, b2, c3, d4 },  { c3, b2, d4, a1 }, and { d4, a1, b2, c3 }. Classic set membership, x { x}, can be extended to incorporate field information by   x s{ xs }.

The notation < x1, x2, .. , xn > is intended to represent both a record with a physical representation and an n-tuple with a unique set membership condition. Using an extended set membership condition with superscripts, referred to as scope values, n-tuples can be defined as a set of cardinality n having n consecutive scope vales from 1 through n. Let  < x1, x2, . . , xn > = { x11, x22, . . , xnn }.
             Examples:  < a > = {  a1  },   < a, b > = {  a1, b2  },   < a, b, c, d > = {  a1, b2, c3, d4  } .
An operation for extracting an i-th item value can now be defined using an extended set tuple membership condition. Let ρi ( < x1, x2, .. , xn > ) =  xi   iff   xi i { x11, x22, . . , xnn }.
                          Example:   ρ(< a, b, c >)  =  ρ(< x, y, z, c >)  =  c
Now that n-tuple membership is un-ambiguously defined and since sets are mathematical objects (i.e. machine-independent), tuple equality can now be determined across platforms.

Informationally Equivalent
Being able to determine whether two records on separate platforms are equal is nice, but rather useless. For any given n-tuple there are n! informationally equivalent representations. Though establishing a tuple representation operation for equality was a good start, it accomplishes very little. Needed was a tuple operation that validated information equality independent of representation ordering. Referred to in set theory as elements of the same equivalence class, a collection of all possible permutations of a finite sequence of items. This implies that for every element of the equivalence class there is a mapping to any other element of the equivalence class. Simply put, given any two tuples, if there exist unique mappings that take each to the other, the tuples are informationally equivalent.

Tuple Re-mapping
Similar in construction to the operation tuple value is an operation for tuple re-mapping. The concept is quite simple "pluck an item from one tuple and plop it into another". In order to obfuscate this simplicity, a formal definition is provided.
             Remapρσ (A)  =  { xs: (∃ i )(  x ∈i A    &    i ∈ σ) }.
                         Example:    ρ< 3, 2, 4, 1 > (< a, b, c, d >)  =  < c, b, d, a >

             Set Mapρ(A,B)  =  { si: (∃ x ) (  x ∈i A   &   x ∈ B ) }.
                         Example:    ρ ( < a, b, c, d >, < c, b, d, a > )  =  < 3, 2, 4, 1>.

Information Access vs. Record Retrieval

(to be continued)


  1. [Skolem] Two Remarks on Set Theory, Skolem, T.: 2. The ordered n-tuples as sets, MATH. SCAND, 5 1957, "It is still a problem how the ordered n-tuple can be defined in the most suitable way."
  2. [Suppes] Axiomatic Set Theory, Suppes, P.:Original publication Van Nostrand, 1960, "All relations are binary."[p. 57]
  3. [TR-3] CONCOMP Project Technical Report 3 - 1968
  4. [IFIP] Feasibility of a Set-theoretic Data Structure: A Reconstituted Definition of Relation, IFIPS - 1968
  5. [AFIP] Description of a Set-theoretic Data Structure: AFIPS - 1968
  6. [ARPA] CONCOMP: Research in Conversational use of Computers: final report, December 1970
  7. [STDS-IMS] UofM Information Management System, STDS supported Information Management System. 1971-1998
  8. [Berztiss] DATA STRUCTURES: Theory and Practice, Berztiss, A.T., Academic Press, New York, 1971, "We have to develop a completely new theory of ordered objects."[p. 27]
  9. [SETP75] , Set Processing in a Network Environment Hardgrave, W. T.: 1975.
  10. [SETP76] A technique for implementing a set processor, Hardgrave, W. T. 1976
  11. [Cratchit] History of Computing at the University of Michigan, 1976
  12. [1976] History of Computing at the University of Michigan, 1976
  13. [1977a] Set Theoretic Data Structures (STDS), Lawrence Livermore Lab. ) Technical Report, 1977.
  14. [LLL-1] Set Theoretic Data Structures (STDS) Tutorial, LLL 1977.
  15. [LLL-2] Comparison of the extended set theory and relational approaches to data base management, LLL, 1977.
  16. [VLDB77] 1977 VLDB Extended Set Theory: A General Model For Very Large, Distributed, Backend Information Systems.
  17. [VLDB84] 1984 VLDB Panel: Inexpensive Large Capacity Storage Will Revolutionize Future Systems.
  18. [NATO] A Mathematical Foundation for Systems Development - NATO-ASI Series, Vol. F24, 1986
  19. [XSP-XML] XSP: An Integration Technology for Systems Development and Evolution, Champion, M. - 2001
  20. [IBI] Information Access Accelerator iXSP Performance evaluation, Information Builders Inc., Stout, R. - 2005
  21. [MSLab] Managing Data Mathematically:   Data As A Mathematical Object: ♦ "Using Extended Set Theory for High Performance Database Management" Presentation given at Microsoft Research Labs. with an introduction by Phil Bernstein. (video: duration 1:10:52) - 2006
  22. [North] Sets, Data Models and Data Independence Ken North a Dr. Dobbs''s Blogger, March 10, 2010
  23. [2010] Sets, Data Models and Data Independence by Ken North a Dr. Dobbs''s Blogger, March 10, 2010
  24. [SETS?] Why Not Sets? All computer processing is set-theoretic in nature. - 2013
  25. [T&P] XSP TECHNOLOGY: Theory & Practice Formal Modeling & Practical Implementation of XML & RDM Systems ♦ The RDM works in spite of set theory, not because of it. XML-Structures and operations on these structures are set-theoretically sound under XST. In this paper the formal foundations of RDM and XML systems are examined in the light of XST to provide practical relevance of XSP Technology to the field of software systems development.
  26. [C&C] Content-Container Data Access Strategies Content for Functionality - Containers for Performance - 2016 ♦ Since data represented for processing is not always ideal for preservation, and since data represented for preservation is not always ideal for processing, accessing application-friendly data from storage-friendly data poses a genuine challenge.
  27. [DDRvsPDS] Set Processing vs. Record Processing Performance: Dynamic Data Restructuring vs. Prestructured Data Storage (1 page summary) ♦ Record processing systems and set processing systems are very different. This paper attempts to clarify the major differences and demonstrate the performance advantages of set processing.
  28. [XSP] XSP: Extended Set Processing: Mathematically Managing Data Representations (1 page summary) ♦ This paper presents a high level overview of why a mathematical model of data representations is necessary, and how an extended set processing model accomplishes the task.

---------- NOTES --------

  Copyright 2022   INTEGRATED INFORMATION SYSTEMS   Last modified on 07/25/2022