Introduction
At the beginning of the
CONCOMP
project,
Dr. Franklin H. Westervelt
initiated an investigation into data structures.
It was his view that present efforts in data structure development
were too machine-oriented and would, inherently,
be unable to handle future needs.
It was his contention that users of huge data bases of the
future should not be burdened with the intricacies of
complicated data representation --
rather, data representations should be transparent to the user.
In fact,
the user should not even be aware of the type of storage media
or even the type of machine being used.
(Machine-independence means
data representation independence.
No machine-independent systems
currently exist.
Even today, accessing data
requires knowledge of how data is physically represented.)
Dr. Westervelt had often lectured on the need for a general data structure
that was not machine oriented but one that was information oriented,
where design emphasis was placed on information content instead of
machine representation.
For example, in a graphics problem concerning diagrams
constructed from lines and points,
the positions of lines and points relative to one another are
invariant no matter what machine is used or how the information is stored .
The idea behind this investigation was to try to
separate those properties
of data structures which were machine independent from those properties
which were machine dependent,
and then develop a data structure which allowed
isolation and separate control of
these respective properties.
Historically (as viewed in 1970) it seems that most
data structures have been developed for a particular problem on
a particular machine and then generalized.
It seemed that the best approach would be to start with a
general means for expressing
any problem and then worry about expressing the general representation
on a particular machine. This was the approach taken
which eventually resulted in the development of the
concept of a Set-Theoretic Data Structure.
Purpose: A Set-Theoretic Data Structure -- STDS
To provide a storage structure representation of arbitrarily related data allowing
the data to be stored as
sets and then retrieved using set operations.
Consequences:
a. Flexibility, generality and reliability provided by set theory.
b. Machine-independent representation of data.
c. Minimal storage requirements.
d. Real-time access of data.
e. Easy modification of data.
f. Ideal for distributed data environments.
g. No restrictions on questions as long as they are formulated
using sets and set operations.
Information vs. Representation
A search for commonality of purpose in existing data structures showed
the basic concern of all data structures was:
information versus representation.
Other complexities involving the amount of data,
kinds of questions to be answered,
speed with which they were answered,
updating requirements,
all depend on the information contained in the data
and the
representation of this information in the machine .
These two areas can be characterized separately
and independently, and therefore should be approached separately and
independently instead of being blended together.
However, this does not seem to be a widely accepted practice
in developing data structures.
Machine Environment & Information Environment
In order to pursue this approach a distinction must be drawn
between a
machine environment and an information environment.
The first may partially be characterized by:
addresses, codes, memories, machine hardware, computer software,
programming costs, data locations, sorts, storage medium,
storage size, retrieval speed, cpu-seconds,
pointers . . . in short: the empirical world of the computer.
This seems to be the most comfortable area to work in for
those involved with the development of data structures.
However, it only covers half the problem.
The second area is not so easily characterized,
which may partially account for its not being isolated.
This is the area of pure information,
data relationships, questions, answers,
information extraction . . . in short:
the abstract world of information.
Here rests the heart of the data structure problem,
for in order to treat the information environment and the machine
environment as functionally separate entities,
their respective structures must be precisely stated.
The structure of the machine environment is generally accepted,
but a separate structure for the information environment is not.
Therefore, the information environment is usually couched
in terms of the machine environment,
thus abolishing any hope of functional separability.
The problem then is to devise or find a suitable structure for
the information environment, such that the resulting
information environment operations will insure the same result,
independent of any particular machine-representation of such
information.
If the information environment operations themselves are defined
in terms of a particular machine-representation,
then they are intrinsically dependent on that representation.
(This, unfortunately, characterizes many current data structures.)
The means for isolating information environment operations from
the machine-representations of data is not necessarily obvious.
The difficulty arises in trying to insure generality
and consistency.
Generality in that the operations must not
be restricted to a particular information configuration,
but must be applicable for any possible information configuration;
and consistency in that, independent of the information
configuration, the operations must yield unique and
well-defined results.
This amounts to the adoption or development of an abstract mathematical
model for the information environment.
(Set theory was eventually chosen as such a model.)
Assuming that such a model can be found and that the
machine environment and the information environment are
functionally separate and completely independent structures,
the function of a general data structure would be to connect
them in such a way as to preserve their integrity.
Therefore, the view is proposed that:
any data structure is actually an isomorphism between a
machine environment and an information environment
preserving the functional aspects of each,
The particular type of data structure would depend on the structure
of the information environment.
FIGURE 1
The schema in Figure 1 represents the relationship of the
user, the information environment,
the data structure, and the machine environment .
It will be argued subsequently
that this definition allows a data structure to enjoy a degree of machine
independence.
However, it may not be immediately obvious that such a definition implies
any operational improvement in the use of the resulting data structure;
but since any particular machine representation has inherent properties
associated with it
(initial programming cost, storage allocation,
updating costs, retrieval speed, etc.),
some machine representations
are better for some purposes than others.
Given a selection of representations, which one, or ones,
should be used with a particular data base;
since, in general, no one representation is best or even
adequate for all retrieval needs?
The answer depends on which storage and retrieval requirements best meet
the information environment conditions
(which may even change during interrogation of the data base).
If the different machine representations could be transformed from one
to another (at a 'minimal' cost)
without affecting the form of the retrieval program,
then any retrieval requirement could be facilitated by the selection
from machine representations available.
The control of these operations should be accessible to the user,
but since these operations do not affect the information environment,
only the machine environment, they should be distinct and separate
from the operations of the information environment.
Figure 2 represents this schematically.
FIGURE 2
The box labeled ME-Controls allows for any operation which affects
only the machine environment, which in turn can not affect the
information environment since they are functionally separate.
Therefore, these operations do not belong as part of a data structure,
(though the previous definition of a data structure does not preclude
the use of these operations in some other context).
The transformation operations, or mode operations,
are in this category and every transformation must have some cost
a associated with it in the form of storage and cpu-time used.
From the user's view the mode operations are independent of the
information environment being used, they are to allow transformations
with a cost savings compared to the resulting retrieval and storage costs.
The most expensive transformation cost would result if each transformation
had to be programmed when the need arose, which is the current situation
with many data structures.
With MODE operations, however, any transformation can be made between
existing machine representations subject only to the gained or lost
retrieval speed versus the increased or decreased storage requirements.
It is important to remember that the information is an invariant under
these transformations, only the machine-representation of the information
is changed.
With MODE operations any user has the following economic flexibility:
During extended periods of non-use the data can be compacted and stored
at the lowest possible storage rate.
When the need arose for interrogation, modification, updating,
or analysis the data could be put into an expanded storage form,
thus acquiring the most economical retrieval costs and fastest response time.
Set-Theoretic Data Structure (STDS)
The concept of a Set-Theoretic Data Structure (STDS) evolved over a period of three years.
The initial direction leading to its development was
inspired by the efforts of Timothy E. Johnson and his work,
and by the work of Jerome A. Feldman, on Associative Data Structures.
A most significant feature of the ADS was the introduction of the
functional notation A(o) = v.
This notation is strictly a part of the information environment
and can have an arbitrary implementation in the machine environment.
The principle drawback, however, with any functional notation is the
prerequisite for the arguments to be single valued.
This lack of generality at the outset may be an undue restriction.
Since a more comprehensive notation does exist,
allowing a collection of values for the arguments,
it seemed worth exploring .
Image Operation
The notation A[O] = V may not seem to connote any more information
or generality than did A(o) = v.
However, this notation represents the Image operation in set theory,
where O and V are sets of values
instead of being restricted to single values. Ambiguities possible with
a functional notation are no longer a problem with the image operation:
if S represents 'square root' then what is the value
of v in S(4) = v?
Here two values for V
are correct: +2 and -2.
The ambiguity disappears using the image notation
S[{4}] = {+ 2, -2}.
This may further be demonstrated: by the
719-th root of 13,
by the children of x,
or by the points reachable from y.
Besides increasing the generality of A(o) = v to
A[O] = V, the introduction of sets
allows the potential of utilizing the powerful operations
of set theory (which are not defined algorithmically,
thus allowing easy adaption to machine independence).
Set Theory?
Crispin Gray and Charles Lang considered set theory in
their 1967 paper on ASP (Associative Set Processors).
They were also quick to point out the inherent difficulty of
attempting to implement set theory on a digital computer:
a set by definition is an unordered collection of objects
while any machine representation forces an order on the elements.
For every set with n elements there exist n!
different orderings.
Which of these should be used for a storage representation,
can a canonical ordering always be found,
or does it make any difference?
Much of the investigation into data structures was directed toward
resolving these questions.
It does make a difference!
One particular ordering is preferred and
it can always be found,
(thus allowing a canonical machine representation for any arbitrary set).
During the investigation it became ever increasingly evident
that set theory -- if it could be applied to a computer -- would
be an ideal structure for the information environment.
The feasibility of a set-theoretic approach was described in
CONCOMP Technical Report 6 and
was also presented in Edinburgh, Scotland, at the IFIP Congress 68,
under the rather cumbersome title:
Feasibility of a Set-Theoretic Data Structure:
A general structure based on a reconstituted definition of
relation.[IFIP]
The result allows the separation of the information environment from
the machine environment,
by letting set theory represent the information environment.
It is now the function of the data structure to tie the two back together again,
using the earlier definition of data structure as an isomorphism.
DEFINITION: A Set-Theoretic Data Structure -- STDS
A STDS is an isomorphism between a machine environment and
set theory preserving
a given universe U under a collection
of set operations S.
This conforms to the schema of Figure 1 with the information environment
represented by set theory.
One immediate consequence of this definition is that from the user's
view (the set theory side) it appears to be a machine independent data structure.
Some people may argue that a machine independent data structure is
a contradiction in terms since any data structure must be concerned with
addresses, searches, pointers, word length, and other machine dependent
characteristics.
By such a definition, of course, a machine independent
data structure is a logical, and hence technical, impossibility.
Whether or not there is any agreement to calling an isomorphism
a data structure in some classical sense is irrelevant.
The relevant issue is the separation and definition of those
aspects of "data structure usage" that do or do not depend on
the particulars of a given machine.
It would seem that any data structure could properly be called
machine independent if a user could access any type of data
represented in any way desired without needing to know what
machine was being used.
The only clue to the machine might be the storage sizes available
and the retrieval times experienced.
To justify that STDS is a machine independent data structure in this sense,
it may help to examine why other data structures are not machine independent.
Most fail immediately since they require a fixed data representation
and seldom allow spontaneous construction of questions.
The data is forced into these fixed structures which are already geared
to specific types of questions.
All data is machine independent before
it is forced into a particular machine .
What happens to this information independence,
why is it lost?
It is only lost in the sense that it can no longer be accessed directly,
it has to be embraced with "handles" or "hooks" or pointers or labels or
whatever, which in turn are able to be accessed.
The result is that all data is accessed indirectly through arbitrary appendages.
Unfortunately these appendages always seem to possess
machine properties which now render the data accessibility machine dependent.
The reason for all this indirection seems to stem from the view that
all information processing must be algorithmic in nature.
A command like "FIND " must invariably reduce to an algorithmic
language or procedural representation, which on the surface seems
quite reasonable since any machine implementation of anything
results in an algorithm.
However, there is the flaw!
What is necessary for the information environment is a structure embodying
operations that are not defined procedurally, but whose implementation,
of course, will be.
Since the operations do not dictate a procedure, any procedure giving
the correct result is legitimate.
Set theory is such a structure.
Given two sets A and B, and any set operation *,
then A*B = C is completely defined for all x
if it can be determined if x is an element of C just
by determining the truth or falsity of a statement concerning the
membership of x
in A and in B.
In other words, in set theory only the result is defined,
not how to obtain the result.
Therefore, if this aspect of set theory can be preserved and
implemented on a computer, then all operations would be independent
of how the sets were represented in the machine and only dependent
on the information content of data represented.
Any procedure for executing the set operation would work as long as
the information content of C was correct.
C could even be represented in the machine a different way every time.
Clearly there would be representations that would be more desirable
than others for certain operations.
Since the information content of any set is the same no matter what
the machine representation is,
a set can have its representation changed without effecting the result
of any set operation, (the only effect would be on execution time and storage allocation).
A STDS, therefore, allows a user to see only set theory while operating on
a machine environment.
However, a user may wish to exercise control over the machine representation.
The MODE operations, mentioned earlier, allow changing from one machine
representation to some other.
Only 'time' and 'storage' characteristics need be known about the
different MODES or machine representations,
since those are the only affects that can be detected by the user.
Figure 3 represents the schema for a STDS.

FIGURE 3
Any structure that allows referencing the data directly without dependence
on artificial devices would give this kind of machine independence.
The viability of such a structure depends on the expressive power allowed
by the interaction of the operations.
Here is the real strength of set theory.
No argument can do it justice.
Only first hand experience can demonstrate the inherent power of
the set operations . For this reason, an interactive demonstration
calling program for set operations was written for use in MTS.
People who were introduced to it very quickly were doing retrieval
queries they had previously thought to be impossible.
(Examples appear in the
appendices).[*]
With set theory as a model the information environment acquires
all the precision, generality, and formalism inherent in set theory.
The general concepts proposed earlier may now be particularized
starting with the separation and delineation of the information
environment and the machine environment.
The result is two separate and functionally independent spheres of activity,
characterized in part by the following table:

Any collection of information can be represented by a set in the information environment,
while the mode of that set is the particular machine-representation of that set in the
machine environment.
Changing the mode of a set clearly does not change the information content
of the set but only how that set is represented in the machine environment.
A legitimate mode can be modeled after any of the currently popular
data organization techniques: trees, lists, inverted lists, ring structures,
cellular multilist files, index sequential files,
hash codes or content addressable organizations.
In fact, any future machine-organization or hardware memory device
can be utilized for a mode, or machine-representation of data.
In all cases the information retrieval speed characteristics of the
data organization will be preserved.
However, due to the intrinsic advantages of set theory,
the storage requirements will generally be far less when the data
is organized with a STDS and the representation will always be machine
relocatable.
The latter property is possible by the consistent use of relative pointers
and the complete avoidance of absolute pointers for representing information.
For an example:
Let E be a set of employed persons and let
F be a set of fathers.
Then, to find the set A of all fathers who are also employed,
A would equal the INTERSECTION of E and F.
This operation takes place in the information environment.
To actually accomplish this retrieval requires assigning modes
to both E and F in a machine environment.
Let E have, say, MODE(6) which may be assumed to have
slow retrieval and small storage properties.
Assign, say, MODE(3) to F and assume that it has
fast retrieval characteristics but requires large storage.
Let IN be a FORTRAN callable subroutine which performs the operation
of INTERSECTION. Then CALL IN(E, F, A) gives A as
the set of fathers who are also employed, and the mode of A will
be the default mode.
The point to be made here is that if the mode of E were changed
from 6 to 3 or if the mode of F were changed from 3 to 6,
or if both modes were changed, the resulting set A of IN (E, F, A)
would be exactly the same, only the time to execute
IN(E, F, A) and the resulting default mode for A might vary .
An important concept in set theory is that of partitioning.
This concept allows STDS to provide the facility to take maximum advantage of any storage
medium when handling large data bases.
Partitioning subdivides a data base into several
distinct and independent parts as a function of the information content of the parts.
For example, suppose that one must deal with a data base of, say,
two million automobiles where each automobile has, say, 150 recorded data items.
This data base might be contained on 20 magnetic tapes and require
one and one-half hours per day to update.
Suppose further, however, that it is the case that only about
30 of the 150 recorded items account for nearly all of the
updating required.
If the data base were organized on tape as an STDS,
the first pass over the set of twenty tapes could partition the
data base into two parts:
4 tapes containing the partition subject to frequent change,
16 tapes of relatively static data.
With the data so partitioned, the daily update would now deal with
only 1/5 of the tapes previously required.
Furthermore, no loss of interrogation generality is incurred
by this process .
If wild card changes occur
(i.e. changes that are not among the most frequent 30 items),
these changes may be used to obtain dynamic partitioning, if desired.
Or they may be maintained as a small set
of "changed " individuals until it becomes economically or procedurally attractive to merge
(in terms of Set Theory, Union) them with the master data base.
[Note: This paper was written at a time
when high speed, flexible data access
was a baggage cart hustled through Pentagon corridors,
loaded with magnetic tapes.
After 1967, STDS focused on random access disk drives.]
Other facilities also follow from the basic nature of a STDS.
Among the more important features are:
1) Any data base that is in machine readable form requires no redesign for use with STDS.
2) Most data bases in their raw input form (i.e. no structure associated)
have enough redundancy and unused portions of data fields to permit
a reduction in storage requirements by a factor of about four
(based on current experience) when placed into the most compact STDS.
Importantly, the data base in maximally compacted form in STDS would
still permit total interrogation.
In one specific case, a data base of 300,000 records of
120 characters/record was reduced to most compact form.
Later when expanded for fastest retrieval, the expanded sets
occupied only 1/4 of the space originally required by just the raw data
(i.e. 9 x 106 characters for fastest retrieval and total interrogation versus
3 x 105 x 120 = 36 x 106 characters in raw form).
3) Test runs of huge data bases may be expedited by using STDS
to extract random samplings (i.e. subsets).
These subsets may be used for experiments with partitioning
and other schemes to obtain the fastest interrogation of the
complete data base.
Importantly, all STDS operations are completely compatible in both
the subset and total data base cases.
4) Set Theory lends itself naturally to an English language superstructure.
Many set operations already have English equivalents.
For example:
"or"→"union" "The set of blue or red automobiles " is the same as
"The union of the set of blue automobiles with the set of red automobiles."
"and"→"intersection" "The set of people who are married and unemployed
" is the
same as "The intersection of the set of married people with the set of unemployed people.
"but not"→"Relative Complement" "The set of employed but not
collage graduates" is the same
as "The relative complement of the
set of employed with respect to the set of collage graduates."
"not both"→"Symmetric Difference"
"The set of people employed or in school, but not both"
is the same as "The Symmetric Difference
of the set of people employed and in school.
5) The modularity of set theory may be carried into the design of a STDS.
All of the set-theoretic operations may be implemented as subroutines.
Each subroutine for either operations or modes (storage-representations)
and independent of one another.
Further, new operations ( or modes)
may be added at any time with no disturbance of previously implemented
operations, modes or data already in existence.
Finally, the universality of set theory allows for the complete
range of inquiry and complete management over any type of data.
This fundamental fact together with the constructive demonstration
of the existence of a canonical machine-representation for an
arbitrary set provides the foundation for a Set-Theoretic Data Structure
and its properties.
However, the viability of any such implementation will depend on the
transformation algorithms, the set operation algorithms,
the storage media available, and most importantly the cleverness
in the design of the different
machine-representations.
Appendix D.1
Expanded Definition of STDS Facilities
Appendix D.2
Some Familiar Examples Presented in Set-Theoretic Form
Appendix D.3
Description of the Demonstration Files for STDS Used in D.4
Appendix D.4
Example Use of STDS
The ARPA report concluded
that machine-independent data access operations, STDS,
had been mathematically well-defined and
implemented successfully.[ARPA]
Though government funding ended in 1970,
commercial interests in
processing data as extended sets
continued for another fifty years.
----------------------
ADDENDUM
----------------------
Machine-Independent Data Access Operations
Accessing Data by Identity, not by Location
In 1965 the term relation was universally understood to be
a well-defined set of homogeneous n-tuples.
Not a user friendly labeled-table obfuscating
a Gordian knot of indexed data.
STDS operations are on relations in the strictest set-theoretic
definition of the term, sets of homogeneous n-tuples.
The only problem was that Classical set theory could not
support operations on n-tuples.
Since the objective was to model operations on records
with set operations on n-tuples
and since we had no idea what set operations were needed,
we assumed that records were sets, wrote the programs that were needed,
and made up set operations to match the programs,
The following provides the set-theoretic justification of the these operations.
(The axioms for supporting these operations were not available until the mid 80's.)
Data, Information & Machine-Independence
When developing any formal model
it is generally a good idea to know what it is
that is intended to be modeled.
Though Data is a familiar term,
it does not necessarily have a shared understanding.
The terms
information and
machine-independence
are also used as if everyone
had the same understanding of their meaning.
This usually results with seeming agreement on
basic misunderstandings.
To circumvent this precedent explicit use of the terms
are presented.
DATA: Relationships of interest.
INFORMATION: Data derived from data.
MACHINE-INDEPENDENCE: Data accessed by identity.
These simple definitions
are rich and precise enough to support
requirements for implementing
machine-independent data access systems.
Information is defined as derived data.
Data defines relationships.
Relationships are abstract associations, mathematical objects.
While relationship representations on a computer are byte sequences, physical objects.
The 1965 research challenge
of implementing machine-independent
information access systems
was to define data relationships mathematically,
store them physically,
and define operations on them to access information.
Set operations provide applications
a portal to information embedded in data.
Even prior to the ARPA project,
Dr. Westervelt had been advocating the need for a mathematical
foundation for
representing and accessing data.
As a mathematician and engineer Dr. Westervelt
recognized that all
engineering disciplines, sciences and technologies had
formal foundations.
Without a formal architectural foundation,
data processing is just an art of computer programming.
Dr. Westervelt proposed a data processing
architecture consisting
of an information environment
and a machine environment
each representing data as sets.
The information environment representing abstract objects.
While the machine environment represented concrete objects.
The key feature of this architecture was that the
environments shared set definitions, but not set representations.
Operations defined abstractly in the information environment
were executed faithfully in the machine environment.
Attempting to formally model data processing systems
with set theory was not a novel idea.
The first step of our research was to analyze previous attempts
with using set theory to model data relationships
abstractly and concretely with
content preserving mappings between them.
None were successful nor even promising to pursue.
Since relying on an existing set theory was not a prerequisite,
the decision was made to determine what operations needed to be
formally modeled
by a set theory, then formulate what an acceptable set theory might look like.
The term machine-independence
is not well undestood and is most important for
future systems development.
It is essential for real-time integarted derivation
of information from globally accessable pools of
arbitrarily preserved data.
The single most impoetant property of a
machine-independent system is the ability to determine
if two data representations on seperate platforms
are informationally equivalent.
No system today can support that requirement without
knowing the physical properties of both data representations.
That is the essence of machine-dependence
Tuples as Records
In set theory all relations are defined
as sets of ordered pairs.[Suppes p.57]
Thus, all relations are binary relationships between just two items.
Most people (excluding mathematicians)
believe that relations can be sets of n-tuples.
Which they can, but all n-tuples
are just ordered-pairs of nested ordered-pairs.
Though modeling records as legitimate n-tuples
was the research objective, it was not clear
how to model operations on n-tuples,
defined by convention
as nested sets of ordered pairs.
Since it was not clear what operations needed to be defined
even if tuples were well-defined,
the objective of research was reversed.
Assume n-tuples are defined as records.
Develop the necessary operations on records
and worry about a mathematical justification later.
STDS was the result.
Extended Set Operations
A unique property of STDS is that
all operations on data are required to be set-theoretically defined.
This not only ensures mathematical consistency and
operational reliability, but leverages machine-independence
by allowing better performing products to be substituted for
inferior performing products.
Since all operations need to be set-theoretically well-defined,
developers need to be able to express new operations
in terms of extended set notation, XSN.
All this means is how to incorporate element superscripts
in set-theoretic definitions.
This is accomplished quite simply by incorporating the element superscript
as a membership subscript.
CST SET:
A is a CST set: x∈A ↔ {x} ⊆ A.
XST SET:
A is a XST set: x∈iA ↔ {xi} ⊆ A.
The above are read,
A is a CST set asserts that
x is an element of A iff the set containing x
is a singleton subset of A.
While
A is a XST set asserts that
x is an i-th element of A iff the set containing x,
as an i-th element,
is a singleton subset of A.
This simple extended notation can be used to convert
any CST set into an XST set.
All familiar CST operations can be extended to XST operations.
UNION:
A ⋃ B = { zi: x∈iA or x∈iB }
INTERSECTION:
A ⋂ B = { zi: x∈iA & x∈iB }
RELATIVE COMPLEMENT:
A ~ B = { zi: x∈iA & x∉iB }
SYMMETRIC DIFFERENCE:
A Δ B = (A ⋃ B) ~ (A ⋂ B)
All of the above apply to all XST sets, but like with CST
some additional operations are specific to sets of n-tuples.
Additional CST definitions, however, only apply to
sets of ordered pairs.
XST definitions are extended to sets of n-tuples.
Extended Relation Operations
To appreciate what the term extended means when
applied to extended relations
it might help to compare XST definitions with CST definitions.
Two points to notice.
CST treats all relations as sets of ordered pairs
and none of the definitions involving relations
require knowledge of how ordered pairs
are defined as sets.
The major difficulty in moving from binary relations to n-ary relations
was the lack of an ability to compare different tuple-values
from different tuples.
Which is a programming capability quite common
for comparing different field values from different records.
n-ary Relations
Applications process records.
There are no set operations that process n-tuples.
Unfortunately, n-tuples are defined,
by conventions,
as nested ordered pairs.
In order to model programming procedures on records with
set-theoretic operations on tuples, tuples
need to be set-theoretically defined.
Definition: Tuple: x = < x1, x2, .. , xn >
= { x11, x22, .. , xnn }
The definition of tuple introduces the notation of extended sets.
The extension applies to the addition of a superscript on
each element of a set.
This superscript indicates a formal extension to the definition of
set membership.
For defining tuples the extension imposes a ordering of
1 to n on the elements of the set representing
a specific n-tuple.
For example:
< a, b > = { a1, b2 }
< a, b, c, d > = { a1, b2, c3, d4 }
< < a, b >, < c, d > > =
{ { a1, b2 }1, { c1, d2 }2 }
< a > = { a1 }
Without the operations of tuple item and
tuple extraction
STDS could not have been implemented.
Both operations are absent in set theory.
Both are required for a machine-independent declaration
of a machine-dependent executable.
Procedurally the contents of two given records can be compared.
This is not the case with two given tuples.
Without a formal set-theoretic definition membership can not be determined.
For example given
r = < a, b, x, y > and
s = < a, x, b, y >,
CST provides no operations for comparing
r and s.
Nor does CST provide any mechanism for mapping one to the other.
Extended set tuple operations provide the following:
ρ1 (r) = ρ1 (s)
ρ2 (r) = ρ3 (s)
ρ3 (r) = ρ2 (s)
ρ4 (r) = ρ4 (s)
and
for σ = < 1, 3, 2, 4 >,
ρσ (s) = r
.
Relations as Processes
Dr. Westervelt distinguished between
Data Management as a record retrieval system
and Information Management as an information derivation facility.
A distinguishing feature between the two is the existence of
an Image operation.
The Image operation turns relations into processes.
For example given the relation F,
a set of ordered pairs < x, y >,
where "y" is the father of "x"
the function F(x) = y
is formally defined in set theory by the Image operation,
F[{x}] = {y}.
The Image operation was the workhorse of STDS.
The Image operation was
programmed using FORTRAN.
Ordered pairs were defined as binary-records.
In STDS the Image operation was represented by the FORTRAN subroutine
IM( A, F, C ) for
C = F[A] .
Defining n-tuples as records and only allowing unary and binary records
allows modeling sets with 1-tuples and 2-tuples.
For example let
F = { < a, b >, < c, d >, < r, s > }
and
A = { < a > }
then
F[A] = { < b > } .
The STDS Image operation has a simple set definition,
F[A] = { < y >: (∃ x)( < x > ∈ A & < x, y > ∈ F )}.
Though set theory does not support a useful definition of
a 1-tuple,
< x >,
the notation provided an unambiguous description of what
had to be programmed.
Derived Relationships (Information)
For an information management system
the single most important feature is the ability to extract
meaningful relationships (information)
from virgin source data.
Set operations coupled with pools of related binary relationships
support such a capability.
Consider a source data to be just the two sets F and M
such that given a set of people, P,
F[P] is the set of all fathers of people in P
and
M[P] is the set of all mothers of people in P.
Then the following represent some derived relationships
obtainable from the source data:
F[F[P]] is the set of Grand-Fathers on the father's side of people in P,
F[M[P]] is the set of Grand-Fathers on the mother's side of people in P,
M[F[P]] is the set of Grand-Mothers on the father's side of people in P,
M[M[P]] is the set of Grand-Mothers on the mother's side of people in P,
(F ⋃ M)[P] is the set of Parents of people in P,
(F ⋃ M)[(F ⋃ M)[P]] is the set of Grand-Parents of people in P.
Using the Image operation in conjunction with standard set operations
allowed STDS to derive new relationships (information) from
previously stored or derived relations.
Machine-Independence
A most important property of
machine-independence is the ability to determine
if two data representations, on separate platforms,
are information equivalent.
The simplest case is record equality.
Using the n-tuple notation,
< x1, x2, .. , xn >,
to represent a record of n items,
and given an i-th item value operation
defied by
ρi
( < x1, x2, .. , xn > ) = xi,
let
r = < a,b,c,d> >
and
s = < a,b,c,d> >,
each on a different platform.
It should now be simple to determine that r equals s.
Does
ρi
( r)
=
ρi
( s) for all i?
It should be if the records were well-defined sets,
but with
r = <<< a, b >, c>, d>>>
and
s = < a,< b,< c, d>>>
the records are well-defined
(using different conventions)
and membership
comparison determines that they are not equal.
Without a convention-independent
definition of n-tuple membership
classical set theory is not suitable for modeling
machine-independent
information management systems.
Since any n-tuple has n!
information equivalent
representations, in
a truly machine-independent implementation
any two information equivalent records could be
verified.
For example, the following
< a, b, c, d >,
< c, b, d, a >,
and
< d, a, b, c >
could be verified as information equivalent.
Only the notation orderings are different.
To support machine-independence,
operations are required that can verify
information-equality between any two qualifying records,
no matter what platform they might be on.
If the notation had been order independent,
more set-theoretic in nature,
like
{ a, b, c, d },
{ c, b, d, a },
and
{ d, a, b, c },
then simple set membership
could establish information equivalent.
Though order is irrelevant for set-theoretic equivalence classes of
n-tuples, it is information bearing for records.
They specify which field or domain
the respective items belong to.
A simple solution is to incorporate the record field information as part
of the set membership:
{ a1, b2, c3, d4 },
{ c3, b2, d4, a1 },
and
{ d4, a1, b2, c3 }.
Classic set membership,
x ∈ { x},
can be extended
to incorporate field information
by
x ∈s{ xs }.
The notation
< x1, x2, .. , xn >
is intended to represent both
a record with a physical representation
and
an n-tuple with a unique
set membership condition.
Using an extended set membership condition with
superscripts, referred to as scope values,
n-tuples can be defined as a set of cardinality n
having n consecutive scope vales from 1 through n.
Let
< x1, x2, . . , xn >
= { x11, x22, . . , xnn }.
Examples:
< a > = { a1 },
< a, b >
= { a1, b2 },
< a, b, c, d > =
{ a1, b2, c3, d4 }
.
An operation
for extracting an i-th item value
can now be defined using an extended set tuple membership condition.
Let
ρi
( < x1, x2, .. , xn > ) = xi
iff
xi ∈i
{ x11, x22, . . , xnn }.
Example:
ρ3 (< a, b, c >) =
ρ4 (< x, y, z, c >) = c
Now that n-tuple membership
is un-ambiguously defined
and since sets are mathematical objects
(i.e. machine-independent),
tuple equality can now be
determined across platforms.
Informationally Equivalent
Being able to determine whether two records on separate platforms
are equal is nice, but rather useless.
For any given n-tuple there are n! informationally equivalent
representations.
Though establishing a tuple representation operation
for equality was a good start, it accomplishes very little.
Needed was a tuple operation that validated
information equality independent of representation
ordering. Referred to in set theory as elements of the same
equivalence class,
a collection of all possible permutations of a finite
sequence of items.
This implies that for every element of the equivalence class there is
a mapping to any other element of the equivalence class.
Simply put, given any two tuples, if there exist
unique mappings that take each to the other,
the tuples are informationally equivalent.
Tuple Re-mapping
Similar in construction to the operation tuple value
is an operation for tuple re-mapping.
The concept is quite simple
"pluck an item from one tuple and plop it into another".
In order to obfuscate this simplicity,
a formal definition is provided.
Remap:
ρσ (A)
=
{ xs: (∃ i )( x ∈i A
&
i ∈s σ) }.
Example:
ρ< 3, 2, 4, 1 > (< a, b, c, d >)
= < c, b, d, a >
Set Map:
ρ(A,B)
=
{ si:
(∃ x )
( x ∈i A
&
x ∈s B ) }.
Example:
ρ
( < a, b, c, d >,
< c, b, d, a > )
= < 3, 2, 4, 1>.
Information Access vs. Record Retrieval
(to be continued)
References
-
[Skolem]
Two Remarks on Set Theory,
Skolem, T.: 2. The ordered n-tuples as sets, MATH. SCAND, 5 1957,
"It is still a problem how the ordered n-tuple can be defined in the most suitable way."
-
[Suppes]
Axiomatic Set Theory,
Suppes, P.:Original publication Van Nostrand, 1960,
"All relations are binary."[p. 57]
-
[TR-3]
CONCOMP Project Technical Report 3 - 1968
-
[IFIP]
Feasibility of a Set-theoretic Data Structure:
A Reconstituted Definition of Relation, IFIPS - 1968
-
[AFIP]
Description of a Set-theoretic Data Structure:
AFIPS - 1968
-
[ARPA]
CONCOMP: Research in Conversational use of Computers: final report, December 1970
-
[STDS-IMS]
UofM Information Management System,
STDS supported Information Management System. 1971-1998
-
[Berztiss]
DATA STRUCTURES: Theory and Practice, Berztiss, A.T., Academic Press, New York, 1971,
"We have to develop a completely new theory of ordered objects."[p. 27]
-
[SETP75]
,
Set Processing in a Network Environment Hardgrave, W. T.: 1975.
-
[SETP76]
A technique for implementing a set processor,
Hardgrave, W. T. 1976
-
[Cratchit]
History of Computing at the University of Michigan, 1976
-
[1976]
History of Computing at the University of Michigan, 1976
-
[1977a]
Set Theoretic Data Structures (STDS), Lawrence Livermore Lab. )
Technical Report, 1977.
-
[LLL-1]
Set Theoretic Data Structures (STDS) Tutorial, LLL 1977.
-
[LLL-2]
Comparison of the extended set theory and relational approaches to data base management, LLL, 1977.
-
[VLDB77]
1977 VLDB Extended Set Theory:
A General Model For Very Large, Distributed, Backend Information Systems.
-
[VLDB84]
1984 VLDB Panel: Inexpensive Large Capacity Storage Will Revolutionize
Future Systems.
-
[NATO]
A Mathematical Foundation for Systems Development - NATO-ASI Series, Vol. F24, 1986
-
[XSP-XML]
XSP: An Integration Technology for Systems Development and Evolution, Champion, M. - 2001
-
[IBI]
Information Access Accelerator
iXSP Performance evaluation,
Information Builders Inc., Stout, R. - 2005
-
[MSLab]
Managing Data Mathematically:
Data As A Mathematical Object:
♦
"Using Extended Set Theory for High Performance Database Management"
Presentation given at Microsoft Research Labs. with an introduction by Phil Bernstein.
(video: duration 1:10:52) - 2006
-
[North]
Sets, Data Models and Data Independence
Ken North a Dr. Dobbs''s Blogger, March 10, 2010
-
[2010]
Sets, Data Models and Data Independence
by Ken North a Dr. Dobbs''s Blogger, March 10, 2010
-
[SETS?]
Why Not Sets? All computer processing is set-theoretic
in nature. - 2013
-
[T&P]
XSP TECHNOLOGY: Theory & Practice Formal Modeling &
Practical Implementation of XML & RDM Systems
♦
The RDM works in spite of set theory, not because of it.
XML-Structures and operations on these structures are set-theoretically
sound under XST.
In this paper the formal foundations of RDM and XML systems are examined
in the light of XST to provide practical relevance of XSP Technology to
the field of software systems development.
-
[C&C]
Content-Container Data Access Strategies Content for Functionality - Containers for Performance - 2016
♦
Since data represented for processing is not always ideal
for preservation, and since data represented for
preservation is not always ideal for processing,
accessing application-friendly data
from storage-friendly data
poses a genuine challenge.
-
[DDRvsPDS]
Set Processing vs. Record Processing Performance:
Dynamic Data Restructuring vs. Prestructured Data Storage (1 page summary)
♦
Record processing systems and set processing systems are very different.
This paper attempts to clarify the major differences and demonstrate
the performance advantages of set processing.
-
[XSP]
XSP: Extended Set Processing:
Mathematically Managing Data Representations (1 page summary)
♦
This paper presents a high level overview of why a mathematical model of
data representations is necessary, and how an extended set processing
model accomplishes the task.
---------- NOTES --------
«
Copyright © 2022
INTEGRATED INFORMATION SYSTEMS
« Last modified on 07/25/2022 »
|