The algorithm implemented in the pywrapper to create xml responses following the requested view is described roughly here.
Introduction and Term Definitions
The ViewObject constitutes out of nested node objects that represent the potential space of all valid xml instance documents based on the view response structure defined in a xml schema. A view stands somewhere between the schema and an xml instance document because it resolved already all global element and type references into local nodes. A view node represents either an instance elements or attribute or an "abstract" schema sequence or choice.
A repeatable node, that is an element, sequence or choice with maxOccurrs > 1, is called here in short repnode. A repnode is not necessarily being repeated in the instance doc, but if the underlaying database structure which is mapped to a view is analyzed, the algorithm knows where nodes are potentially being repeated in the output. Those nodes are called here active repnodes. Every active repnode has a list of locked tables associated to it. Those locked tables are used to generate loops through a recordset to generate the repeating elements. They are called here locked tables because in the path from a repnode up to the root node of the view every table may only be used onced to generate a loop and is therefore locked for all following subnodes.
There are 2 kind of mappings involved in this view algorithm. The database mapping maps real database fields (table + attribute + data type) to abstract concepts defined in a conceptual schema like ABCD or DarwinCore?. The view mapping maps a view concept from the response structure to a standardized conceptual schema concept. In both kind of mappings several concepts can be mapped to another one.
The datasource contains a database structure graph configuration, short db graph, where all needed database tables are represented including their primary keys and foreign keys. With the help of the foreign key relations a graph can be generated to represent the database table structure. This graph is not allowed to contain any circular structure, because at some point we need to transform it into a rooted tree to generate dynamic SQL from it. For circular graphs this is ambiguous.
Every view needs at least one index element. This element defines what is considered a "record" in the xml structure needed for the paging of results. To identify a certain page in the database, the pywrapper needs to know the primary keys of the tables that are relevant for a page definition. So the algorithm has to transform the indexing element defined as a concept in the view into local database fields. This is done by identifying a set of index tables.
Overview
To generate a response from a view object parsed from a request, the following steps are taken:
- a view object is created on the fly via SAX from the view request that follows the TAPIR protocol. A view might also be taken from a cached serialized form of the view object to speed up performance in case this view was parsed before already.
- if a partial view is requested the loaded view is reduced to a partial view.
- the view is "prepared" for the xml generation by the following steps:
- the datasource preferences incl the database to concept mappings are incorporated into the view mappings. So every previously mapped view node is now directly mapped to a list of database fields.
- the view nodes are analyzed and active repnodes and their list of locked tables incl their primary keys are determined
- the views indexing element(s) are analyzed, the list of index tables are calculated and a single root table is selected that is needed to transform the circular free database structure graph (taken from the datasource configuration) into a simple tree with a single root node (=table)
- a full recordset containing all data as lists (rows) of lists (columns) is passed to the view to generate an xml representation of this data.
Partial View Generation
To create a partial view, first the desired nodes are flagged to be kept. Then all other nodes are removed. The following nodes are flagged to remain in the partial view:
- The requested nodes, all their descendants incl attributes and parental nodes up to the root node
- All index elements and their parental nodes up to root
- All mandatory (minOccurrs>0) nodes attached to any of the already flagged nodes.
The view is then reanalyzed (see below) to recalculate repnodes, etc.
Repnode Analysis
The anaysis of the view nodes is based on nothing more than a nodes mappings. The view node tree is traversed (breadth or depth first order - not important) and for every node the unique list of tables used in the nodes mappings are retrieved. Tables which haven't been locked already by parental upper nodes are then "locked" by adding them to the list of locked tables for the closest parental repnode. If the node itself is a repnode they are added to the nodes locked table list.
Example view with node names, their min/max cardinality and mappings (table.field or literal values "XXX"):
DataSet 0/∞
CollectionName 1/1 ---> metadata.collection_name
Specimen 0/∞
SpecimenID 1/1 ---> specimen.catalogue_number
ScientificName 1/1 ---> name.fullname
HigherTaxa 0/∞
HigherTaxon 0/1 ---> name.family
@Rank 1/1 ---> "family"
Coordinates 0/1
@Lat 1/1 ---> collecting_event.latitude
@Long 1/1 ---> collecting_event.longitude
This view will create 2 active repnodes like this:
DataSet ∞ [metadata]
CollectionName
Specimen ∞ [specimen,name,collecting_event]
SpecimenID
ScientificName
HigherTaxa
HigherTaxon
@Rank
Coordinates
@Lat
@Long
So the Specimen element is used to loop over 3 different tables independently in this example. Thus all combinations of values are created in the output when looping over an "outer join" of the tables. Lets take a look at an example again:
Lets say there is 1 specimen record having 2 name records and 1 collection_event record. The Specimen loop will therefore be executed twice with a different name but the same data for all other mappings - in this example higher taxa and coordinates:
<Specimen>
<SpecimenID>13754</SpecimenID>
<ScientificName>Bellis primulifolia Lam.</ScientificName>
<HigherTaxa>
<HigherTaxon Rank="Family">Asteraceae</HigherTaxon>
</HigherTaxa>
<Coordinates Lat="13.2" Long="31.9" />
</Specimen>
<Specimen>
<SpecimenID>13754</SpecimenID>
<ScientificName>Erigeron primulifolius (Lam.) Greuter</ScientificName>
<HigherTaxa>
<HigherTaxon Rank="Family">Asteraceae</HigherTaxon>
</HigherTaxa>
<Coordinates Lat="13.2" Long="31.9" />
</Specimen>
So what has happend is that the higher cardinality of the name table mapping has migrated into the specimen element. If multiple tables would be repeatable, the specimen element can easily get repeated much more. Consider 1 specimen record with 2 names and 3 collecting_events (quite strange though):
<Specimen>
<SpecimenID>13754</SpecimenID>
<ScientificName>Bellis primulifolia Lam.</ScientificName>
<HigherTaxa>
<HigherTaxon Rank="Family">Asteraceae</HigherTaxon>
</HigherTaxa>
<Coordinates Lat="13.2" Long="31.9" />
</Specimen>
<Specimen>
<SpecimenID>13754</SpecimenID>
<ScientificName>Erigeron primulifolius (Lam.) Greuter</ScientificName>
<HigherTaxa>
<HigherTaxon Rank="Family">Asteraceae</HigherTaxon>
</HigherTaxa>
<Coordinates Lat="13.2" Long="31.9" />
</Specimen>
<Specimen>
<SpecimenID>13754</SpecimenID>
<ScientificName>Bellis primulifolia Lam.</ScientificName>
<HigherTaxa>
<HigherTaxon Rank="Family">Asteraceae</HigherTaxon>
</HigherTaxa>
<Coordinates Lat="21.4" Long="30.7" />
</Specimen>
<Specimen>
<SpecimenID>13754</SpecimenID>
<ScientificName>Erigeron primulifolius (Lam.) Greuter</ScientificName>
<HigherTaxa>
<HigherTaxon Rank="Family">Asteraceae</HigherTaxon>
</HigherTaxa>
<Coordinates Lat="21.4" Long="30.7" />
</Specimen>
<Specimen>
<SpecimenID>13754</SpecimenID>
<ScientificName>Bellis primulifolia Lam.</ScientificName>
<HigherTaxa>
<HigherTaxon Rank="Family">Asteraceae</HigherTaxon>
</HigherTaxa>
<Coordinates Lat="7.8" Long="19.9" />
</Specimen>
<Specimen>
<SpecimenID>13754</SpecimenID>
<ScientificName>Erigeron primulifolius (Lam.) Greuter</ScientificName>
<HigherTaxa>
<HigherTaxon Rank="Family">Asteraceae</HigherTaxon>
</HigherTaxa>
<Coordinates Lat="7.8" Long="19.9" />
</Specimen>
We will get 6 specimen elements now! This might not be desired and leads to the IndexingElementExplosion problem.
Root Table Detection
In order to determine the root table for a view, the index element is selected. This node must be a repnode, otherwise the view is considered not valid and an error is risen!
If the index element node is an active repnode, then any (implemented: the first) of the locked tables are used as the root table.
If its a non-active repnode, the closest parental active repnode is used to retrieve the first locked table.
If no root table can be found, an error is risen.
Index Table Determination
The procedure to find the index tables is nearly the same as for the root table, but instead of selecting just one table all locked tables of the closest parental active repnode (incl the indexing node itself) are used.
Example of a denormalized database with a single specimen table holding data about the taxon name, where the specimens was collected and the current collections name:
dataset
collection 0/∞ [locked: specimen]
taxonname 0/∞ --> indexing element
country 0/∞
So here the specimen table is used as the index table for this view, although the taxonname element is considered to define a new page. But as every specimen record produces also a new collection element (although they might have the same values cause the database is denormalized), the specimen tables proves useful as an index table.
XML Generation
Synchronizing attributes
Often attributes correlate strongly to their parent element. ...TBD
XML Element Creation
/!\ In general only XML elements are created that have either:
- a mapping which produces some non NULL or only whitespace value to be used as the elements content
- at least one XML attribute containing mapped data
- some child element below that have any of the above requirements
All other nodes of the view structure will not show up in the final xml output. So an element/node with a mapping that resolves into just a string of whitespace for example will not show up in the final output (Also all parent elements will be "dropped" if there are no other elements containing data).
If the view structure does not specify any xml data type for an xml element, no data will show up in the elements content, but the empty element will stay in the output as an valid element.
Repnode Behaviour Example
For a single record with data within a repnode it is possible that more than one instance of the repnode is being created. Assume the following simple structure using a sequence (with * representing the repnode):
A*
Seq
B
C
If B returns 3 instances of itself (because of multiple mappings or other repnodes below) but C only 1, 3 instances of A are returned because C is cloned for every instance of B to create a sequence. This is possible because all B and C nodes stem from the same record and therefore really belong together:
A B1 C1 A B2 C1 A B3 C1
If there would be a choice instead of the sequence, B and C would not be correlated, thus 4 instances of A are returned:
A*
Choice
B
C
-->
A
B1
A
B2
A
B3
A
C1
Empty Repnodes
There is the notion of a EmptyRepeatableRegionProblem . The PyWrapper currently ignores the problem and treats the lowest repnode as the correct one and issues a warning.
