Tuesday, November 17, 2009

Temporal Data: Considerations, Conventions and Cheats

I’m not an expert of temporal database, its concepts and principles. The techniques I discuss here are the ones I employed to an application during my early days of programming. For exhaustive information on this subject, you may consult the works of Richard T. Snodgrass or C.J. Date. It’s so easy, just Google them. ;)
Temporal Granularity
The first thing to establish in a temporal database is the shortest length of time by which a change in the data is considered significant. This is commonly referred to as the "temporal granularity”.  In my application, I assumed it to be by “day” but SQL Server didn’t have any DATE-only data type until version 10 so this limitation was taken into consideration during the development. A function was created to remove the time portion of a DATETIME.
image
Figure 1 – A function to remove the time portion of a DATETIME
Index
If you isolate the columns to be tracked in another table, the new table should have a unique clustered index on the combination of the foreign key and the start date time. An example of this is the operation history table which tracks changes of the rate of an operation. In that table, the foreign key and start date combination is part of a unique clustered index to ensure that there is always one valid rate per given date. It also wouldn’t hurt if a check is employed to make sure that the end date is always after the start date.
image
Figure 2 – Table containing temporal data should have unique clustered index on the FK and start date/time
Querying Strategy
Querying a version of data in a temporal table always involves a date parameter indicating the version to retrieve. This date should fall between the start and end dates of at most one row for every id. The id here refers to the foreign key pointing to the non-temporal parent table. To make this clear, consider the following transactions:
  1. Operation “Button Attachment” with rate of 0.35 is inserted on 10/28/2009
  2. Rate for operation “Button Attachment” is changed to 0.37 on 11/01/2009
These insert 3 rows: 1 for the operation table and 2 to the operation history. The former is not pertinent to the discussion because it does not contain temporal data. In the latter, the rows inserted are highlighted below:
image
Figure 3 – Sample rows showing the current and previous version of a temporal data
If the user asks for the rate on 10/30/2009, obviously she would retrieve 0.35. The query is straightforward as long as the date fall between the transaction dates. What would she get then if she asks for the rate on 11/01/2009? Again, this is where business should come to the rescue. The business should establish the convention employed during the closing and opening of two contiguous periods, which happens when the data is updated. The convention can be determined by simply asking two questions:
Given an update on date n
  • What should be the end date of the previous period?
  • What should be the start date of the current period?
The answers determine the insert and update strategy to employ. If for the first question, Business says n-1, then it also implies that the answer for the second question is n. It can also be an n for the first and n+1 for the second. Important thing is that they should not overlap. After the convention is established, the BETWEEN operator can be safely used in the query as shown here:

SELECT o.name_vc, h.labor_sm FROM t_operation o
JOIN t_ophist h ON o.operation_id = h.operation_id
WHERE o.operation_id = @operation_id
AND (@versionDate BETWEEN h.start_dt AND h.end_dt)
Figure 4 – The most correct way of querying temporal data
In Figure 3 however, one can see that in my application, the end of the previous and the start of the current periods are the same. This is because back the, I didn’t know about BETWEEN operator. To achieve the same effect, I had to use inequalities operator and lines like these were all over the place:

start_dt <= @versionDate AND @versionDate < end_dt
Figure 5 – A condition used in the application in lieu of BETWEEN

Thru that filter statement, I effectively established the convention of n on end of current and n+1 on the start of the new version
Data Modification Logic
One interesting aspect of temporal data is that it changes the conventional semantics of data modification.  Update can be any of these 3 cases: update plus insert, update only, delete and update. A delete in the other hand is merely an update on one column. Finally, an insert may only occur in the temporal table and not in the parent (non-temporal) table. Now it should be clear why dealing with temporal data is definitely not a walk in the park.

When a temporal data in my application is updated, the start date of the current version is compared to the date of the transaction. The first case in update happens if the transaction date is not the same as the start date of the current version. The current version is closed or invalidated and a new version is inserted. Invalidating a version is just providing a valid value to the end date depending on the convention used. The “Button Attachment” transactions is an example of this case. The update is later than the start date of the current version therefore a new record is inserted and the current version is invalidated with the date of the transaction.

The second case happens when the transaction occurs on the same day as the start date of the current version. A delete and insert in this case is an overwork because all that is needed is an update on the rate (labor_sm). If we use the “Button Attachment” as an example, this is the case when the rate is raised again to 0.38 on 11/1/2009.

The third case is a special case of the second. This happens when the new value is the same as the value of the previous version. The previous version is the one preceding the current. In the “Button Attachment” example, this happens when the rate is changed back to 0.35 on 11/01/2009. This is like not updating the 10/28/2009-11/01/2009 version at all. So the course of action is to delete the current and open the 10/28/2009-11/01/2009 version.

image
Figure 6 – An stored procedure illustrating the different cases involved in updating a temporal data
As mentioned before, there is no actual delete in a temporal database. Delete is achieved by simply closing the current version on the temporal table. In the example, if “Button Attachment” is deleted on 11/01/2009, the only action taken is closing the current version. The record “Button Attachment” in the operation table is left intact. This means that during insert, it’s possible that the data is already in the operation table. In such a case, an insert to the operation history for the new valid version of the said data does the job.
Cheats
Later during the development of the application, it became clear that most of the queries in the frontend involved only the current data. I was fully aware of the performance implication innate to temporal queries like that in Figure 4 so I decided to employ a cheat. Instead of BETWEEN, I used = operator to get the current version since a convention is already established that end date for current should always be 12/30/5000. This also means that end date was no longer null-able and was included in the unique clustered index together with the FK and start date.

Another cheat that I employed later in my other projects is what I call the “reverse relationship”. It still uses = but instead of end date, the parent table contains a foreign key of the temporal child table. This also necessitates having a surrogate primary key in the temporal table. Usually, like in Figure 2, the temporal table can use the combination of the FK and the start date as a composite PK but in the case of reverse relationship, this composite key becomes a bulky FK in the parent table. A lean and faster approach is to have a surrogate key as FK like the one illustrated in Figure 7. Of course, it doesn’t have to be a constraint; a column pointer would suffice but careful check should be done to avoid pointing to a non-existent temporal data. The “fast” factor comes into play in there is a very high volume of data in the temporal table. A slight performance benefit can be squeezed out from querying integer instead of a date.
Reverse relationship
Figure 7 – A sample parent-temporal child table with “reverse relationship” to speed up temporal query
I’ve also seen temporal table that has flag column to indicate if the row is current. Like the reverse relationship, any technique that involves additional column to convey a semantics also open the database to potential synchronization problems. For example, it’s possible that a flaw in the update logic might insert the wrong FK in the parent table or flag the wrong row as current.

I’m sure there are still more “cheats” or techniques out there. A modeler can employ even the most unconventional but it shouldn’t matter as long it’s justified. These kind of practices deviates from the standard so proper documentation of these “deviation” is also a must.

What’s Next

Next post will be about the preliminary efforts I had when translating the VB6 version of the application to .NET version. I’ll discuss the overall architecture and motivation of the approach I chose.

No comments:

Post a Comment