Tuesday, November 17, 2009

Temporal Data: Considerations, Conventions and Cheats

I’m not an expert of temporal database, its concepts and principles. The techniques I discuss here are the ones I employed to an application during my early days of programming. For exhaustive information on this subject, you may consult the works of Richard T. Snodgrass or C.J. Date. It’s so easy, just Google them. ;)
Temporal Granularity
The first thing to establish in a temporal database is the shortest length of time by which a change in the data is considered significant. This is commonly referred to as the "temporal granularity”.  In my application, I assumed it to be by “day” but SQL Server didn’t have any DATE-only data type until version 10 so this limitation was taken into consideration during the development. A function was created to remove the time portion of a DATETIME.
image
Figure 1 – A function to remove the time portion of a DATETIME
Index
If you isolate the columns to be tracked in another table, the new table should have a unique clustered index on the combination of the foreign key and the start date time. An example of this is the operation history table which tracks changes of the rate of an operation. In that table, the foreign key and start date combination is part of a unique clustered index to ensure that there is always one valid rate per given date. It also wouldn’t hurt if a check is employed to make sure that the end date is always after the start date.
image
Figure 2 – Table containing temporal data should have unique clustered index on the FK and start date/time
Querying Strategy
Querying a version of data in a temporal table always involves a date parameter indicating the version to retrieve. This date should fall between the start and end dates of at most one row for every id. The id here refers to the foreign key pointing to the non-temporal parent table. To make this clear, consider the following transactions:
  1. Operation “Button Attachment” with rate of 0.35 is inserted on 10/28/2009
  2. Rate for operation “Button Attachment” is changed to 0.37 on 11/01/2009
These insert 3 rows: 1 for the operation table and 2 to the operation history. The former is not pertinent to the discussion because it does not contain temporal data. In the latter, the rows inserted are highlighted below:
image
Figure 3 – Sample rows showing the current and previous version of a temporal data
If the user asks for the rate on 10/30/2009, obviously she would retrieve 0.35. The query is straightforward as long as the date fall between the transaction dates. What would she get then if she asks for the rate on 11/01/2009? Again, this is where business should come to the rescue. The business should establish the convention employed during the closing and opening of two contiguous periods, which happens when the data is updated. The convention can be determined by simply asking two questions:
Given an update on date n
  • What should be the end date of the previous period?
  • What should be the start date of the current period?
The answers determine the insert and update strategy to employ. If for the first question, Business says n-1, then it also implies that the answer for the second question is n. It can also be an n for the first and n+1 for the second. Important thing is that they should not overlap. After the convention is established, the BETWEEN operator can be safely used in the query as shown here:

SELECT o.name_vc, h.labor_sm FROM t_operation o
JOIN t_ophist h ON o.operation_id = h.operation_id
WHERE o.operation_id = @operation_id
AND (@versionDate BETWEEN h.start_dt AND h.end_dt)
Figure 4 – The most correct way of querying temporal data
In Figure 3 however, one can see that in my application, the end of the previous and the start of the current periods are the same. This is because back the, I didn’t know about BETWEEN operator. To achieve the same effect, I had to use inequalities operator and lines like these were all over the place:

start_dt <= @versionDate AND @versionDate < end_dt
Figure 5 – A condition used in the application in lieu of BETWEEN

Thru that filter statement, I effectively established the convention of n on end of current and n+1 on the start of the new version
Data Modification Logic
One interesting aspect of temporal data is that it changes the conventional semantics of data modification.  Update can be any of these 3 cases: update plus insert, update only, delete and update. A delete in the other hand is merely an update on one column. Finally, an insert may only occur in the temporal table and not in the parent (non-temporal) table. Now it should be clear why dealing with temporal data is definitely not a walk in the park.

When a temporal data in my application is updated, the start date of the current version is compared to the date of the transaction. The first case in update happens if the transaction date is not the same as the start date of the current version. The current version is closed or invalidated and a new version is inserted. Invalidating a version is just providing a valid value to the end date depending on the convention used. The “Button Attachment” transactions is an example of this case. The update is later than the start date of the current version therefore a new record is inserted and the current version is invalidated with the date of the transaction.

The second case happens when the transaction occurs on the same day as the start date of the current version. A delete and insert in this case is an overwork because all that is needed is an update on the rate (labor_sm). If we use the “Button Attachment” as an example, this is the case when the rate is raised again to 0.38 on 11/1/2009.

The third case is a special case of the second. This happens when the new value is the same as the value of the previous version. The previous version is the one preceding the current. In the “Button Attachment” example, this happens when the rate is changed back to 0.35 on 11/01/2009. This is like not updating the 10/28/2009-11/01/2009 version at all. So the course of action is to delete the current and open the 10/28/2009-11/01/2009 version.

image
Figure 6 – An stored procedure illustrating the different cases involved in updating a temporal data
As mentioned before, there is no actual delete in a temporal database. Delete is achieved by simply closing the current version on the temporal table. In the example, if “Button Attachment” is deleted on 11/01/2009, the only action taken is closing the current version. The record “Button Attachment” in the operation table is left intact. This means that during insert, it’s possible that the data is already in the operation table. In such a case, an insert to the operation history for the new valid version of the said data does the job.
Cheats
Later during the development of the application, it became clear that most of the queries in the frontend involved only the current data. I was fully aware of the performance implication innate to temporal queries like that in Figure 4 so I decided to employ a cheat. Instead of BETWEEN, I used = operator to get the current version since a convention is already established that end date for current should always be 12/30/5000. This also means that end date was no longer null-able and was included in the unique clustered index together with the FK and start date.

Another cheat that I employed later in my other projects is what I call the “reverse relationship”. It still uses = but instead of end date, the parent table contains a foreign key of the temporal child table. This also necessitates having a surrogate primary key in the temporal table. Usually, like in Figure 2, the temporal table can use the combination of the FK and the start date as a composite PK but in the case of reverse relationship, this composite key becomes a bulky FK in the parent table. A lean and faster approach is to have a surrogate key as FK like the one illustrated in Figure 7. Of course, it doesn’t have to be a constraint; a column pointer would suffice but careful check should be done to avoid pointing to a non-existent temporal data. The “fast” factor comes into play in there is a very high volume of data in the temporal table. A slight performance benefit can be squeezed out from querying integer instead of a date.
Reverse relationship
Figure 7 – A sample parent-temporal child table with “reverse relationship” to speed up temporal query
I’ve also seen temporal table that has flag column to indicate if the row is current. Like the reverse relationship, any technique that involves additional column to convey a semantics also open the database to potential synchronization problems. For example, it’s possible that a flaw in the update logic might insert the wrong FK in the parent table or flag the wrong row as current.

I’m sure there are still more “cheats” or techniques out there. A modeler can employ even the most unconventional but it shouldn’t matter as long it’s justified. These kind of practices deviates from the standard so proper documentation of these “deviation” is also a must.

What’s Next

Next post will be about the preliminary efforts I had when translating the VB6 version of the application to .NET version. I’ll discuss the overall architecture and motivation of the approach I chose.

Sunday, November 1, 2009

Temporal Data

In the first installment of this series, I mentioned that I unnecessarily complicated the ApparelFactory app by introducing temporal data, thus delving myself into the world of temporal database. A temporal database differs from an ordinary database in that it stores different version of data based on their validity throughout time. Other people simply refers to these versions as “history” of the information. A piece of data is said to be “valid” on a certain time if it is consider to be true at that time.  
“Querying a particular version of a temporal data is not straightforward because you need to consider the time component.”
It’s easy to spot a database which contains temporal data. You usually see two time columns in a table indicating the start and the end during which the data is considered valid. The granularity dictates the data type used. For example, if you don’t consider changes within a day to be relevant, then you can use DATETIME ignoring the time portion. SQL Server 2008 has a DATE data type well-suited for this need. If changes within the year is not relevant, you can use INT for both the start and end years. The granularity is usually dictated by the business but in the case of the ApparelFactory, I simply assumed that changes within the day is not relevant, thus I have a tables like the one below where start_dt and end_dt are the start date and end date respectively.
image
Figure 1 – A table for storing temporal data
Most of the time, users are only concerned with the current valid version of the data. Past versions are of interest only for analysis and evaluation of trends, say the price fluctuation of certain commodity. They may be viewed only through reports and some other data warehousing tools. Querying a particular version of a temporal data is not straightforward because you need to consider the time component. For example, in the t_allowance table, if I were to write a flexible query to retrieve any version an employee’s allowance by date, the WHERE statement would have looked like this
WHERE employee_id = @employee_id AND (@referenceDate BETWEEN start_dt AND end_dt)
Figure 2 – Flexible WHERE statement for a temporal data
A query involving BETWEEN operator is not the most efficient so coders usually cheat when writing one for retrieving the current version. They use = instead, but this presupposes that there is already an established convention in determining the current version. The most common is to make the end date nullable and any row with a null value is considered current. In the application, I used the another variation which sets the end date to a very unrealistically high value. This has the advantage of not requiring you to convert the null end date to a real value when using BETWEEN. This is because using BETWEEN on a null date always evaluates to false.

image
Figure 3 – Sample query utilizing equality operator instead of BETWEEN for temporal data
In a true temporal database, every change in the data is recorded. This is not practical since most of the time, only a subset of the columns are worth tracking. Only the columns  which trends are worth analyzing are tracked. These are then separated in a table together with  the start and end time columns. In the apparel app, I made sure that user can always see the history of the rates of every operation as well as their usage in a style (what operation is involved in a style at a certain point). The queries would have been complicated had I not employed some “acceptable cheats” in my table design. I’ll discuss more about these so-called cheats in the future installment of the series.

image
Figure 4 – Column which need to be tracked are usually taken out and made into a separate table
Temporal data are useless if the coder doesn’t employ an intuitive way of presenting them to the user. All these data should be available to reports of course, but users also appreciate it if they could see them instantly without resorting to reports. This way, they don’t have to be inundated with so many information when all they want is, let’s say the rate history for an operation which rate the company has just decided to change. In this connection, the ability for a user to select a date as a parameter during viewing of the data is very integral when designing the UI. A coder can simply throw in a DateTimePicker or Calendar web control. If the user prefers the actual change dates, then a list is a good choice. This approach involves another query but it always ensures new results during every query. It also presents the user with the exact number of changes; something not available with the date time picker / calendar approach.

In the application, the list approach is used as seen in Figure 5. The number of current operation for the style is 3, one is added sometime on 11/01/2009. The rate of one operation also changed on the same date, thus changing the current rate of the style.

Style history 1
Figure 5 – A UI displaying historical changes on data membership

The addition of another operation and the change in rate of another one can be seen in Figure 6. Since only the rate is tracked for the operation, it makes much more sense to simply display the history right away.

Operation view
igure 6 – A UI displaying historical changes of a single column
Being able to work on temporal database is really a fulfilling experience for a coder, let alone for a novice. A coder feels that his application adds more value to the business because it ensures all the data are there and easy for the viewing. The flipside is, it might not be needed at all and it’s just another unnecessary complexity to an already delayed project. It wouldn’t hurt if the client is asked first because the last thing a coder needs is a flawless feature that no one even dares to use.

Next, I’ll discuss the adjustments, conventions and the cheats I employed.