{"id":285,"date":"2012-10-02T13:54:48","date_gmt":"2012-10-02T13:54:48","guid":{"rendered":"http:\/\/62.131.51.129\/wordpress\/?p=285"},"modified":"2012-10-02T13:54:48","modified_gmt":"2012-10-02T13:54:48","slug":"a-script-to-create-a-sample-data-warehouse","status":"publish","type":"post","link":"http:\/\/archief.van-maanen.com\/?p=285","title":{"rendered":"A script to create a sample data warehouse &#8211; part 1 the dimension"},"content":{"rendered":"<p>In the last week, I have written a sample script to create a skeleton data warehouse. I will use this as a head start for future work where the skeleton will be used to implement an actual data warehouse.<\/p>\n<p>The skeleton is written for a Teradata DBMS. This is the platform that I currently use.<\/p>\n<p>The example data warehouse has two tables: a dimension table and a fact table. Hence we have a small example that can be expanded in the future.<\/p>\n<p>Let us start with the dimension table.<br \/>\nSee also the script with <a href=\"http:\/\/62.131.51.129\/wp-content\/uploads\/2012\/10\/ExampleDW.zip\">ExampleDW<\/a><br \/>\nThe dimension table is created with:<\/p>\n<pre>drop table devADW_BAN_MI.dimTARGET;\n\ncreate table devADW_BAN_MI.dimTARGET\n(SK DECIMAL(15,0) GENERATED  BY DEFAULT AS IDENTITY\n           (START WITH 1\n            INCREMENT BY 1\n            MINVALUE 1\n           NO CYCLE\n            ),\n BK varchar(3),\n extra varchar(256),\n VERSION_SDATE timestamp(0)  ,\n version_edate  timestamp(0)   );<\/pre>\n<p>This dimension has a surrogate key (SK) that is loaded with the DBMS facility. This leads to the situation that every record is given an attribute that can be used as a primary key. This SK is the primary key. It uniquely identifies a record within a table.<br \/>\nThe table also has a business key (BK). This is something by which the business identifies an object (say a product, a contract or a customer). If we have a business key, we are able to know whether we have an earlier version of the object. If we get a new record on an object that we already have in the table, we may act accordingly.<br \/>\nIn our case, we assume that we want to act in the following way whenever a new record is found on an object that we already have in the data warehouse.<\/p>\n<ul>\n<li>The record is added to the table.<\/li>\n<li>The record that describes the old state of the object is labelled as &#8216;old version&#8217; .<\/li>\n<\/ul>\n<p>In the table we have an attribute &#8220;extra&#8221;, that acts as an example of a descriptive attribute. It can be a colour of the object, the size, the owner etc.<\/p>\n<p>We also encounter an attribute &#8220;version_sdate&#8221;\u00a0 which indicates the starting point of validity of the record. We assume that an object generates over time a series of records. Once the record is sent as the newest version of the object, it is stored in the data warehouse. We use the attribute version_sdate as the starting point in time after which the record is seen as the valid version of the object.<\/p>\n<p>\u00a0Likewise, we have a &#8220;version_edate&#8221;\u00a0 that indicates the finishing point in time. If a record is stored in the data warehouse and if a new version is included in the data warehouse, the old record is updated with version_edate equals finishing point in time.<\/p>\n<p>A source table is created with:<\/p>\n<pre>create table devADW_BAN_MI.dimSOURCE\n(BK varchar(3),\n extra varchar(256));<\/pre>\n<p>This represents the source of data. This source contains two attributes: a business key (BK) that identifies an object and an additional attribute (extra) that contains a description of the object.<\/p>\n<p>Let us then load the source:<\/p>\n<pre>insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('BKO','BKO extra');\ninsert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('TBO','TBO extra');\ninsert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('SRM','SRM extra');\ninsert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('XYZ','XYZ extra');<\/pre>\n<p>Ok. we now have a source that is loaded and a dimension table that is still empty. This can be initially loaded with:<\/p>\n<pre>insert into devADW_BAN_MI.dimTARGET(BK,extra,version_sdate,version_edate)\nselect BK,extra, current_time, cast ('99991231000000' as TIMESTAMP(0) FORMAT 'yyyymmddhhmiss')  \nfrom devADW_BAN_MI.dimSOURCE<\/pre>\n<p>This gives:<\/p>\n<pre>\t\n          SK  \tBK\textra\t VERSION_SDATE\t version_edate\n1\t100,001\tXYZ\tXYZ extra 10\/2\/2012 16:29:15 12\/31\/9999 00:00:00\n2\t300,001\tBKO\tBKO extra 10\/2\/2012 16:29:15 12\/31\/9999 00:00:00\n3\t1\tSRM\tSRM extra 10\/2\/2012 16:29:15 12\/31\/9999 00:00:00\n4\t200,001\tTBO\tTBO extra 10\/2\/2012 16:29:15 12\/31\/9999 00:00:00<\/pre>\n<p>Let us continue with a new dataload for the dimesnion table:<\/p>\n<pre>delete  from devADW_BAN_MI.dimSOURCE;\ninsert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('SRM','SRM nog meer');\ninsert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('ABC','ABC extra');\ninsert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('TBO','TBO extra');<\/pre>\n<p>We first deleted all records in the source. This is followed\u00a0by loading three records.<\/p>\n<p>In an incremental load step, we may have three situations:<\/p>\n<ul>\n<li>the record represents an object that is not yet included in the dimensional table. Whether or not this is the case can be assessed with the business key. If a new business key is encountered, we have this situation. The algorithm is quite simple: do we have a record is the source for which the business key\u00a0is not yet included in the target dimension table. The sql reads as:<\/li>\n<\/ul>\n<pre>--NEW records in source\ninsert into devADW_BAN_MI.dimTARGET\n (BK,extra,version_sdate,version_edate)\n select \n  BK,\n  extra, \n  CURRENT_time, \n  cast ('99991231000000' as TIMESTAMP(0) \n     FORMAT 'yyyymmddhhmiss') \n from devADW_BAN_MI.dimSOURCE \n  where BK not in \n  (select BK \n   from devADW_BAN_MI.dimTARGET \n   where version_edate > CURRENT_DATE\n  );<\/pre>\n<li>\u00a0the record in the source is included exactly in the target table. In our case, we have a combination (BK, extra) that may already be present in the target table. In that case we have no reason to apply any change in the target table: the data are already present in the target. We may go on looking for records in the source that may need action.<\/li>\n<li>the record in the source table represents an update on an object that is already present in the data warehouse. In that case we have a business key (BK) that is already present in the dimensional table. We also have a situation whereby the content in additional attributes are different from what we have in the dimensional table. In this case &#8220;extra&#8221; in the source\u00a0is not equal to &#8220;extra&#8221;\u00a0 in the target table. We identify these records as:\n<pre>create table devADW_BAN_MI.temp\n(BK varchar(3),\n extra varchar(256));\ninsert into  devADW_BAN_MI.temp \nselect B.BK, A.extra\nfrom \n(select BK, extra from devADW_BAN_MI.dimSOURCE) A,\n(select BK, extra from devADW_BAN_MI.dimTARGET \n  where version_edate > CURRENT_DATE) B\nwhere A.BK=B.BK and A.extra != B.extra;<\/pre>\n<\/li>\n<\/ul>\n<p>Once the records that contain an update are identified, we must take 2 actions:<\/p>\n<ol>\n<li>Update records in the data warehouse that represent an obsolete version of the object. This update consists of setting version_edate to a date that indicates the end of a validity period.\n<pre>update devADW_BAN_MI.dimTARGET\nset version_edate = CURRENT_TIME \nwhere  \n(BK  in (select BK from devADW_BAN_MI.temp))  \nand \nversion_edate > CURRENT_DATE; <\/pre>\n<\/li>\n<li>Insert records that represent the newest update on the objects.\n<pre>insert into devADW_BAN_MI.dimTARGET\n(BK,extra,version_sdate,version_edate)\nselect BK,extra, CURRENT_time,  \ncast ('99991231000000' as TIMESTAMP(0) \n   FORMAT 'yyyymmddhhmiss')  \nfrom devADW_BAN_MI.temp\nwhere (BK  in (select BK from devADW_BAN_MI.Temp)) ;\n<\/pre>\n<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the last week, I have written a sample script to create a skeleton data warehouse. I will use this as a head start for future work where the skeleton will be used to implement an actual data warehouse. The skeleton is written for a Teradata DBMS. This is the platform that I currently use. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-285","post","type-post","status-publish","format-standard","hentry","category-data-warehousing"],"_links":{"self":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/285","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=285"}],"version-history":[{"count":0,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/285\/revisions"}],"wp:attachment":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=285"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=285"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=285"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}