Wednesday, June 27, 2018

Designing Hierarchies - Path Enumeration Technique

     Every time there is a need to create a hierarchical structure from database point of view, a default design that pops up in a developer's mind is a self referencing table. However, this design may have performance issues due to recursion (recursive Common table expression [CTE])  that would need to be implemented. Recursion leads to additional logical reads. Apart from that, one other issue is, while inserting the data in self referencing table, each data node needs to be entered in certain logical order, in presence of foreign key constraint.

    Out of many other approaches, one other technique that can be used for implementing hierarchy is Path Enumeration technique.

     As the name suggests, in this technique, each node in the tree has an attribute that stores complete path from root up-to that specific node. This is similar to Linux file paths. This technique is much flexible when implementing, natural hierarchies, ragged hierarchies, network pattern etc. In Data warehouse / data marts such tables are called Bridge tables that separately store the Parent-child relationship with the hierarchy path code and connect directly with Fact tables.

Hierarchy Use Case

We need to design a hierarchy for below classification of financial instruments. How do we design this?

Figure 1





























Table Design 

Below is the table structure that can be implemented.
Figure 2


  • Instrument Type stores a unique list of nodes. Sample data in Figure 3 below.
  • Instrument Type Hierarchy would store the relationships between nodes. Sample data Figure 4 below.
    • Here we have HierarchyMapCd which stores the path to the hierarchy.
    • NodeTypeInd would be R (Root), I (Intermediate) or L(Leaf)
    • HierarchyLevelNbr would represent the level at which the node is in the hierarchy with Top node being at Level 0.

Instrument Type Sample Data 


Figure 3


Instrument Type Hierarchy Sample Data


Figure 4


Understanding Hierarchy Path Code Creation

Let's take a sample piece of hierarchy that needs to be stored in our table design. 
  • BOND
    • GOVERNMENT
      • BILL 
      • NOTES
      • GOVERNMENT BONDS
Each of the items individually are distinct nodes that would be stored in InstrumentType table and they get their own unique identifiers.


InstrumentTypeId
InstrumentTypeNm
7
Bond
8
Government
9
Bill
10
Notes
11
Government Bonds

Now in order to construct HierarchyPathCd for Bill, we would construct a "/" separated string of all InstrumentTypeId from Parents to child. So in this case it would be 
/7/8/9/. This represents /Bond/Government/Bill.

Similarly,
  • Notes would be /7/8/10/
  • Government Bonds would be /7/8/11/

Implementation for few scenarios below

a. Get entire hierarchy under "Bond" (top down)



b. Get all nodes above "Variable Rate" Bonds (bottom up)



































C. Get "/" separated hierarchy description for application code

The hierarchy can also  be stored with "/" separated descriptions that can be processed by application middle tier to represent in any manner as suitable for UI application. Below is an example of how this can be done.

Alternatively, the HierarchyMapDesc column can be persisted as a part of the table itself for better performance.

In below example the encoded hierarchy /1 / 4 / 5/ is converted into readable description /Equity / Depository Receipts / ADR  in a separate column.

This description can be processed by middle tier for an appropriate display in UI based applications.



d. Other scenarios

  • Get only the root nodes. hint - filter on NodeTypeInd R 
  • Get only the leaf nodes. hint - filter on NodeTypeInd L 
  • Get all nodes under a subtree at certain level. hint - after filtering on subtree level, apply the filter on HierarchyLevelNbr.
  • In case of a network pattern, find various paths where a node participates in a network.

Benefits
  1. Unlike self referencing table design, when inserting hierarchy nodes in InstrumentTypeHierarchy table, there is no need to follow a specific order as all the references are already there in InstrumentType table.
  2. Non recursive queries may help in countering performance issues.
  3. This table design, can take care of natural hierarchy, ragged hierarchy and network type of relationship between nodes. One needs to ensure that right constraints are applied for data integrity.
Pain points
  1. Populating Hierarchy Path Codes need to be automated via scripts or a UI needs to be developed to maintain the hierarchy.
  2. The Hierarchy Map Codes may not correspond to actual hierarchy due to lack of constraint. Here the application team would have to ensure that right codes are being maintained. 

Conclusion
           Path enumerated hierarchies could be seen as another viable technique for storing hierarchical data with optimized performance while retrieving, inserting or deleting hierarchies. It can be used to store hierarchies of various types - Natural (single parent for one child),  Unnatural (more than one parent for a single child,  Ragged (Missing parents at certain levels) etc. 



Source Code 
/***************************************************************************/
/*                          TABLE CREATION                                 */
/***************************************************************************/

USE testing;
GO

IF OBJECT_ID( 'InstrumentTypeHierarchy' ) IS NOT NULL
BEGIN
    DROP TABLE InstrumentTypeHierarchy;
END;
GO

IF OBJECT_ID( 'InstrumentType' ) IS NOT NULL
BEGIN
    DROP TABLE InstrumentType;
END;
GO

CREATE TABLE InstrumentType
(
    InstrumentTypeId SMALLINT IDENTITY(1, 1)
,   InstrumentTypeNm VARCHAR(100) NOT NULL 
    CONSTRAINT PK_InstrumentType PRIMARY KEY(InstrumentTypeId));
GO

-- business key unique index
CREATE UNIQUE INDEX IX_NC_InstrumentType_InstrumentTypeNm
    ON InstrumentType (InstrumentTypeNm);

CREATE TABLE InstrumentTypeHierarchy
(
    ChildInstrumentTypeId  SMALLINT NOT NULL
  , ParentInstrumentTypeId SMALLINT NOT NULL
  , HierarchyPathCd        VARCHAR(5000)
  , NodeTypeInd            CHAR(1) NOT NULL
  , HierarchyLevelNbr      TINYINT NOT NULL
   CONSTRAINT PK_InstrumentTypeHierarchy PRIMARY KEY(ChildInstrumentTypeId)
);
GO


CREATE INDEX IX_NC_InstrumentTypeHierarchy_1
ON InstrumentTypeHierarchy (ChildInstrumentTypeId);

CREATE INDEX IX_NC_InstrumentTypeHierarchy_2
ON InstrumentTypeHierarchy (ParentInstrumentTypeId);

CREATE UNIQUE INDEX IX_NC_InstrumentTypeHierarchy_3
ON InstrumentTypeHierarchy (HierarchyPathCd);

ALTER TABLE InstrumentTypeHierarchy
ADD CONSTRAINT FK_InstrumentType_InstrumentTypeHierarchy_ChildInstrumentTypeId 
    FOREIGN KEY( ChildInstrumentTypeId ) 
        REFERENCES InstrumentType( InstrumentTypeId );

ALTER TABLE InstrumentTypeHierarchy
ADD CONSTRAINT FK_InstrumentType_InstrumentTypeHierarchy_ParentInstrumentTypeId 
    FOREIGN KEY( ParentInstrumentTypeId ) 
        REFERENCES InstrumentType( InstrumentTypeId );


set identity_insert InstrumentType on
INSERT INTO InstrumentType( InstrumentTypeId,InstrumentTypeNm )
values(-1,'Not Applicable')
set identity_insert InstrumentType off


/***************************************************************************/
/*                          DATA CREATION                                  */
/***************************************************************************/


INSERT INTO InstrumentType( InstrumentTypeNm )
VALUES( 'Equity' ), ( 'Common Stock' ), ( 'Preferred Stock' ), 
( 'Depository Receipts' ), ( 'ADR' ), ( 'GDR' ), 
( 'Bond' ), ( 'Government' ), ( 'Bill' ), ( 'Note' ), 
( 'Government Bond' ), ( 'Municipal' ), ( 'Agency' ), 
( 'Corporate' ), ( 'Fixed Rate' ), ( 'Variable Rate' ), 
( 'Asset Backed Security' ), ( 'Mortgage Backed Security' ), 
( 'Collateralized Debit Obligation' ), ( 'Money Market Security' ), 
( 'Derivative' ), ( 'Futures' ), 
( 'Forwards' ), ( 'Options' ), ( 'Swaps' );

INSERT INTO InstrumentTypeHierarchy
( ChildInstrumentTypeId, ParentInstrumentTypeId, HierarchyPathCd, NodeTypeInd, HierarchyLevelNbr )
VALUES( 1, 1, '/1/', 'R', 0 ), ( 2, 1, '/1/2/', 'L', 1 ), ( 3, 1, '/1/3/', 'L', 1 ), 
( 4, 1, '/1/4/', 'I', 1 ), ( 5, 4, '/1/4/5/', 'L', 2 )
, ( 6, 4, '/1/4/6/', 'L', 2 ), ( 7, 7, '/7/', 'R', 0 ), 
( 8, 7, '/7/8/', 'I', 1 ), ( 9, 8, '/7/8/9/', 'L', 2 ), 
( 10, 8, '/7/8/10/', 'L', 2 ) , ( 11, 8, '/7/8/11/', 'L', 2 ), 
( 12, 7, '/7/12', 'L', 1 ), ( 13, 7, '/7/13/', 'L', 1 ), 
( 14, 7, '/7/14/', 'I', 1 ), ( 15, 14, '/7/14/15/', 'L', 2 ), 
( 16, 14, '/7/14/16/', 'L', 2 ), ( 17, 7, '/7/17/', 'L', 1 ), 
( 18, 7, '/7/18/', 'L', 1 ), ( 19, 7, '/7/19/', 'L', 1 ), 
( 20, 7, '/7/20/', 'L', 1 ), ( 21, 21, '/21/', 'R', 0 ), 
( 22, 21, '/21/22/', 'L', 1 ), ( 23, 21, '/21/23/', 'L', 1 ), 
( 24, 21, '/21/24/', 'L', 1 ), ( 25, 21, '/21/25/', 'L', 1 );



/***************************************************************************/
/*                          UTILITY FUNCTIONS                              */
/***************************************************************************/

IF OBJECT_ID(N'ConvertStringToTable', N'TF') IS NOT NULL
    DROP FUNCTION ConvertStringToTable
GO
CREATE FUNCTION dbo.ConvertStringToTable
(
    @InputString VARCHAR(MAX)
,   @Separator   CHAR(1)
)
RETURNS @Result TABLE(ID INT identity(1,1), Value VARCHAR(MAX))
AS
BEGIN
         DECLARE @XML XML;
         SET @XML = CAST(('<row>'+REPLACE(@InputString, @Separator, '</row><row>')+'</row>') AS XML);

        INSERT INTO @Result
        SELECT t.row.value('.', 'VARCHAR(MAX)')
        FROM @XML.nodes('row') AS t(row)
            WHERE t.row.value('.', 'VARCHAR(MAX)') <> '';

        RETURN;
END;
GO

IF OBJECT_ID(N'GetHierarchyMapDescription', N'FN') IS NOT NULL
    DROP FUNCTION GetHierarchyMapDescription
GO
CREATE FUNCTION dbo.GetHierarchyMapDescription
(
    @InputString VARCHAR(MAX) 
)
RETURNS VARCHAR(MAX)
AS
BEGIN
    DECLARE @Result VARCHAR(MAX)
    select @Result = COALESCE(@Result  + '/', '') + InstrumentTypeNm 
    from  dbo.ConvertStringToTable(@InputString,'/') F
    inner join InstrumentType IT
        ON IT.InstrumentTypeId = F.Value

    RETURN @Result 
END;
go


/***************************************************************************/
/*                          SAMPLE QUERIES                                 */
/***************************************************************************/
-- Get entire hierarchy bond and all its types (top down) 
DECLARE @InstrumentType VARCHAR (500) = 'Bond' -- search string here
DECLARE @HierarchyMapCd VARCHAR(5000)

SELECT @HierarchyMapCd = ITH.HierarchyPathCd
FROM InstrumentTypeHierarchy AS ITH
     INNER JOIN InstrumentType AS ITC
        ON ITH.ChildInstrumentTypeId = ITC.InstrumentTypeId
WHERE
    ITC.InstrumentTypeNm = @InstrumentType

SELECT ITC.InstrumentTypeNm, ITH.HierarchyLevelNbr, ITH.HierarchyPathCd
FROM InstrumentTypeHierarchy ITH
INNER JOIN InstrumentType ITC
    ON ITH.ChildInstrumentTypeId = ITC.InstrumentTypeId
INNER JOIN InstrumentType ITP
    ON ITH.ParentInstrumentTypeId = ITP.InstrumentTypeId
WHERE
    ITH.HierarchyPathCd LIKE @HierarchyMapCd + '%'
order by HierarchyPathCd, HierarchyLevelNbr
GO


-- Find hierarchy nodes above "Variable Rate" bond type (Bottom up)
DECLARE @InstrumentType VARCHAR (500) = 'Variable Rate' -- search string here
DECLARE @HierarchyMapCd VARCHAR(5000)

SELECT @HierarchyMapCd = ITH.HierarchyPathCd
FROM InstrumentTypeHierarchy AS ITH
     INNER JOIN InstrumentType AS ITC
        ON ITH.ChildInstrumentTypeId = ITC.InstrumentTypeId
     INNER JOIN InstrumentType AS ITP
        ON ITH.ParentInstrumentTypeId = ITP.InstrumentTypeId
WHERE
    ITC.InstrumentTypeNm = @InstrumentType

IF OBJECT_ID('tempdb..#TmpTableFormatCode ') IS NOT NULL
    DROP TABLE #TmpTableFormatCode 
select * INTO #TmpTableFormatCode from dbo.ConvertStringToTable(@HierarchyMapCd,'/')

SELECT IT.InstrumentTypeNm, ITH.HierarchyLevelNbr, ITH.HierarchyPathCd
FROM InstrumentTypeHierarchy ITH
INNER JOIN #TmpTableFormatCode P
    ON P.Value = ITH.ChildInstrumentTypeId
INNER JOIN InstrumentType IT
    ON IT.InstrumentTypeId = ITH.ChildInstrumentTypeId
ORDER BY ID ASC


-- Get "/" separated  hierarchy description for application 

select 
ITH.*,
dbo.GetHierarchyMapDescription(HierarchyPathCd) AS HierarchyMapDesc
from InstrumentTypeHierarchy ITH







Sunday, June 17, 2018

The power of Abstraction in data models

A good data model should stand the test of time. Extensive changes in a data model has ripple effect on other parts of an application, that may turn out to be an expensive affair, which business may NOT be happy with!

Good thing is, a data modeler has a tool - at his/her disposal that may shield an application from such expensive changes.  The tool is - ABSTRACTION!

A word of caution here though is, this technique needs to be used with right amount of balance between usability and flexibility.

Let's understand this with an example. Let's say we are building an application for a company that would display soccer related statistics and other information for each player. Let's convert below requirement into a data model.

Initial requirement 

"In an international soccer competition, certain number of qualified countries participate. Each country selects a group of players, that would represent it in the competition." 

Level - 0 Logical data model (without abstraction)

Our data modeler has modeled as below -

Figure 1

Please note : This is a very basic, specific and inflexible model.

The website is live for a year. Now the company feels that it needs to extend the website for other US leagues too. You get below new requirement.


New requirement after going live

"WE also need to include competitions for prominent USA soccer leagues in our web site. Each major league in USA has certain teams or clubs. Each club has certain players that compete in a league. Important point to be noted here is, a player may be able to change his/her team. However, in the international format, changing the country is not possible."

Level-0 Logical data model (without Abstraction)

With the above requirement, our original model starts breaking. Here we have new entities called League and Club. Players play as a part of club in a league. However in the international format, players represent a country.

Figure 2


With change in requirements there are some changes to the data model which would have an impact on the application. Lets say, developers change the code to accommodate the above requirements with some difficulty. Now the website is live. After a year, after looking at the response on the website, the business  provides below requirement.


Yet another requirement after going live

"The website is getting many hits, with International and US leagues related information. We would like to extend this to European Leagues.  European leagues are however hierarchical in nature. There is a concept of promotion and relegation. As the teams perform better, they move up in the pyramid of leagues. If they under perform, certain number of bottom teams would be relegated to lower leagues"
This issue is never ending. The website can practically be enhanced to include any soccer competition happening under the sun. This is where ABSTRACTION can help. One needs to carefully include abstraction to make the data model immune to such drastic changes that costs a lot!

Level-1 Logical data model (with Abstraction)
Figure 3


Level-2 Logical data model (abstraction + hierarchical flexibility)
The issue with above structure is that it allows only natural hierarchy. Natural hierarchy means, a child can have only one parent, however a parent may have zero or many children. That does not support many to many relationship between organizations. In order to support that, below structure can be used. 

If one specifies right unique constraints, the same model can be used for one-to-many or many-to-many type of relationships.

Figure 4

What is Abstraction?
        Abstraction is a generalization technique where commonality in Data Elements and Entities are extracted into more generic structures. This is done with the intention of broadening the applicability to a wider range of situations.  Like in above examples, various concepts like Country, League, State, Club etc are collapsed into one generic concept called Organization.

Abstraction Levels
   There are multiple levels at which abstraction may be implemented. As the concepts are refined towards achieving more generalized concepts, we are moving towards higher level of abstraction that can represent more broader concepts leading to more flexibility.

       The modeler needs to know where to stop in order to bring a balance between flexibility and usability.

For example -
Figure 5



Benefits of Abstraction
  • Abstraction brings flexibility for wider scenario coverage. Data model would be able to scale to other business concepts that are abstracted in the data model.
  • Helps in those rare situations when for certain parts of application, business is not sure of specific requirements. Their requirement is around getting more flexibility with high level scenarios. Here the abstraction is forced which is not a good thing. In such cases it is the responsibility of SMEs to provide clarity to the tech team.
  • Generalizing the design needs a better understanding of a logical data model as one needs to find commonalities. The abstraction exercise helps in promoting better understanding of business concepts.


Evils of Abstraction

Like every approach has its own PROS and CONS, Abstraction too has some issues 
  • Relaxed database constraints. This responsibility moves from the data modeler to application development team. 
  • Application Complexity may increase due to generic structures. The amount of complexity may depend on the level of abstraction implemented in a model.
  • More development time due to added complexity.
  • Possible performance issues due to complex structure depending on level of abstraction.
  • Application team may misuse the data model. This may happen when a data modeler hands over the data model to the team and moves out. Due to lack of constraints and more flexibility, development team may end up using the data model in a way not intended by the data modeler. This may lead to issues around application maintainability and stability.
  • Documentation is a must as the generic concepts would help development team to understand the concepts, applicability and the usage. For the developers who did not work on such models before, make take time to understand the model and implement.


Decision making process (To abstract or NOT-to abstract)


Figure 6


Normalization v/s Abstraction
       
        People at times get confused between normalization and abstraction. However these two are very distinct activities.

        Normalization is done with the intention of putting together elements into a more appropriate entity to reduce redundancy and avoid insert, update and delete anomalies. This does not lead to creation of newer concepts like Organization, Party, etc.

        However, Abstraction is the process of refining the structures to represent more generic concepts without violating normalization.



Conclusion
     Abstraction when used with care is beneficial to an application. However if misused, may fire back extensively. Use abstraction where and when necessary at right level. Too much could be dangerous and too less could bring fragility to application. The data modeler needs to know where to stop.


Sunday, September 7, 2014

Find Statistics information using T-SQL Scripts

SELECT
    OBJECT_NAME(s.object_id) As TableNm
  , COL_NAME(s.object_id, sc.column_id) AS ColumnNm
  , s.NAME                              AS StatisticsNm
  , STATS_DATE(s.object_id, s.stats_id) AS LastUpdated
  , s.auto_created
  , s.user_created
  , s.no_recompute
  , s.has_filter
FROM   sys.stats s
       INNER JOIN sys.stats_columns sc
               ON s.stats_id = sc.stats_id
                  AND s.object_id = sc.object_id
       INNER JOIN sys.sysobjects o
               ON o.id = s.object_id
WHERE
  o.type = 'U'
ORDER  BY
  OBJECT_NAME(s.object_id)
  , COL_NAME(s.object_id, sc.column_id)
  , s.NAME 

Saturday, December 22, 2012

SSAS - Designing approaches for dimensional model






In SSAS, there are two ways to design dimensional models. Top-down or bottom up.

Top Down Approach
  • This is generally not much used option.
  • The relational data source does not exist in this case. The tables are created from the dimensional model from SSAS.
  • Cube and dimension wizards has the option to implement this option.
  • Schema Generation Wizard is used to generate and deploy schema in the database..
  • DBO permission required in order to create database objects.

Bottom Up Approach
  • Most of the times this is the approach thats taken.
  • The data mart already exists.
  • The dimensional model and cubes are designed from the Relational tables in the Data Mart.
  • Cube and Dimension wizards have the option to implement this solution.
 

Monday, December 10, 2012

SQL Server - Cost estimation issues even with up-to-date statistics

The only way to know if statistics are update-to-date or outdated is look at Actual Records count and Estimated Records count in an execution plan. A mismatch indicates that statistics need to be updated. But there are exceptions to this rule. There are situations when even if statistics is up-to-date still there is a difference in estimated and actual count.


First let's check the result with fixed parameter query. In my example I have bloated the data in AdventureWorksDW2008R2.FactInternetSales by duplicating the data multiple times (of couse, ensuring that the Primary key is unique).

T-SQL with fixed query (OK)



T-SQL with local variables (Issue)


Stored procedure with local variables (Issue)

 Here are the scenarios when the same query with different implementation style would give a matching estimated and actuals count.








Possible Solutions...

Here are the possible solutions to correct the issue.
    Stored procedure without local variables (OK)
 


 
    Dynamic sql with sp_executesql (OK)



Dynamic sql with EXEC (OK)




_

Saturday, December 1, 2012

Big Data - 1: A big picture


Everyday we get bombarded with terminologies in connection with big data like NoSQL, RavenDB, Hadoop, MongoDB and so on. Since this is an upcoming area in IT, we struggle to understand what those terms are and how these technologies can help us. I am not an expert in Big data, but I would share what I have known so far.



The diagram above gives a high level view of various terminologies in the big data arena and their relationship with each other.
  • Big Data : 
    • Big data is the data that is big in terms of
      • Volume (terabytes, petabytes and above)
      • Velocity (rate of transmission and reception) 
      • Variety (complexity, text, pictures, videos etc).
    • Big data by itself is messy and would not make any sense as it is unstructured or semi-structured unlike traditional data. The only way it can be useful is pull out only the relevant information. Pulling out relevant information out of such big data is the real problem that traditional sytems cannot help in. This is where big data technologies come into picture
There are three technology classes that support Big Data. Hadoop, NoSQL and Massively Parallel Processing (MPP) databases. I would start with MPP databases as they have been there for decades.
  • MPP Databases: These databases spread the data into independent storage and CPU thus achieving parallellism and great processing capability. This is special hardware and software combination that help you achieve such parallellism. The big products in market are IBM's Netezza, HP's Vertica and EMC's Greenplum. Microsoft's SQL Server Parallel Datawarehouse (PDW) is one upcoming MPP database.
  • Hadoop : Hadoop is one of the technology classes that support big data. It is an opensource version of Mapreduce created by Apache software foundation. In general, Mapreduce is parallel programming framework.
  • NoSQL : NoSQL stands for "not only SQL". It represents a class of databases that do not follow RDBMS principles (like transactional integrity) and mainly deal with big data. There are four categories of NoSQL databases. Please refer to the diagram above for the database products that fall under the below categories. Few of the NoSQL databases work in conjunction with Hadoop components.
    • Document store
    • Graph Database
    • Key-Value store
    • Wide Column store
I hope this post was useful in understanding the basic terminologies of big data and related technologies.

 In the next big data post, I would provide more details on NoSQL database categories.

Monday, November 26, 2012

SQL Server - Partitioning Design : 4 - Partitioned Tables + Partitioned View

This post is an extension to the previous post - Partitioning Design Patterns. In order to understand this post, you would need to go through the previous post. In this post I will write in-depth on why this design is an optimal solution. I would cover only those points that make this design a preferred implementation style as compared to other designs.


 

How this works?

  • Please refer to the diagram above.
  • Data for each year stored in separate physical table. In the above example, FactTransaction_2010, FactTransaction_2011, FactTransaction_2012, FactTransaction_2013.
  • Each physical table is partitioned into one partition per month.
  • Each table is stored on a different file group. Each file group having 4 files per filegroup.
  • Each filegroup is stored on a different storage drive based on workload. More recent data (very frequently accessed) is stored on high performance storage option. Older data generally accessed less frequently is placed on a less performance, less expensive storage option.
  • A view (FactTransaction_vw, union All on all tables) exposes the data in all the tables. The application queries uses this view for its reporting needs.
  • A synonym (FactTransaction_syn) points to the underlying table that is for current year (FactTransaction_2012). All the queries that insert or update data for current year, would use the synonym. Next year, the synonym would be pointed to the relevant table without any impact on existing code.
  • Most imporantly, tables for each year will have a check constraint on the date column (Partition column) for valid date range. For example, a check constraint on date column in FactTransaction_2012 to ensure that data from 1Jan2012 to 31Dec2012 can be stored. With this sql server does table elimination when searching for a particular date. Since each table is further partitioned per month, SQL Server does partition elimination. For instance, if a user queries data for 10 Nov 2012. SQL Server would directly jump to table FactTransaction_2012 (Table elimination. Check constraint helps sql server do this) without going to any other table and then within FactTransaction_2012 it jumps to partition meant for month Nov (partition elimination). This greatly helps in partition column based search operations.

Variations of this design depending on data volume


       With some more creativity, this design can be tweaked if daily/monthly data volumes are too heavy. But do a thorough testing if this really works for you. Below is one such variation.
    • One table for each year (FactTranaction_2010, FactTransaction_20xx, and so on).
    • Each table relevant to each year can be partitioned per month (or even date).
    • One separate table (FactTransaction_CurrentMonth) that would store data for current month (or quarter depending on data volume).
    • Daily rebuilds could to be done only on FactTransaction_CurrentMonth  table which would be a very small fraction of a huge historic table. Any maintenance activity like index rebuild or statistics update would not put undue pressure on system resources and would definitely benefit the system availability.
    • Once current month is done (Eg Nov2012), the current month data from FactTransaction_CurrentMonth can be merged in the FactTransaction_2012 table. Then FactTransaction_CurrentMonth would be used to store Dec2012 data and so on.
    • Monthly rebuilds need to be done on Yearly table after merging current month data from FactTransaction_CurrentMonth to FactTransaction_2012.

Pros

Below benefits give this design an edge over other implementations. Please remember, I have not mentioned all other benefits of this design that overlap with other designs.
  • Relieves SQL Server from unwanted CPU and IO pressure from online/offline index rebuilds.
    • Index rebuild is something that cannot be avoided by any database implementation. Based on index design and index fragmentation levels, indexes might need to be rebuilt as a part of daily batch. Index rebuilds are generally a problem for huge databases due to space requirements and resource consumption.
    • Let's understand this with an example. Let's say there is a table with 1 Terabyte (TB) of data. This table stores 4 years of data. If any partitioning design pattern except Design-4 is implemented, then index rebuild would lead to a very big hit on CPU and IO. In case of clustered index, because it includes actual data pages plus other index pages; the clustered index size would be 1 TB (data size) + other index pages size. Rebuilding such a clustered index is going to be a huge spike on server in terms of IO and CPU consumption. In case of non-clustered indexes, even if it does not contain the data pages as in case of clustered index; its still going to hit CPU and IO as well. That's because, the Non-clustered index structure will store data for all the historic data of entire table. Let's say, there are 5 non-clustered indexes, and each index is (hypothetically, based on design) is 100 gb. So a rebuild here will lead to rebuild of 500 gb of data which will lead to a big hit on server resources.
    • It's worth a thought... why should the index data for previous years be rebuilt over and over again, even if what really needs to be rebuilt is only current year data. Now coming to the recommended multiple table based design, data is already split into multiple "separate" physical structures. Obviously, rebuilding only the current year indexes will be very light on SQL Server as compared to rebuilding indexes on a single table with all the historic data
 
  • Better availabiliy with online rebuilds.
    • A particular partition can be rebuilt only in offline mode. Online option is available only for entire table. This can be an issue for applications that need to be online 24x7. In offline mode, the partition being rebuilt is not available for querying.
    • With a table for each year, the entire data is divided into separate structures and so rebuilding only a single year table becomes more practical in ONLINE mode.
    • If a single year data is huge, the solution can be tweaked to suit your needs. Refer to section "Variations of this design depending on data volume". As mentioned in the example, online rebuild would take very small fraction of time, as only a very small fraction of data would be rebuilt. For example, in the FactTransaction_CurrentMonth, on day-1 only one day data would to be rebuilt. On Day-2, two days and so on until day 31. So at the max 31 days of data would be rebuilt, while rest of the data for all the years is untouched. That cuts down a lot of maintenance time and resources and enhances application availability.

  • Space issue associated with online rebuilds is eradicated.
    • Rebuilding indexes in online mode on huge tables need a lot of space. Any many a times, the rebuilds fail due to lack of space. Let's understand this with an example.
    • For instance, if a clustered index on table with 500 GB data needs to be rebulit, it would need 1.1 Terabyte of space. This is how it would need.
      • A copy of data is created which would need 500 GB
      • Around 20% of sorting space, that would be 100 GB
      • 500 GB of log is generated.
      • If mirroring is implemented, then it would generate a lot of network traffic.
    • Lets see the space requirement with this design or its variation (FactTransaction_CurrentMonth). For instance, the FactTransaction_CurrentMonth table that stores one month of data at the max would store 10 GB at the end of the month. At the max it would need around 22 GB of space.
      • A copy of data is created which would need 10 GB
      • Around 20% of sorting space, that would be 2 GB.
      • 10 GB of log is generated.

  • Relieves SQL Server from unwanted CPU and IO pressure from statistics recompute for a huge base data.
    • A huge table will have huge regular indexes. There could be situations when the performance is bad due to skewed statistics as the sampling size is not enough. Scanning such huge data with higher sampling size to rebuild statistics would overkill CPU and IO for a long time, as sql server has more data pages to scan through to create more accurate statistics.
    • With one huge table storing data for all the previous years, statistics update would lead to re-computation of statistics for all the previous years. This is un-necessary work. Would it not be great if only current year statistics is being recomputed. Previous year statistics can be computed once and used multiple times by all the queries.

  • Helps SQL Server in maintaining accurate statistics which is a major contributor in better query performance.
    • Apart from the undue pressure on CPU and IO, huge tables also lead to skewed statistics. The rule is, small tables have accurate statistics and huge tables by default have skewed statistics. In this particular design, because each table is a separate physical structure, the statistics too are maintained separately for each table by year. If it was one big table with all the historic data, there would be huge statistics maintained for that table.
    • So why accurate statistics are beneficial for better execution plans? Thats because accurate statistics have enough data to help sql server come up with a good execution plan. If the statistics is skewed, SQL Server cannot come up with good execution plan. This follows the standard rule - "Garbage in, garbage out".
    • Since this blog is not dedicated to statistics, I cannot include more details on why and how statistics get skewed on huge tables. I would do that in a separate post. But in general stats get skewed in huge tables because SQL server computes statistics only on occurence of certain events and when it does, it uses a sampling algorithm. The samples might not be enough to compute accurate statistics

Cons    

  • Any change in table schema need to be reflected in all the underlying tables. If you have 10 tables for 10 years, then all those would need to be updated with the new schema. Although there is another way to avoid this. One can create a dummy value column in the view. But I would not prefer that as a personal choice.
  • Can have max of 255 tables in a view. This practically should not be an issue as it would mean 255+ years of historic data which is next to impossible for all practical purposes.

Please leave your comments if you liked/have suggestions for  this blog post.