Home arrow Articles arrow Paradox Solutions arrow A Primer for Data Warehousing
02 September 2014
 
 
A Primer for Data Warehousing Print E-mail
Written by Larry DiGiovanni   
26 February 2002
Learn the basics of data warehousing from what solutions it offers to strategies, data modeling, populating and using the warehouse, and making a business case for or against using one in your environment.A Primer for Data Warehousing
© 2002 Lawrence G. DiGiovanni
Digico, Inc.
www.digicoinc.com

Acknowledgement

The author would like to recognize Jean Friedberg, Denn Santoro, and Liz Woodhouse for their contributions in reviewing this article.


Preface

Data Warehousing is an industry term that gets a lot of press and probably more than its fair share of hoopla. This article will present common sense definitions, describe a straightforward approach, and attempt to dispel some common misconceptions. We’ll present data warehousing as a solution to a problem, rather than an end unto itself.

This article is intended for database developers with relational database experience who wish to begin to gain an understanding of the process and goals of data warehousing.


Introduction

A data warehouse is an increasingly common solution to the challenge of data access that is sometimes present in transactional database applications. In a properly structured relational database, tables are highly normalized. Normalization is the process of organizing business data into a structure that eliminates duplication of entities and properties, avoids modification anomalies, and prevents inconsistency within a database. For a more thorough review of database normalization, refer to the paper Data Normalization: A Primer, by Dennis Santoro, which is available on the Paradox Resources Page at http://www.rdaworldwide.com.

A highly normalized database structure provides a number of key benefits to development, data integrity, and the flexibility of applications, which makes the job of the application developer easier. The downside is that normalization tends to cause certain types of reporting to be slower than it would be on a flatter data model. This makes the job of reporting and analysis on this data harder, particularly ad hoc reporting and analysis.

Another obstacle to reporting is the fact that business data are usually spread out amongst a number of business applications, which means that your data are stored in multiple locations, potentially using multiple database engines.

These are the seeds from which data warehousing grows. A data warehouse is a database that contains data from one or more databases in use by a given organization where the data are structured for rapid access.


The Data Warehouse Solution

Data Warehousing is simply the process of bring data together in a model that is organized so as to optimize reporting and aggregation operations. A data warehouse is the product of that process, and becomes the data source against which certain reporting, decision support, and data mining can occur. A data warehouse is not specifically defined by the data modeling strategy, technology platforms, or even the size of the database. In fact, it’s probable that you have done this, or something akin to this, before.

There are various data modeling strategies for data warehousing that improve data access performance in a way that is also flexible enough to meet evolving reporting and analysis needs without introducing reporting anomalies. There are also technology platforms which have specific features intended to support certain special needs of data warehousing strategies. And since the size of the whole is influenced by the sum of it’s parts, data warehouses tend to be among the largest databases in use at a given organization.

There is no rule that forbids warehousing your Paradox applications’ data and no reason not to consider Paradox as the database engine for your warehouse.


Data Warehouse Strategies

A detailed methodology for data warehousing is well outside the scope of this document. We can present a generalized approach.

There are three fundamental components to a data warehouse:
  • The alternate data model that is optimized for reporting and analysis rather than for data entry and data integrity.


  • The tools and code used for extracting and translating data from the source business system(s) and for loading that data into the data warehouse.


  • The reporting and visualization tools used for accessing the data within the data warehouse.
Building a data warehouse is therefore a process of identifying the sources of data to warehouse and modeling this data to meet the reporting requirements, developing processes to periodically populate this model, and selecting and integrating the reporting and analysis tools.


Data Modeling for Warehousing

The data model is arguably the most critical aspect of the data warehouse. Just as a poorly normalized transactional database creates the risk of data modification anomalies, a poorly organized data warehouse creates the risk of reporting inconsistencies. There are several generally accepted modeling strategies and principles that can be shown to mitigate this risk.

Most data warehouse modeling strategies involve a fundamental shift in the way you are probably accustomed to representing data. Instead of organizing by atomicity or functional dependency, data warehouses commonly group data which are numeric and additive (facts) separate from data which are textual or non-additive (dimensions). Facts (counts, sums, averages, etc.) are easily constrained in queries by associated dimensions.

As a simple example, consider the orders table in the DBDEMOS database. ItemsTotal, AmountPaid, Freight, are examples of facts in that they are both numeric and additive. SaleDate, ShipToAddr, and Terms are examples of dimensions in that they are non-numeric (hence non-additive). CustNo, EmpNo, and TaxRate are also dimensions because even though they are numeric, they are non-additive. We could model this table in a data warehouse as two tables:
OrdFct:
OrderNum*
ItemsTotal
AmountPaid
Freight

OrdDim:
OrderNum*
CustNum (FK)
SaleDate
ShipToAddr
Terms
EmpNo (FK)
TaxRate
And so forth. Certain columns from the original orders table were left out of our model for the sake of simplicity. Where there was one table, now there are two. So far, no real value has been added.

Now look at the OrdDim table. Except for the distinguishing OrderNum, there is likely to be significant duplication within this table, because there are a limited number of Customers, Employees, Payment Terms and so on. It may appear at first that the number of permutations of these attributes is close to the original number of rows in the orders table, but this is rarely the case. Eliminating duplicate dimensions commonly reduces storage requirements by half or more. This obviously requires the introduction of a surrogate key for OrdDim, and a corresponding foreign key in OrdFct:
OrdFct:
OrderNum*
OrderID (FK)
ItemsTotal
AmountPaid
Freight

OrdDim:
OrderID*
CustNum (FK)
SaleDate
ShipToAddr
Terms
EmpNo (FK)
TaxRate
This new model has a potentially significant storage benefit. Note that a report or query on OrdDim that needs to resolve the foreign key values has to include those tables in the query or report. Inclusion of the foreign key columns will not, by definition, increase the cardinality, and it will improve query and reporting performance:
OrdFct:
OrderNum*
OrderID (FK)
ItemsTotal
AmountPaid
Freight

OrdDim:
OrderID*
CustNum
CustomerName
SaleDate
ShipToAddr
Terms
EmpNo
EmployeeName
TaxRate
Again, most of the foreign key referenced columns have been left out for simplicity sake, but pretend they’re there. This model is a poor example of Normal Forms, but it is easier and faster to query and report against. This is safe so long as it is only used for reporting and querying (i.e., not for transactional data entry).

Now consider the SaleDate column. Note that the date on which the sale occurred is part of a week, which is also part of a month, which is also part of a fiscal quarter, and so forth. This is an example of a hierarchy. It is very useful to predefine these hierarchies in separate dimension tables for drill-up and drill-down capability.

Finally, data warehouses can also be made to contain pre-summarized views of table data where appropriate. Pre-aggregating facts by commonly used dimensions and dimensional hierarchies can significantly improve performance of common queries. This comes at the expense of storage, though storage is considerably cheaper than I/O and CPU, so this is usually an acceptable trade-off. Tables in data warehouses will have a fairly large number of indexes, which are sometimes an obstacle in transactional data.

One consequence of the common batch mode of updating a data warehouse is that the overhead of index maintenance in a data warehouse is less a cause for concern than in a transactional environment. That gives significant freedom in creating as many maintained indexes as would be useful to query performance.


Populating the Data Warehouse

In data warehouse parlance, the processes that populate the data warehouse are referred to as ETL, for Extraction, Translation, and Load. This merely refers to the notion that data are pulled from source application data stores, translated to meet the storage requirements of the data warehouse, and loaded to the data warehouse model.

Extraction will obviously require the cooperation of the stewards of the source data, either to render source data in a predefined format or to simply provide access to source data in its native format. In either event, the key is to minimize the performance impact on the source data application users, and to obtain a read consistent set of the data (no partial complex transactions).

Translation involves the reorganization of the data from the original format in preparation for loading. This step joins foreign keys, eliminates the duplicates, creates the new surrogate keys, and creates the summary tables, as needed.

Loading is simply the addition of the translated data into the data warehouse model. This step potentially involves removing or replacing certain data in the data warehouse, and can require a great deal of I/O. As with extraction, the key is to minimize the impact on the users of the data warehouse.

The processing overhead and contention for source and target resources constrains the frequency of updates to the data warehouse. Most production data warehouses are rarely updated more than once per day, usually at a time when system usage is low.


Using the Data Warehouse

Data Warehouses are typically accessed for read-only use (except for ETL processing). From a user’s perspective, it is common to make a clear distinction between applications that use the warehoused data and applications that use the source transactional data. Clearly differentiating between the data warehouse and the source application data goes a long way to avoiding confusion. For instance, if ETL is executed nightly, a sales report for the current fiscal period will not change during the business day to reflect daily sales activity.

There are a growing number of tools (particularly for reporting and data visualization) that are specialized for use with data warehouses. Most application development environments, like Paradox, are more than capable of accessing and rendering this data as well.


The Business Case For (Against) a Data Warehouse

Data warehousing is not for every business and certainly not for every application. There are no hard and fast rules to follow in making the above determination. It really comes down to:
  • Would your organization benefit from the addition of a data warehouse? A data warehouse is a solution to some common problems, such as:


    • You are considering denormalizing transactional application data to streamline reporting, or are just unhappy with the amount of time spent waiting on report execution.


    • You need to report across multiple business systems, but cannot because of the differences in physical storage locations, storage formatting, or proprietary storage format of off-the-shelf or vertical market business systems.


    • Your users’ reporting requirements frequently change even though the actual data requirements are static.

  • Do you have the resources available to develop the data model and code? A data warehouse has certain acquisition and maintenance costs in development, operations, and infrastructure. The development can generally be outsourced, as we have shown that there are some tried and true warehousing strategies that are common regardless of the intricacies of the application or the business domain. Most of the maintenance is the simple execution and monitoring of the ETL processing. Obviously the data warehouse and ETL can be developed to minimize the maintenance overhead.


  • Do you have buy-in from the stewards of the source data? Some departments may be unwilling to share their data for a variety of reasons.


  • Will your reporting needs be met with non-transactional data? The up-to-the-minute data will remain in your application databases, can you live with the lag? Data warehouse data is only as fresh as the most recent update. You can choose a more aggressive update cycle to try to mitigate this, but data extraction from highly transactional data presents its own set of challenges.

Summary

Data Warehousing is an approach to the common problem of reporting on transactional data, or data from a variety of source applications. No special tools are required for building, populating, or using a data warehouse, other than tools which support data access, manipulation, and reporting.
< Prev   Next >
 
Top! Top!