Download Data Warehouse Design: Understanding Dimensional Modeling and Indexing and more Slides Data Warehousing in PDF only on Docsity!
Data Warehousing Design
Warehouse Design
What to materialize in the warehouse
– Which source data?
– Which summary tables?
– Which indices?
Influenced by both querying and
maintenance
Trade storage space and update time for
query speed
Why is DW Design Different?
DSS: few transactions, each accessing a large
number of records
Typical ER designs tend to be complex and
difficult to navigate
Topic/Function Operational Decision Support
Data Content Current values Archival data
Data Organization Application by application Subject areas across enterprises
Nature of Data Dynamic Static until refreshed
Data Structure, Format Complex: suitable for operational
computation
Simple: suitable for business
analysis
Random Access Probability High Moderate to low
Data Update Updated on a field-by-field basis Accessed and read: no direct
update
Usage Highly structured repeptive
processing
Highly unstructured analytical
processing
Response Time Sub-second to 2-3 seconds Seconds to minutes, hours
Multi-Dimensional Data
Measures - numerical data being tracked
Dimensions - business parameters that define
a transaction
Example: Analyst may want to view sales
data (measure) by geography , by time , and by
product (dimensions)
Dimensional modeling is a technique for
structuring data around the business concepts
ER models describe “entities” and
“relationships”
Dimensional models describe “measures”
and “dimensions”
Dimensional Modeling Using
Special-Purpose Model (MDDB)
Facts stored in multi-dimensional arrays
Dimensions used to index array
Sometimes on top of relational DB
Products
– Pilot, Arbor Essbase, Gentia
Example
―Sales by product line over the past six months‖
―Sales by account between 1990 and 1995‖
Prod Code Time Code Acct Code Sales Qty
Account Info
Product Info
Time Info
Numerical Measures
Key columns joining fact table
to dimension tables
Fact table for
measures
Dimension tables
Example Cont’d
Sales
Time
Time Code
Quarter Code
Quarter Name
Week Code
Day Code
Day name
Account
Account Code
Key Account Code
Account Name
Account Type
Account Market
Geography
Geography Code
Region Code
Region Mgr
City Code
City Name
Product
Product Code
Product Name
Brand Mgr
Brand Code
Prod. Line Code
Prod. Line Name
Prod. Name
...
Geography Code
Time Code
Account Code
Product Code
Dollar Amount
Units
Dimensional Modeling Cont’d
Fact tables are fully normalized
Dimension tables are denormalized
– Repetitively stored for sake of simplicity and
performance
Product_Code Product_Name Product_Color Brand_Code 101 Widget Blue XYZ 102 Gadget Blue XYZ 103 Snicket Orange ABC 104 Graplit Orange ABC
Brand_Code Brand_Mgr XYZ J. Smith ABC T. Jones
Product_Code Product_Name Product_Color Brand_Code Brand_Mgr 101 Widget Blue XYZ J. Smith 102 Gadget Blue XYZ J. Smith 103 Gadget Green XYZ J. Smith 104 Snicket Orange ABC T. Jones 105 Graplit Orange ABC T. Jones
Advantages of Dimensional Modeling
Define complex, multi-dimensional data with
simple model
Reduces the number of phycial joins a query
has to process
Allows the data warehouse to evolve with rel.
low maintenance
HOWEVER! Star schema and rel. DBMS are
not the magic solution
– Query optimization is still problematic
Index Structures
Traditional access methods
– B-trees, hash tables, grid files, etc.
Popular in warehouses
– Inverted indexes (lists)
– Bit map indexes
– Join indexes
Bit Map Index
Developed for Model 204 DBMS in 1987
Id Name Age 1 Joe 20 2 Fred 20 3 Sally 21 4 Nancy 20 5 Tom 20 6 Pat 25 7 Dave 21
Using Bit Maps
Query:
– “Get people with age=20 and name =„Fred‟”
(1) Bit map for age =20: 1101100
(2) Bit map for name=„Fred‟: 0100000
(3) Answer is intersection: 0100000
Good if domain cardinality is small
Bit vectors can be compressed
Aggregation
Process by which low-level data is summarized in
advanced and placed into intermediate tables
Speeds up query processing, less ad-hoc
– “Show me total US sales for 1990”
How much to aggregate?
Data cube data model
– All possible aggregations along all dimensions
– Cells contain aggregated values
– How much of the cells in cube should be pre-computed?
Aggregation Cont’d
Special operators to navigate the hierarchies
– Roll-up: remove a dimension element
– e.g., Roll-up products to brands
– Drill-down (opposite of roll-up),
– Slice (defines a subcube)
– Various visualization ops (e.g., pivot)