Chapter 3 Extracting
Part 1: The Logical Data Map
Design Logical Before Physical
1. Have a plan – Document for data lineage (Alsocan be in database, if the numbers of tables is large , it is easy to searchthe requirements )
2. Identify data source candidates – Decide which data source should be used
3. Analyze source systems with a data-profilingtool – Data quality, completeness,fitness for purpose
4. Receive walk-though of data lineage and businessrules – Cleaning/Conforming
5. Receive walk-through of data warehouse datamodel – Understand the physical datamodel
6. Validate calculations and formulas
Inside the Logical Data Map (Excel or Database)
· Target table name
· Target column name
· Table type (Dim or Fact)
· SCD type
· Source database
· Source table name
· Source column name
· Transformation
· Join Clause
· Filter for source
Build the Logical Data Map
· Data Discovery Phase – Detail to the records of source systems
· Collecting and Documenting Source Systems – Get source systems
· Keeping Track of the Source Systems – Detail Information for every source system.
What is ReverseEngineering?
· Analyze the source system: Using findings fromdata profiling
· ER Model:
· Data Content Analysis:
o Null Values
o Dates in nondate fields
· Collecting Business Rules in the ETL Process
The Business Rules in ETL team isdifferent from that in Data Modeling team.
e.g. Status Code: four-digitcode for Data Modeling team, but may have data that is three digitals in thesource. ETL team needs to convert it to four-digit code
Integrating Heterogeneous Data Sources
· Identify the source systems
· Understand the source systems (data profiling)
Uncover unexpected data anomalies anddata-quality issues
· Created record matching logic
Attribute -> Entity -> System
Column -> Table -> Database
· Establish survivorship rules
Something like MDM does, match and mergesource data.
· Establish non-key attribute business rules
Keep data-lineage
· Load conformed dimension
CDC
Part 2: The Challenge of Extracting from Disparate Platforms
Tool can’t access thesource, can be loaded to flat file firstly.
Connect to Diverse Sources through ODBC
ODBC is bad performance
Mainframe Sources
Too many scenarios referto it for detail later.
Flat File
· Delivery of source data
· Working/Staging tables
· Preparation for bulk load
Does not support in-stream bulk loading
For safekeeping or archiving
XML Sources
Web Log Sources
ERP System Sources
Part 3: Extracting Changed Data
Detect Changes
Using Audit Columns
Database Log Scraping or Sniffing
Timed Extracts
Not based on the dateof the records inserted.
Process of Elimination
Compare stage table and source table to getthe delta data
Extract Tips
Use DISTINCT sparingly
The DISTINCT clause isslow
Use SET operators sparingly
UNION, MINUS ANDINTERSECT are SET operators. They are slow
Use HINT as necessary
E.g. force the queryto use a particular index
Avoid NOT and ‘<>’
Scan full table rather than utilize indexes
Avoid function in where clause