Self Learning Note - Chapter 3 Extracting

Chapter 3 Extracting

Part 1: The Logical Data Map

Design Logical Before Physical

1.      Have a plan – Document for data lineage (Alsocan be in database, if the numbers of tables is large , it is easy to searchthe requirements )

2.      Identify data source candidates – Decide which data source should be used

3.      Analyze source systems with a data-profilingtool – Data quality, completeness,fitness for purpose

4.      Receive walk-though of data lineage and businessrules – Cleaning/Conforming

5.      Receive walk-through of data warehouse datamodel – Understand the physical datamodel

6.      Validate calculations and formulas

Inside the Logical Data Map (Excel or Database)

·        Target table name

·        Target column name

·        Table type (Dim or Fact)

·        SCD type

·        Source database

·        Source table name

·        Source column name

·        Transformation

·        Join Clause

·        Filter for source

Build the Logical Data Map

·        Data Discovery Phase – Detail to the records of source systems

·        Collecting and Documenting Source Systems – Get source systems

·        Keeping Track of the Source Systems – Detail Information for every source system.
What is ReverseEngineering?

·        Analyze the source system: Using findings fromdata profiling

·        ER Model:

·        Data Content Analysis:

o  Null Values

o  Dates in nondate fields

·        Collecting Business Rules in the ETL Process
The Business Rules in ETL team isdifferent from that in Data Modeling team.
e.g. Status Code: four-digitcode for Data Modeling team, but may have data that is three digitals in thesource. ETL team needs to convert it to four-digit code

Integrating Heterogeneous Data Sources

·        Identify the source systems

·        Understand the source systems (data profiling)
Uncover unexpected data anomalies anddata-quality issues

·        Created record matching logic
Attribute -> Entity -> System
Column -> Table -> Database

·        Establish survivorship rules
Something like MDM does, match and mergesource data.

·        Establish non-key attribute business rules
Keep data-lineage

·        Load conformed dimension
CDC

Part 2: The Challenge of Extracting from Disparate Platforms

Tool can’t access thesource, can be loaded to flat file firstly.

Connect to Diverse Sources through ODBC

ODBC is bad performance

Mainframe Sources

Too many scenarios referto it for detail later.

Flat File

·        Delivery of source data

·        Working/Staging tables

·        Preparation for bulk load
Does not support in-stream bulk loading
For safekeeping or archiving

XML Sources

Web Log Sources

ERP System Sources

Part 3: Extracting Changed Data

Detect Changes

Using Audit Columns

Database Log Scraping or Sniffing

Timed Extracts

Not based on the dateof the records inserted.

Process of Elimination

 Compare stage table and source table to getthe delta data

Extract Tips

Use DISTINCT sparingly

The DISTINCT clause isslow

Use SET operators sparingly

UNION, MINUS ANDINTERSECT are SET operators. They are slow

Use HINT as necessary

E.g. force the queryto use a particular index

Avoid NOT and ‘<>’

Scan full table rather than utilize indexes

Avoid function in where clause

 

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章