《精通數據倉庫設計》中英對照_第2章

《精通數據倉庫設計》中英對照_2

2 關係的基本概念

每一種數據建模技術都有一套自己的術語、定義與技術。這些習語允許我們理解複雜與困難的概念,並用它們設計複雜的數據庫。這本書提供關係數據建模技術,用於開發數據倉庫數據模型。爲了這個目的,這一章介紹關係數據建模的術語和專有名詞。然後,接着介紹規範化技術和不同的規範化等級的規則(例如,第一範式、第二範式、第三範式)及每個範式的目標。我們會給出示例數據模型,演示規範化的過程。本章最後,討論規範化數據模型和它的好處。

在我們進入各種數據倉庫數據模型之前,我們必須理解數據模型爲什麼重要,及你在開發BI環境時會創建的各種模型。

Chapter 2 Fundamental Relational Concepts

Every data-modeling technique has its own set of terms, definitions, and techniques. This vernacular permits us to understand complex and difficult concepts and to use them to design complex databases. This book applies relational data-modeling techniques for developing the data warehouse data model. To that end, this chapter introduces the terms and terminology of relational data modeling. It then continues with an overview of normalization techniques and the rules for the different normalization levels (for example, first, second, and third normal form) and the purpose for each. Sample data models will be given, showing the progression of normalization. The chapter ends with a discussion of normalization of the data model and the associated benefits.

Before we get into the various types of data models we use in creating a data warehouse, it is necessary to first understand why a data model is important and the various types of data models you will create in developing your BI environment.

 

爲什麼需要數模型

模型是一個事物的抽象與表現,它的外觀和行爲與原物整體或部分相象,例如概念車、建築模型等等。所有模型都有一個普通的客觀性。設計模型用於幫助人們想象各部件如何連接起來,幫助人們理解怎麼使用或應用最終的產品,減少開發的風險,保證建立的產品與需求有同樣的期望。讓我們更詳細的介紹這些好處:

■■模型減少整體風險,保證最終產品的需求得到滿足。檢查最終產品的實體模型,有意的用戶可以做出可靠的決定,即產品是否能真正的滿足他們的需要與目標。

■■模型幫助開發者想象最終產品與其他系統或功能的接口。如果創建了一個詳細的模型,可以可靠估計創建這個接口難度及可行性。(在數據倉庫環境下,這個接口包括數據獲取與數據交付程序,何時、何地執行清洗、審計、數據維護過程等等)。

■■模型幫助所有有關人員理解怎樣與最終的產品關聯起來,及如何與它們的工作聯繫起來。模型同時幫助開發者理解最終觀衆需要的技能,及需要進行哪些培訓,以保證對產品正確使用。

■■最後,模型保證創建產品的人和需求產品的人對最終的成果有同樣的期望。通過檢查模型,錯失機會的情況會大大減少,各部分人信任和信心增加,大大提高最終產品的滿意度。模型是如此重要,尤其是在承擔一個複雜的項目(如BI)的建設時,我們建議在模型沒有完成之前,所有的項目都要擱置或者延遲。

Why Do You Need a Data Model?

A model is an abstraction or representation of a subject that looks or behaves like all or part of the original. Examples include a concept car and a model of a building. All models have a common set of objectives. They are designed to help people envision how the parts fit together, help people understand how to use or apply the final product, reduce the development risk, and ensure that the people building the product and those requesting it have the same expectations. Let’s look more closely at these benefits:

■■ A model reduces overall risk by ensuring that the requirements of the final product will be satisfactorily met. By examining a “mock-up” of the ultimate product, the intended users can make a reasonable determination of whether the product will indeed fulfill their needs and objectives.

■■ A model helps the developers envision how the final product will interface with other systems or functions. The level of effort needed to create the interfaces and their feasibility can be reasonably estimated if a detailed model is created. (In the case of a data warehouse, these interfaces include the data acquisition and the data delivery programs, where and when to perform data cleansing, audits, data maintenance processes, and so on.)

■■ A model helps all the people involved understand how to relate to the ultimate product and how it will pertain to their work function. The model also helps the developers understand the skills needed by the ultimate audience and what training needs to occur to ensure proper usage of the product.

■■ Finally a model ensures that the people building the product and those requesting it have the same expectations about the ultimate outcome of the effort. By examining the model, the potential for a missed opportunity is greatly reduced, and the belief and trust by all parties that the ultimate product will be satisfactory is greatly enhanced. We feel that a model is so important, especially when undertaking a set of projects as complex as building a business intelligence (BI) environment, that we recommend a project be halted or delayed until the justification for a solid set of models is made, signed off on, and funded.

 

關係數據建模目標

在理解了爲何需要模型之後,讓我們回到這個特殊的模型——數據模型。在描述各種層次的模型之前,我們先提出一套普遍的術語。

注意

本書無意替代很多有名的、權威的關於通用數據建模的書,這一節僅僅給新手提供一些我們貫通全書的術語。如果需要更多細節,有大量的數據建模書籍參考,也可以參考本書“推薦閱讀”介紹的書籍。

 

Relational Data-Modeling Objects

Now that we understand the need for a model, let’s turn our attention to a specific type of model—the data model. Before describing the various levels of models, we need to come up with a common set of terms for use in describing these models.

 

CNOTE

This book is not intended to replace the many significant and authoritative books written on generic data modeling; rather this section should only serve as a refresher on some of the more significant terms we will use throughout the book. If more detail is needed, please refer to the wealth of data-modeling books at your disposal and listed in the “Recommended Reading” section in this book.

 

 

主題

第一個描述的術語是主題。你會看到我們使用面向主題的數據倉庫和主題域模型。這兩種情況下,術語“主題”指企業的一個數據對象,或者是一個主要的數據分類。“主題域”指企業的一個數據子集,包含相關的實體和關係。顧客、銷售、產品等都是主體域。

 

Subject

The first term to describe is a subject. You will see us refer to a subject-oriented data warehouse and a subject area model. In both cases, the term subject refers to a data subject or a major category of data relevant to the business. A subject area is the subset of the enterprise’s data and consists of related entities and relationships. Customers, Sales, and Products are examples of subject areas.

實體

實體一般定義爲一個人、地方、事情、概念、事件,企業對它感興趣並且能夠抓取及存貯信息。在一個模型內,實體是唯一的。對第三範式模型,有一個,也只有一個詞目表示一個實體。在傳統的Code Date的實體關係圖(ERD)或邏輯數據建模中,有四種類型的實體,用於創建邏輯或業務數據模型與數據倉庫模型。

■■首要的或基礎實體定義爲不依賴於任何其他實體存在的實體。一般來說,每一個主題域由一個同名的首要實體表示(除非主題域名是複數,而實體名是單數),例如顧客、銷售、產品。這些實體是一些獨立的數據的分組,這些數據是各異的。

■■子類實體是一個父實體(超實體)的邏輯部分或分類。例如,客戶的子類實體有零售客戶、批發客戶。子類實體往往繼承父實體的特徵(屬性)和關係,也就是說,零售客戶會繼承父實體的所有屬性,客戶(例如客戶號、客戶姓名),還有關係,例如“客戶獲得產品”。

■■表示屬性或特徵的實體,是一類依賴於其他實體的實體。對於它的父實體,會有一組發生多次的數據。顧客地址是一個表示屬性的實體,因爲每一個顧客可能有多個地址。

■■聯合或交叉實體是一個依賴於兩個或多個實體的實體。如,訂單是一個聯合實體,它的主鍵由兩個父實體——客戶與產品——加一個限定詞,如日期,其他屬性有產品數據、購買日期等。

使用這四類實體,我們有了所有創建業務或數據倉庫數據模型的部件。我們會在本章的下一節描述這些模型及在第3 張與第4章描述創建的步驟。

Entity

An entity is generally defined as a person, place, thing, concept, or event in which the enterprise has both the interest and the capability to capture and store information. An entity is unique within the data model. For the third normal form data model, there is one and only one entry representing that entity. In entity-relationship diagrams (ERD) or logical data modeling in the classical Codd and Date sense, there are four types of entities from which to build logical or business data models and data warehouse models (see Figure 2.1).

■■ A Primary or Fundamental Entity is defined as an entity that does not depend on any other entity for its existence. Generally each subject area is represented by a primary entity that has the same name (except that the subject area name is pluralized and the entity name is singular), such as Customer, Sale, and Product. These entities are a grouping of dependent data occurring singularly.

■■ A Subtype Entity is a logical division or category of a parent (supertype) entity. Examples of subtypes for the Customer entity are Retail Customer and Wholesale Customer. The subtypes always inherit the characteristics, or attributes and relationships, of the parent entity; that is, the Retail Customer will inherit any attributes that describe the more generic parent entity, Customer (for example, Customer ID, Customer Name), as well as relationships such as “Customer acquires Product.”

■■ An Attributive or Characteristic Entity is an entity whose existence depends on another entity. It is created to handle a group of data that could occur multiple times for each instance of its parent entity. Customer Address is Fan attributive entity of Customer since each customer may have multiple addresses.

 

■■ An Associative or Intersection Entity is an entity that is dependent upon two or more entities for its existence, and that records data at the point of intersection. Order is an associative entity. Its key is composed of the keys of the two parent entities—Customer and Item—and a qualifier such as Date. Attributes that could be retained include the Quantity of the Item and Purchase Date.

With these four types of entities, we have all we will need in terms of components to create the business and data warehouse data models. We describe these models in the next section of this chapter and go through the steps to create them in Chapters 3 and 4.

 

元素或屬性

元素或屬性是關於實體最低層次的信息。它表示一個具體的信息片或一個具體實體的特性。元素或屬性在一個實體內實現幾個目標:

■■一個主鍵唯一標示一個實體,用於物理數據庫在存貯和訪問時定位一條記錄,如客戶號是客戶實體的主鍵,產品號是產品實體的主鍵。

Element or Attribute

An element or attribute is the lowest level of information relating to any entity. It models a specific piece of information or a property of a specific entity. Elements or attributes serve several purposes within an entity.

■■ A primary key serves to uniquely identify the entity and is used in the physical database to locate a record for storage or access. Examples include Customer ID for the Customer entity and Item ID for the Item entity.

Chapter 2

(個人說明,圖2.1 Primary Entity Item 應該畫錯了,主鍵應該是ItemId,其他屬性也錯了)。

 

注意

鍵可能是單個元素,也可能是多個元素的組合,這種情況,叫複合鍵。主鍵可能有,也可能沒有意義(也即智能)。必須注意智能主鍵,例如一個表示地理區域或部門的帳目編碼常常讓人迷惑及不正確使用,關於一個好的鍵的規則討論請參見補充材料。

■■外鍵存在於一對實體父子關係中,子實體的外鍵是父實體的主鍵,這樣把兩個實體連接起來。例如,客戶實體的客戶號,也存在訂單實體內,連接這兩個實體。

■■非鍵元素或屬性是一個實體不用於唯一定義實體的部分,用於進一步描述實體的特徵。例如,客戶實體的非鍵屬性包括客戶姓名、客戶類型等;產品實體的非鍵屬性有產品顏色、產品數量等。

 

NOTE

The key may be a single element or it may consist of multiple elements that are combined, in which case it is called a concatenated key. Finally, primary keys may or may not have meaning or intelligence. Care must be taken with intelligent primary keys. For example, an Account Code that also depicts geographic area or department is both confusing and erroneous in this data model. See the sidebar for further rules for good keys.

■■ A foreign key is a key that exists because of a parent-child relationship between a pair of entities. The foreign key in the child entity is the primary key in the parent entity and links the two entities together. For example, the Customer ID of the Customer entity is also found in the Order entity, relating the two.

■■ A nonkey element or attribute is not needed to uniquely identify the entity but is used to further describe or characterize information about the entity. Examples of nonkey elements or attributes are Customer Name, Customer Type, Item Color, and Item Quantity.

 

一個好的鍵的特徵

下面列出一個“行爲良好”鍵的特徵,在操作型系統裏,這些鍵在整個生命週期是可維護的、可持續的,數據倉庫也一樣:

鍵在整個集成範圍內不能爲空。在任何情況下,不能有空鍵值。

鍵在整個集成範圍內要唯一。在任何情況下,不能產生重複的鍵值。

鍵是由設計來決定唯一性,而不是由環境來決定唯一性。鍵產生器通過了仔細的思考和整個環境的測試。

鍵是持久穩定的。這在數據倉庫內是強制的,因爲數據有一個很長的生命期。

鍵使用可管理的格式,也就是說,在創建或維護鍵結構時沒有不適當的花銷,它由直接的整數或字符串組成,沒有嵌入符號或特殊字符。

鍵不應該隱含智能,而只是一個普通的串。(它可能基於某些智能創建,但是,一旦創建,內置在鍵內的智能永不再使用)。

Characteristics of a Good Key

The following are characteristics of “well-behaved” keys—those keys that are maintainable and sustainable over the lifetime of the operational system and therefore, the data warehouse:

The key is not null over the scope of integration. It is imperative that there can never be a situation or event that could cause a null key.

The key is unique over the scope of integration. It is also imperative that there can never be a situation where duplicate keys could be generated.

The key is unique by design not by circumstance. Key generation has been carefully thought out and tested under all circumstances.

The key is persistent over time. This is mandatory in the data warehouse environment where data has a very long lifetime.

The key is in a manageable format, that is, there is no undue overhead produced in the creation or maintenance of the key structures. It consists of straightforward integers or character strings, no embedded symbols or odd characters.

The key should not contain embedded intelligence but rather is a generic string. (It may be created based on some intelligence but, once created, the intelligence embedded in the key is never used.)l

關係

關係表示兩個實體間的業務規則。關係用於描述兩個實體是如何自然地關聯起來。圖21中,客戶下訂單,訂單包含產品,這些都是關係。在描述企業業務規則時,關係有不同的特徵:

■■基數表示一個實體關聯到另一個實體的最大對象數。一般地,表示爲“1”或者“多。圖2.1,一個用戶有多個地址(帳單地址、送貨地址等),而每一個地址屬於一個用戶。

■■可選性或必須性指示一個實體對象是否必須參與到另一個關係,這種特徵告訴你關係中的最小對象數。

有幾種不同的關係類型:

■■確定的關係是指父實體的主鍵,也作爲子實體的主鍵。

■■非確定的關係是指父實體的主鍵,作爲子實體的非主鍵。這種關係類型的一個示例是遞歸關係,即一個實體與自身關聯。一個客戶關聯到其他客戶(例如,公司的子公司,家庭或者家族)就是遞歸關係。這用於表示一個實體對象關聯到同一個實體的另一個對象,如圖2.2

在數據模型中,一個關係包含一個表示業務規則的動詞(安排、擁有、包含等)、一個基數、可選性或必要性。

 

Relationships

A relationship documents the business rule associating two entities together. The relationship is used to describe how the two entities are naturally linked to each other. Customer places Order and Order is for Items are examples of relationships in Figure 2.1.

There are different characteristics of relationships used in documenting the business rules of the enterprise:

■■ Cardinality denotes the maximum number of occurrences of one entity that can be related to another entity. Usually these are expressed as “one” or “many.” In Figure 2.1, a Customer has many addresses (Bill-to, Ship-to) and every address belongs to one customer.

■■ Optionality or modality indicates whether an entity occurrence must participate in a relationship. This characteristic tells you the minimum number (zero or optional) of occurrences in the relationship. There are also different types of relationships:

■■ An identifying relationship is one in which the primary key of the parent entity becomes a part of the primary key of the child entity.

■■ A nonidentifying relationship is one in which the primary key of the parent entity becomes a nonkey attribute of the child entity. An example of this type of relationship is a recursive relationship, that is, a situation in which an entity is related to itself. Customers who are related to other customers

(for example, subsidiaries of corporations and families or households) are examples of recursive relationships. These are used to denote an entity occurrence that is related to another entity occurrence of the same entity. See Figure 2.2 for more on these types of relationships.

 The components of a relationship in a data model consist of a verb phrase denoting the business rule (places, has, contains), the cardinality, and the modality or optionality of the relationship.

 

數據模型的類型

數據模型是某個環境下數據的抽象與表示,它是數據需求的集合及驗證,也是溝通的方法,幫助創建一個準確的、有效的的數據數據庫。數據模型由實體、屬性、與關係組成。在一個完整的數據模型內,要定義所有這些項目的元數據,如定義與物理特徵等。

如我們早已申明的那樣,數據模型對BI 的成功起關鍵作用,從一開始到長時間的維護與持續過程。既然數據模型如此重要,爲什麼沒有總是被開發呢?有很多的原因:

■■不容易。創建數據模型需要IT技術組與企業團體付出很大的努力,企業必須有經過數據建模訓練的員工或者外部資源。

■■需要規章和工具。一旦獲得了數據建模技術,必須保持一致和依從,企業必須創建一系列詳細的標準用於創建數據模型,給出指定的標準的示例,衝突解決過程、數據管家角色和責任(更詳細的討論見第3章)、元數據獲取和維護過程。

■■需要相當的業務參與。一個公司的數據模型必須——重複——必須有業務團體參與。畢竟,我們是在設計重要的部件,會成爲企業最終的競爭武器,我們是在業務人員創建大量的信息財富。

■■它延遲了看得見的工作。數據建模沒有爲企業產生切實的產品,模型提供給技術隊伍關於業務與需求的信息。老笑話這樣說“開始編寫代碼——我要繼續發現它們需要什麼”

■■它需要廣闊的視野。數據倉庫的數據建模必須圍繞整個企業,它會用於創建最終的決策系統——數據集市——用於所有戰略分析。因此,必須要有跨部門、跨流程的長遠眼光。

■■數據建模的好處往往在在第一個項目被認識。真正的產出在於重用和企業遠景。

 

Types of Data Models

A data model is an abstraction or representation of the data in a given environment. It is a collection and a subsequent verification and communication method to fully document the data requirements used in the creation of accurate, effective, and efficient physical databases. The data model consists of entities, attributes, and relationships. Within the complete data model, appropriate meta data, such as definitions and physical characteristics, is defined for each of these.

As we stated earlier, we feel that the data models you create for your BI environment are critical to the overall success of your initiative as well as the long term maintenance and sustainability of the environment. If the data model is so important, why isn’t it always developed? There are a number of reasons for this:

■■ It’s not easy. Creating the data model takes significant effort from the IT technical staff and business community. Data modelers must be either hired or internal resources trained in the disciplines of data modeling.

■■ It requires discipline and tools. Once the techniques for data modeling are learned, they must be applied with conformity and compliance. The enterprise must create a set of documents detailing the standards it will use in the creation of its data models. Examples of these are naming standards, conflict resolution procedures, data steward roles and responsibilities (see Chapter 3 for more on this topic), and meta data capture and maintenance procedures.

■■ It requires significant business involvement. A company’s data model must—repeat—must have business community involvement. We are, after all, designing the critical component of the business community’s ultimate competitive weapon. It is for them that we are creating this vast wealth of information.

■■ It postpones the visible work. Data modeling does not create tangible products that can be used by the business community. The models provide the technical staff creating the environment with information about the business environment and some requirements. The old joke goes something like this: “Start coding—I’ll go find out what they want.”

■■ It requires a broad view. The data model for the BI environment must encompass the entire enterprise. It will be used to create the ultimate decision-making components—the data marts—for all strategic analysis. Therefore, it must have a multi department and multi process perspective.

■■ The benefits of a data model are often not realized with the first project. The real productivity comes in its reuse and its enterprise perspective.

說了這些,沒有開發數據模型的影響是什麼?

■■抽取想要的數據非常困難。很容易會導致不能滿足用戶的期望,或者只能部分滿足。

■■要花費相當的努力在接口上,而通常只能提供很小,甚至沒有商業價值。

■■環境複雜性增長迅速。當缺乏數據模型作爲地圖指南,瞭解數據倉庫裏已經有什麼,還需要什麼,會變得非常困難,或者說不可能。

■■實際的代價缺乏數據集成,因爲你無法直觀的看到事情是如何結合在一起。數據倉庫的開發會沒有效率,甚至可能不可行。

■■最重要的障礙之一是:沒有數據模型,作爲資產的數據不能有效地管理。

Having said all this, what is the impact of not developing a data model?

■■ It becomes very difficult to extract desired data. It is easy to implement something that either misses the users’ expectations or only partially satisfies them.

■■ Significant effort is spent on interfaces that generally provide little or no business value.

■■ The environment’s complexity increases significantly. When there is no data model to serve as a roadmap, it becomes difficult, if not impossible, to know what you already have in your data warehouse and what needs to be added.

■■ It virtually guarantees lack of data integration because you cannot visualize how things fit together. Data warehouse development will not be effective and efficient, and may not even be feasible.

■■ One of the most significant drawbacks is that, without a data model, data will not be effectively managed as an asset.

 

現在,已經說明了需要數據模型,數據倉庫的數據模型有哪些類型呢?圖2.3 顯示了我們推薦的數據模型類型,及模型之間的交互作用。以下章節描述一個完全的、成功的、可維護的BI環境下的不同的數據模型。重要的一點是要注意雙向箭頭,指向低級模型的箭頭,暗示集成上一級模型的特徵(基本實體、屬性與關係),這保證我們使用同樣的曲譜表示格式、定義與業務規則;向上指的箭頭暗示我們在把模型變成現實時常常發生變化,這些變化必須反映並結合到後續的模型之中,使模型保持生命力。

Now, having explained the need for data models, what are the types of data models will you need for your data warehouse implementation? Figure 2.3 shows the types of data models we recommend and the interaction between the models. The following sections describe the different data models necessary for a complete, successful, and maintainable BI environment. It is important to note the two-way arrows. The arrows pointing to the next lower level of models indicate that the characteristics (basic entities, attributes, and relationships) are inherited from the upper model. This ensures that we are all singing from the same sheet of music in terms of format, definition, and business rules. The upward-pointing arrows indicate that changes constantly occur as we implement these models into reality and that the changes must be reflected or incorporated into the preceding models for them to remain viable.

 

主題域模型

主題域是企業感興趣的主要事情的分組,這些感興趣的事情最終會描述成實體。一般企業有15 20個主題域。主題域模型一個好處就是能夠快速開發(一般在幾天之內)。這個初始模型作爲企業數據模型的藍圖,以此爲基礎進行提煉。主題域模型能夠快速開發的原因之一是一些主題在很多公司都有,公司可以從這些現成的主題域開始。

Subject Area Model

Subject areas are major groupings of things1 of interest to the enterprise. These things of interest are eventually depicted in entities. The typical enterprise has between 15 and 20 subject areas. One of the beauties of a subject area model is that it can be developed very quickly (typically within a few days). The initial model serves as a blueprint for the business data model, and refinements in the subject area model should be expected. One of the reasons that the subject area model can be developed quickly is that there are some subjects that are common to many organizations, and a company embarking on the development of a subject area model can begin with these.

Fundamental Relational Concept

這些主題在主題域模型裏保持一致的標準:

■■主題域名稱使用複數名詞。

■■主題域的定義指明過去、現在、未來。

■■主題域大致相當於同樣層次的抽象。

■■結構化的定義,使主題域互相排斥。

These subject areas conform to standards governing the subject area model:

■■ Subject area names are plural nouns.

■■ Definitions apply implicitly to the past, present, and future.

■■ Subject areas are at approximately the same level of abstraction.

■■ Definitions are structured so that the subject areas are mutually exclusive.

主題域模型的好處

不管主題域模型創建如何快速,也只有在有好處的時候才值得努力。下面列出主題域模型的主要好處:

指導業務數據模型的開發。業務數據模型是更詳細的模型,用於指導操作型系統與數據倉庫模型的開發。這樣做,有利於數據倉庫實現它的目標之一——數據一致性。

通常有幾種人影響業務數據模型。主題域模型的一個應用是拆分主題域的工作量。在這種方式下,每個人成爲某個領域的專家,如客戶、產品、銷售。建模員有時忙於業務功能,這樣每一個人的工作會涉及到多個主題域。確定每個主題域的首要人員,可以最小化重複的勞動,提高協調性。即使工作沒有分拆到人,主題域模型對一致性和避免冗餘提供保證。當建模員判斷是否需要一個新的實體時,他會根據實體定義檢查合適的主題域。在真正創建這個新實體前,建模員僅僅需要瀏覽那個主題域的模型(一般少於30個),而不要瀏覽所有上百個模型。有了這些信息,建模員或者創建新的實體,或者指出已存在的哪個實體能滿足需要。

 

Subject Area Model Benefits

Regardless of how quickly the subject area model can be developed, the effort should only be undertaken if there are benefits to be gained. Following are some of the major benefits provided by the subject area model.

Guide the Business Data Model Development

The business data model is the detailed model used to guide the development of the operational systems and the data warehouse. By doing so, it helps the data warehouse accomplish one of its major generic objectives—data consistency.

Often, there are several people working on the business data model. One application of the subject area model is to divide the workload by subject area. In this manner, each person becomes an expert for a particular area such as Customers, Products, and Sales. The modelers sometimes address business functions, and hence each person’s work could involve multiple subject areas. By establishing a primary person for each subject area, duplication of effort is minimized and coordination is improved. Even if the workload is not divided by person, the subject area model helps ensure consistency and avoid redundancy. When a modeler identifies the need for a new entity, the modeler determines the appropriate subject area based on the definition. Before actually creating the new entity, the modeler need only review the entities in that subject area (typically less than 30) rather than reviewing the hundreds of entities that may exist in the full model. Armed with that information, the modeler can either create the new entity or ensure that the existing entity addresses the needs.

 

指導數據倉庫項目選擇

公司常常面臨要推動多個數據倉庫,考慮是需求組合成一個項目及建立優先級的問題。主題域模型提供一個高層次的方法,依據每個項目包含的數據把他們組合起來。這些信息可以和業務優先級、技術難度、人員可用性等一起考慮,用於建立最終的項目順序。第3章會更詳細的討論這個問題。

Guide Data Warehouse Project Selection

Companies often contemplate multiple data warehouse initiatives and struggle with both grouping the requirements into projects and with establishing the priorities. The subject area model provides a high-level approach for grouping projects based on the data they encompass. This information should

be considered along with the business priority, technical difficulty, availability of people, and so on in establishing the final project sequence. Chapter 3 will cover this in more detail.

 

指導數據倉庫開發項目。

主題問題專家常常基於已存在的數據,例如,金融主管是公司內的“金融”專家,人力資源部的某個人是“人力資源”專家,從事銷售、市場與客戶服務的人員會成爲“客戶”專家。精通所從事的領域幫助項目隊伍定義需要包含的業務模塊。同時,數據主文件(如客戶主文件、產品主文件)趨於包含具體領域的數據。

Guide Data Warehouse Development Projects

Subject matter experts often exist based on the data that is being addressed. For example, someone in the chief financial officer’s organization would be the expert for “Financials”; someone in the Human Resources Department would be the expert for “Human Resources”; people from Sales, Marketing, and Customer Service would provide the expertise for “Customers.” Understanding the subject areas being addressed helps the project team identify the business representatives that need to be involved. Also, data master files (for example, Customer Master File, Product Master File) tend to contain data related to specific subjects.

 

業務數據模型

業務數據模型是另一種類型的模型,是數據在給定業務環境的抽象和表示,它對任何模型提供好處。它幫助人們直觀地理解一個業務信息如何和另一個業務信息關聯(“各部分如何互相連接起來”)。使用業務數據模型的產品包括操作型系統、數據倉庫、數據集市數據庫,模型提供這些數據庫的元數據(或者叫數據的信息),幫助人們理解怎麼使用或應用最終的產品。業務數據模型減少開發風險,因爲它確保實現的系統正確反映業務環境。最後,它用於指導開發工作,它給開發者提供一個基礎,用於解釋業務信息關係,確保項目有關人員共享一個共同的遠景。

Business Data Model

The business data model is another type of model. It is an abstraction or representation of the data in a given business environment, and it provides the benefits cited for any model. It helps people envision how the information in the business relates to other information in the business (“how the parts fit together”). Products that apply the business data model include operational systems, data warehouse, and data mart databases, and the model provides the meta data (or information about the data) for these databases to help people understand how to use or apply the final product. The business data model reduces the development risk by ensuring that all the systems implemented correctly reflect the business environment. Finally, when it is used to guide development efforts, it provides a basis to confirm the developers’ interpretation of the business information relationships to ensure that the key stakeholders share a common set of expectations.

業務數據模型的好處

業務數據模型提供一個關於業務信息及其關係一致的、穩定的視圖。它用於認識、評估、反映業務變化。下面列出數據倉庫業務數據模型的特殊益處:

Business Data Model Benefits

The business data model provides a consistent and stable view of the business information and business information relationships. It can be used as a basis for recognizing, evaluating, and responding to business changes. Specific benefits of the data model for data warehousing efforts follow.

 

範圍定義

任何項目都應該包括範圍定義作爲第一步,數據倉庫項目也不例外。若一個業務數據模型已經存在,它能用於傳達隨後的數據倉庫的信息。範圍定義文檔應該有專門的章節用於列出數據倉庫要包含的實體,另一章節專門列出那些有人期望包含在數據倉庫但是沒有包含在內的實體。顯式的聲明包含哪些實體,不包含哪些實體,確保對數據倉庫的關注沒有意外。這個實體列表對確定需要的主題問題專家與潛在的源系統非常重要。而且,這個清單可以用於幫助項目建立。很多活動(如數據倉庫模型開發,數據轉換邏輯)都依賴於數據元素的數量。使用數據實體(如果有的話,包括屬性)作爲項目的開始,給管理人員對項目工作量的估算提供基本依據。例如,開發數據倉庫模型的公式,可能由實體與屬性的數量乘以某一個數值,結果可能會根據複雜性、可用的文檔等做出調整。雖然第一個數據倉庫工作量的計算公式可能很粗糙,如果在實際工作中取得了數據,可以提煉這個公式,提供以後估算工作量的可靠性。

Scope Definition

Every project should include a scope definition as one of its first steps, and data warehouse projects are no exception. If a business data model already exists, it can be used to convey the information that will be addressed by the resultant data warehouse. A section of the scope document should be devoted to listing the entities that will be included within the data warehouse; another section should be devoted to listing the entities that someone could reasonably expect to be included in the data warehouse but which have been excluded. The explicit statement of the entities that are included and excluded ensures that there are no surprises with respect to the content of the data warehouse. The list of entities is useful for identifying the needed subject matter experts and for identifying the potential source systems that will be needed. Additionally, this list can be used to help in estimating the project. A number of activities (for example, data warehouse model development, data transformation logic) are dependent on the number of data elements. Using the data entities (and attributes if available) as a starting point provides the project manager with a basis for estimating the effort. For example, the formula for developing the data warehouse model may consist of the number of entities and attributes2

multiplied by the number of hours for each. The result can then be adjusted based on anticipated complexity, available documentation, an so on. While the formula for the first data warehouse effort may be very rough, if data is maintained on the actual effort, the formula can be refined, and the reliability of the estimates can be improved in future implementations.

 

集成的基礎

在設計一個企業級的數據模型時,設計者很快會碰到同名異物(同樣名稱的實體或屬性表示的不同事物)與同物異名(不同的實體或屬性名稱表示同一個事物)的問題。如圖2.4,設計者會看到在“普通分類帳”和“訂單明細”實體中都一個屬性“帳目編號”,這兩個屬性是不是相同呢?也許不是,一個用於表示各種財務賬目,另一個用於表示帶有組織的客戶賬目。

類似的,在圖2.5 中,訂單明細和帳單實體都分別有叫賬目編碼和客戶ID 的屬性,是否是同一個意思呢?答案時也許是的。在創建數據模型的時候,設計者必須確定這些屬性是同名異物的,並保證給他們不同的名字。(如果使用本章推薦的命名協定,在新的模型裏不會產生同名異物的現象)。同樣的,在模型裏,一個屬性必須由一個且只有一個名稱來表示,所以設計者必須協調同物異名問題,給每個屬性一個唯一的名字。這樣,數據模型用於管理冗餘的實體和屬性,給每一個實例“全局”的名字,減少冗餘。同時,數據模型對於同名異物情形下,清洗容易混淆及誤用的實體、屬性名稱也非常有用。實體和屬性有唯一的名稱確保作爲整體的企業不會做出錯誤的假設,這些假設將導致關於數據壞的決策。

Integration Foundation

In designing any enterprise’s data model, the designer will immediately run into situations where homonyms (entities or attributes that have the same name but mean very different things) and synonyms (entities or attributes that have different names but mean exactly the same thing) are encountered. In Figure 2.4, the designer may see that the General Ledger and the Order Entry systems both have an attribute called “Account Number.” Are these the same? Probably not! One is used to denote the field used for various financial accounts, and the other is used to denote the customer’s account with the organization. Similarly, in Figure 2.5, the Order Entry and Billing systems have attributes called Account Number and Customer ID, respectively. Are these the same? The answer is probably yes. In the data model being created, the designer must identify those attributes that are homonyms and ensure that they have distinctly different names. (If the naming convention for attributes recommended in this chapter is used, there will be no homonyms in the new models.) By the same token, an attribute must be represented once and only once in the model so the designer must reconcile the synonyms as well and represent each attribute by a single

 

2If the number of attributes is not known, an anticipated average number of attributes per entity can be used.

 

name. Thus, the data model is used to manage redundant entities and attributes rendering the “universal” name for each instance, reducing the redundancy in the environment. The data model is also very useful for clearing up confusing and misleading names for entities and attributes in the homonym situation as well. Ensuring that all entities and attributes have unique names guarantees that the enterprise as a whole will not make erroneous assumptions, which lead to bad decisions, about the data.

 

 

多項目協調

一個數據倉庫程序由多個數據倉庫項目組成,有時這些程序同時管理。當多個團隊在數據倉庫上工作時,主題域模型不能用於確定項目的交疊與裂縫部分。業務數據模型用於建立項目之間的交疊部分,用於每個項目要使用的數據。當一個實體被不只一個項目使用時,它的設計、定義與實現應該只分配給一個團隊。由其他項目發現的這部分數據的變更由那個團隊協調一致。數據模型還能幫助指出系統的裂縫,即實體和屬性沒有加入任何系統。是否所有的實體、屬性、關係都在某地被創建?如果沒有,你的系統真正有問題。在這個系統內他們是否被其他地方變更或使用?如果這樣,在系統是否有正確的接口創建數據的流?最後,他們在系統內會被刪除會丟棄嗎?創建一個數據模型之間的交叉矩陣能幫助你回答這些問題。

Multiple Project Coordination

A data warehouse program consists of multiple data warehouse implementation projects, and sometimes several of these are managed simultaneously. When multiple teams are working on the data warehouse, the subject area model can be used to initially identify where the projects overlap and gaps that will remain following completion of the projects. The business data model is then used to establish where the projects overlap to fine-tune what data each project will use. Where the same entity is used by more than one project, its design, definition, and implementation should be assigned to only one team. Changes to that piece of data discovered by other projects can be coordinated by that team. The data model can also help to identify gaps in your systems where entities and attributes are not addressed at all. Are all entities, attributes, and relationships created somewhere? If not, you have a real problem in your systems. Are they updated or used somewhere else within the systems? If so, do you have the right interfaces between systems to handle the flow of created data? Finally, are they deleted or disposed of somewhere in your systems? The creation of a matrix based upon the crossing of your data model with your systems’ processes will give you a sound basis from which to answer these questions.

 

依賴識別

數據模型幫助識別不同的實體與屬性之間的依賴關係。這種方式,用於幫助評估變更的影響。當你改變或創建一個過程,你必須能回答這個問題:它是否會影響到其他過程的數據。數據模型能幫助你確保依賴的實體和屬性在設計和實現一個新系統或者系統變更時被考慮,

Dependency Identification

The data model helps to identify dependencies between various entities and attributes. In this fashion, it can be used to help assess the impact of change. When you change or create a process, you must be able to answer the question of whether it will have any impact on sets of data used by other processes. The data model can help ensure that dependent entities and attributes are considered in the design or implementation of new or changed systems.

 

冗餘管理

數據模型盡力去處所有的冗餘。每個實體、屬性、關係在模型裏只出現一次,除非他們在另一個實體裏用作外鍵。在創建這個模型時,你能快速看到必須解決的疊交與衝突問題,同時還有在繼續之前必須去除的冗餘。在“關係建模指導方針”裏提出的範式規則用於確保設計一個非冗餘的模型。有非常多的理由在系統模型和技術模型裏引入冗餘,最常見的理由是查詢的性能。明白何地及爲和引入冗餘非常重要,通過數據模型,冗餘得到控制、提前關注、在整個設計階段檢驗它的影響。

Redundancy Management

The business data model strives to remove all redundancies. Entities, attributes, and relationships appear only once in this model unless they are used as foreign keys into other entities. By creating this model, you can immediately see overlaps and conflicts that must be resolved, as well as redundancies that must be removed, before going forward. The normalization rules specified in the “Relational Modeling Guidelines” section are designed to ensure a nonredundant data model. There are many reasons to introduce redundancy back into system and technology data models; the most common one is to improve the performance of queries or requests for data. It is important to understand where and why any redundancy is introduced, and it is through the data model that redundancy can be controlled, thought out ahead of time, and examined for its impact on the overall design.

 

變更管理

數據模型同時作爲記錄實體、屬性、關係變更的最好方式。系統建立之後,我們可能發現新的業務規則,需要增加實體和屬性。當這些變更記錄在技術模型和系統模型(如2.3 )時,必須強迫返回到數據模型鏈——到業務模型,甚至可能要到主題域模型本身。很明顯,如果沒有各個層次數據模型的變更控制,很快就會發生混亂,數據模型所有的好處都會丟失。

Change Management

Data models also serve as your best way to document changes to entities, attributes, and relationships. As systems are created, we may discover new business rules in effect and the need for additional entities and attributes. As these changes are documented in the technology and system data models (see Figure 2.3), these changes must be enforced all the way back up the data model chain—to the business data model and maybe even to the subject area diagram itself. Without solid change control over all levels of the data models, it should be clear that chaos will quickly take over and all the benefits of the data models will be lost.

 

系統模型

2.3中下一層次的數據模型由一系列系統模型組成。一個系統模型是一個信息的集合,這些信息與具體的系統或功能有關,如賬單系統、數據倉庫、或者數據集市,獨立於任何具體的技術或者DBMS 環境。例如,賬單系統和數據倉庫系統模型很可能不會發現有每個企業感興趣的數據片,因爲系統模型是從業務模型開發而來的,它必須與業務模型一致。第4章更詳細介紹數據倉庫系統模型。我們構建的每個系統或數據庫都有它自己唯一的系統模型,描述所支持的那個系統或功能模塊具體的數據需求。一般每個系統只有一個系統模型,也就是說,數據倉庫只有一個系統模型,賬單系統只有一個系統模型,等等。我們可以選擇物理上實現這個模型的很多版本(請見下一節關於技術模型的介紹),但是系統模型仍然只有一個,它用於實現真正的系統。

System Model

The next level of data models in Figure 2.3 consists of the set of system models. A system model is a collection of the information being addressed by a specific system or function such as a billing system, data warehouse, or data mart. The system model is an electronic representation of the information needed by that system. It is independent of any specific technology or DBMS environment. For example, the billing system and data warehouse system models will most likely not have every scrap of data of interest to the enterprise found in them. Because the system model is developed from the business data model, it must, by default, be consistent with that model. See Chapter 4 for more detail on the construction of the data warehouse system model. It is also important to note that there will be more than one system model. Each system or database that we construct will have its own unique system model denoting the specific data requirements for that system or the function it supports. Alternatively, there typically is only one system model per system. That is, there is only one system model for the data warehouse, one for the billing system, and so on. We may choose to physically implement many versions of the system model (see the next section on technology model) but still have only one system model from which to implement the actual system(s).

技術模型

最後要開發的模型是技術模型。這個模型是一個具體信息的集合,這些信息被存貯在一個特殊的系統裏,用一個具體的平臺實現。現在,我們必須考慮所有有關數據庫的技術,包括:

硬件。你對平臺的選擇意味着依照你的平臺技術,你必須考慮單個數據文件的尺寸,並且在技術模型裏標記這些規格

數據庫管理系統(DBMS。爲數據倉庫選擇的數據庫管理系統對你最終設計的數據庫有很大影響,你必須做一些決定:

Technology Model

The last model to be developed is a technology model. This model is a collection of the specific information being addressed by a particular system and implemented on a specific platform. Now, we must consider all of the technology that is brought to bear on this database including:

Hardware. Your choice of platform means that you must consider the sizes of the individual data files according to your platform technology and notate these specifications in the technology model.

Database management system (DBMS). The DBMS chosen for your data warehouse will have a great impact upon the ultimate design of your database. You must make the following determinations:

■■反規範化的數量。一些DBMS環境在很少或沒有反規範化時性能較好,另一些需要顯著的反規範化以得到好的性能。

■■物化視圖。依賴於你使用的DBMS技術,你可能創建物化視圖或虛擬數據集市,用於加速查詢性能。

■■分區策略。你應該使用分區加快數據裝入數據倉庫及交付到數據集市的速度,或者選擇水平分區,或者選擇垂直分區。第5章更詳細的討論這個話題。

■■索引策略。依賴於你的DBMS,索引策略有很多選擇。位圖、編碼矢量、稀疏矩陣、哈西算法、簇、連接索引等,都是可選的。

■■參照完整性。綁定(DBMS綁定參照完整性——在父記錄裝入前不能裝入子記錄)與不綁定(裝入數據到後備區,用程序檢查完整性,然後再裝入數據倉庫)是兩種選擇,時間是一個限定因素。

■■數據交付技術。怎樣把數據從數據倉庫交付到各種數據集市對數據庫的設計有影響,要考慮是通過門戶交付還是管理的查詢過程。

■■安全性。數據倉庫常常包含高度敏感的數據。你可以選擇在DBMS層,通過物理分離這部分數據來調用安全性,或者使用視圖、存貯過程確保安全。如果數據是極端敏感的,你可能使用加密技術確保安全。

技術模型必須與控制系統模型一致,也就是說,從它的系統模型繼承基本需求。同樣地,技術模型裏基礎實體、屬性、關係的任何變更,都必須返回影響模型鏈,如2.3所示(向上的箭頭)。正象有很多系統模型一樣——每個系統一個——每一個系統模型有很多技術模型。例如,你可以選擇在分開的實例中實現企業數據倉庫的子集;你可以選擇按主題域實現數據——例如,分別爲客戶、產品、訂單使用物理上不同的實例;或者你可以選擇按地理區域分開數據子集——一個數據倉庫用於北美,另一個用於歐洲,第三個用於亞洲。每一個物理實例有它自己的技術模型,這些技術模型都基於系統模型,而根據你選擇的實現技術進行修改。

 

■■ Amount of denormalization. Some DBMS environments will perform better with minimal or no denormalization; others will require significant denormalization to achieve good performance.

■■ Materialized views. Depending on the DBMS technology you use, you may create materialized views or virtual data marts to speed up query performance.

■■ Partitioning strategy. You should use partitioning to speed up the loading of data into the data warehouse and delivery to the data marts. You have two choices—either horizontal or vertical partitioning. Chapter 5 discusses this topic in more detail.

■■ Indexing strategy. There are many choices, depending on the DBMS you use. Bitmap, encoded vector, sparse, hashing, clustered, and join indexes are some of the possibilities.

■■ Referential integrity. Bounded (the DBMS binds the referential integrity for you—you can’t load a child until the parent is loaded) and unbounded (you load the data in a staging area to programmatically check for integrity and then load it into the data warehouse) are two possibilities. You must make sure that time is one of the qualifiers.

■■ Data delivery technology. How you deliver the data from the data warehouse into the various data marts will have an impact on the design of the database. Considerations include whether the data is delivered via a portal or through a managed query process.

■■ Security. Many times the data warehouse contains highly sensitive data. You may choose to invoke security at the DBMS level by physically separating this data from the rest, or you can use views or stored procedures to ensure security. If the data is extremely sensitive, you may choose to use encryption techniques to secure the data. The technology model must be consistent with the governing system model. That is, it inherits its basic requirements from its system model. Likewise, any changes in the fundamental entities, attributes, and relationships discovered as the technology model is implemented must be reflected back up the chain of models as shown in Figure 2.3 (upward arrows). Just as there are many system models—one per system—there may be multiple technology models for a single system model. For example, you may choose to implement subsets of the enterprise data warehouse in physically separate instances. You may choose to implement data by subject area—for example, using a physically different instance for customer, product, and order. Or you may choose to separate subsets of data by geographic area—one warehouse for North America, another for Europe, and a third for Asia. Each of these physical instances will have its own technology model that is based upon the system model and modified according to the technology upon which you implement.

 

關係數據建模方針

數據建模是一個非常抽象的過程,不是所有的IT專家都擁有創建一個堅實的模型的資格。數據建模員需要有概念化的能力,把那些關於企業需要用於處理業務及包含在其中的規則的無形思想概念化。而且,數據建模是非確定性的——只有一個正確的方式建立一個數據模型,而錯誤的方式有很多。關於數據建模一個普遍關心的問題是發生變更的數量。當我們越來越瞭解企業,這些知識會反過來導致已有的數據模型的變更。數據建模員不要把這個看作威脅,而必須準備好變更,並欣然接受它——事實上,模型是更有洞察力地、更接近地象徵一個企業整體。

數據建模員在創建各種數據模型時,必須堅持一系列原則和規則。我們已經建議,在你開始建模實踐前,建立這些“基礎規則”,以免後面的混淆和情緒上的爭論。對這些規則的任何背離都必須記錄文檔及列出例外的原因,之後採取的用於減少或消除這些例外的任何行動也都要記錄文檔。最後,數據建模需要判斷力,即使判斷力的理由不清晰或不能文檔化。當出現這種情況,數據建模員應該重新檢查後面列出的三個指導方針。如從一個模型增加或者刪除某些東西,能提供模型的溝通效用與能力,那麼就應該那樣做。

本書的目標是確保讀者在開始數據倉庫設計前有一個堅實的基礎並一步一步指導讀者處理這些問題。下面這些指導意見是我們從事多年建模經驗中得來的,讓我們開始吧。

 

Relational Data-Modeling Guidelines

Data modeling is a very abstract process, and not all IT professionals have the qualifications to create a solid model. Data modelers require the ability to conceptualize intangible notions about what the business requires to perform its business and what its rules are in doing business. Also, data modeling is nondeterministic—there is one right way to create a data model. There are many wrong ways.

A common concern in data modeling is the amount of change that occurs. As we learn more and more about the enterprise, this knowledge will be reflected in changes to the existing data models. Data modelers must not see this aspect as a threat but rather be prepared for change and embrace it as a good sign—a sign that the model is, in fact, more insightful and that it more closely resembles the enterprise as a whole.

Data modelers must adhere to a set of principles or rules in creating the various data models. It is recommended that you establish these “ground rules” before you start your modeling exercise to avoid confusion and emotional arguments later on. Any deviation from these rules should be documented and the reasons for the exception noted. Any mitigating or future actions that reduce or eliminate the exception later on should be documented as well. Finally, data modeling also requires judgment calls even when the reasons for the judgment are not clear or cannot be documented. When faced with this situation, the data modeler should revisit the three guidelines described in the next section. If adding or deleting something from the model improves its utility or ability to be communicated, then it should be done.

It is the goal of this book to ensure that you have the strong foundation and footing you need to deal with these issues before you begin your data warehouse design. Let’s start with a set of guidelines garnered from the many years of data modeling we have performed.

 

方針與最佳實踐

數據模型的目標是完全地、準確地反應數據需求與業務規則,使業務高效處理。爲了這個目標,我們相信在數據模型設計時應該遵守三個方針:

溝通工具。數據模型應該用作業務團體和IT成員之間的溝通工具。數據需求必須很好的形成文檔,並且被所有有關人員理解,必須面向業務,必須包含適當的細節層次。數據模型應當用於傳達業務團體的關於企業數據的觀點,給實現系統的技術人員。當開發這些模型時,目標必須始終清晰準確。當往一個數據模型增加信息時,建模員必須問自己,是會提高清晰性還是降低清晰性。

粒度的層次。數據模型應該反應企業使用的信息的“最小公分母”。聚合、導出、彙總數據元素應該從這些基本部分分解出來,無必要的冗餘和重複數據元素應該去除。當我們爲了使用和性能要求,通過增加向後的聚合、導出項、彙總項等進行“反規範化”時,我們清楚地知道每個組件里加入了什麼數據元素。換句話說,數據必須根據需要細化,以支持本性和最終的使用。而最終的技術模型可能有一些明顯的聚合、彙總、導出項,這些都要數據建模文檔指向最初的明細數據。

面向業務。模型本身就表示了企業視圖,而不需要物理限制。我們總是努力按業務的需要建模,而不是根據現有的系統、技術、數據庫等限制業務來建模。不以業務團體的要求爲基礎的項目往往必定要失敗。我們常常沒有和業務團體一致,因爲我們簡單地相信我們已經知道分析的結果是什麼(類似於“如果我們建好了,他們就會來”的信念)。這些方針應該永遠銘記在建模員心中,從他/她從事建模開始。只要一出現問題或意見,建模員應當回想這些方針,確定模型的能力是增加了還是減少了。

把這些記在心中,讓我們看看數據建模的最佳實踐有哪些:

 

Guidelines and Best Practices

The goal of any data model is to completely and accurately reflect the data requirements and business rules for handling that data so that the business can perform its functions effectively. To that end, we believe that there are three guidelines that should be followed when designing your data models:

Communication tool. The data models should be used as a communication tool between the business community and the IT staff and within the IT staff. Data requirements must be well documented and understood by all involved, must be business-oriented, and must consist of the appropriate level of detail. The data model should be used to communicate the business community’s view of the enterprise’s data to the technical people implementing their systems. When developing these models, the objectives must always be clarity and precision. When adding information to a data model, the modeler should ask whether the addition adds to clarity or subtracts from it.

Level of granularity. The data models should reflect the “lowest common denominator” of information that the enterprise uses. Aggregated, derived, or summarized data elements should be decomposed to their basic parts, and unnecessary redundancy or duplication of data elements should be removed. When we “denormalize” the model by adding back aggregations, derivations, or summarization according to usage and performance objectives, we know precisely what elements went into each of these components. In other words, the data should be as detailed as necessary to understand its nature and ultimate usage. While the ultimate technology model may have significant aggregations, summarizations, and derivations in it, these will be connected back to the ultimate details through the data modeling documentation.

Business orientation. It is paramount that the models represent the enterprise’s view of itself without physical constraints. We strive always to model what the business wants to be rather than model what the business is forced to be because of its existing systems, technologies, or databases. Projects that are not grounded in what the business community wants are usually doomed to fail. Generally, we miss the boat with our business community because we cut corners in the belief that we already know what the results of analysis will be (the “if we build it, they will come” belief). These guidelines should always be at the forefront of the modeler’s mind when he or she commences the modeling process. Whenever questions or judgment calls come into play, the modeler should fall back to these guidelines to determine whether the resolution adds or detracts to the overall usability of the models.

With these in mind, let’s look at some of the best practices in data modeling:

 

業務用戶參與。首先,必須瞭解,業務團體必須花時間與資源,幫助建立各種模型,數據建模不只是IT人員的技術活。如果業務團體不能找到時間,拒絕參與,或者宣稱IT人員應當對他們需要的數據“負責”,那麼需要一個明智的項目經理把他們拉進來。缺席業務團體的數據建模是浪費時間、資源和精力,而且極有可能失敗。而且,業務團體加入得越快越好。第一步,你必須決定業務團體的哪些人應該加入進來。這些人可能願意也可能不願意參加,如果他們公開抵制,你可能需要舉辦一些培訓,消除他們的恐懼,或者尋找其他資源。典型的參與者包括髮起人高管、具有領域專業的經理、業務分析師。

面談與小型會議。在短時間內取得大量信息的一個最普通的方式是面談及小型會議。面談一般從一個多兩個人那裏得到信息,會議可能得到更深的信息。小型會議一般有5——10人蔘加,一般用於決定方向或取得一致意見,或者用於培訓。這些會議的記錄文檔被確認,並添加到信息庫,作爲數據模型的依據。

驗證。提交的模型然後通過面談或小型會議直接反饋,或者通過正式的通道確認。請一些業務團體成員確認時,你可能更關注於業務規則與限制,而不是真正的數據模型本身;而對另一些人,你應當確認真正的數據模型結構和關係是否正確。

數據模型維護。在任何建模工作中,變更都是常有的,你應當準備應對這種情況。變更管理應當制定正式的規章,包含簽入、簽出過程,正式的變更請求及解決衝突的過程。

知道什麼時候“適可而止”。也許數據建模員應該學會的最重要的實踐是什麼時候說模型已經足夠好。因爲我們正在設計一個抽象的、有爭議的結構,數據建模員很容易發現自己處於“分析麻痹”狀態。數據模型什麼時候算完成了?永遠沒有!因此,建模員必須做一個艱難的決定:模型是否足夠支持已經實現的功能需求。知道變更會發生,並且準備下次處理變更。

 

Business users’ involvement. It must be understood up front that the business community must set aside time and resources to help create the various data models; data modeling is not just a technical exercise for IT people. If the business community cannot find the time, refuses to participate, or basically declares that IT should “divine” what data they need, it is the wise project manager who pulls the plug on the project. Data modeling in a business community vacuum is a waste of time, resources, and effort, and is highly likely to fail. Furthermore, the sooner the business community gets involved, the better. As a first step, you must identify who within the business community should be involved. These people may or may not be willing to participate. If they are openly resistant, you may need to perform some education, carry out actions to mitigate their fears, or seek another resource. Typical participants are sponsoring executives, managers with subject matter expertise, and business analysts.

Interviews and facilitated sessions. One of the most common ways to get a lot of information in a short amount of time is to perform interviews and use facilitated sessions. The interviews typically obtain information from one or two people at a time. More depth information can be obtained from these sessions. The facilitated sessions are usually for 5 to 10 attendees and are used to get general direction and consensus, or even for educational purposes. The documentation from these sessions is verified and added to the bank of information that contributes to the data models.

Validation. The proposed data model is then verified by either immediate feedback from the interviews or facilitated sessions, or by formal walkthroughs. It may be that you focus on just the verification of the business rules and constraints rather than the actual data model itself with some of the business community members. With others though, you should verify that the actual data model structures and relationships are appropriate.

Data model maintenance. Because change becomes a common feature in any modeling effort, you should be prepared to handle these occurrences. Change management should be formalized by documented procedures that have check-in and check-out processes, formal requests for changes, and processes to resolve conflicts.

Know when “enough is enough.” Perhaps the most important practice any data modeler should learn is when to say the model is good enough. Because we are designing an abstract, debatable structure, it is very easy for the data modeler to find him- or herself in “analysis paralysis.” When is the data model finished? Never! Therefore it is mandatory that the modeler make the difficult determination that the model is sufficient to support the needs of the function being implemented, knowing that changes will happen and that he or she is prepared to handle them at a later date.

 

規範化

規範化是一個確保數據模型滿足準確性、一致性、簡單性、非冗餘性、穩定性等目標的方法,它是一個物理數據庫設計技術,在關係技術中應用數學規則,定義和減少插入、修改、刪除異常。

我們常用的第三範式簡單的定義是:所有屬性必須依賴於鍵,所有鍵,不能依賴除了鍵之外的其他屬性。規範化是一種從根本上保證屬性在合適的實體裏,使設計對關係型DBMS有用及有效。我們會在本章的下一節討論數據模型設計中規範化的過程。規範化有如下特徵:

■■驗證數據模型結構的正確性與一致性。

■■獨立於任意物理限制。

■■最小化存貯空間,通過減少數據在多個地方存貯而實現。

最後,規範化帶來:

■■移除數據的不一致性,因爲數據制存貯一次,因此消除了數據衝突的可能性。

■■減少插入、修改、刪除異常,因爲數據只存貯一次。

■■提高數據結構穩定性,因爲屬性放入實體位置是基於內部的特性而不是具體的應用需求。

 

Normalization

Normalization is a method for ensuring that the data model meets the objectives of accuracy, consistency, simplicity, nonredundancy, and stability. It is a physical database design technique that applies mathematical rules to the relational technology to identify and reduce insertion, update, or deletion anomalies.

The mantra we use to get to third normal form is that all attributes must depend on the key, the whole key, and nothing but the key—to put it simply. Fundamentally this means that normalization is a way of ensuring that the attributes are in the proper entity and that the design is efficient and effective for a relational DBMS. We will walk through the steps to get to this data model design in the next sections of this chapter. Normalization has these characteristics as well:

■■ Verification of the structural correctness and consistency of the data model

■■ Independence from any physical constraints

■■ Minimization of storage space requirement by eliminating the storage of data in multiple places

Finally, normalization:

■■ Removes data inconsistencies since data is stored only once, thus eliminating the possibility of conflicting data

■■ Diminishes insertion, updating, and deletion anomalies because data is stored only once

■■ Increases the data structure stability because attributes are positioned in entities based on their intrinsic properties rather than on specific application requirements

 

關係數據模型的規範化

在業務數據模型裏規範化非常有用:

■■它不指示任何物理的處理方向,因此使業務模型有一個好的開始,它面向所有的應用和數據庫。

■■它減少聚合、彙總或導出元素,保證在數據模型裏沒有隱含的過程。

■■它禁止所有的屬性和實體的複製或冗餘的發生。

系統模型與技術模型從業務數據模型繼承了這些屬性,這樣開始一個完全規範化的數據模型。然而,因爲各種原因,有時需要加入反規範化的屬性,這會在第3章與第4章介紹,最要的是必須認識到何時、何地進行反規範化,並記錄這樣做的原因。失去控制的冗餘和反規範化會導致混亂及不可執行的數據庫設計。

規範化應當在業務數據模型設計階段採用,然而,有一點非常重要,你不應該修改業務規則以適應規範化的嚴格限制,就是說,不要僅僅爲了滿足規範化而創建對象。

Normalization of the Relational Data Model

Normalization is very useful for the business data model because:

■■ It does not instruct any physical processing direction, thus making the business model a good starting place for all applications and databases.

■■ It reduces aggregated, summarized, or derived elements to their basic components, ensuring that no hidden processes are contained in the data model.

■■ It prevents all duplicated or redundant occurrences of attributes and entities.

The system and technology models inherit their characteristics from the business data model and so start out as a fully normalized data model. However, denormalized attributes will be designed into these data models for a variety of reasons, as described in Chapters 3 and 4, and it is important to recognize where and when the denormalization occurs and to document the reasons for that denormalization. Uncontrolled redundancy or denormalization will result in a chaotic and nonperforming database design.

Normalization should be undertaken during the business data model design. However, it is important to note that you should not alter the business rules just to follow strict normalization rules. That is, do not create objects just to satisfy normalization.

第一範式

第一範式(1NF)是建立數據模型的第一步——屬性依賴於鍵。這需要兩個條件:所有實體都有一個主鍵,唯一標示一個對象;實體不包含會反覆出現的或者多值的組。每一個屬性應該在它的最低層次的細節,並且有唯一的意義和名字。1NF是所有其他規範化技術的基礎。圖2.5顯示模型如何轉換爲1NF

First Normal Form

First normal form (1NF) takes the data model to the first step described in our mantra—the attribute is dependent on the key. This requires two conditions— that every entity have a primary key that uniquely identifies it and that the entity contain no repeating or multivalued groups. Each attribute should be at its lowest level of detail and have a unique meaning and name. 1NF is the basis for all other normalization techniques. Figure 2.6 shows the conversion of our model to 1NF.

 

在圖2.5中,我們發現在課程實體裏包含一些課程安排的屬性,而不是課程本身的屬性(課程安排,階段、教授標識、教授名稱)。這些屬性不依賴於課程實體的鍵而存在,因此,應該把他們放到自己的實體內(課程安排)。

In Figure 2.6, we see that the Course entity contains the attributes that deal with a specific offering of the course rather than the generic course itself (Course Offering, Period, Professor Identifier, and Professor Name). These attributes are not dependent on the Course entity key for their existence, and therefore should be put into their own entity (Course Offering).

 

第二範式

第二範式(2NF)對模型進行進一步提煉,根據我們的宗旨:屬性必須依賴於所有鍵。爲達到2NF,實體必須滿足1NF,而且每一個非主鍵屬性必須依賴整個主鍵而存在。2NF 進一步減少可能的冗餘,通過移除依賴於部分主鍵的屬性,把他們放到自己單獨的實體裏。注意學科名稱僅僅依賴於學科標識,如果保留在模型裏,學科標識和學科名稱必須爲每一個課程重複一次。把他們放在自己的實體裏,他們只需要存貯一次。圖2.7 顯示了把模型轉換爲2NF

Second Normal Form

Second normal form (2NF) takes the model to the next level of refinement according to our mantra—the attributes must be dependent on the whole key. To attain 2NF, the entity must be in 1NF and every nonprimary attribute must be dependent on the entire primary key for its existence. 2NF further reduces possible redundancy in the data model by removing attributes that are dependent on part of the key and placing them in their own entity. Notice that Discipline Name was only dependent on the Discipline Identifier. If this remains in the model, then Discipline Identifier and Name must be repeated for every course. By placing these in their own entity, they are stored only once. Figure 2.7 shows the conversion of our model to 2NF.

 

第三範式

第三範式(3NF)使模型達到最後層次的提高,根據我們的宗旨——屬性不能依賴於任何主鍵外的屬性。爲達到3NF ,實體必須滿足2NF,而且非鍵屬性必須僅僅依賴於主鍵,而不依賴於實體裏的任何其他屬性而存在。這樣移除了傳遞依賴,也就是非主鍵屬性不只依賴於主鍵,同時也依賴於其他非主鍵屬性的情況。圖2.8顯示模型轉換爲3NF的情形,注意課程安排教授(標識)與課程安排教授名稱是重複屬性,教授名稱和教授標識都不依賴於課程安排,因此,我們把這些屬性從課程安排實體移出,把他們放入自己獨立的實體,實體名爲教授。在這一點,爲任何技術實現做好了準備——帳務、訂單、普通分類帳等操作型系統,數據倉庫、數據集市等商業智能,或者操作型存貯等其它環境。

Third Normal Form

Third normal form (3NF) takes the data model to the last level of improvement referred to in our mantra—the attribute must be dependent on nothing but the key. To attain 3NF, the entity must be in 2NF, and the nonkey fields must be dependent on only the primary key, and not on any other attribute in the entity, for their existence. This removes any transitive dependencies in which the nonkey attributes depend on not only the primary key but also on other nonkey attributes. Figure 2.8 shows the conversion of our model to 3NF. In Figure 2.8, notice that Course Offering Professor and Course Offering Professor Name are recurring attributes. Neither the Professor Name or the Professor Identifier depend on the Course Offering. Therefore, we remove these attributes from the Course Offering entity and place them in their own entity, titled Professor. At this point, the data model is in 3NF in which all attributes are dependent on the key, the whole key, and nothing but the key. Your business data model should be presented in 3NF at a minimum. At this point, it is ready for use in any of your technological implementations— operational systems such as billing, order entry, or general ledger (G/L); business intelligence such as the data warehouse and data marts; or any other environment such as the operational data store.

 

 

其他範式

我們爲一個組織設計業務模型時,常常到3NF爲止。然而,你應該注意到,其它層次的規範化及他們好處決定了3NF是否能滿足你的組織。有幾本好的數據建模書討論BC範式、第四範式和第五範式。我們不會在本書的後面章節討論這些。

 

注意

我們再次警告你不要過度熱心的使用這些規範化技術。換句話說,不要過度規範化你的模型。你應該使用結構化的一致性平衡考慮業務意義,你應該永遠把模型基於業務概念放在首位,然後使用規範化技術去驗證結構的完整性和一致性。

 

Other Normalization Levels

We usually stop with 3NF when we design the business model for organizations. However, you should be aware of other levels of normalization and their benefits to determine if 3NF will be sufficient for your organization. There are several good data-modeling books that discuss the merits of Boyce/Codd and fourth and fifth normal forms. We will not discuss these further in this book.

 

WARNING

We caution you against overzealous usage of these normalization techniques. In other words, don’t overnormalize your models. You should balance the consideration of business meanings with structural consistency. You should always base your model on business concepts first, then apply the normalization techniques to verify the structural integrity and consistency.

小結

我們在這一章討論了數據模型作爲公司資產是非常必要的。沒有這一套數據模型,業務用戶和創建系統的技術人員不能開發一個複雜的、準確的東西來表示信息結構、業務規則、數據之間的關係。只有在設計時記住重用、一致性、完整性等概念,並嚴格遵守本章提出的建模技術才能完成。我們創建各種數據模型來滿足這個需要:從主題域圖形開始,到業務數據模型、到系統模型、到技術模型——每一類模型定義一個不同的抽象和轉換層次——最後引導致一個一致的、完全集成的數據庫模式。我們看到主題域圖形可以很快開發出來,一般在幾天內就可以完成;全部業務模型需要稍長一些時間,包含整個企業的業務規則,能被數據倉庫的所有應用使用。有一點非常重要,如果你的組織沒有這些模型,我們建議你創建一個企業範圍內的主題域圖,而把焦點放在數據倉庫需要的那些主題上,用來建立業務模型。當新的主題域加入到數據倉庫時,你可以繼續把他們加入數據模型,而不會牽制到整個數據模型。第4章會介紹更多這方面的信息。

Summary

We have discussed in this chapter the fact that data models are essential for managing data as a corporate asset. Without the set of data models described here, the business users and technical staff creating systems cannot develop a comprehensive and precise representation of information structures, business rules, and relationships among the data. This can only be accomplished when the databases are designed with the concept of reusability, consistency, and integration in mind and with rigorous of compliance to the modeling techniques contained in this chapter. We covered the need for various data models, starting with a subject area diagram, and migrating to a business data model, the system, and the technology models—each one defining a different level of abstraction and transformation— ultimately leading to a coordinated and fully integrated database schema. We see that the subject area diagram can be developed very quickly, usually within a few days. The full business data model will take a bit longer and, when fully developed, contains the business rules for the entire enterprise and can be used by all applications, including the data warehouse. It is important to note here that if your organization does not have either of these models, we recommend that you create an enterprise-wide subject area diagram but focus on only the subject area(s) needed for your data warehouse for the business data model. You will continue to fill out the business and data warehouse data models as new areas are needed in the data warehouse but should not get sidetracked into trying to create the entire business data model before you generate your first data warehouse subject area. See Chapter 4 for more information on this.

 

開發任何模型時,我們必須記住並遵循三個方針:數據模型是溝通的工具,包含最低明細級的數據,必須面向企業。當面對不同的決策時,應該考慮這三個方針。我們還在本章學到規範化及其給數據庫設計帶來的好處,我們建議你的業務模型滿足3NF。在3NF,屬性的依賴關係如下:

1NF。屬性依賴於鍵,通過移出重複的組得到。

2NF。屬性依賴於全部主鍵,通過移出依賴於部分主鍵的屬性得到。

3NF。屬性只依賴於鍵,通過移出依賴於非主鍵的屬性而得到。

我們對過度規範化及分析麻痹提出警告。最後,鍵保持模型的完整性、一致性、可重用性。這些模型快速產生穩定的、可維護的數據庫(或者數據庫的子集),這樣,纔可以繼續數據倉庫的分析和交付工作。

There are three important guidelines to follow in developing any of the models we discuss. These are to remember that the data model is a communication tool, that it contains the lowest common denominator of detail, and that it reflects a solid business orientation. When confronted with difficult decisions, these three guidelines should rule the day. We also learned in this chapter about normalization and its benefits for database design. Our recommendation is to develop your business data model in 3NF. In 3NF, attributes are dependent:

1NF. On the key, accomplished by removing repeating groups

2NF. The whole key, accomplished by removing attributes dependent on part of the key

3NF. Nothing but the key, accomplished by removing attributes dependent on nonkey attributes

We warn against overzealous normalization and analysis paralysis. At the end of the day, the key is to get a set of models that is fully integrated, consistent, and reusable. These models produce stable and maintainable databases (or a subset of them) quickly so that work can proceed on providing timely business deliverables from the data warehouse.

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章