Chapter 2 Data Exploration

目錄

1. What is Data:

    A. Data Types

    B. Record Data

    C. Types of Attributes

2. Data Exploration:

    A. About Data Quality

    B. Preprocessing

        ① Quality

        ② Sampling

        ③ Attribute Selection

        ④ Dimensionality Reduce

        ⑤ Discretization:Binning

        ⑥ Statistics

        ⑦ Visualization

        

1. What is Data:

 A.  Data Types: Document Data、Transaction Data、Graph Data、Sequence Data、Spatial-Temporal Data、Record   Data、 Data Matrix

Spatial  [ˈspeɪʃl]   空間的
Temporal [ˈtempərəl] 時間的

 B.  Record Data:

  Collection of data objects and their attributes

  An attribute is a property or characteristic of an Object

  A collection of attributes describe an Object

 property        [ˈprɑːpərti]       特性
 characteristic  [ˌkærəktəˈrɪstɪk]  特徵

C.  Types of Attributes:

      ① Discrete Attribute and Continus Attribute

      ② Nominal Attribute and Ordinal Attribute

      ③ Interval Attribute and Ratio Attribute

Nominal  [ˈnɒmɪnl]   名義
Ordinal  [ˈɔːrdənl]  序數 
 
Interval [ˈɪntəvl]   區間
Ratio    [ˈreɪʃioʊ]  比率

2. Data Exploration:

 A. About Data Quality: Data in the real world is dirty. 

 ① incomplete: lacking attribute values

 ② noisy:data errors, outliers

 ③ inconsistent: discrepancy between duplicate records

outlier      [ˈaʊtlaɪər]    離羣的, 異常的
discrepancy  [dɪsˈkrepənsi] 差異,不一致
duplicate    [ˈduːplɪkeɪt]  完全一樣的,複製的

 B. Preprocessing:

 ① Quality:Handle missing values (Ignore or Estimate)、Remove Outliers、Resolve Confilcts (Merge or Identify)

 ② Sampling:

      Key principle:using a sample will work almost as well as using the entire data sets, if the sample is representative;

                              A sample is representative if it has approximately the same property as the origin set of data

      Types of Sampling:Simple Random Sampling、Sampling without replacement、Sampling with repacement、

                                       Stratified Sampling

      Sampling Rate:

 ③ Attribute Selection:Redundant Attributes and Irrelevant Attributes

stratified  [ˈstrætɪfaɪd] 分層的
redundant   [rɪˈdʌndənt]  冗餘的
irrelevant  [ɪˈreləvənt]  無關的

 ④ Dimensionality Reduce: 

      Reduce the number of attributes by creating a new set of attributes.

 ⑤ Discretization:Binning

      Convert numerical data into categorical data 

      Divides the range into N intervals

 ⑥ Statistics:

      Center Measurement:Mean、Median

      Frequency Distribution:Mode

      Variability Measurement:Variance,Standard Devitation

  ⑦ Visualization:

      Visualization is the conversion of data into a visual or tabular format

          so that characters of the data and the relations among data items or attributes can be analyzed or reported

      Visualization of data is one of the most powerful and appealing techniques for Data Exploration

dimensionality [dɪˌmɛnʃəˈnæləti] 維度
discretization   離散化
binning   [ˈbɪnɪŋ]  裝箱
categorical  [ˌkætəˈɡɔːrɪkl] 分類的
mode  衆數
devitation  偏差
tabular [ˈtæbjələr] 表格式的
appealing  吸引人的

      Examples Of Visualization:

      Sea Surface Temperature

          Histogram:[ˈhɪstəɡræm]  直方圖

    Box Plots:方塊圖

       Scatter Plot:散點圖

     Correlation Matrix:關聯矩陣

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章