大數據實戰【千億級數倉】階段二

寫在前面： 博主是一名軟件工程系大數據應用開發專業大二的學生，暱稱來源於《愛麗絲夢遊仙境》中的Alice和自己的暱稱。作爲一名互聯網小白，寫博客一方面是爲了記錄自己的學習歷程，一方面是希望能夠幫助到很多和自己一樣處於起步階段的萌新。由於水平有限，博客中難免會有一些錯誤，有紕漏之處懇請各位大佬不吝賜教！個人小站:http://alices.ibilibili.xyz/ , 博客主頁:https://alice.blog.csdn.net/
儘管當前水平可能不及各位大佬，但我還是希望自己能夠做得更好，因爲一天的生活就是一生的縮影。我希望在最美的年華，做最好的自己！

本篇博客，博主爲大家帶來的是大數據實戰【千億級數倉】階段二的內容。

通過之前的預告，先來回顧一下我們需要掌握的技能。

學習、掌握kettle的使用、使用kettle將項目需求所需的數據在MySQL同步到Hive。
使用sqoop,將剩餘的數據在MySQL同步到Hive。

關於Kettle的具體使用情況，體貼的博主就不在這裏贅述太多了，畢竟之前關於Kettle的使用說明的博客可花了不少心思。

關於Kettle的詳情，感興趣的朋友可以進入👉Kettle專欄

接下來講的是，如何使用Kettle將項目所需要的數據從MySQL同步到Hive中。

        首先我們將快速在MySQL中創建好原始表的sql文件複製到DataGrip的新建文件夾下

        然後選中右鍵執行

        執行完畢，我們集羣的MySQL下就會創建一個新的數據庫itcast_shop,數據庫下又會有諸多已經創建好的數據表

這些表正是在階段一中提到的那八十多個表

然而，我們本次項目中真正用到的就只有這裏面中的10個

現在表全在MySQL中了，我們要做的就是使用Kettle將這10個表同步到Hive中。然後將剩下的表用Sqoop導入到Hive。

這裏肯定就有朋友要問了，爲什麼不全部都用Sqoop同步，還要分兩種方式來同步數據，不是自找麻煩麼？

確實沒錯，但這裏使用Kettle是爲了讓我們對kettle的使用更熟練，畢竟Kettle的功能有多強大，相信看過博主前面的介紹kettle博文的朋友都知道。

因爲使用Kettle導入10個表的數據到Hive，因此我們需要先在Hive中將這些數據表先創建出來。

執行下面的建表語句

-- 創建ods層訂單表
drop table if exists `itcast_ods`.`itcast_orders`;
create EXTERNAL table `itcast_ods`.`itcast_orders`(
    orderId            bigint,
    orderNo            string,
    shopId             bigint,
    userId             bigint,
    orderStatus        bigint,
    goodsMoney         double,
    deliverType        bigint,
    deliverMoney       double,
    totalMoney         double,
    realTotalMoney     double,
    payType            bigint,
    isPay              bigint,
    areaId             bigint,
    userAddressId      bigint,
    areaIdPath         string,
    userName           string,
    userAddress        string,
    userPhone          string,
    orderScore         bigint,
    isInvoice          bigint,
    invoiceClient      string,
    orderRemarks       string,
    orderSrc           bigint,
    needPay            double,
    payRand            bigint,
    orderType          bigint,
    isRefund           bigint,
    isAppraise         bigint,
    cancelReason       bigint,
    rejectReason       bigint,
    rejectOtherReason  string,
    isClosed           bigint,
    goodsSearchKeys    string,
    orderunique        string,
    receiveTime        string,
    deliveryTime       string,
    tradeNo            string,
    dataFlag           bigint,
    createTime         string,
    settlementId       bigint,
    commissionFee      double,
    scoreMoney         double,
    useScore           bigint,
    orderCode          string,
    extraJson          string,
    orderCodeTargetId  bigint,
    noticeDeliver      bigint,
    invoiceJson        string,
    lockCashMoney      double,
    payTime            string,
    isBatch            bigint,
    totalPayFee        bigint
)
partitioned by (dt string)
STORED AS PARQUET;

-- 創建ods層訂單明細表
drop table if exists `itcast_ods`.`itcast_order_goods`;
create EXTERNAL table `itcast_ods`.`itcast_order_goods`(
    ogId            bigint,
    orderId         bigint,
    goodsId         bigint,
    goodsNum        bigint,
    goodsPrice      double,
    payPrice        double,
    goodsSpecId     bigint,
    goodsSpecNames  string,
    goodsName       string,
    goodsImg        string,
    extraJson       string,
    goodsType       bigint,
    commissionRate  double,
    goodsCode       string,
    promotionJson   string,
    createtime      string
)
partitioned by (dt string)
STORED AS PARQUET;

-- 創建ods層店鋪表
drop table if exists `itcast_ods`.`itcast_shops`;
create EXTERNAL table `itcast_ods`.`itcast_shops`(
    shopId             bigint,
    shopSn             string,
    userId             bigint,
    areaIdPath         string,
    areaId             bigint,
    isSelf             bigint,
    shopName           string,
    shopkeeper         string,
    telephone          string,
    shopCompany        string,
    shopImg            string,
    shopTel            string,
    shopQQ             string,
    shopWangWang       string,
    shopAddress        string,
    bankId             bigint,
    bankNo             string,
    bankUserName       string,
    isInvoice          bigint,
    invoiceRemarks     string,
    serviceStartTime   bigint,
    serviceEndTime     bigint,
    freight            bigint,
    shopAtive          bigint,
    shopStatus         bigint,
    statusDesc         string,
    dataFlag           bigint,
    createTime         string,
    shopMoney          double,
    lockMoney          double,
    noSettledOrderNum  bigint,
    noSettledOrderFee  double,
    paymentMoney       double,
    bankAreaId         bigint,
    bankAreaIdPath     string,
    applyStatus        bigint,
    applyDesc          string,
    applyTime          string,
    applyStep          bigint,
    shopNotice         string,
    rechargeMoney      double,
    longitude          double,
    latitude           double,
    mapLevel           bigint,
    BDcode             string,
    modifyTime         string
)
partitioned by (dt string)
STORED AS PARQUET;

-- 創建ods層商品表
drop table if exists `itcast_ods`.`itcast_goods`;
create EXTERNAL table `itcast_ods`.`itcast_goods`(
    goodsId              bigint,
    goodsSn              string,
    productNo            string,
    goodsName            string,
    goodsImg             string,
    shopId               bigint,
    goodsType            bigint,
    marketPrice          double,
    shopPrice            double,
    warnStock            bigint,
    goodsStock           bigint,
    goodsUnit            string,
    goodsTips            string,
    isSale               bigint,
    isBest               bigint,
    isHot                bigint,
    isNew                bigint,
    isRecom              bigint,
    goodsCatIdPath       string,
    goodsCatId           bigint,
    shopCatId1           bigint,
    shopCatId2           bigint,
    brandId              bigint,
    goodsDesc            string,
    goodsStatus          bigint,
    saleNum              bigint,
    saleTime             string,
    visitNum             bigint,
    appraiseNum          bigint,
    isSpec               bigint,
    gallery              string,
    goodsSeoKeywords     string,
    illegalRemarks       string,
    dataFlag             bigint,
    createTime           string,
    isFreeShipping       bigint,
    goodsSerachKeywords  string,
    modifyTime           string
)
partitioned by (dt string)
STORED AS PARQUET;

-- 創建ods層組織機構表
drop table `itcast_ods`.`itcast_org`;
create EXTERNAL table `itcast_ods`.`itcast_org`(
    orgId        bigint,
    parentId     bigint,
    orgName      string,
    orgLevel     bigint,
    managerCode  string,
    isdelete     bigint,
    createTime   string,
    updateTime   string,
    isShow       bigint,
    orgType      bigint
)
partitioned by (dt string)
STORED AS PARQUET;

-- 創建ods層商品分類表
drop table if exists `itcast_ods`.`itcast_goods_cats`;
create EXTERNAL table `itcast_ods`.`itcast_goods_cats`(
    catId               bigint,
    parentId            bigint,
    catName             string,
    isShow              bigint,
    isFloor             bigint,
    catSort             bigint,
    dataFlag            bigint,
    createTime          string,
    commissionRate      double,
    catImg              string,
    subTitle            string,
    simpleName          string,
    seoTitle            string,
    seoKeywords         string,
    seoDes              string,
    catListTheme        string,
    detailTheme         string,
    mobileCatListTheme  string,
    mobileDetailTheme   string,
    wechatCatListTheme  string,
    wechatDetailTheme   string,
    cat_level           bigint
)
partitioned by (dt string)
STORED AS PARQUET;

-- 創建ods層用戶表
drop table if exists `itcast_ods`.`itcast_users`;
create EXTERNAL table `itcast_ods`.`itcast_users`(
    userId          bigint,
    loginName       string,
    loginSecret     bigint,
    loginPwd        string,
    userType        bigint,
    userSex         bigint,
    userName        string,
    trueName        string,
    brithday        string,
    userPhoto       string,
    userQQ          string,
    userPhone       string,
    userEmail       string,
    userScore       bigint,
    userTotalScore  bigint,
    lastIP          string,
    lastTime        string,
    userFrom        bigint,
    userMoney       double,
    lockMoney       double,
    userStatus      bigint,
    dataFlag        bigint,
    createTime      string,
    payPwd          string,
    rechargeMoney   double,
    isInform        bigint
)
partitioned by (dt string)
STORED AS PARQUET;

-- 創建ods層退貨表
drop table if exists `itcast_ods`.`itcast_order_refunds`;
create EXTERNAL table `itcast_ods`.`itcast_order_refunds`(
    id                bigint,
    orderId           bigint,
    goodsId           bigint,
    refundTo          bigint,
    refundReson       bigint,
    refundOtherReson  string,
    backMoney         double,
    refundTradeNo     string,
    refundRemark      string,
    refundTime        string,
    shopRejectReason  string,
    refundStatus      bigint,
    createTime        string
)
partitioned by (dt string)
STORED AS PARQUET;

-- 創建ods層地址表
drop table if exists `itcast_ods`.`itcast_user_address`;
create EXTERNAL table `itcast_ods`.`itcast_user_address`(
    addressId    bigint,
    userId       bigint,
    userName     string,
    otherName    string,
    userPhone    string,
    areaIdPath   string,
    areaId       bigint,
    userAddress  string,
    isDefault    bigint,
    dataFlag     bigint,
    createTime   string
)
partitioned by (dt string)
STORED AS PARQUET;

-- 創建ods層支付方式表
drop table if exists `itcast_ods`.`itcast_payments`;
create EXTERNAL table `itcast_ods`.`itcast_payments`(
    id         bigint,
    payCode    string,
    payName    string,
    payDesc    string,
    payOrder   bigint,
    payConfig  string,
    enabled    bigint,
    isOnline   bigint,
    payFor     string
)
partitioned by (dt string)
STORED AS PARQUET;

執行完畢，此時數據庫中就創建好了10個空的表

接下來我們就需要通過Kettle讀取MySQL中的數據，輸出到各個hive表存儲在HDFS的路徑下的parquent文件中即可。

相信看到這裏的朋友，對於Kettle已經相當熟練了，所以我這裏就不再像第一次教學Kettle那樣細講了。

我們根據需求，需要使用到表輸入組件，字段選擇(根據業務添加)，parquent輸出組件。

我們將所需要的組件連接起來，因爲需要同時同步10個表的數據，所以我們也構造了10個"線路"

組件連接好了之後，讓我們來看看如何單獨設置每個的內容

首先雙擊空白處，我們需要設置一個kettle中的參數，方便我們調用，用來做數據分區使用

然後就可以進行設置表的輸入了，需要注意的地方有如下四個

如果不放心，還可以選擇預覽數據

字段選擇中，如果沒有其他的特殊情況，我們這裏默認就獲取字段

然後我們就可以設置parquent文件的輸出了

需要注意位置要設置成HDFS，然後在預覽中選擇需要導入Hive表在HDFS上的元數據的路徑。
另外建議勾選上，覆蓋已存在文件，這樣我們就反覆運行程序而無需擔心每次都要換個輸出路徑了~
默認也都是獲取所有的字段，然後我們就可以設置壓縮格式Snappy,就可以點擊確定了。

上面演示的是一個表從MySQL讀取到輸出Parquent的過程，因爲這裏我們涉及到了十個表，所以需要操作十次…

待到10個表的流程都完成，直接運行然後在命令行上修復分區數據也是一樣的

但是都操作到這裏了，我們還是換一種優雅的方式

首先我們新建一個作業

在作業界面，我們獲取到這些組件，並連接起來

Start 我們無需操作，後面掛的小鎖代表着無需任何條件即可執行
關於轉換組件的設置，是一個重點
這裏的路徑需要設置成我們前面已經創建的轉換文件在本地的路徑

接着就在SQL組件中，連接上hive，並編寫需要執行的SQL腳本

待到設置完畢，我們就可以運行這個Job了

正常情況下，我們可以在執行完畢之後，查詢之前創建的Hive數據表，可以發現10張表都已經有了數據

Kettle如何實現MySQL同步到Hive已經說完了。下面我們來整點“刺激”的！

關於全量導入mysql表數據到Hive，有以下兩種方法：

首先，進入到Sqoop的安裝目錄下，cd /export/servers/sqoop-1.4.6.bin__hadoop-2.0.4-alpha

方式一：先複製表結構到hive中再導入數據

將關係型數據的表結構複製到hive中

bin/sqoop create-hive-table \
--connect jdbc:mysql://節點IP:3306/mysql數據庫 \
--table mysql數據表名 \
--username mysql賬戶 \
--password mysql密碼 \
--hive-table 數據庫.需要輸出的表名

從關係數據庫導入文件到hive中

bin/sqoop import \
--connect jdbc:mysql://節點IP:3306/mysql數據庫\
--username mysql賬戶 \
--password mysql密碼 \
--table mysql數據表 \
--hive-table 數據庫.需要輸出的表名 \
--hive-import \
--m 1

方式二：直接複製表結構數據到hive中

bin/sqoop import \
--connect jdbc:mysql://節點IP:3306/mysql數據庫\
--username mysql賬戶 \
--password mysql密碼 \
--table mysql數據表 \
--hive-import \
--m 1 \
--hive-database hive數據庫;

如果用方式二想把數據導出到分區表，可以用下面這種方式

sqoop import \
--connect jdbc:mysql://IP節點:3306/mysql數據庫 \
--username mysql賬戶 \
--password mysql密碼 \
--where "查詢條件" \
--target-dir /user/hive/warehouse/xxx輸出路徑/ \
--table mysql數據表--m 1

注意：

用這兩種方法都可以實現從MySQL同步數據到Hive，區別就在於方式二輸出的hive表名與mysql輸入的表名是一樣的，方式一可以自己定義hive輸出表的名字

小結

大數據實戰【千億級數倉】階段二需要大家熟練Kettle的基本使用，項目所需數據的從MySQL到Hive同步以及使用Sqoop同步其他數據。

如果以上過程中出現了任何的紕漏錯誤，煩請大佬們指正😅

受益的朋友或對大數據技術感興趣的夥伴記得點贊關注支持一波🙏

大數據實戰【千億級數倉】階段二

小結

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

【學點數據結構和算法】06-二叉堆和優先隊列

大數據【企業級360°全方位用戶畫像】基於USG模型的挖掘型標籤開發

【學點數據結構和算法】05-樹

大數據【企業級360°全方位用戶畫像】之USG模型和決策樹分類算法

【學點數據結構和算法】03-棧和隊列

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結