大數據ETL實踐探索(9)---- postgresSQL 數據入庫使用pandas sqlalchemy 以及多進程


最近有個需求,需要將200W 左右的 excel 格式數據錄入 postgreSQL 數據庫。 我想了幾種辦法:

  1. 使用psycopg2 原生 api
  2. 使用pgAdmin 頁面 建立好table 直接導入csv
  3. 使用pandas to_sql 方法
  4. 使用 sqlalchemy 批量錄入方法
  5. 使用python 多進程,pandas 數據清洗後用 sqlalchemy 批量錄入方法

且聽我娓娓道來


基礎性工作

連接類

主要作用是是數據庫鏈接時候進行數據庫鏈接字符串的管理

# data_to_database.py

class connet_databases:
    def __init__(self):
        '''
        # 初始化數據庫連接,使用pymysql模塊
        #
        '''
        _host = ''
        _port = 5432
        _databases = ''  # 'produce' #
        _username = ''
        _password = ''

        self._connect = r'postgres+psycopg2://{username}:{password}@{host}:{port}/{databases}'.format(
            username=_username,
            password=_password,
            host=_host,
            port=_port,
            databases=_databases)



sqlclchemy 基礎操作類



def init_sqlalchemy(dbname='',
                    Echo=True,
                    Base=declarative_base(),
                    DBSession=scoped_session(sessionmaker())):
    # 主要用來建立表
    print(dbname)
    engine = create_engine(dbname,
                           max_overflow=0,  # 超過連接池大小外最多創建的連接
                           pool_size=2,  # 連接池大小
                           pool_timeout=30,  # 池中沒有線程最多等待的時間,否則報錯
                           pool_recycle=-1,  # 多久之後對線程池中的線程進行一次連接的回收(重置)
                           echo=True)
    #con = engine.connect()
    try:
        # engine = create_engine(dbname, echo=Echo)
        DBSession.remove()
        DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)

        Base.metadata.drop_all(engine)
        Base.metadata.create_all(engine)

        DBSession.flush()
        DBSession.commit()

    except Exception as e:
        error = traceback.format_exc()
        Multiprocess_loaddata_toDatabase.log.logger.error(error)

    finally:
        DBSession.remove()
        engine.dispose()


def insert_list(list_obj, DBSession):
    try:
        # init_sqlalchemy(str_path_sqlite)
        DBSession.add_all(list_obj)
        DBSession.flush()
        DBSession.commit()

    except:
        DBSession.rollback()
        raise



def get_conn(dbname, Echo=True):
    # 獲取鏈接
    try:

        engine = create_engine(dbname, echo=Echo)
        DBSession = scoped_session(sessionmaker())
        #DBSession.remove()#scoped_session 本身是線程隔離的,這塊不需要remove
        DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)

        return DBSession


    except:
        DBSession.rollback()
        raise



sqlalchemy 數據庫shema 表 樣例


import sqlalchemy

from sqlalchemy import create_engine

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import scoped_session, sessionmaker

from sqlalchemy import Column, TEXT, String, Integer, DateTime,Float
from sqlalchemy.ext.declarative import declarative_base


Base = declarative_base()

class DetailsOfDrugsItems(Base):
    '''
    # 定義診療項目明細對象:

    '''
    __tablename__ = 'details_of_drugs_items'

    # 表的結構:
    id = Column(String(64), primary_key=True)
    結算編號 = Column(String(64), index=True)
    單價 = Column(Float)
    數量 = Column(Float)
    總金額 = Column(Float)
    結算日期 = Column(DateTime)

    def __init__(self):
        pass


psycopg2 原生 api

文檔鏈接:https://www.psycopg.org/docs/module.html


pgAdmin 導入

文檔:https://www.pgadmin.org/docs/pgadmin4/development/import_export_data.html

導入文件支持3中方式:

binary for a .bin file.
csv for a .csv file.
text for a .txt file.

在這裏插入圖片描述

具體導入速度待測試


pandas 數據清洗與to_sql方法錄入數據

數據清洗

pandas 數據清洗細節可以參考我的文章:

大數據ETL實踐探索(5)---- 大數據ETL利器之 pandas

# pandas_to_postgresql.py

def change_dtypes(col_int, col_float, df):
    '''
    AIM    -> Changing dtypes to save memory

    INPUT  -> List of column names (int, float), df

    OUTPUT -> updated df with smaller memory
    ------
    '''
    df[col_int] = df[col_int].astype('int32')
    df[col_float] = df[col_float].astype('float32')


def convert_str_datetime(df):
    '''
    AIM    -> Convert datetime(String) to datetime(format we want)

    INPUT  -> df

    OUTPUT -> updated df with new datetime format
    ------
    '''
    df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))

from sqlalchemy import Column, TEXT, String, Integer, DateTime, Float


# 定義函數,自動輸出DataFrme數據寫入oracle的數類型字典表,配合to_sql方法使用(注意,其類型只能是SQLAlchemy type )
def mapping_df_types(df):
    dtypedict = {}
    for i, j in zip(df.columns, df.dtypes):
        if "object" in str(j):
            dtypedict.update({i: String(64)})
        if "float" in str(j):
            dtypedict.update({i: Float})
        if "int" in str(j):
            dtypedict.update({i: Float})
    return dtypedict

幾個數據脫敏的樣例:

姓名脫敏


def desensitization_name(name):
    new_name = str(name)[0] + '**'
    return new_name

工作單位或者住址的脫敏

import random
def desensitization_location(location):
    length = random.randint(2, len(location))
    str_desensitization = ''
    for i in range(0, length):
        str_desensitization = str_desensitization + '*'
    temp_str = location[0:length - 1]
    new_location = location.replace(temp_str, str_desensitization)

    return new_location

#基本敏感信息進行脫敏
明細['姓名'] = 明細['姓名'].apply(pandas_to_postgresql.desensitization_name)
明細['單位名稱'] = 住院明細['單位名稱'].apply(pandas_to_postgresql.desensitization_location)

to_sql 數據錄入

參考文檔:to_sql 方法文檔


from sqlalchemy.types import Integer

engine = create_engine(data_to_database.connet_databases()._connect, echo=False)
df.to_sql('integers', con=engine, index=False,
          dtype={"A": Integer()})



使用 sqlalchemy 批量錄入方法

不得不說的是sqlalchemy這個玩意的文檔可讀性真的很差。

性能調優

其實就是加個參數好像。

https://www.psycopg.org/docs/extras.html#fast-execution-helpers

Modern versions of psycopg2 include a feature known as Fast Execution Helpers , which have been shown in benchmarking to improve psycopg2’s executemany() performance, primarily with INSERT statements, by multiple orders of magnitude. SQLAlchemy allows this extension to be used for all executemany() style calls invoked by an Engine when used with multiple parameter sets, which includes the use of this feature both by the Core as well as by the ORM for inserts of objects with non-autogenerated primary key values, by adding the executemany_mode flag to create_engine():

engine = create_engine(
    "postgresql+psycopg2://scott:tiger@host/dbname",
    executemany_mode='batch')

Possible options for executemany_mode include:

None - By default, psycopg2’s extensions are not used, and the usual cursor.executemany() method is used when invoking batches of statements.

‘batch’ - Uses psycopg2.extras.execute_batch so that multiple copies of a SQL query, each one corresponding to a parameter set passed to executemany(), are joined into a single SQL string separated by a semicolon. This is the same behavior as was provided by the use_batch_mode=True flag.

‘values’- For Core insert() constructs only (including those emitted by the ORM automatically), the psycopg2.extras.execute_values extension is used so that multiple parameter sets are grouped into a single INSERT statement and joined together with multiple VALUES expressions. This method requires that the string text of the VALUES clause inside the INSERT statement is manipulated, so is only supported with a compiled insert() construct where the format is predictable. For all other constructs, including plain textual INSERT statements not rendered by the SQLAlchemy expression language compiler, the psycopg2.extras.execute_batch method is used. It is therefore important to note that “values” mode implies that “batch” mode is also used for all statements for which “values” mode does not apply.

For both strategies, the executemany_batch_page_size and executemany_values_page_size arguments control how many parameter sets should be represented in each execution. Because “values” mode implies a fallback down to “batch” mode for non-INSERT statements, there are two independent page size arguments. For each, the default value of None means to use psycopg2’s defaults, which at the time of this writing are quite low at 100. For the execute_values method, a number as high as 10000 may prove to be performant, whereas for execute_batch, as the number represents full statements repeated, a number closer to the default of 100 is likely more appropriate:

engine = create_engine(
    "postgresql+psycopg2://scott:tiger@host/dbname",
    executemany_mode='values',
    executemany_values_page_size=10000, executemany_batch_page_size=500)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章