文章大綱

最近有個需求，需要將200W 左右的 excel 格式數據錄入 postgreSQL 數據庫。我想了幾種辦法：

使用psycopg2 原生 api
使用pgAdmin 頁面建立好table 直接導入csv
使用pandas to_sql 方法
使用 sqlalchemy 批量錄入方法
使用python 多進程，pandas 數據清洗後用 sqlalchemy 批量錄入方法

且聽我娓娓道來

基礎性工作

連接類

主要作用是是數據庫鏈接時候進行數據庫鏈接字符串的管理

# data_to_database.py

class connet_databases:
    def __init__(self):
        '''
        # 初始化數據庫連接，使用pymysql模塊
        #
        '''
        _host = ''
        _port = 5432
        _databases = ''  # 'produce' #
        _username = ''
        _password = ''

        self._connect = r'postgres+psycopg2://{username}:{password}@{host}:{port}/{databases}'.format(
            username=_username,
            password=_password,
            host=_host,
            port=_port,
            databases=_databases)

sqlclchemy 基礎操作類



def init_sqlalchemy(dbname='',
                    Echo=True,
                    Base=declarative_base(),
                    DBSession=scoped_session(sessionmaker())):
    # 主要用來建立表
    print(dbname)
    engine = create_engine(dbname,
                           max_overflow=0,  # 超過連接池大小外最多創建的連接
                           pool_size=2,  # 連接池大小
                           pool_timeout=30,  # 池中沒有線程最多等待的時間，否則報錯
                           pool_recycle=-1,  # 多久之後對線程池中的線程進行一次連接的回收（重置）
                           echo=True)
    #con = engine.connect()
    try:
        # engine = create_engine(dbname, echo=Echo)
        DBSession.remove()
        DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)

        Base.metadata.drop_all(engine)
        Base.metadata.create_all(engine)

        DBSession.flush()
        DBSession.commit()

    except Exception as e:
        error = traceback.format_exc()
        Multiprocess_loaddata_toDatabase.log.logger.error(error)

    finally:
        DBSession.remove()
        engine.dispose()


def insert_list(list_obj, DBSession):
    try:
        # init_sqlalchemy(str_path_sqlite)
        DBSession.add_all(list_obj)
        DBSession.flush()
        DBSession.commit()

    except:
        DBSession.rollback()
        raise



def get_conn(dbname, Echo=True):
    # 獲取鏈接
    try:

        engine = create_engine(dbname, echo=Echo)
        DBSession = scoped_session(sessionmaker())
        #DBSession.remove()#scoped_session 本身是線程隔離的，這塊不需要remove
        DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)

        return DBSession


    except:
        DBSession.rollback()
        raise

sqlalchemy 數據庫shema 表樣例


import sqlalchemy

from sqlalchemy import create_engine

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import scoped_session, sessionmaker

from sqlalchemy import Column, TEXT, String, Integer, DateTime,Float
from sqlalchemy.ext.declarative import declarative_base


Base = declarative_base()

class DetailsOfDrugsItems(Base):
    '''
    # 定義診療項目明細對象:

    '''
    __tablename__ = 'details_of_drugs_items'

    # 表的結構:
    id = Column(String(64), primary_key=True)
    結算編號 = Column(String(64), index=True)
    單價 = Column(Float)
    數量 = Column(Float)
    總金額 = Column(Float)
    結算日期 = Column(DateTime)

    def __init__(self):
        pass

psycopg2 原生 api

文檔鏈接：https://www.psycopg.org/docs/module.html

pgAdmin 導入

文檔：https://www.pgadmin.org/docs/pgadmin4/development/import_export_data.html

導入文件支持3中方式：

binary for a .bin file.
csv for a .csv file.
text for a .txt file.

具體導入速度待測試

pandas 數據清洗與to_sql方法錄入數據

數據清洗

pandas 數據清洗細節可以參考我的文章：

大數據ETL實踐探索（5）---- 大數據ETL利器之 pandas

# pandas_to_postgresql.py

def change_dtypes(col_int, col_float, df):
    '''
    AIM    -> Changing dtypes to save memory

    INPUT  -> List of column names (int, float), df

    OUTPUT -> updated df with smaller memory
    ------
    '''
    df[col_int] = df[col_int].astype('int32')
    df[col_float] = df[col_float].astype('float32')


def convert_str_datetime(df):
    '''
    AIM    -> Convert datetime(String) to datetime(format we want)

    INPUT  -> df

    OUTPUT -> updated df with new datetime format
    ------
    '''
    df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))

from sqlalchemy import Column, TEXT, String, Integer, DateTime, Float


# 定義函數，自動輸出DataFrme數據寫入oracle的數類型字典表,配合to_sql方法使用(注意，其類型只能是SQLAlchemy type )
def mapping_df_types(df):
    dtypedict = {}
    for i, j in zip(df.columns, df.dtypes):
        if "object" in str(j):
            dtypedict.update({i: String(64)})
        if "float" in str(j):
            dtypedict.update({i: Float})
        if "int" in str(j):
            dtypedict.update({i: Float})
    return dtypedict

幾個數據脫敏的樣例：

姓名脫敏


def desensitization_name(name):
    new_name = str(name)[0] + '**'
    return new_name

工作單位或者住址的脫敏

import random
def desensitization_location(location):
    length = random.randint(2, len(location))
    str_desensitization = ''
    for i in range(0, length):
        str_desensitization = str_desensitization + '*'
    temp_str = location[0:length - 1]
    new_location = location.replace(temp_str, str_desensitization)

    return new_location

#基本敏感信息進行脫敏
明細['姓名'] = 明細['姓名'].apply(pandas_to_postgresql.desensitization_name)
明細['單位名稱'] = 住院明細['單位名稱'].apply(pandas_to_postgresql.desensitization_location)

to_sql 數據錄入

參考文檔：to_sql 方法文檔


from sqlalchemy.types import Integer

engine = create_engine(data_to_database.connet_databases()._connect, echo=False)
df.to_sql('integers', con=engine, index=False,
          dtype={"A": Integer()})

使用 sqlalchemy 批量錄入方法

不得不說的是sqlalchemy這個玩意的文檔可讀性真的很差。

sqlalchemy orm1.3 參考文檔：https://docs.sqlalchemy.org/en/13/orm/index.html
PostgreSQL 支持參考文檔（Support for the PostgreSQL database.）：https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#module-sqlalchemy.dialects.postgresql.psycopg2

性能調優

其實就是加個參數好像。

https://www.psycopg.org/docs/extras.html#fast-execution-helpers

Modern versions of psycopg2 include a feature known as Fast Execution Helpers , which have been shown in benchmarking to improve psycopg2’s executemany() performance, primarily with INSERT statements, by multiple orders of magnitude. SQLAlchemy allows this extension to be used for all executemany() style calls invoked by an Engine when used with multiple parameter sets, which includes the use of this feature both by the Core as well as by the ORM for inserts of objects with non-autogenerated primary key values, by adding the executemany_mode flag to create_engine():

engine = create_engine(
    "postgresql+psycopg2://scott:tiger@host/dbname",
    executemany_mode='batch')

Possible options for executemany_mode include:

None - By default, psycopg2’s extensions are not used, and the usual cursor.executemany() method is used when invoking batches of statements.

‘batch’ - Uses psycopg2.extras.execute_batch so that multiple copies of a SQL query, each one corresponding to a parameter set passed to executemany(), are joined into a single SQL string separated by a semicolon. This is the same behavior as was provided by the use_batch_mode=True flag.

‘values’- For Core insert() constructs only (including those emitted by the ORM automatically), the psycopg2.extras.execute_values extension is used so that multiple parameter sets are grouped into a single INSERT statement and joined together with multiple VALUES expressions. This method requires that the string text of the VALUES clause inside the INSERT statement is manipulated, so is only supported with a compiled insert() construct where the format is predictable. For all other constructs, including plain textual INSERT statements not rendered by the SQLAlchemy expression language compiler, the psycopg2.extras.execute_batch method is used. It is therefore important to note that “values” mode implies that “batch” mode is also used for all statements for which “values” mode does not apply.

For both strategies, the executemany_batch_page_size and executemany_values_page_size arguments control how many parameter sets should be represented in each execution. Because “values” mode implies a fallback down to “batch” mode for non-INSERT statements, there are two independent page size arguments. For each, the default value of None means to use psycopg2’s defaults, which at the time of this writing are quite low at 100. For the execute_values method, a number as high as 10000 may prove to be performant, whereas for execute_batch, as the number represents full statements repeated, a number closer to the default of 100 is likely more appropriate:

engine = create_engine(
    "postgresql+psycopg2://scott:tiger@host/dbname",
    executemany_mode='values',
    executemany_values_page_size=10000, executemany_batch_page_size=500)

大數據ETL實踐探索（9）---- postgresSQL 數據入庫使用pandas sqlalchemy 以及多進程

文章大綱

基礎性工作

連接類

sqlclchemy 基礎操作類

sqlalchemy 數據庫shema 表樣例

psycopg2 原生 api

pgAdmin 導入

pandas 數據清洗與to_sql方法錄入數據

數據清洗

to_sql 數據錄入

使用 sqlalchemy 批量錄入方法

性能調優

《自然語言處理實戰入門》基礎知識 ----機器學習與深度學習組件

《自然語言處理實戰入門》深度學習組件TensorFlow2.0---- 初探

《自然語言處理實戰入門》文本檢索---- 初探

python Pandas Profiling 一行代碼EDA 探索性數據分析

《硅谷鋼鐵俠》---- 讀書筆記

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

大數據ETL實踐探索（9）---- postgresSQL 數據入庫使用pandas sqlalchemy 以及多進程

文章大綱

基礎性工作

連接類

sqlclchemy 基礎操作類

sqlalchemy 數據庫shema 表 樣例

psycopg2 原生 api

pgAdmin 導入

pandas 數據清洗與to_sql方法錄入數據

數據清洗

to_sql 數據錄入

使用 sqlalchemy 批量錄入方法

性能調優

sqlalchemy 數據庫shema 表樣例