文章大綱
最近有個需求,需要將200W 左右的 excel 格式數據錄入 postgreSQL 數據庫。 我想了幾種辦法:
- 使用psycopg2 原生 api
- 使用pgAdmin 頁面 建立好table 直接導入csv
- 使用pandas to_sql 方法
- 使用 sqlalchemy 批量錄入方法
- 使用python 多進程,pandas 數據清洗後用 sqlalchemy 批量錄入方法
且聽我娓娓道來
基礎性工作
連接類
主要作用是是數據庫鏈接時候進行數據庫鏈接字符串的管理
# data_to_database.py
class connet_databases:
def __init__(self):
'''
# 初始化數據庫連接,使用pymysql模塊
#
'''
_host = ''
_port = 5432
_databases = '' # 'produce' #
_username = ''
_password = ''
self._connect = r'postgres+psycopg2://{username}:{password}@{host}:{port}/{databases}'.format(
username=_username,
password=_password,
host=_host,
port=_port,
databases=_databases)
sqlclchemy 基礎操作類
def init_sqlalchemy(dbname='',
Echo=True,
Base=declarative_base(),
DBSession=scoped_session(sessionmaker())):
# 主要用來建立表
print(dbname)
engine = create_engine(dbname,
max_overflow=0, # 超過連接池大小外最多創建的連接
pool_size=2, # 連接池大小
pool_timeout=30, # 池中沒有線程最多等待的時間,否則報錯
pool_recycle=-1, # 多久之後對線程池中的線程進行一次連接的回收(重置)
echo=True)
#con = engine.connect()
try:
# engine = create_engine(dbname, echo=Echo)
DBSession.remove()
DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
DBSession.flush()
DBSession.commit()
except Exception as e:
error = traceback.format_exc()
Multiprocess_loaddata_toDatabase.log.logger.error(error)
finally:
DBSession.remove()
engine.dispose()
def insert_list(list_obj, DBSession):
try:
# init_sqlalchemy(str_path_sqlite)
DBSession.add_all(list_obj)
DBSession.flush()
DBSession.commit()
except:
DBSession.rollback()
raise
def get_conn(dbname, Echo=True):
# 獲取鏈接
try:
engine = create_engine(dbname, echo=Echo)
DBSession = scoped_session(sessionmaker())
#DBSession.remove()#scoped_session 本身是線程隔離的,這塊不需要remove
DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
return DBSession
except:
DBSession.rollback()
raise
sqlalchemy 數據庫shema 表 樣例
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy import Column, TEXT, String, Integer, DateTime,Float
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class DetailsOfDrugsItems(Base):
'''
# 定義診療項目明細對象:
'''
__tablename__ = 'details_of_drugs_items'
# 表的結構:
id = Column(String(64), primary_key=True)
結算編號 = Column(String(64), index=True)
單價 = Column(Float)
數量 = Column(Float)
總金額 = Column(Float)
結算日期 = Column(DateTime)
def __init__(self):
pass
psycopg2 原生 api
文檔鏈接:https://www.psycopg.org/docs/module.html
pgAdmin 導入
文檔:https://www.pgadmin.org/docs/pgadmin4/development/import_export_data.html
導入文件支持3中方式:
binary for a .bin file.
csv for a .csv file.
text for a .txt file.
具體導入速度待測試
pandas 數據清洗與to_sql方法錄入數據
數據清洗
pandas 數據清洗細節可以參考我的文章:
大數據ETL實踐探索(5)---- 大數據ETL利器之 pandas
# pandas_to_postgresql.py
def change_dtypes(col_int, col_float, df):
'''
AIM -> Changing dtypes to save memory
INPUT -> List of column names (int, float), df
OUTPUT -> updated df with smaller memory
------
'''
df[col_int] = df[col_int].astype('int32')
df[col_float] = df[col_float].astype('float32')
def convert_str_datetime(df):
'''
AIM -> Convert datetime(String) to datetime(format we want)
INPUT -> df
OUTPUT -> updated df with new datetime format
------
'''
df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))
from sqlalchemy import Column, TEXT, String, Integer, DateTime, Float
# 定義函數,自動輸出DataFrme數據寫入oracle的數類型字典表,配合to_sql方法使用(注意,其類型只能是SQLAlchemy type )
def mapping_df_types(df):
dtypedict = {}
for i, j in zip(df.columns, df.dtypes):
if "object" in str(j):
dtypedict.update({i: String(64)})
if "float" in str(j):
dtypedict.update({i: Float})
if "int" in str(j):
dtypedict.update({i: Float})
return dtypedict
幾個數據脫敏的樣例:
姓名脫敏
def desensitization_name(name):
new_name = str(name)[0] + '**'
return new_name
工作單位或者住址的脫敏
import random
def desensitization_location(location):
length = random.randint(2, len(location))
str_desensitization = ''
for i in range(0, length):
str_desensitization = str_desensitization + '*'
temp_str = location[0:length - 1]
new_location = location.replace(temp_str, str_desensitization)
return new_location
#基本敏感信息進行脫敏
明細['姓名'] = 明細['姓名'].apply(pandas_to_postgresql.desensitization_name)
明細['單位名稱'] = 住院明細['單位名稱'].apply(pandas_to_postgresql.desensitization_location)
to_sql 數據錄入
參考文檔:to_sql 方法文檔
from sqlalchemy.types import Integer
engine = create_engine(data_to_database.connet_databases()._connect, echo=False)
df.to_sql('integers', con=engine, index=False,
dtype={"A": Integer()})
使用 sqlalchemy 批量錄入方法
不得不說的是sqlalchemy這個玩意的文檔可讀性真的很差。
- sqlalchemy orm1.3 參考文檔:https://docs.sqlalchemy.org/en/13/orm/index.html
- PostgreSQL 支持參考文檔 (Support for the PostgreSQL database.):https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#module-sqlalchemy.dialects.postgresql.psycopg2
性能調優
其實就是加個參數好像。
https://www.psycopg.org/docs/extras.html#fast-execution-helpers
Modern versions of psycopg2 include a feature known as Fast Execution Helpers , which have been shown in benchmarking to improve psycopg2’s executemany() performance, primarily with INSERT statements, by multiple orders of magnitude. SQLAlchemy allows this extension to be used for all executemany() style calls invoked by an Engine when used with multiple parameter sets, which includes the use of this feature both by the Core as well as by the ORM for inserts of objects with non-autogenerated primary key values, by adding the executemany_mode flag to create_engine():
engine = create_engine(
"postgresql+psycopg2://scott:tiger@host/dbname",
executemany_mode='batch')
Possible options for executemany_mode include:
None - By default, psycopg2’s extensions are not used, and the usual cursor.executemany() method is used when invoking batches of statements.
‘batch’ - Uses psycopg2.extras.execute_batch so that multiple copies of a SQL query, each one corresponding to a parameter set passed to executemany(), are joined into a single SQL string separated by a semicolon. This is the same behavior as was provided by the use_batch_mode=True flag.
‘values’- For Core insert() constructs only (including those emitted by the ORM automatically), the psycopg2.extras.execute_values extension is used so that multiple parameter sets are grouped into a single INSERT statement and joined together with multiple VALUES expressions. This method requires that the string text of the VALUES clause inside the INSERT statement is manipulated, so is only supported with a compiled insert() construct where the format is predictable. For all other constructs, including plain textual INSERT statements not rendered by the SQLAlchemy expression language compiler, the psycopg2.extras.execute_batch method is used. It is therefore important to note that “values” mode implies that “batch” mode is also used for all statements for which “values” mode does not apply.
For both strategies, the executemany_batch_page_size and executemany_values_page_size arguments control how many parameter sets should be represented in each execution. Because “values” mode implies a fallback down to “batch” mode for non-INSERT statements, there are two independent page size arguments. For each, the default value of None means to use psycopg2’s defaults, which at the time of this writing are quite low at 100. For the execute_values method, a number as high as 10000 may prove to be performant, whereas for execute_batch, as the number represents full statements repeated, a number closer to the default of 100 is likely more appropriate:
engine = create_engine(
"postgresql+psycopg2://scott:tiger@host/dbname",
executemany_mode='values',
executemany_values_page_size=10000, executemany_batch_page_size=500)