一、發現串行串列數據
1.1 根據分隔符每行出現次數判斷是否串列
根據分隔符出現的次數判斷是否串列
#發現串行
with open(file,'r',encoding='utf-8') as file:
rows = file.readlines()
sep_cnt = rows[0].count('|')
num = 0
for i in rows:
cnt = i.count('|')
if cnt!=sep_cnt:
num = num + 1
print(num,'|',cnt,':',i)
1.2 知道在第幾行串列
pandas 在讀取時會報錯,錯誤會告知在第幾行多一個列,可以根據錯誤查看具體行的數據
pandas.errors.ParserError: Error tokenizing data. C error: Expected 35 fields in line 191072, saw 36
1.2.1 pandas 讀取串列行
import pandas as pd
file ='tmp.csv'
#讀取錯誤行
dat = pd.read_table(file, sep='|',encoding = "utf8",
dtype=str,skiprows=191072, nrows=2 ,low_memory=False,
header=None
)
1.2.2 linecache 讀取串列行
import linecache
error_line= linecache.getline(file,191073)
二、替換串列數據
2.1 替換第n次出現的某個字符
def nth_repl(s, sub, repl, n):
"""
替換第n次出現的字符
:s:字符串
:sub:被替換的字符串
:repl:替換爲的新字符串
:n:字符串第n次出現的次數
-------------------------
替換第7次出現的位置
nth_repl(z,'|','_',7)
"""
find = s.find(sub)
# If find is not -1 we have found at least one match for the substring
i = find != -1
# loop util we find the nth or we find no match
while find != -1 and i != n:
# find + 1 means we start searching from after the last match
find = s.find(sub, find + 1)
i += 1
# If i is equal to n we found nth match so replace
if i == n:
return s[:find] + repl + s[find+len(sub):]
return s
參考:https://stackoverflow.com/questions/35091557/replace-nth-occurrence-of-substring-in-string
2.2 替換空字符
#替換\x00 與空字符串(\r,\n,\t)
z=re.sub('\x00','',re.sub('\s','',count))
2.3 逐行替換
列表效率更高
#清洗數據
def file_sub(old_file,new_file):
file_data = [] # 初始化
with open(old_file, "r", encoding="utf-8") as f:
print('開始替換...')
rows = f.readlines()
sep_cnt = rows[0].count('|')
for line in tqdm(rows): # line一行行讀取替換文本
cnt = line.count('|')
if cnt!=sep_cnt:
a = re.sub('\x00','',re.sub('\s','',line))
a = nth_repl(a,'|','_',30)
#如果仍然大於30個,繼續替換
while a.count('|')>sep_cnt:
a = nth_repl(a,'|','_',30)
else :
a = re.sub('\x00','',re.sub('\s','',line))
file_data.append(a)
with open(new_file, "w", encoding="utf-8") as f: # 寫入替換好的文本
print('寫入替換文本...')
for line in tqdm(file_data):
f.write(line + '\n')
print('批量替換完成')
def main():
file_sub('hdx_finance_20211130_f.csv','hdx_finance_20211130_new.csv')