python txt csv 文件串行串列處理 一、發現串行串列數據 二、替換串列數據

一、發現串行串列數據

1.1 根據分隔符每行出現次數判斷是否串列

根據分隔符出現的次數判斷是否串列

#發現串行
with open(file,'r',encoding='utf-8') as file:
    rows = file.readlines()
    sep_cnt = rows[0].count('|')
    num = 0
    for i in rows:
        cnt = i.count('|')
        if cnt!=sep_cnt:
            num = num + 1
            print(num,'|',cnt,':',i)

1.2 知道在第幾行串列

pandas 在讀取時會報錯,錯誤會告知在第幾行多一個列,可以根據錯誤查看具體行的數據

pandas.errors.ParserError: Error tokenizing data. C error: Expected 35 fields in line 191072, saw 36

1.2.1 pandas 讀取串列行

import pandas as pd
file ='tmp.csv'
#讀取錯誤行
dat = pd.read_table(file, sep='|',encoding = "utf8",
                     dtype=str,skiprows=191072, nrows=2 ,low_memory=False,
                     header=None
                     )

1.2.2 linecache 讀取串列行

import linecache
error_line= linecache.getline(file,191073)

二、替換串列數據

2.1 替換第n次出現的某個字符

def nth_repl(s, sub, repl, n):
    """
    替換第n次出現的字符
    :s:字符串
    :sub:被替換的字符串
    :repl:替換爲的新字符串
    :n:字符串第n次出現的次數  
    -------------------------
    替換第7次出現的位置
    nth_repl(z,'|','_',7)
    """
    find = s.find(sub)
    # If find is not -1 we have found at least one match for the substring
    i = find != -1
    # loop util we find the nth or we find no match
    while find != -1 and i != n:
        # find + 1 means we start searching from after the last match
        find = s.find(sub, find + 1)
        i += 1
    # If i is equal to n we found nth match so replace
    if i == n:
        return s[:find] + repl + s[find+len(sub):]
    return s

參考:https://stackoverflow.com/questions/35091557/replace-nth-occurrence-of-substring-in-string

2.2 替換空字符

#替換\x00 與空字符串(\r,\n,\t)
z=re.sub('\x00','',re.sub('\s','',count))

2.3 逐行替換

列表效率更高

#清洗數據
def file_sub(old_file,new_file):  
    
    file_data = []  # 初始化
    
    with open(old_file, "r", encoding="utf-8") as f:
        print('開始替換...')
        rows = f.readlines()
        sep_cnt = rows[0].count('|')
        
        for line in tqdm(rows):      # line一行行讀取替換文本
            cnt = line.count('|')
            if cnt!=sep_cnt:    
                a = re.sub('\x00','',re.sub('\s','',line))
                a = nth_repl(a,'|','_',30)
                #如果仍然大於30個,繼續替換
                while a.count('|')>sep_cnt:
                    a = nth_repl(a,'|','_',30)
            else :
                a = re.sub('\x00','',re.sub('\s','',line)) 
            file_data.append(a) 
    

    with open(new_file, "w", encoding="utf-8") as f:   # 寫入替換好的文本
        print('寫入替換文本...')
        for line in tqdm(file_data):
            f.write(line + '\n')
        
    print('批量替換完成')
 
def main():  
    file_sub('hdx_finance_20211130_f.csv','hdx_finance_20211130_new.csv')
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章