python txt csv 文件串行串列處理一、發現串行串列數據二、替換串列數據

原創

2022-11-11 04:54

一、發現串行串列數據

1.1 根據分隔符每行出現次數判斷是否串列

根據分隔符出現的次數判斷是否串列

#發現串行
with open(file,'r',encoding='utf-8') as file:
    rows = file.readlines()
    sep_cnt = rows[0].count('|')
    num = 0
    for i in rows:
        cnt = i.count('|')
        if cnt!=sep_cnt:
            num = num + 1
            print(num,'|',cnt,':',i)

1.2 知道在第幾行串列

pandas 在讀取時會報錯，錯誤會告知在第幾行多一個列，可以根據錯誤查看具體行的數據

pandas.errors.ParserError: Error tokenizing data. C error: Expected 35 fields in line 191072, saw 36

1.2.1 pandas 讀取串列行

import pandas as pd
file ='tmp.csv'
#讀取錯誤行
dat = pd.read_table(file, sep='|',encoding = "utf8",
                     dtype=str,skiprows=191072, nrows=2 ,low_memory=False,
                     header=None
                     )

1.2.2 linecache 讀取串列行

import linecache
error_line= linecache.getline(file,191073)

二、替換串列數據

2.1 替換第n次出現的某個字符

def nth_repl(s, sub, repl, n):
    """
    替換第n次出現的字符
    :s:字符串
    :sub:被替換的字符串
    :repl:替換爲的新字符串
    :n:字符串第n次出現的次數  
    -------------------------
    替換第7次出現的位置
    nth_repl(z,'|','_',7)
    """
    find = s.find(sub)
    # If find is not -1 we have found at least one match for the substring
    i = find != -1
    # loop util we find the nth or we find no match
    while find != -1 and i != n:
        # find + 1 means we start searching from after the last match
        find = s.find(sub, find + 1)
        i += 1
    # If i is equal to n we found nth match so replace
    if i == n:
        return s[:find] + repl + s[find+len(sub):]
    return s

參考：https://stackoverflow.com/questions/35091557/replace-nth-occurrence-of-substring-in-string

2.2 替換空字符

#替換\x00 與空字符串（\r,\n,\t）
z=re.sub('\x00','',re.sub('\s','',count))

2.3 逐行替換

列表效率更高

#清洗數據
def file_sub(old_file,new_file):  
    
    file_data = []  # 初始化
    
    with open(old_file, "r", encoding="utf-8") as f:
        print('開始替換...')
        rows = f.readlines()
        sep_cnt = rows[0].count('|')
        
        for line in tqdm(rows):      # line一行行讀取替換文本
            cnt = line.count('|')
            if cnt!=sep_cnt:    
                a = re.sub('\x00','',re.sub('\s','',line))
                a = nth_repl(a,'|','_',30)
                #如果仍然大於30個,繼續替換
                while a.count('|')>sep_cnt:
                    a = nth_repl(a,'|','_',30)
            else :
                a = re.sub('\x00','',re.sub('\s','',line)) 
            file_data.append(a) 
    

    with open(new_file, "w", encoding="utf-8") as f:   # 寫入替換好的文本
        print('寫入替換文本...')
        for line in tqdm(file_data):
            f.write(line + '\n')
        
    print('批量替換完成')
 
def main():  
    file_sub('hdx_finance_20211130_f.csv','hdx_finance_20211130_new.csv')

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python txt csv 文件串行串列處理一、發現串行串列數據二、替換串列數據

一、發現串行串列數據

1.1 根據分隔符每行出現次數判斷是否串列

1.2 知道在第幾行串列

1.2.1 pandas 讀取串列行

1.2.2 linecache 讀取串列行

二、替換串列數據

2.1 替換第n次出現的某個字符

2.2 替換空字符

2.3 逐行替換

DAPPER 事務 TRANSACTION

python 清理串行數據 1.替換 2.查看是否有串行

-bash: sqlplus: command not found

視覺對象visuals下載(PBIVIZ文件)

python txt csv文件同時含有多種編碼清洗

python txt csv 文件串行串列處理一、發現串行串列數據二、替換串列數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

python txt csv 文件串行串列處理 一、發現串行串列數據 二、替換串列數據

一、發現串行串列數據

1.1 根據分隔符每行出現次數判斷是否串列

1.2 知道在第幾行串列

1.2.1 pandas 讀取串列行

1.2.2 linecache 讀取串列行

二、替換串列數據

2.1 替換第n次出現的某個字符

2.2 替換空字符

2.3 逐行替換

python txt csv 文件串行串列處理一、發現串行串列數據二、替換串列數據