Python3批量轉換文件編碼

| 背景： 我這個程序員菜鳥有一天突然發現，自己的某個很菜鳥的項目，所有文件編碼都是混亂的。這該怎麼辦？急，在線等。

可惜，我終於沒有等到大佬給我推薦什麼好使喚的軟件。於是我覺得我是不是可以自己批量解決一下。

準備工作

python3
pip install chardet （檢測編碼）

檢測文件編碼

“凡事預則立，不預則廢”，編碼混亂的文件實在太多，還是的好好計劃下：首先，我們檢測一下各個文件的編碼狀況，然後纔可以動工修正。
檢測文件編碼，我們可以使用 chardet 開源庫，用法很簡單，直接將 bytes 傳入即可：

import chardet

f_file = open(path, "rb")
content = f_file.read()
# 結果是一個字典，包含了猜測的編碼與概率
guess_encode = chardet.detect(content)

獲取要檢測編碼的所有文件

“有子存焉，子又生孫，孫又生子，子又有子，子又有孫，子子孫孫無窮匱也”——對於一些個文件夾而言，真的是有非常有深度，它們有非常深的目錄結構。

無論是檢測編碼，還是修正文件編碼，都應先將這許多個文件先查找出來。如何查找？

一般我們想到的是遞歸，但其實針對文件的這個情況，python 的os 模塊已經做好了準備，使用os.walk即可：

import os
import re

    # 深度遞歸遍歷所有文件夾下的文件
    def walk_files(path, regex=r"."):
        if not os.path.isdir(path):
            return [path]
        file_list = []
        for root, dirs, files in os.walk(path):
            for file in files:
                file_path = os.path.join(root, file)
                if re.match(regex, file_path):
                    file_list.append(file_path)

        return file_list

使用正則表達式（re模塊），是爲了方便過濾，總有些文件是可能不需要檢測或修改的。
既然獲取了文件列表，那麼遍歷讀取並檢測編碼並不是難事，只需要加上一個循環即可，在循環中我們記錄下編碼的猜測結果，或是打印，或是暫存到最後寫入到報告文件中，不再贅述。

修改文件編碼

python2 的字符串可以說設計得比較糟糕，二進制bytes類型也算是字符串，導致了一系列的混亂。

python3 對這方面做了改進，byte編碼轉換隻需要如下進行即可：

# byte解碼爲字符串
contentStr = content.decode(original)
# 轉爲目標編碼bytes
targetBytes = bytes(contentStr, target)

當然，記得加上try，bytes的解碼需要按照正確方法進行，否則會拋出異常，這相當於是一個解密的過程，用錯了鑰匙將無法打開大門（比如本來是 utf-8 編碼的內容，錯用了 gbk 解碼）

獲取修改完編碼方式的bytes後，我們還需要保存文件：

f_file.seek(0)
f_file.truncate()
f_file.write(targetBytes)

先將文件指針移動到最前面，接着使用 f_file.truncate() 清空指針後所有內容，最後寫入。

終章（實例代碼和截圖）

上文大部分都是在敘述思路，代碼並不完整。不過，最重要的是——進行任何批量操作前，請先備份。但我沒有實現，可以考慮使用 shutil.copytree(原文件夾，新文件夾) 進行備份。

如上圖，chardet 的猜測不一定是正確的，所以需要備份，需要針對某些文件進行一些微調，直到IDE能夠正常顯示或運行。

下面是完整的測試代碼：

# -*- coding: utf-8 -*-
# @Date:2020/1/12 19:04
# @Author: Lu
# @Description

import os
import copy
import re
import chardet


class FileUtil():

    # 深度遞歸遍歷所有文件夾下的文件
    def walk_files(path, regex=None):
        if not os.path.isdir(path):
            return [path]
        file_list = []
        for root, dirs, files in os.walk(path):
            for file in files:
                file_path = os.path.join(root, file)
                if re.match(regex, file_path):
                    file_list.append(file_path)

        return file_list


class EncodeTask():

    def __init__(self):
        self.default_config = {
            "workpaths": [u"./"],
            "filefilter": r"."
        }
        self.config = copy.deepcopy(self.default_config)
        self.work_files = []
        self.workpaths = []

    def update(self, config, fill_default_value=False):
        cache = copy.deepcopy(config)
        for k in self.default_config.keys():
            if cache.get(k):
                self.config[k] = cache[k]
            elif fill_default_value:
                self.config[k] = self.default_config[k]
        self.__gen_files(self.config["workpaths"])
        return self

    def __gen_files(self, workpaths):
        self.work_files.clear()
        for workpath in workpaths:
            self.work_files += FileUtil.walk_files(workpath, self.config["filefilter"])

    def check_encoding(self):
        encoding_report = {"stat": {}, "reports": []}
        for path in self.work_files:
            f_file = open(path, "rb")
            content = f_file.read()
            guess_encode = chardet.detect(content)

            encoding = guess_encode.get("encoding")
            encoding_report["reports"].append([path, guess_encode])
            if not encoding_report["stat"].get(encoding):
                encoding_report["stat"][encoding] = 1
            else:
                encoding_report["stat"][encoding] += 1

            f_file.flush()
            f_file.close()

        reportfile = open(u"./encoding_report.txt", "w",encoding="utf-8")
        reportContent = u"{}\n".format(encoding_report["stat"])

        for item in encoding_report["reports"]:
            reportContent += u"\n{}    {}".format(item[0], item[1])

        reportfile.write(reportContent)
        reportfile.flush()
        reportfile.close()
        print(encoding_report)

    def change_encoding(self, original, target):
        for path in self.work_files:
            print(u"\n{}\nchange {} to {}".format(path, original, target))
            f_file = open(path, "rb+")
            content = f_file.read()
            try:
                # byte解碼爲字符串
                contentStr = content.decode(original)
                # 字符串編碼爲uniccode str
                # unicodeBytes = contentStr.encode("unicode_escape")

                # 轉爲目標編碼bytes
                targetBytes = bytes(contentStr, target)

                # print(targetBytes)

                f_file.seek(0)
                f_file.truncate()
                f_file.write(targetBytes)

            except Exception as e:
                print(u"Error:可能編碼有誤\n{}".format(e))

            finally:
                f_file.flush()
                f_file.close()


def task():
    print("""You can use it like this code:
# -*- coding: utf-8 -*-

    from conver_encode import EncodeTask

    EncodeTask().update({
        "workpaths": [u"./test"],
        "filefilter": r".*\.(?:java)"
    }).check_encoding()

    EncodeTask().update({
        "workpaths": [u"./test"],
        "filefilter": r".*\.(?:java)"
    }).change_encoding("gb18030", "utf-8")

    # }).change_encoding("utf-8", "gb18030")
    # }).change_encoding("Windows-1252", "utf-8")
    """);
    pass


if __name__ == '__main__':
    task()

Python3批量轉換文件編碼

Python3批量轉換文件編碼

檢測文件編碼

獲取要檢測編碼的所有文件

修改文件編碼

終章（實例代碼和截圖）

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

徹底刪除誤提交到git倉庫的文件

ps圖片黑白調整算法——Android實現及性能優化

自定義View之描邊、便籤、貼紙效果

安卓Zxing生成Data Matrix、PDF417二維碼錯誤：數組下標異常

【NppExec】配置Notepad++編譯Python、Java、Go代碼之通用配置（根據後綴名）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結