文件存儲格式轉換(ASCII UTF-8)

原創

2020-06-14 19:34

文件存儲格式轉換(ASCII&UTF-8)

在用 Source Insight[version 3.50.0080] 看用在 Linux 上的代碼時發現對中文註釋的支持很不友好，看到網上又說要改註釋字體爲“新宋體”（/“宋體”）的，但我沒弄成。就想着直接把編碼爲 UTF-8 的文件存爲 ASCII，首先想到的是“記事本”中的“另存爲”，但當文件太多時顯然不行。
搜了好多，發現一個寫的還不錯的sourceinsight中文顯示亂碼問題徹底解決辦法，
簡單明瞭，不過好似有點問題–會把原本是 ASCII 的文件給弄壞了，將改進了一點（在命令行輸入目標文件夾，並不能修復關於 ASCII 的問題 -.-。另外，記事本存爲 UTF-8 時其實是 “UTF-8 with BOM”，這也帶來了不少問題）的貼在下邊：

@echo off
set DIR=%1%
if "%DIR%"=="" (
  echo "Should input the dictionary name") else (
    for /R %DIR% %%i in (*.h *.c *.cpp *.cs *.mak *.java) do (
    echo %%i
    native2ascii -encoding UTF-8 %%i %DIR%\temp
    native2ascii -reverse %DIR%\temp %%i
    )
echo ALL DONE
pause
)

關於 native2ascii 的一些參考資料：

1.native2ascii命令

2.native2ascii命令詳解

所以，就自己寫了個 python 程序來實現所需功能：ASCII 與 UTF-8 互相轉換：

注：需要自行安裝 chardet 模塊，且我的 python 環境是 2.7

使用方式： python transformFormat.py fileOrDirName toUTF_8(True/False) fileExtensions(c,cpp,h,cs,mak)[optional]

比如：python transformFormat.py H:\test True c cpp h

就可以將 H:\test 文件夾下的所有後綴爲 .c/.cpp/.h 的文件轉爲 UTF-8 模式（原來的格式並不牽扯）

"""
transFormat.py, aim to transform the codec of the file,especially between the ASCII and
UTF-8.
"""
class Transform(object):

    def listFiles(self, root=''):
        allFiles = []        
        import os
        #s = os.sep
        #root = "d:" + s + "ll" + s

        if os.path.isfile(root): #root is just a file
            allFiles.append(root)
            return allFiles

        for i in os.listdir(root):  #root is a dictionary
            f = os.path.join(root,i)
            if os.path.isdir(f):
                allFiles += self.listFiles(root= f)

            elif os.path.isfile(f):
                allFiles.append(f)

        return allFiles



    def transform(self, fileName, toUTF_8):
        import chardet
        import codecs
        with open(fileName, 'r') as f:
            data = f.read()
            if data[:3] == codecs.BOM_UTF8: # In case of UTF-8 with BOM       
                data = data[3:]            
        try:
            print('Transform begin, file: ' + root + ';toUTF_8: ' + str(toUTF_8))
            encodeType = chardet.detect(data)['encoding'].upper()
            print(fileName, encodeType)

            alreadyUTF_8 = (encodeType.find('UTF') != -1) #already utf-8

            if (toUTF_8 and alreadyUTF_8) or (not toUTF_8 and not alreadyUTF_8): #Do not need to transform,already OK
                print (fileName + ' Already')
                return

            if toUTF_8: #meet the require to change to utf-8
                data = data.decode('gbk','ignore').encode('utf-8')
            else:
                data = data.decode('utf-8', 'ignore').encode('gbk')

            #write back the content
            with open(fileName, 'w') as f:
                f.write(data)

            print(fileName + ' OK')

        except Exception as e:
            print('WRONG with ' + fileName)
            print(e)


    def main(self, root='', toUTF_8=True, fileExtensions=''):
        #print('Transform begin, root: ' + root + ';toUTF_8: ' + str(toUTF_8))
        allFiles = self.listFiles(root=root)
        allFiles2 = []
        for f in allFiles:
            fends = f.split('.')[-1]
            if fends in fileExtensions:
                allFiles2.append(f)      

        if len(allFiles2) == 0:
            print('No file to transform')
            return

        for f in allFiles2:
            self.transform(f, toUTF_8)



#t = Transform()
#root = 'H:\leetcode\wingide\he'
#fE = ['c','cpp','h','cs','mak','txt']
#t.main(root=root,toUTF_8=False, fileExtensions = fE)
#exit()

if __name__ == '__main__':
    print('Usage: python transformFormat.py fileOrDirName toUTF_8(True/False)  fileExtensions(c,cpp,h,cs,mak)[optional]')
    import sys
    #print(sys.argv)
    if len(sys.argv) < 2:
        print("No file name!")
        exit()        
    if len(sys.argv) == 2:
        print('Should give toUTF_8')
        exit()
    root = sys.argv[1]

    if len(sys.argv) >= 3:
        if sys.argv[2] == 'True':
            toUTF_8 = True
        elif sys.argv[2] == 'False':
            toUTF_8 = False
        else:
            print('toUTF should be True or False')

    fileExtensions = ['c','cpp','h','cs','mak']
    if len(sys.argv) > 3:
        fileExtensions = sys.argv[3:]

    print('Transform begin, root: ' + root + ';toUTF_8: ' + str(toUTF_8) + ';fileExtensions:' + str(fileExtensions))
    t = Transform()
    t.main(root=root, toUTF_8=toUTF_8,fileExtensions=fileExtensions)
    print('Transform Over')

參考資料：

1.python 中文亂碼問題深入分析

2.字符編碼筆記：ASCII，Unicode和UTF-8

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

文件存儲格式轉換(ASCII UTF-8)

文件存儲格式轉換(ASCII&UTF-8)

釘釘打卡速度慢

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Python 潮流週刊#51：用 Python 繪製美觀的圖表

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

文件存儲格式轉換(ASCII UTF-8)

wrapper primitives和char向int轉化

J001.關於main裏的static

"奇淫技巧"

cpp 併發編程小計

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結