破解字體反爬（二）

介紹

實現

介紹

本篇文章描述通過程序解析字庫文件中字體的方法。

背景知識

網頁上使用的字庫文件常用格式有：.ttf，.woff，.eot（認識 Iconfont 以及什麼是 .eot、.woff、.ttf、.svg ——簡書）。經過測試，以ttf文件格式爲準編寫的解析程序，其他格式一般也能解析。下面就以解析ttf格式作介紹。

TrueType字體

TrueType字體通常包含在單個TrueType字體文件中，其文件後綴爲.TTF。OpenType字體是以類似於TrueType字體的格式編碼的POSTSCRIPT字體。OPENTYPE字體使用.OTF文件後綴。OPENTYPE還允許把多個OPENTYPE字體組合在一個文件中以利於數據共享。這些字體被稱爲TrueType字體集（TrueType collection），其文件後綴爲.TTC

對於TrueType字體文件結構解析的帖子很多，我這裏貼出一個格式比較清晰的。ttf文件結構解析.doc，這裏對ttf文件結構解析.doc作簡單概述。

一個ttf文件中包含了很多個表結構，大概有：

head 字體頭字體的全局信息
cmap 字符代碼到圖元的映射把字符代碼映射爲圖元索引
glyf 圖元數據圖元輪廓定義以及網格調整指令
maxp 最大需求表字體中所需內存分配情況的彙總數據
mmtx 水平規格圖元水平規格
loca 位置表索引把元索引轉換爲圖元的位置
name 命名錶版權說明、字體名、字體族名、風格名等等
hmtx 水平佈局字體水平佈局星系：上高、下高、行間距、最大前進寬度、最小左支撐、最小右支撐
kerm 字距調整表字距調整對的數組
post PostScript信息所有圖元的PostScript FontInfo目錄項和PostScript名
PCLT PCL 5數據 HP PCL 5Printer Language 的字體信息：字體數、寬度、x高度、風格、記號集等等
OS/2 OS/2和Windows特有的規格 TrueType字體所需的規格集

其中，圖元數據（glyf表）是TrueType字體的核心信息，因此通常它是最大的表。圖元，全稱爲圖形輸出原語，也就是字體的圖形數據。一個圖元對應這個一個字體，一個圖元有多條輪廓，一條輪廓有多個點。例如：一個字體文件中有字符“0”，那麼“0”就對應一個圖元，“0”的圖元有兩條輪廓線，一條是內部的輪廓，另一條是外部的輪廓。，外圈有21個點控制，內圈有11個點控制。這些點控制這輪廓的形狀，但不是組成。因爲TureType字體中的圖元輪廓是用二階Bezier曲線定義的，有三個點：一個曲線上的點，一個曲線外的點和另一個曲線上的點。多個連續的不在曲線上的點是允許的，但不是用來定義三階或更高階的Bezier曲線，而是爲了減少控制點的數目。比如，對於on-off-off-on模式的四個點，會加入一個隱含的點使之成爲on-off-on-off-on,因此定義的是兩段二階Bezier曲線。如下圖，會發現“0”外圈上會超過21個點，內圈上會超過11個。

fontTools 字體文件解析庫

fontTools是python語言編寫的字體文件解析程序。上面“ttf文件解析.doc”中使用C語言描述ttf文件內部表的結構如圖元頭部信息表：

typedef   struct   
{
WORD   numberOfContours;   //contor   number,negative   if   composite  圖元輪廓線數量
FWord   xMin;       //Minimum   x   for   coordinate   data. // 圖元位置x軸最小值
FWord   yMin;       //Minimum   y   for   coordinate   data. // 圖元座標y軸最小值
FWord   xMax;       //Maximum   x   for   coordinate   data. // 圖元座標x軸最大值
FWord   yMax;       //Maximum   y   for   coordinate   data. // 圖元座標y軸最大值
}GlyphHeader;

使用fontTools工具包會更方便，fontTools可以把字體文件中的表結構及數據以xml的格式保存到文件中。

from fontTools.ttLib import TTFont

#  ttfFile是ttf文件的路徑或ttf文件流
font = TTFont(ttfFile)

# path可以是保存xml文件的路徑或io流，tables參數指定保存的表，爲空則保存全部表。
# keys方法可以查看所有表名['GlyphOrder', 'head', 'hhea', 'maxp', 'OS/2', 'hmtx', 'cmap', 'loca', 'glyf', 'name', 'post', 'GSUB']
font.saveXML(path, tables=['glyf'])

字符“0”以xml格式描述上面C語言結構體爲：

<TTGlyph name="uniE575" xMin="0" yMin="-12" xMax="508" yMax="719"></TTGlyph>

fontTools.ttLib.TTFont.getBestCmp方法可以輸出字符序列與圖元名稱之間的映射關係

因爲fontTools會把數據轉存爲xml，所以我們需要解析xml

xml.dom.minidom xml代碼解析庫

python提供的xml代碼解析庫有很多，這裏只是隨便使用其中的一種

from xml.dom.minidom import parseString
doc = parseString(io.getvalue())
root = doc.documentElement

xml.dom.minidom解析出的xml-dom對象支持基本的dom操作，我們要使用的也就element.getElementsByTag，element.getElementByName，element.getAttribute

matplotlib 繪圖工具包

matplotlib讓我們能夠把ttf中的圖元繪製出來。預覽一個matplotlib

pytesseract ocr識別庫

pytesseract是python語言編寫的tesseract-ocr開發工具包，需要安裝tesseract-ocr才能正常使用。Tesseract：開源的OCR識別引擎，初期Tesseract引擎由HP實驗室研發，後來貢獻給了開源軟件業，後經由Google進行改進，消除bug，優化，重新發布。
tesseract-ocr可以把我們繪製的字體圖形識別成文字。pytesseract的使用方式很簡單：

import pytesseract
from PIL import Image

image = Image.open(imagePath)
word = pytesseract.image_to_string(image, lang='chi_sim', config='--psm 10')

實現

我們的是方案是：通過fontTools解析ttf文件轉存爲xml，xml.dom.minidom解析xml，matplotlib繪製字體圖形，pytesseract識別字體圖形爲字符串

fontTools解析ttf文件轉存爲xml

from fontTools.ttLib import TTFont
from io import StringIO

class TTFParser:
	ttfFile = ""

	"""parser ttf font file"""
	def __init__(self):
		self.ttfFile = ""

	# 解析字庫文件
	def parseFontFile(self, ttfFile):
		self.ttfFile = ttfFile
		try:
			self.font = TTFont(ttfFile)
		except Exception as e:
			raise Exception("a exception occurred during instantiatting TTFParser["+ttfFile+"]("+e.message+")")
			return;
		try:
			io = StringIO()
			self.font.saveXML(io, tables=['glyf'])
			return io
		except Exception as e:
			raise Exception("a exception occurred during saving TTFont["+ttfFile+"] to xml file("+e.message+")")

io的內容爲：

xml.dom.minidom解析xml

from xml.dom.minidom import parseString

doc = parseString(io.getvalue())
root = doc.documentElement

matplotlib繪製字體圖形

先貼出核心代碼

class TTFPath(Path):
    """docstring for TTFPath."""
    def __init__(self, glyph):
        self.__verts = []
        self.__codes = []
        self.calculatePath(glyph)

        super(TTFPath, self).__init__(self.__verts, self.__codes)

    # 計算path
    def calculatePath(self, glyph):
        contours = glyph.getContours()
        for contour in contours:
            # 畫筆移至第一個點
            self.__moveTo(contour.getPoint(0))
            # 遍歷輪廓線上的點，從第二個點開始
            for index in range(1,contour.size()):
                point = contour.getPoint(index)
                if(point.onCurve):
                    # onCurve爲True標識爲貝塞爾曲線的起止點，該點在曲線上
                    self.__lineTo(point)
                    continue
                else:
                    # onCurve爲False標識該點爲控制點，該點不在曲線上
                    if contour.getPoint(index-1).onCurve:
                        self.__quadTo(point)
                    else:
                        # 連續兩個點onCurve爲控制點，添加兩點中點作爲起止點
                        self.__lineTo(self.__mindPoint(point, contour.getPoint(index-1)))
                        self.__quadTo(point)
            # 曲線繪製完成，畫筆回到起點
            self.__moveTo(contour.getPoint(0))


    def __moveTo(self, point):
        self.__codes.append(Path.MOVETO)
        self.__verts.append((point.x, point.y))
    # path.append((Path.MOVETO, (point.x, point.y)))
    def __lineTo(self, point):
        self.__codes.append(Path.LINETO)
        self.__verts.append((point.x, point.y))
    # path.append((Path.LINETO, (point.x, point.y)))
    def __quadTo(self, point):
        self.__codes.append(Path.CURVE3)
        self.__verts.append((point.x, point.y))
    # path.append((Path.CURVE3, [(ctrlPoint.x, ctrlPoint.y), (point.x, point.y)]))
    def __closePath(self, point):
        self.__codes.append(Path.CLOSEPOLY)
        self.__verts.append((point.x, point.y))
    def __mindPoint(self, pointA, pointB):
        return Point(pointA.x+(pointB.x-pointA.x)/2, pointA.y+(pointB.y-pointA.y)/2)

class TTFRender:
    """docstring for TTFRender."""
    color = 'black'

    def __init__(self):
        self.fig, self.ax = plt.subplots()

    def draw(self, glyph):
        path = TTFPath(glyph)
        patch = PathPatch(path, facecolor=self.color)
        plt.cla()
        # plt.clf()
        self.ax.add_patch(patch)
        # 設置座標系
        self.ax.grid()
        # self.ax.axis('equal')
        self.centralize(glyph.getXmin(),glyph.getXmax(),glyph.getYmin(),glyph.getYmax())
        # 隱藏座標軸刻度
        plt.xticks([])
        plt.yticks([])
        plt.axis('off')

    # 使圖像居中。應對逗號，句號，小數點等特殊字體
    def centralize(self, xMin,xMax,yMin,yMax):
        width = 640
        height = 480
        self.ax.axis(xmin=int(xMin-width/2),xmax=int(xMax+width/2),ymin=int(yMin-height/2),ymax=int(yMax+height/2))

    def getBufferImage(self, width=80, height=60):
        io = BytesIO()
        self.fig.savefig(io)
        image = Image.open(io)
        return image.resize((width,height))

pytesseract識別字體圖形爲字符串

# encoding: utf-8

import pytesseract
from PIL import Image

class Ocr:
	"""docstring for Ocr"""
	def __init__(self, image = None):
		self.image = image
		self.lang = 'chi_sim'
		self.psm = '10'
	def setImagePath(self, imagePath):
		self.image = Image.open(imagePath)
	def  setImageFile(self, imageFile):
		self.image = imageFile
	def setLang(self, lang):
		self.lang = lang
	def getWords(self):
		return pytesseract.image_to_string(self.image, lang=self.lang, config='--psm '+self.psm)
	def recoginze(self, image):
		return pytesseract.image_to_string(image, lang=self.lang, config='--psm '+self.psm)

破解字體反爬（二）

破解字體反爬（二）

介紹

背景知識

TrueType字體

fontTools 字體文件解析庫

xml.dom.minidom xml代碼解析庫

matplotlib 繪圖工具包

pytesseract ocr識別庫

實現

fontTools解析ttf文件轉存爲xml

xml.dom.minidom解析xml

matplotlib繪製字體圖形

pytesseract識別字體圖形爲字符串

Laravel 的 HTTP 會話機制——Session

laravel自定義模型方法拋異常Non-static method XXX should not be called statically

破解字體反爬（二）

Java核心技術：集合——遺留的集合

Java核心技術：集合——視圖與包裝器

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結