編譯原理課上的一個實驗是做一個編譯器前端的詞法分析器,我選擇了用Python來寫C語言的詞法分析。
詞法分析器的功能是輸入源程序,輸出單詞符號。當初定義Token(單詞種別,屬性值)序列的時候,是將單詞種別用數字來表示,後來再做語法分析的時候,發現用數字時不太合理的,所以又對單詞的種別碼進行了一番修改。
我的程序的總體思路是先對源程序進行一遍掃描,將多餘的空格和註釋去除,然後再讀一遍已經進行過預處理的源程序,進行單詞的識別,轉換成二元組,保存到token文件中,並建立符號表對標識符進行管理,如果發現了錯誤,對其的位置和錯誤信息進行打印。
在對單詞的識別部分,我採用了有窮自動機的理論來進行識別。這樣就可以根據現在的狀態和輸入符號決定其後繼行爲。因此在對單詞的識別中,我畫了很多的狀態圖來識別不同的單詞,如字符串、數字等等。狀態圖的繪製中,本來想用visio來畫的,後來的後來覺得太麻煩了,還是用了最快的手畫的方法。
圖1.註釋的狀態轉換圖
圖2.標誌符的狀態裝換圖
圖3.字符串的狀態裝換圖
圖4.界符的狀態轉換圖
圖5.整常數、浮點常數的狀態轉換圖
圖6.字符常數的狀態轉換圖
關於錯誤處理的方面,我對於詞法分析階段所能遇到的幾種錯誤,如下圖所示中的四種中的前三種都進行了相應的處理。但是對於第三點做的不太好,對字符常數中可以出現的字符限制的有點過於厲害,例如分號等在我的詞法分析器中是不能再字符串中出現的。
圖7.詞法分析中的四種錯誤
測試程序如下,內包含主要的C語言的各種語句,含有少量的錯誤:
int main() { int _a; char ch = 'f; floatb,centigrade,fahrj@enheit; char fd = '\n'; printf("please inputa); scanf("%d",&a); /*mycomment1or***2*/ printf("please inputb"); scanf("%f",&b); if (a==8.1.6) { centigrade=095*(b-32)/9; /*itismyc5435omment*/ printf("TheCentigrade is ",centigrade); /*mess/age*/ } else if (a!=0) { fahrenheit=(9/5.0)*b++32; /*mycontent*/ printf("TheFahrenheit is fahrenheit); /*hello****/ } return 0; }
運行結果如下圖所示:
圖8.測試程序的錯誤報告
這是用Python寫的第一個稍微像點樣的東西,所以很多地方寫的不大好,代碼結構也是有點混亂。總而言之,就是在這樣的條件下把編譯原理的第一次實驗給寫完了。接下來是我的水水的代碼了。
# -*- coding: utf-8 -*-
'''
Created on 2012-10-18
@author: zouliping
'''
import string
_key = ("auto","break","case","char","const","continue","default",
"do","double","else","enum","extern","float","for",
"goto","if","int","long","register","return","short",
"signed","static","sizeof","struct","switch","typedef","union",
"unsigned","void","volatile","while") # c語言的32個關鍵字
_abnormalChar = '@#$%^&*~' #標識符中可能出現的非法字符
_syn = '' #單詞的種別碼
_p = 0 #下標
_value = '' #存放詞法分析出的單詞
_content = '' #程序內容
_mstate = 0 #字符串的狀態
_cstate = 0 #字符的狀態
_dstate = 0 #整數和浮點數的狀態
_line = 1 #代碼的第幾行
_mysymbol = [] #符號表
def outOfComment():
'''去除代碼中的註釋'''
global _content
state = 0
index = -1
for c in _content:
index = index + 1
if state == 0:
if c == '/':
state = 1
startIndex = index
elif state == 1:
if c == '*':
state = 2
else:
state = 0
elif state == 2:
if c == '*':
state = 3
else:
pass
elif state == 3:
if c == '/':
endIndex = index + 1
comment = _content[startIndex:endIndex]
_content = _content.replace(comment,'') #將註釋替換爲空,並且將下標移動
index = startIndex - 1
state = 0
elif c == '*':
pass
else:
state = 2
def getMyProm():
'''從文件中獲取代碼片段'''
global _content
myPro = open(r'E://test.txt','r')
for line in myPro:
if line != '\n':
_content = "%s%s" %(_content,line.lstrip()) #效率更高的字符串拼接方法
else:
_content = "%s%s" %(_content,line)
myPro.close()
def analysis(mystr):
'''分析目標代碼,生成token'''
global _p,_value,_syn,_mstate,_dstate,_line,_cstate
_value = ''
ch = mystr[_p]
_p += 1
while ch == ' ':
ch = mystr[_p]
_p += 1
if ch in string.letters or ch == '_': ###############letter(letter|digit)*
while ch in string.letters or ch in string.digits or ch == '_' or ch in _abnormalChar:
_value += ch
ch = mystr[_p]
_p += 1
_p -= 1
for abnormal in _abnormalChar:
if abnormal in _value:
_syn = '@-6' #錯誤代碼,標識符中含有非法字符
break
else:
_syn = 'ID'
for s in _key:
if cmp(s,_value) == 0:
_syn = _value.upper() #############關鍵字
break
if _syn == 'ID':
inSymbolTable(_value)
elif ch == '\"': #############字符串
while ch in string.letters or ch in '\"% ' :
_value += ch
if _mstate == 0:
if ch == '\"':
_mstate = 1
elif _mstate == 1:
if ch == '\"':
_mstate = 2
ch = mystr[_p]
_p += 1
if _mstate == 1:
_syn = '@-2' #錯誤代碼,字符串不封閉
_mstate = 0
elif _mstate == 2:
_mstate = 0
_syn = 'STRING'
_p -= 1
elif ch in string.digits:
while ch in string.digits or ch == '.' or ch in string.letters:
_value += ch
if _dstate == 0:
if ch == '0':
_dstate = 1
else:
_dstate = 2
elif _dstate == 1:
if ch == '.':
_dstate = 3
else:
_dstate = 5
elif _dstate == 2:
if ch == '.':
_dstate = 3
ch = mystr[_p]
_p += 1
for char in string.letters:
if char in _value:
_syn = '@-7' #錯誤代碼,數字和字母混合,如12AB56等
_dstate = 0
if _syn != '@-7':
if _dstate == 5:
_syn = '@-3' #錯誤代碼,數字以0開頭
_dstate = 0
else:
_dstate = 0
if '.' not in _value:
_syn = 'DIGIT' ##################digit digit*
else:
if _value.count('.') == 1:
_syn = 'FRACTION' ################## 浮點數
else:
_syn = '@-5' #錯誤代碼,浮點數中包含多個點,如1.2.3
_p -= 1
elif ch == '\'': ################## 字符
while ch in string.letters or ch in '@#$%&*\\\'\"':
_value += ch
if _cstate == 0:
if ch == '\'':
_cstate = 1
elif _cstate == 1:
if ch == '\\':
_cstate = 2
elif ch in string.letters or ch in '@#$%&*':
_cstate = 3
elif _cstate == 2:
if ch in 'nt':
_cstate = 3
elif _cstate == 3:
if ch == '\'':
_cstate = 4
ch = mystr[_p]
_p += 1
_p -= 1
if _cstate == 4:
_syn = 'CHARACTER'
_cstate = 0
else:
_syn = '@-4' #錯誤代碼,字符不封閉
_cstate = 0
elif ch == '<':
_value = ch
ch = mystr[_p]
if ch == '=': ########### '<='
_value += ch
_p += 1
_syn = '<='
else: ########### '<'
_syn = '<'
elif ch == '>':
_value = ch
ch = mystr[_p]
if ch == '=': ########### '>='
_value += ch
_p += 1
_syn = '>='
else: ########## '>'
_syn = '>'
elif ch == '!':
_value = ch
ch = mystr[_p]
if ch == '=': ########## '!='
_value += ch
_p += 1
_syn = '!='
else: ########## '!'
_syn = '!'
elif ch == '+':
_value = ch
ch = mystr[_p]
if ch =='+': ############ '++'
_value += ch
_p += 1
_syn = '++'
else : ############ '+'
_syn = '+'
elif ch == '-':
_value = ch
ch = mystr[_p]
if ch =='-': ########### '--'
_value += ch
_p += 1
_syn = '--'
else : ########### '-'
_syn = '-'
elif ch == '=':
_value = ch
ch = mystr[_p]
if ch =='=': ########### '=='
_value += ch
_p += 1
_syn = '=='
else : ########### '='
_syn = '='
elif ch == '&':
_value = ch
ch = mystr[_p]
if ch == '&': ########### '&&'
_value += ch
_p += 1
_syn = '&&'
else: ########### '&'
_syn = '&'
elif ch == '|':
_value = ch
ch = mystr[_p]
if ch == '|': ########## '||'
_value += ch
_p += 1
_syn = '||'
else: ########## '|'
_syn = '|'
elif ch == '*': ########## '*'
_value = ch
_syn = '*'
elif ch == '/': ########## '/'
_value = ch
_syn = '/'
elif ch ==';': ########## ';'
_value = ch
_syn = ';'
elif ch == '(': ########## '('
_value = ch
_syn = '('
elif ch == ')': ########### ')'
_value = ch
_syn = ')'
elif ch == '{': ########### '{'
_value = ch
_syn = '{'
elif ch == '}': ########### '}'
_value = ch
_syn = '}'
elif ch == '[': ########### '['
_value = ch
_syn = '['
elif ch == ']': ########### ']'
_value = ch
_syn = ']'
elif ch == ',': ########## ','
_value = ch
_syn = ','
elif ch == '\n':
_syn = '@-1'
def inSymbolTable(token):
'''將關鍵字和標識符存進符號表'''
global _mysymbol
if token not in _mysymbol:
_mysymbol.append(token)
if __name__ == '__main__':
getMyProm()
outOfComment()
symbolTableFile = open(r'E://symbol_table.txt','w')
tokenFile = open(r'E://token.txt','w')
while _p != len(_content):
analysis(_content)
if _syn == '@-1':
_line += 1 #記錄程序的行數
elif _syn == '@-2':
print '字符串 ' + _value + ' 不封閉! Error in line ' + str(_line)
elif _syn == '@-3':
print '數字 ' + _value + ' 錯誤,不能以0開頭! Error in line ' + str(_line)
elif _syn == '@-4':
print '字符 ' + _value + ' 不封閉! Error in line ' + str(_line)
elif _syn == '@-5':
print '數字 ' + _value + ' 不合法! Error in line ' + str(_line)
elif _syn == '@-6':
print '標識符' + _value + ' 不能包含非法字符!Error in line ' + str(_line)
elif _syn == '@-7':
print '數字 ' + _value + ' 不合法,包含字母! Error in line ' + str(_line)
else: #若程序中無詞法錯誤的情況
#print (_syn,_value)
tokenFile.write(str(_syn)+'@'+_value+'\n')
tokenFile.close()
symbolTableFile.write('入口地址\t變量名\n')
i = 0
for symbolItem in _mysymbol:
symbolTableFile.write(str(i)+'\t\t\t'+symbolItem+'\n')
i += 1
symbolTableFile.close()