小試awk
awk好久沒用了,這些天處理文本用了一下,還是那麼銳利。awk的高產,PATTERN-ACTION獨特的文本行處理方式,規則表達式的匹配,C語言的語法,使之始終是我編程工具中必備的利器之一。貼一下:
注意:原始文件爲:a.txt,b.txt,uk.txt,只引用其中一小部分, merge.txt爲中間文件,illegal.txt爲最終生成的文件。
a.txt
[ei]
[Ab]
[Ab5ekstrE]
[Abi5niFiEu]
[7Ab5EuvEu]
......................
b.txt
a
a few
a little
ab
ab extra
ab initio
ab intra
ab ovo
..............................
#merge.awk
#ARGV[1] is dictionary tones, ARGV[2] is words list.
#merge the two files except blank tones in ARGV[1]
BEGIN {
ORS="";
while(getline < ARGV[1] > 0)
{
tone = $0 #store the line
getline < ARGV[2]
if(length(tone) > 0)
{
print tone "/t/t/t" $0
print "/n"
}
}
}
$ awk -f merge.awk a.txt b.txt>merge.txt
生成的中間文件下面要用到:
merge.txt
[ei] a
[Ab] ab
[Ab5ekstrE] ab extra
[Abi5niFiEu] ab initio
[7Ab5EuvEu] ab ovo
.......................................
uk.txt
ii i:
aa B:
oo C:
uu u:
@@ /:
......................................
#find.awk
#check illegal dictionary tones(illege is defined as this: the tones can't find in tts lists(us.txt))
BEGIN {
while(getline<ARGV[1] > 0)
tone[$2]++
while(getline<ARGV[2] >0)
{
for(i = 1; i <= length($1); i++)
{
c1 = substr($1, i, 1) #tone charaters
c2 = substr($1, i, 2) #two charaters
if(c2 in tone)
i++
else if(c1 in tone)
{
c2 = substr($1, i, 2)
if(c2 in tone)
i++
}
else if(c1 != "[" && c1 != "]" && c1 !~ /[57]/)
{
if(c1 == ":")
c1 = substr($1, i-1, 2);
print c1 "/t" $0
}
}
}
}
$awk -f find.awk uk.txt merge.txt>illegal.txt
最終生成的文件
illegal.txt
[: [:blQdsQkE] bloodsucker
B [5blBuzi] blowsy
B [5dVBki] jockey
B [5glEub7flBuE] globeflower
B [5hBundztQN] hound's-tongue
2、背單詞老是按1個次序背效果不好,下面的小程序可以用來打亂次序:
#This program is used to rearrange a text in random way.
{
lines[NR] = $0 #read the text into array line by line (i->line), NR increments by one from one.
}
END{
seed = srand() #set the seed with current time if pass empty parameter to srand, return older seed.
#print seed
for(i = 0; i < FNR; i++)
{
while(1)
{
r = int(rand()*FNR) + 1 #get a random r in range(1-FNR)
if(r in lines) #if r exist in array, print it and delete it.
{
#print r" --------- "lines[r]
print lines[r]
delete lines[r]
break
}
#print r
}
}
}
注意:原始文件爲:a.txt,b.txt,uk.txt,只引用其中一小部分, merge.txt爲中間文件,illegal.txt爲最終生成的文件。
a.txt
[ei]
[Ab]
[Ab5ekstrE]
[Abi5niFiEu]
[7Ab5EuvEu]
......................
b.txt
a
a few
a little
ab
ab extra
ab initio
ab intra
ab ovo
..............................
#merge.awk
#ARGV[1] is dictionary tones, ARGV[2] is words list.
#merge the two files except blank tones in ARGV[1]
BEGIN {
ORS="";
while(getline < ARGV[1] > 0)
{
tone = $0 #store the line
getline < ARGV[2]
if(length(tone) > 0)
{
print tone "/t/t/t" $0
print "/n"
}
}
}
$ awk -f merge.awk a.txt b.txt>merge.txt
生成的中間文件下面要用到:
merge.txt
[ei] a
[Ab] ab
[Ab5ekstrE] ab extra
[Abi5niFiEu] ab initio
[7Ab5EuvEu] ab ovo
.......................................
uk.txt
ii i:
aa B:
oo C:
uu u:
@@ /:
......................................
#find.awk
#check illegal dictionary tones(illege is defined as this: the tones can't find in tts lists(us.txt))
BEGIN {
while(getline<ARGV[1] > 0)
tone[$2]++
while(getline<ARGV[2] >0)
{
for(i = 1; i <= length($1); i++)
{
c1 = substr($1, i, 1) #tone charaters
c2 = substr($1, i, 2) #two charaters
if(c2 in tone)
i++
else if(c1 in tone)
{
c2 = substr($1, i, 2)
if(c2 in tone)
i++
}
else if(c1 != "[" && c1 != "]" && c1 !~ /[57]/)
{
if(c1 == ":")
c1 = substr($1, i-1, 2);
print c1 "/t" $0
}
}
}
}
$awk -f find.awk uk.txt merge.txt>illegal.txt
最終生成的文件
illegal.txt
[: [:blQdsQkE] bloodsucker
B [5blBuzi] blowsy
B [5dVBki] jockey
B [5glEub7flBuE] globeflower
B [5hBundztQN] hound's-tongue
2、背單詞老是按1個次序背效果不好,下面的小程序可以用來打亂次序:
#This program is used to rearrange a text in random way.
{
lines[NR] = $0 #read the text into array line by line (i->line), NR increments by one from one.
}
END{
seed = srand() #set the seed with current time if pass empty parameter to srand, return older seed.
#print seed
for(i = 0; i < FNR; i++)
{
while(1)
{
r = int(rand()*FNR) + 1 #get a random r in range(1-FNR)
if(r in lines) #if r exist in array, print it and delete it.
{
#print r" --------- "lines[r]
print lines[r]
delete lines[r]
break
}
#print r
}
}
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.