python网络数据采集第七章

原創

2018-09-09 19:55

py3.6 编译器pycharm

在这本书第七章数据清洗一节中，作者采集wikipeidia上的文本，并用2grams方法对数据进行处理，具体的函数如下：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string


def cleaninput(input):
    input=re.sub('\n+',' ',input)
    input=re.sub('\[[0-9]*\]',"",input)
    input=re.sub(' +',' ',input)
    input=bytes(input,"UTF-8")
    input=input.decode("ascii","ignore")
    cleanInput=[]
    input=input.split(' ')
    for item in input:
        item=item.strip(string.punctuation)
        if len(item)>1 or (item.lower()=='a' or item.lower()=="i"):
            cleanInput.append(item)
    return cleanInput

def ngrams(input,n):
    input=cleaninput(input)
    output=[]
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

html=urlopen("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj=BeautifulSoup(html,'lxml')
content=bsObj.find("div",{"id":"mw-content-text"}).get_text()
ngrams=ngrams(content,2)
print(ngrams)
print("2-grams count is:"+str(len(ngrams)))

该函数最后的输出结果应该为一个列表list，如下：

[['Python', 'Paradigm'], ['Paradigm', 'Object-oriented'], ['Object-oriented',
 'imperative'], ['imperative', 'functional'], ['functional', 'procedural'], ......]

之后为了将该列表中重复的元素合并并统计词频，作者将函数修改为：

.....
from collections import OrderedDict

........

ngrams=ngrams(content,2)
ngrams=OrderedDict(sorted(ngrams.items(),key=lambda  t:t[1],reverse=True))
print(ngrams)
print("2-grams count is:"+str(len(ngrams)))

这里由于ngrams为一个列表list，因此编译器报错显示’list‘ object has no attribute ‘items’。

在这里将书上的代码修改为如下：

......
from collections import OrderedDict

.........

ngrams=ngrams(content,2)
ngrams_dict={}
for i in ngrams:
    n=ngrams.count(i)
    i=str(i)#将list i 转变为一个字符串用作健key，并利用了python字典的key具有唯一性的属性
    ngrams_dict[i]=n
ngrams=OrderedDict(sorted(ngrams_dict.items(),key=lambda  t:t[1],reverse=True))
print(ngrams)
print("2-grams count is:"+str(len(ngrams)))

之后函数便可输出与书上相近的结果：

OrderedDict([("['Python', 'Software']", 37), ("['Software', 'Foundation']", 37), 
("['of', 'the']", 34), ("['Foundation', 'Retrieved']", 30), ("['of', 'Python']", 28), .....]

在python网络数据采集一书中，对作者未写完的函数进行修改，并基于此发现python编程中的问题。

1)、列表元素不能作为字典的健，原因在于其不能够提供唯一的哈希值，详情见链接：Why list can't be a dictionary keys？

2)、集合（set）中不能够添加list作为元素，原因同上。

例如：

a=set([1,2,2])

输出结果为：a={1,2}。

但是代码

a=set([[1,2,2],[1,2,2]])

则会报错，显示unhashable type: 'list'，因此与set有关的方法add等也不能够添加list作为元素，若要求像类似书中的列表作为键值使用一样，则只能将其转换成string类型使用。

有关python中set的操作点击此

3)、python中统计列表中元素出现个数（要求value-n对应）的方法：

方法一：
List = [1,2,3,4,5,3,2,1,4,5,6,4,2,3,4,6,2,2] List_set = set(List) #List_set是另外一个列表，里面的内容是List里面的无重复项 for item in List_set: print("the %d has found %d" %(item,List.count(item)))
方法二：（利用字典的特性来实现）
List=[1,2,3,4,5,3,2,1,4,5,6,4,2,3,4,6,2,2] a = {} for i in List: if List.count(i)>1: a[i] = List.count(i) a = sorted(a.items(), key=lambda item:item[0]) print (a)
方法三：from collections import Counter List=[1,2,3,4,5,3,2,1,4,5,6,4,2,3,4,6,2,2] Counter(list)
注意该方法中Counter实际上生成了一个字典，因此如果List本身是一个带有List作为元素的复合的列表，这种方法不能够使用

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python网络数据采集第七章

如何使用 JS 判断用户是否处于活跃状态

通过HPA+CronHPA组合应对业务复杂弹性伸缩场景

❤️‍🔥 Solon Cloud Event 新的事务特性与应用

有關git的常用操作整理

linux初學者

python類中 new 和init的區別和聯繫

python 源代碼分析

非csdn文章

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結