[python爬虫]收纳一些常见问题

原創

2020-02-24 03:37

数据获取

1>一个标签中有多行数据, 如何分行获取
我遇到的情况是这样的:

源码是:

<p>
	杭州 余杭区 仓前
	<em class="vline"></em>
	1-3年
	<em class="vline"></em>
	本科
</p>

解决方法

# selenium获取这个p标签的源码,然后split成list
info_ls = chrome.find_element_by_xpath('//div[@class="job-banner"]//p').get_attribute('innerHTML').split('<em class="vline"></em>')
item['City'] = info_ls[0]	# 城市
item['eduLevel'] = info_ls[1]	# 学历
item['workingExp'] = info_ls[2]		# 工作经验

Pyspider

1>安装pyspider过程出错：ERROR: Command errored out with exit status 10: python setup.py egg_info Check…
https://blog.csdn.net/weixin_43810415/article/details/99694315

2>pyspider all运行出错：①SyntaxError: invalid syntax，② - Deprecated option ‘domaincontroller’: use 'http_au
https://blog.csdn.net/u012424313/article/details/89511520

3>运行时出现: ValueError: Invalid configuration
解决方法: pip install wsgidav==2.4.1

4>运行时卡在result_worker starting…
错误如下:

(venv1) D:\Fire\PycharmProject\pyspider\test1&gt;pyspider
c:\users\xxx\pycharmprojects\untitled1\venv1\lib\site-packages\pyspider\libs\utils.py:196: FutureWarning: timeout is not supported on you
r platform.
  warnings.warn(&quot;timeout is not supported on your platform.&quot;, FutureWarning)
[W 191028 21:30:05 run:413] phantomjs not found, continue running without it.
[I 191028 21:30:07 result_worker:49] result_worker starting...

解决方法:
下载phantomjs, 然后将phantomjs.exe拖到python根目录, 重新运行即可
如果还是不行，请参考：https://blog.csdn.net/qq_35167821/article/details/89162394

5. 在实际的调试中发现pyspider的Web预览界面只有一点非常小
这篇文章中的第3个:
https://www.jianshu.com/p/7bff6fd4dc1b

数据储存

1>MongoDB让数据具有过期时间
主要使用pymongo库中的createIndex()方法, 其中有个expireAfterSeconds的参数, 作用是指定一个以秒为单位的数值，可以用来创建一个具有过期时间的索引, 这样之后写入的集合就可以拥有过期时间
详情参考: https://www.runoob.com/mongodb/mongodb-indexing.html
注:指定的索引写入时必须为datetime格式, 否则不会自动删除

其他问题

1>解决windows下 cd 无法切换盘符目录
https://blog.csdn.net/kakuma_chen/article/details/71173243

2>关于解决’\u’开头的字符串转中文的方法
https://www.cnblogs.com/hahaxzy9500/p/7685955.html

3>修改jupyter notebook启动的虚拟环境
合并查看一下两篇文章：
https://blog.csdn.net/hao5335156/article/details/81165727
https://blog.csdn.net/weixin_41813895/article/details/84750990

4.python爬虫随机UA库
https://blog.csdn.net/qq_18525247/article/details/81355397

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

[python爬虫]收纳一些常见问题

数据获取

Pyspider

数据储存

其他问题

[django項目] 爲後臺網站編寫自定義通用視圖

[django項目] 利用elasticsearch實現搜索功能

[django項目] 用戶登錄登出功能

[python]收納一些常見問題——更新於20200615

[JavaScript] JavaScript快速上手

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結