NLP中的两大流派

知识图谱

用三元组来表示知识：对领域的特定知识进行结构化存储和表示
模型算法：利用图谱关系进行推导，进行实现自主学习

深度学习

利用机器提取的特征来表述数据：自动提取内在特征
模型算法：优化特征的权重进行非线性映射

深度学习对比知识图谱能够实现端到端的模型，中间减少人为的参与，知识图谱通过三元组的关系表示，可以最大限度的获得自然世界中的相互联系

发展方向

Pretrain+finetune

预训练：大语料、无监督、深模型获得语义表示
微调：在下游任务中添加具体语义信息实现任务
强化学习
训练：自动获得更深的语义信息
测试：可面对复杂语境自动找到合适的表达

爬虫获取

利用正则和结构化处理公开数据集、人工整理数据、爬虫获得数据
什么是爬虫：通过解析网页资源自动化的获得网页数据的手段
在各个子领域内的专业网站内，获得网页内容、解析网页内容、保存获得资源

爬虫相关的类库

urilib:网络获得html文件
re:正则表达式
bs4:简单边界的处理html
selenium:自动化测试

import urllib
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')  # 返回HTML5文件
print(response.read().decode('utf-8'))

# 修改头文件，打开网页
from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'zhaofan'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

# 设置代理，打开网页
import urllib.request
proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
# response = opener.open('http://httpbin.org/get')
response = opener.open("http://yao.xywy.com/")
print(response.read())

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>bs4
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title)  # 通过这种soup.标签名 我们就可以获得这个标签的内容
print(soup.title.name)  # 当我们通过soup.title.name的时候就可以获得该title标签的名称，即title
print(soup.title.string)  # 获取第一个title标签的内容：
print(soup.title.parent.name)   # 获取第一个title标签的内容：
print(soup.p)  # 获取第一个p标签的内容：
print(soup.p["class"])  # 获取p标签的name属性值
print(soup.a)
print(soup.find_all('a'))  # 可以根据标签名，属性，内容查找文档
print(soup.find(id='link3'))  # 获得属性

from selenium import webdriver
# 声明浏览器
browser = webdriver.Chrome()
browser = webdriver.Firefox()
url = 'https:www.baidu.com'
browser.get(url)#打开浏览器预设网址
print(browser.page_source)#打印网页源代码
browser.close()#关闭浏览器  
input_first = browser.find_element_by_id('q')  #通过id的方式获得相应元素
input_two = browser.find_element_by_css_selector('#q')  # 通过CSS的方式获得相应元素
print(input_first)
print(input_two)
from selenium.webdriver.common.by import By  # 通过By的方式获得元素
input_1 = browser.find_element(By.ID, 'q')
print(input_1)
# selenium的强项在于它的交互动作
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
target = browser.find_element_by_css_selector('#droppable')
actions = ActionChains(browser)
actions.drag_and_drop(source, target)
actions.perform()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

机器学习项目（六）医疗知识图谱构建（一）

NLP中的两大流派

知识图谱

深度学习

发展方向

Pretrain+finetune

爬虫获取

爬虫相关的类库

LeetCode826.安排工作以達到最大收益

算法強化 —— XGBoost

算法強化 —— 提升樹算法(三)

算法強化 —— XGBoost(三)

算法強化 —— 反向傳播

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結