題目：寫一個python程序，利用正則表達式，提去一個html頁面中的所有超鏈接，去除html中的標籤元素，生成一個文本文件。

原創

2020-06-16 16:18

題目：寫一個python程序，利用正則表達式，提去一個html頁面中的所有超鏈接，去除html中的標籤元素，生成一個文本文件。

import re
import urllib
import os

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html

def getHref(html):
    text = []
    #html = "<script src=\" http://hm.baidu.com/h.js?3d8e7fc0de8a2a75f2ca3bfe128e6391\" type=\"text/javascript\"></script>"
    http_res = r"(?<=http://).+?(?=\")"
    https_res = r"(?<=https://).+?(?=\")"
    #res = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')"
    # (?<=exp) 匹配前面滿足表達式exp的位置
    # .   匹配除換行符 \n 之外的任何單字符     要匹配 . 請使用 \.
    # +   匹配前面的子表達式一次或多次    例如，'zo+' 能匹配 "zo" 以及 "zoo"，但不能匹配 "z"。+ 等價於 {1,}
    # ?   匹配前面的子表達式零次或一次。例如，"do(es)?" 可以匹配 "do" 、 "does" 中的 "does" 、 "doxy" 中的 "do" 。? 等價於 {0,1}
    # (?=exp) 匹配後面滿足表達式exp的位置

    http_urls = re.findall(http_res, html.decode('utf-8'))
    for url in http_urls:
        print("http://"+url)
        text.append("http://"+url)
    https_urls = re.findall(https_res, html.decode('utf-8'))
    for url in https_urls:
        print("https://"+url)
        text.append("https://" + url)

    return text

if __name__ == "__main__":
    url = "http://tieba.baidu.com/"
    #url = input("請輸入要抓取的網址：")
    path = "txt"
    html = getHtml(url)
    text = getHref(html)

    folder = os.path.exists(path)
    if not folder:  # 判斷是否存在文件夾如果不存在則創建爲文件夾
        os.makedirs(path)  # makedirs 創建文件時如果路徑不存在會創建這個路徑
        print("目錄創建成功！")
    file = open(path + '/url' + '.txt', 'w')
    for urls in text:
        file.write(urls + "\r\n")  # 寫入內容信息
    file.close()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

題目：寫一個python程序，利用正則表達式，提去一個html頁面中的所有超鏈接，去除html中的標籤元素，生成一個文本文件。

題目：寫一個python程序，利用正則表達式，提去一個html頁面中的所有超鏈接，去除html中的標籤元素，生成一個文本文件。

HTML頁面關於高分屏的設置

北歐瑞典挪威芬蘭瑞士TikTok海外網紅與YouTube博主的合作模式

歐洲英國德國法國TikTok與YouTube海外網紅達人的完美合作策略

druid數據源 xml配置

subprocess.py報錯：FileNotError: [Errno 2] No such file or directory: java: java

ubuntu安裝Java教程

turtlebot3機器人通信

Python 判斷文件夾是否存在，否則創建該文件夾

Python題目：求出555555的約數中最大的3位數

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結