【爬蟲之路】一點有關學習BeautifulSoup的筆記

不務正業也要按照基本法。。

資料參考：http://cuiqingcai.com/1319.html.

BeautifulSoup是python的一個html解析庫，最新版本是bs4，但不兼容python3，所以這次是用的python2.7寫法。。

導入主要用到的三個庫：

import requests
from bs4 import BeautifulSoup
import re

首先要解決一下亂碼問題：
由於ubuntu下的python默認解碼是ascii，所以一有中文就爆炸，會出現這樣的情況：

'ascii' codec can't encode characters in position 848-851: ordinal not in range(128)

於是在代碼里加上這兩句修改一下，當然也有其他辦法，不予贅述了~~（天滅python2，選3保平安，網搜“python2 中文亂碼”有真相！）~~

import sys
reload(sys)
sys.setdefaultencoding('utf8')

接下來，先爬取一個網頁的html源碼，這裏以隊友博客的一篇題解爲例，html源碼太長，可以自行查看。

然後建立一個BeautifulSoup對象:

soup = BeautifulSoup(html)
#print soup.prettify()

print soup.prettify()可以看到html源碼以樹狀結構存儲了起來。

嘗試通過tag標籤查找內容：

print soup.title

得到：

<title>HDU 4275 Color the Tree（哈希+樹同構+組合數學+樹形dp） - GODSPEED
        - 博客頻道 - CSDN.NET</title>

此外還有很多屬性：

print soup.title.string
#HDU 4275 Color the Tree（哈希+樹同構+組合數學+樹形dp） - GODSPEED
#        - 博客頻道 - CSDN.NET
print soup.title.name
#title
...

值得注意的是，這裏通過tag來查找，只返回匹配的第一個節點標籤數據，如果要得到所有該標籤數據，則需要遍歷文檔樹來實現。這個結構有點像trie樹，查找搜索方式很靈活，具體可參考開頭提到的文檔。

接下來嘗試按照標籤裏的id值來搜索：

info = soup.select("#blog_rank")
print info

得到：

[<ul id="blog_rank">
<li>訪問：<span>29690次</span></li>
<li>積分：<span>2605</span> </li>
<li>等級： <span style="position:relative;display:inline-block;z-index:1">
<img alt="" id="leveImg" src="http://c.csdnimg.cn/jifen/images/xunzhang/jianzhang/blog5.png" style="vertical-align: middle;"/>
<div id="smallTittle" style=" position: absolute;  left: -24px;  top: 25px;  text-align: center;  width: 101px;  height: 32px;  background-color: #fff;  line-height: 32px;  border: 2px #DDDDDD solid;  box-shadow: 0px 2px 2px rgba (0,0,0,0.1);  display: none;   z-index: 999;">
<div style="left: 42%;  top: -8px;  position: absolute;  width: 0;  height: 0;  border-left: 10px solid transparent;  border-right: 10px solid transparent;  border-bottom: 8px solid #EAEAEA;"></div>
            積分：2605 </div>
</span> </li>
<li>排名：<span>第7622名</span></li>
</ul>]

除此之外還有類名查找，組合查找等。。具體內容，還是參考開頭提到的博客。。

寫到這裏其實就差不多了。。這次比較無聊地把隊友題解內容和代碼給爬取了下來，雖然沒什麼卵用，算做練一下bs和正則表達式的運用吧。

soup = BeautifulSoup(html)
code = soup.select("#article_content")
print code #未做處理的提取內容
code = str(code[0])
code = re.compile(r'<[^>]+>').sub('', code)#去掉標籤
code = code.replace('&lt;','<')
code = code.replace('&gt;','>')
code = code.replace('&amp;','&') #替換html特殊字符，這裏其實可以寫成一個函數的。。
print code

[<div class="article_content" id="article_content">
<p><span style="font-family:SimHei; font-size:18px">題意：給出一顆n個結點的數，現在用m種顏色給每個節點染色，問不重複（通過旋轉）的染色方法數有多少種。</span></p>
<p><span style="font-family:SimHei; font-size:18px">思路：這道題主要是要解決對稱而導致的重複的問題，所以選擇根節點時要選擇樹的中心，所以可以先bfs求出樹的直徑，如果直徑上的結點個數爲奇數，那麼取直徑的中點作爲中心dfs，如果是偶數那麼在兩個中點位置的中間新建一個節點。</span></p>
<p><span style="font-family:SimHei; font-size:18px">用dp[i]表示以i爲節點的子樹中方案有多少種 ，那麼對子樹的形態哈希後排序，哈希值相同的子樹相鄰，如果某一個哈希值（即同構的子樹）的子樹有n個，這種子樹形態的方案數爲m，那麼由組合數學可以得出這些子樹組合起來的方案數爲C(n+m-1, n)，這個可以由擋板法求出，假設m-1個擋板和m個子樹，從這些中選出m棵子樹構成一種方案的方法有C(n+m-1, n)種。</span></p>
<p><span style="font-family:SimHei; font-size:18px">然後就是一個普通的樹形dp了，因爲我們是從中心開始dfs的，這樣保證了不會遺漏所有同構的子樹。</span></p>
<p><span style="font-family:SimHei; font-size:18px">ps：一開始哈希的素數又選搓了.....wa了一晚上加一早上........個人體會....選素數的時候不要選擇太小的，最好這個素數的平方在mod的附近,這樣可以保證哈希不具有太強的線性。</span></p>
<div style="top:0px">
<p></p><pre class="cpp" name="code">#include&lt;cstdio&gt;
#include&lt;cstring&gt;
#include&lt;cmath&gt;
#include&lt;cstdlib&gt;
#include&lt;iostream&gt;
#include&lt;algorithm&gt;
#include&lt;vector&gt;
#include&lt;map&gt;
#include&lt;queue&gt;
#include&lt;stack&gt;
#include&lt;string&gt;
#include&lt;map&gt;
#include&lt;set&gt;
#include&lt;ctime&gt;
#define eps 1e-6
#define LL long long
#define pii pair&lt;int, int&gt;
#pragma comment(linker, "/STACK:1024000000,1024000000")
using namespace std;

const int MAXN = 50500;
const int MOD = 1e9+7;
const int P = 10003;
const int A = 131237; 
int n, m;
vector&lt;int&gt; G[MAXN];
bool vis[MAXN];
int pre[MAXN], dis[MAXN];
LL inv[MAXN]; 
struct Node {
    LL hash, ans;
    bool operator &lt; (const Node&amp; A) const {
        return hash &lt; A.hash;
    }
};
LL pow_mod(LL a, LL p, LL n) {
    if(p == 0) return 1;
    LL ans = pow_mod(a, p/2, n);
    ans = ans * ans % n;
    if(p%2 == 1) ans = ans * a % n;
    return ans;
} 
void init() {
    for(int i = 1; i &lt;= 50005; i++) inv[i] = pow_mod(i, MOD-2, MOD);
}
LL C(int n, int m) {
    LL ans = 1;
    for(int i = n-m+1; i &lt;= n; i++) ans = ans * i % MOD;
    for(int i = 1; i &lt;= m; i++) ans = ans * inv[i] % MOD; 
    return ans;
}
int bfs1() {
    memset(vis, 0, sizeof(vis));
    queue&lt;int&gt; q;
    q.push(1);
    vis[1] = 1;
    int ans = 1;
    while(!q.empty()) {
        int t = q.front();
        q.pop();
        for(int i = 0; i &lt; G[t].size(); i++) {
            int u = G[t][i];
            if(vis[u]) continue;
            vis[u] = 1;
            q.push(u);
            ans = u;
        }
    }
    return ans;
} 
int bfs2(int cur) {
    memset(vis, 0, sizeof(vis));
    queue&lt;int&gt; q;
    q.push(cur);
    vis[cur] = 1;
    dis[cur] = 1;
    int ans = cur;
    while(!q.empty()) {
        int t = q.front();
        q.pop();
        for(int i = 0; i &lt; G[t].size(); i++) {
            int u = G[t][i];
            if(vis[u]) continue;
            vis[u] = 1;
            q.push(u);
            ans = u;
            pre[u] = t; 
            dis[u] = dis[t] + 1;
        }
    }
    if(dis[ans] &amp; 1) {
        int tmp = dis[ans] / 2; 
        while(tmp--) ans = pre[ans];
        return ans;
    }
    else {

        int pos, tmp = dis[ans]/2 - 1;
        while(tmp--) ans = pre[ans];
        pos = pre[ans];
        int root = n + 1;
        G[root].push_back(ans);
        G[root].push_back(pos);
        for(vector&lt;int&gt;::iterator it = G[ans].begin(); it != G[ans].end(); it++) 
            if(*it == pos) {
                G[ans].erase(it);
                break;
            }
        for(vector&lt;int&gt;::iterator it = G[pos].begin(); it != G[pos].end(); it++) 
            if(*it == ans) {
                G[pos].erase(it);
                break;
            }
        //cout &lt;&lt; ans &lt;&lt; " " &lt;&lt; pos &lt;&lt; endl;
        return root;
    }
} 

Node dfs(int cur, int fa) {
    vector&lt;Node&gt; val;
    for(int i = 0; i &lt; G[cur].size(); i++) {
        int u = G[cur][i];
        if(u == fa) continue;
        Node tmp = dfs(u, cur);
        val.push_back(tmp);
    }
    sort(val.begin(), val.end());
    Node ret;
    ret.hash = A, ret.ans = 1;
    int sz = val.size();
    for(int i = 0; i &lt; sz;) {
        int j = i;
        while(j&lt;sz &amp;&amp; val[j].hash==val[i].hash) {
            ret.hash *= P;
            ret.hash ^= val[i].hash;
            ret.hash %= MOD;
            j++;
        }
        ret.ans *= C(j-i+val[i].ans-1, j-i);
        ret.ans %= MOD;
        i = j;
        //cout &lt;&lt; val[i].ans &lt;&lt; " " &lt;&lt; cur &lt;&lt; " " &lt;&lt; i &lt;&lt; endl;
    }
    if(cur &lt;= n) ret.ans = ret.ans * m % MOD;
    //cout &lt;&lt; ret.ans &lt;&lt; endl;;
    return ret;
}
int main() {
    //freopen("input.txt", "r", stdin);
    init();
    while(scanf("%d%d", &amp;n, &amp;m) == 2) {
        for(int i = 1; i &lt;= n+1; i++) G[i].clear();
        for(int i = 1, u, v; i &lt; n; i++) {
            scanf("%d%d", &amp;u, &amp;v);
            G[u].push_back(v);
            G[v].push_back(u);
        }
        int tmp = bfs1();
        int root = bfs2(tmp);
        Node ans = dfs(root, 0);
        cout &lt;&lt; ans.ans &lt;&lt; endl;
    }
    return 0;
}
















</pre><br/>

</div>
<div style="padding-top:20px">
<p style="font-size:12px;">版權聲明：本文爲Godspeed原創文章，未經博主允許不得轉載。</p>
</div>
</div>]

題意：給出一顆n個結點的數，現在用m種顏色給每個節點染色，問不重複（通過旋轉）的染色方法數有多少種。
思路：這道題主要是要解決對稱而導致的重複的問題，所以選擇根節點時要選擇樹的中心，所以可以先bfs求出樹的直徑，如果直徑上的結點個數爲奇數，那麼取直徑的中點作爲中心dfs，如果是偶數那麼在兩個中點位置的中間新建一個節點。
用dp[i]表示以i爲節點的子樹中方案有多少種 ，那麼對子樹的形態哈希後排序，哈希值相同的子樹相鄰，如果某一個哈希值（即同構的子樹）的子樹有n個，這種子樹形態的方案數爲m，那麼由組合數學可以得出這些子樹組合起來的方案數爲C(n+m-1, n)，這個可以由擋板法求出，假設m-1個擋板和m個子樹，從這些中選出m棵子樹構成一種方案的方法有C(n+m-1, n)種。
然後就是一個普通的樹形dp了，因爲我們是從中心開始dfs的，這樣保證了不會遺漏所有同構的子樹。
ps：一開始哈希的素數又選搓了.....wa了一晚上加一早上........個人體會....選素數的時候不要選擇太小的，最好這個素數的平方在mod的附近,這樣可以保證哈希不具有太強的線性。

#include<cstdio>
#include<cstring>
#include<cmath>
#include<cstdlib>
#include<iostream>
#include<algorithm>
#include<vector>
#include<map>
#include<queue>
#include<stack>
#include<string>
#include<map>
#include<set>
#include<ctime>
#define eps 1e-6
#define LL long long
#define pii pair<int, int>
#pragma comment(linker, "/STACK:1024000000,1024000000")
using namespace std;

const int MAXN = 50500;
const int MOD = 1e9+7;
const int P = 10003;
const int A = 131237; 
int n, m;
vector<int> G[MAXN];
bool vis[MAXN];
int pre[MAXN], dis[MAXN];
LL inv[MAXN]; 
struct Node {
    LL hash, ans;
    bool operator < (const Node& A) const {
        return hash < A.hash;
    }
};
LL pow_mod(LL a, LL p, LL n) {
    if(p == 0) return 1;
    LL ans = pow_mod(a, p/2, n);
    ans = ans * ans % n;
    if(p%2 == 1) ans = ans * a % n;
    return ans;
} 
void init() {
    for(int i = 1; i <= 50005; i++) inv[i] = pow_mod(i, MOD-2, MOD);
}
LL C(int n, int m) {
    LL ans = 1;
    for(int i = n-m+1; i <= n; i++) ans = ans * i % MOD;
    for(int i = 1; i <= m; i++) ans = ans * inv[i] % MOD; 
    return ans;
}
int bfs1() {
    memset(vis, 0, sizeof(vis));
    queue<int> q;
    q.push(1);
    vis[1] = 1;
    int ans = 1;
    while(!q.empty()) {
        int t = q.front();
        q.pop();
        for(int i = 0; i < G[t].size(); i++) {
            int u = G[t][i];
            if(vis[u]) continue;
            vis[u] = 1;
            q.push(u);
            ans = u;
        }
    }
    return ans;
} 
int bfs2(int cur) {
    memset(vis, 0, sizeof(vis));
    queue<int> q;
    q.push(cur);
    vis[cur] = 1;
    dis[cur] = 1;
    int ans = cur;
    while(!q.empty()) {
        int t = q.front();
        q.pop();
        for(int i = 0; i < G[t].size(); i++) {
            int u = G[t][i];
            if(vis[u]) continue;
            vis[u] = 1;
            q.push(u);
            ans = u;
            pre[u] = t; 
            dis[u] = dis[t] + 1;
        }
    }
    if(dis[ans] & 1) {
        int tmp = dis[ans] / 2; 
        while(tmp--) ans = pre[ans];
        return ans;
    }
    else {

        int pos, tmp = dis[ans]/2 - 1;
        while(tmp--) ans = pre[ans];
        pos = pre[ans];
        int root = n + 1;
        G[root].push_back(ans);
        G[root].push_back(pos);
        for(vector<int>::iterator it = G[ans].begin(); it != G[ans].end(); it++) 
            if(*it == pos) {
                G[ans].erase(it);
                break;
            }
        for(vector<int>::iterator it = G[pos].begin(); it != G[pos].end(); it++) 
            if(*it == ans) {
                G[pos].erase(it);
                break;
            }
        //cout << ans << " " << pos << endl;
        return root;
    }
} 

Node dfs(int cur, int fa) {
    vector<Node> val;
    for(int i = 0; i < G[cur].size(); i++) {
        int u = G[cur][i];
        if(u == fa) continue;
        Node tmp = dfs(u, cur);
        val.push_back(tmp);
    }
    sort(val.begin(), val.end());
    Node ret;
    ret.hash = A, ret.ans = 1;
    int sz = val.size();
    for(int i = 0; i < sz;) {
        int j = i;
        while(j<sz && val[j].hash==val[i].hash) {
            ret.hash *= P;
            ret.hash ^= val[i].hash;
            ret.hash %= MOD;
            j++;
        }
        ret.ans *= C(j-i+val[i].ans-1, j-i);
        ret.ans %= MOD;
        i = j;
        //cout << val[i].ans << " " << cur << " " << i << endl;
    }
    if(cur <= n) ret.ans = ret.ans * m % MOD;
    //cout << ret.ans << endl;;
    return ret;
}
int main() {
    //freopen("input.txt", "r", stdin);
    init();
    while(scanf("%d%d", &n, &m) == 2) {
        for(int i = 1; i <= n+1; i++) G[i].clear();
        for(int i = 1, u, v; i < n; i++) {
            scanf("%d%d", &u, &v);
            G[u].push_back(v);
            G[v].push_back(u);
        }
        int tmp = bfs1();
        int root = bfs2(tmp);
        Node ans = dfs(root, 0);
        cout << ans.ans << endl;
    }
    return 0;
}




















版權聲明：本文爲Godspeed原創文章，未經博主允許不得轉載。

另一位隊友的：（http://www.cnblogs.com/Cw-trip/p/4898112.html

import requests
from bs4 import BeautifulSoup
import re
import sys

reload(sys)
sys.setdefaultencoding('utf8')
url = 'http://www.cnblogs.com/Cw-trip/p/4898112.html'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:41.0) Gecko/20100101 Firefox/41.0'}
html = requests.get(url, headers=headers).content
soup = BeautifulSoup(html)
code = soup.select("#cnblogs_post_body")
code = str(code[0])
code = re.compile(r'<[^>]+>').sub('', code)
code = code.replace('&lt;','<')
code = code.replace('&gt;','>')
code = code.replace('&amp;','&') #
print code

題目：給出一個數列，要求快速查詢區間lr內相同數字的對數。
思路：對於每次詢問暴力跑，但是我們注意到由於可以複用之前的結果，所以不同的計算順序計算量可能不同，直觀上來說，兩個查詢點的曼哈頓距離越短，需要計算的量就越小。於是我們可以找出一個最佳計算順序，就是平面點陣的最短哈密頓迴路。由於這是np的，所以可以用最小曼哈頓生成樹代替。再簡化一下，可以使用分快處理的方法得到比較快的速度。
這就是所謂的莫隊算法。


/*
* @author:  Cwind
*/
///#pragma comment(linker, "/STACK:102400000,102400000")
#include <iostream>
#include <map>
#include <algorithm>
#include <cstdio>
#include <cstring>
#include <cstdlib>
#include <vector>
#include <queue>
#include <stack>
#include <functional>
#include <set>
#include <cmath>
using namespace std;
#define IOS std::ios::sync_with_stdio (false);std::cin.tie(0)
#define pb push_back
#define PB pop_back
#define bk back()
#define fs first
#define se second
#define sq(x) (x)*(x)
#define eps (1e-7)
#define IINF (1<<29)
#define LINF (1ll<<59)
#define INF (1000000000)
#define FINF (1e3)
#define clr(x) memset((x),0,sizeof (x))
#define cp(a,b) memcpy((a),(b),sizeof (a))
typedef long long ll;
typedef unsigned long long ull;
typedef pair<int,int> pii;
typedef pair<int,int> P;

const int maxn=5e4+3000;
struct Q{
    int l,r;
    int id;
}q[maxn];;
bool cmp1(const Q &a,const Q &b){return a.r<b.r;}
vector<Q> B[50];
int n,m;
int c[maxn];
ll a[maxn];
ll ans[maxn][2];
void solve(){
    for(int i=1;i<=m;i++) B[q[i].l/1001].pb(q[i]);
    for(int i=0;i<50;i++) sort(B[i].begin(),B[i].end(),cmp1);
    ll ax=0,ay=0;
    int l=1,r=1;
    a[c[1]]++;
    for(int i=0;i<50;i++){
        for(int j=0;j<B[i].size();j++){
            int tl=B[i][j].l,tr=B[i][j].r;
            for(int k=r;k>tr;k--){ax-=a[c[k]]-1;a[c[k]]--;ay-=k-l;}
            for(int k=r+1;k<=tr;k++){ax+=a[c[k]];a[c[k]]++;ay+=k-l;}
            for(int k=l;k<tl;k++){a[c[k]]--;ax-=a[c[k]];ay-=tr-k;}
            for(int k=l-1;k>=tl;k--){ax+=a[c[k]];a[c[k]]++;ay+=tr-k;}
            l=tl,r=tr;
            ll dd=__gcd(ax,ay);
            int id=B[i][j].id;
            ans[id][0]=ax/dd;
            ans[id][1]=ay/dd;
            if(ax==0) ans[id][1]=1;
        }
    }
}
int main(){
    freopen("/home/slyfc/CppFiles/in","r",stdin);
    cin>>n>>m;
    for(int i=1;i<=n;i++) scanf("%d",&c[i]);
    for(int i=1;i<=m;i++){
        scanf("%d%d",&q[i].l,&q[i].r);
        q[i].id=i;
    }
    solve();
    for(int i=1;i<=m;i++) printf("%lld/%lld\n",ans[i][0],ans[i][1]);
    return 0;

}

View Code

於是我大概知道了我們的原創博客是怎樣被那些噁心的網站轉走的，但卻無能爲力。

【爬蟲之路】一點有關學習BeautifulSoup的筆記

工作中用到的腳本合集

通過f-string編寫簡潔高效的Python格式化輸出代碼

24-5-18 X

LightOJ 1236 Pairs Forming LCM（算術基本定理）

HDU 4676 Sum Of Gcd（歐拉函數求區間gcd之和+分塊算法）

【爬蟲之路】一點有關學習BeautifulSoup的筆記

ACM常見組合博弈遊戲

[kuangbin帶你飛]數論基礎的簡單題解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結