一個C++解析HTML的庫

HTTP協議使用廣泛，相應的，C++在這塊需求也開始增加。一個好的解析庫可以達到事半功倍的效果，在此貼出我的解析庫的代碼，方便新手朋友們使用。

hHttpParse.h

#ifndef __H_HTML_PARSE_H__
#define __H_HTML_PARSE_H__

#pragma once

#include <Windows.h>
#include <WinInet.h>
#include <string>
#include <cstdio>

class hHtmlParse {
	std::string data;
	int p;
public:
	//構造函數，傳入HTML代碼
	hHtmlParse (std::string& data);
	//獲取網頁的編碼方式
	bool GetCharset (std::string& s);
	//設置當前解析位置
	bool SetPos (const char* find);
	//設置當前解析位置（反向查找目標位置）
	bool SetPos_LastOf (const char* find);
	//查找是否存在目標位置，不會更新當前位置
	bool find (const char* find);
	//匹配一串字符串，使用sscanf_s獲取
	bool MatchString (const char* match, std::string& s);
	//獲取當前電腦IP地址
	static bool GetLocalIp (std::string& ip);
	//查詢某地址或某域名信息
	static bool GetAddrMessage (const wchar_t* addr, std::string& data);
	//關鍵函數，獲取lpURL指向的地址的HTML代碼，並存入data中
	static bool UrlGetHtml (LPCWSTR lpURL, std::string& data);
};

#endif //__H_HTML_PARSE_H__

hHttpParse.cpp

#include "hHtmlParse.h"

#pragma comment(lib, "WinInet.lib")

hHtmlParse::hHtmlParse (std::string& data) {
	this->data = data;
	this->p = 0;
}

bool hHtmlParse::GetCharset (std::string& s) {
	this->SetPos ("charset=");
	return this->MatchString ("%*[\"]%[^\"]", s);
}

bool hHtmlParse::SetPos (const char* find) {
	int t = this->data.find (find, p);
	if (-1 == t) return false;
	this->p = t + strlen (find);
	return true;
}

bool hHtmlParse::SetPos_LastOf (const char* find) {
	int t = this->data.rfind (find);
	if (-1 == t) return false;
	this->p = t + strlen (find);
	return true;
}

bool hHtmlParse::find (const char* find) {
	int t = this->data.find (find, p);
	return t != -1;
}

bool hHtmlParse::MatchString (const char* match, std::string& s) {
	return sscanf_s (&data.c_str () [p], match, const_cast<char*>(s.c_str ()), s.capacity ()) > 0;
}

bool hHtmlParse::GetLocalIp (std::string& ip) {
	std::string page, match;
	page.resize (512);
	match.resize(16);
	if (!hHtmlParse::UrlGetHtml (L"http://1111.ip138.com/ic.asp", page)) return false;
	hHtmlParse hp (page);
	hp.SetPos ("<center>");
	hp.MatchString ("%*[^0-9]%[0-9.]", match);
	ip.clear ();
	ip = match.c_str ();
	return true;
}

bool hHtmlParse::GetAddrMessage (const wchar_t* addr, std::string& data) {
	data.clear ();
	std::wstring link = L"http://www.ip138.com/ips138.asp?ip=";
	std::string page, match;
	link += addr;
	page.resize (16384);
	match.resize (64);
	if (!hHtmlParse::UrlGetHtml (link.c_str (), page)) return false;
	hHtmlParse hp (page);
	hp.SetPos_LastOf ("<table");
	hp.SetPos ("<td");
	hp.SetPos ("<td");
	bool b = true;
	if (hp.find (">>")) {
		b = false;
		hp.SetPos (">>");
		hp.MatchString ("%*[^0-9]%[0-9.]", match);
		data = match.c_str ();
	}
	hp.SetPos ("<ul");
	while (hp.find ("<li")) {
		hp.SetPos ("：");
		hp.MatchString ("%[^<]", match);
		if (b) b = false; else data += "\n";
		data += match.c_str ();
	}
	return true;
}

bool hHtmlParse::UrlGetHtml (LPCWSTR lpURL, std::string& data) {
	HINTERNET hSession = InternetOpenW (L"EmotionSniffer", NULL, NULL, NULL, INTERNET_FLAG_NO_CACHE_WRITE);
	if (!hSession) return FALSE;
	HINTERNET hFile = InternetOpenUrlW (hSession, lpURL, NULL, NULL, INTERNET_FLAG_RELOAD, NULL);
	if (!hFile) {
		InternetCloseHandle (hSession); return FALSE;
	}
	DWORD dwW = 0, dwR = 0;
	int capacity = data.capacity ();
	do {
		dwW += dwR;
		if (dwW != 0 && dwR == 0) break;
		if (dwW + 1024 >= capacity) data.resize (capacity *= 2);
	} while (InternetReadFile (hFile, (LPVOID) (data.c_str () + dwW), 1024, &dwR));
	const_cast<char*>(data.c_str ()) [dwW] = '\0';
	InternetCloseHandle (hFile);
	InternetCloseHandle (hSession);
	return TRUE;
}

簡要說明下使用方法。首先是封裝的三個靜態函數， GetLocalIp 和 GetAddrMessage 這倆是通過調用 www.ip138.com 動態查詢獲取的結果，調用這個庫實現。使用這個庫時可以參照上面兩個函數的代碼； UrlGetHtml 是通過 Windows 的 Internet API 實現從 URL 指定的地址下載網頁。

重點說說這個庫的使用方法，我就說說 hHtmlParse::GetLocalIp 則函數的實現，方法大多類似。

1、字符串定義

std::string page, match;
page.resize (512);
match.resize (16);

其中 page 用作保存 HTML 代碼， match 用作保存匹配的字符串，也就是網頁中需要獲取的數據。由於訪問的數據不大，所以 page 設置 512 字節足夠。這兒也可以不用設置大小的， UrlGetHtml 實現的比較智能，可以自動擴展大小。設置一個大小隻是可以減少內存 I/O ，提高執行速度。另外這兒也給 match 設置一個大小。對於IP地址來說，16字節足夠了。

2、獲取網頁HTML代碼

if (!hHtmlParse::UrlGetHtml (L"http://1111.ip138.com/ic.asp", page)) return false;

這句話的意思就是下載網頁並將網頁代碼保存在 page 中，如果執行失敗則返回。

3、創建解析對象

hHtmlParse hp (page);

這句代碼用於創建一個解析對象，傳入的數據爲網頁HTML字符串。
這裏我們看看ip138的網頁代碼：

<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=gb2312">
<title> 您的IP地址 </title>
</head>
<body style="margin:0px"><center>您的IP是：[123.144.*.*] 來自：XX市 聯通</center></body></html>

地址和位置打碼了，大家看得懂就行了。我們先來分析分析，需要獲取地址的代碼前面有一個 <center> 是吧？那就把位置設置在這兒吧。。。
4、設置當前解析位置

hp.SetPos ("<center>");

在這個解析庫內部維護着一個字符指針，假如說網頁前面的都解析過了，需要解析後面的，那就在解析時給字符指針賦值，然後下次解析時從字符指針位置開始解析，既方便網頁處理，也提高解析速度，一舉兩得。這行代碼的意思就是將內部維護的地址放在 <center> 這兒，下次調用時直接從這兒開始了。

5、獲取匹配字符串，也就是IP地址

hp.MatchString ("%*[^0-9]%[0-9.]", match);

這兒就是獲取匹配字符串的代碼了，第一個參數爲sscanf函數需要的那個字符串，match返回匹配的結果。簡要說說這個字符串的意思，需要深究的自行bing。

%*[^0-9] %*表示跳過，不匹配，[]表示匹配的內容，^表示非，0-9表示那十個數字。連起來的意思就是跳過所有不是0-9的字符。

%[0-9.] %表示匹配，0-9.表示匹配的內容爲那十個數字和小數點，一直到非0-9或小數點爲止。

代碼到這兒就已經獲取了匹配的ip了，接下來的內容就是簡單的處理了。

6、簡單處理並返回

ip.clear ();
ip = match.c_str ();
return true;

由於匹配是通過 const_cast<char*>(s.c_str()) 賦值（這行代碼可以在 hHtmlParse::MatchString 函數中找到），所以，在執行步驟6這段代碼前，調用 match.length() 實際上返回的是不固定值，雖然不固定但是有數據。所以，假如非得在 const_cast<char*>(s.c_str()) 之後獲取字符串長度，只能用lstrlen。

這幾行代碼大家應該都懂，簡要說說第二行，意思是調用 std::string 的重載函數 operator=(char*) ，這樣可以刷新 std::string 的長度，執行第二行代碼之後，ip中的數據可以直接調用 length() 來獲取長度了。

使用步驟描述完畢，同學們如果需要解析其他網頁在相應地方修改就行了。另外， SetPos 和 MatchString 並不是只能調用一次的，只要找網頁方便，並且保證對象內部維護的指針還沒到末尾，就可以重複調用這些函數了。

一個C++解析HTML的庫

VC++ 網絡程序自帶頭文件錯誤

雙緩存技術

C++資源[【乾貨】國外程序員整理的 C++ 資源大全]

用匯編構造__stdcall的sprintf函數

UPX脫殼總結

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結