html轉markdown(轉)

轉自:http://blog.topspeedsnail.com/archives/6787
html2text是一個Python模塊,用來把HTML格式轉換爲文本(Markdown)格式。
# 安裝html2text
$ pip install html2text
# 使用html2text

import html2text

html = '''
<html>
<body>
<h1>Title</h1>
<p>Hello World</p>
<ul>
<li>Here's one thing</li>
<li>And here's another!</li>
</ul>
</body>
</html>
'''

markdown = html2text.html2text(html)

print(markdown)

輸出結果:

html轉markdown(轉) - ♂蘋果 - 眼睛想旅行
 
它還提供一個命令行接口:

$ html2text -h
Usage: html2text [(filename|url) [encoding]]

Options:
--version show program's version number and exit
-h, --help show this help message and exit
--pad-tables pad the cells to equal column width in tables
--no-wrap-links wrap links during conversion
--ignore-emphasis don't include any formatting for emphasis
--reference-links use reference style links instead of inline links
--ignore-links don't include any formatting for links
--protect-links protect links from line breaks surrounding them with
angle brackets
--ignore-images don't include any formatting for images
--images-to-alt Discard image data, only keep alt text
--images-with-size Write image tags with height and width attrs as raw
html to retain dimensions
-g, --google-doc convert an html-exported Google Document
-d, --dash-unordered-list
use a dash rather than a star for unordered list items
-e, --asterisk-emphasis
use an asterisk rather than an underscore for
emphasized text
-b BODY_WIDTH, --body-width=BODY_WIDTH
number of characters per output line, 0 for no wrap
-i LIST_INDENT, --google-list-indent=LIST_INDENT
number of pixels Google indents nested lists
-s, --hide-strikethrough
hide strike-through text. only relevant when -g is
specified as well
--escape-all Escape all special characters. Output is less
readable, but avoids corner case formatting issues.
--bypass-tables Format tables in HTML rather than Markdown syntax.
--single-line-break Use a single line break after a block element rather
than two line breaks. NOTE: Requires --body-width=0
--unicode-snob Use unicode throughout document
--no-automatic-links Do not use automatic links wherever applicable
--no-skip-internal-links
Do not skip internal links
--links-after-para Put links after each paragraph instead of document
--mark-code Mark program code blocks with [code]...[/code]
--decode-errors=DECODE_ERRORS
What to do in case of decode errors.'ignore', 'strict'
and 'replace' are acceptable values

其它工具:
還有一個node.js的在線工具:

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章