?
在Python中,何解解析HTML文檔有多種方法,何解以下是何解一些常用的方法:
(圖片來(lái)源網(wǎng)絡(luò ),侵刪)1、何解使用BeautifulSoup庫
pip install beautifulsoup4
接下來(lái),何解我(wo)們可以使用以下代碼來(lái)解析HTML文檔:
from bs(′_ゝ`)4 import BeautifulSouphtml_doc = ""&??quot;<html><head><title>網(wǎng)頁(yè)標題</title>&??lt;/head><body><p class="title&q(′ω`)uot;><b>文章標題</b??></p&┐(′?`)┌gt;<p class="content">這是何解一個(gè)簡(jiǎn)單的HTML(°ロ°) !文檔示例。</p><a hr??ef="http://example.com/link1" class="link">(?_?;);鏈??接1</a><a href="http://example.com/link2" class="link">鏈接2</a>??(?_?;);</body></html>"""創(chuàng )建一個(gè)BeautifulSoup??對象,何解并將HTML文檔作為參數傳遞soup = BeautifulSoupヽ(′?`)ノ(html_doc,何解(╯°□°)╯ 'html.par(/ω\)ser')獲取網(wǎng)頁(yè)標(╯‵□′)╯題title = soup.title.s??tr(′?ω?`)ingp(′?_?`)rint("網(wǎng)頁(yè)標題:",ˉ\_(ツ)_/ˉ title)獲取文章標題article_title = soup.find('p', class_='title').b.??stringprint("文??章標題:", article_title)獲取所有鏈接links = soup.find_all('a', cla??ss_='link')for link in links:( ?° ?? ?°) print("鏈接:&q(′?ω?`)uot;, link['href'], "文本:", link.string)2、使用lxml庫
lxml是何解一個(gè)高性能的Python庫,用于處理XM(//ω//)L和HTML文檔,何解它基于C語(yǔ)言編寫(xiě),何解因此速度非???,要使用lxml,首先需(′▽?zhuān)?)要安裝它:
pip install lxml
接下來(lái),我們可以使用以下代碼來(lái)解析HTML文檔:
from lxml import etreehtmヽ(′▽?zhuān)?ノl_ヽ(′?`)ノdoc = """<html&(???)gt;<head><title>網(wǎng)頁(yè)標題</title></head&g??t;<body><p class="t(╯°□°)╯itle"><(′Д` );b>文章標題&l???t;/b></p><??;p cl??ass="content"ヽ(′ー`)ノ>這是一個(gè)簡(jiǎn)單的HTML文檔示例。</ヽ(′▽?zhuān)?ノp><a href="http://example.com/link1" class="link">鏈接1</a><a href="http://example.com/link2" class="link">?鏈接2</a></body></html>"""創(chuàng )建一個(gè)ElementTree對象,并將HTM(′?ω?`)L文檔作為參數傳遞root = etree.fromstring(html_doc,?? parserヾ(′ω`)?=etree.HTMLParser())獲取網(wǎng)頁(yè)標題title = root.find('title').textprint("網(wǎng)頁(yè)標題:", title)獲取文章標題article_title = roo??t.(°ロ°) !find('.//p[@class="title&qu??ot;]/b').??tex??tprint(&quo??t;文章標題:", article_ti(/ω\)tle)獲取所有鏈接links = root.xpath('//a[@class="(′?ω?`)link"]')for link in links: print("鏈接:", link.get('href'), "文本:", link.text)3、使用正則表達式(不推薦)
import reimport requestsfrom bs4 import BeautifulSoup as bs4_BeautifulSoupfrom lxml import etree as lxmlヾ(′?`)?_etree,( ???) html as lxml_html, fromstring as lxml_fromstring, tostring as lxml_tostring, parse as lxml_parse, etree as lxml_etree_element, Element as lxml??(′ω`*)_Element, SubElement as lxml_Sub(′?_?`)Element, tostring as lxml_tostrin??g_elemen??t, fromstring as lxml_fromstring_element, Comment as lxml_Co??mment, ProcessingInstruction?? as lxml_Pro(′?`)cessingInstruction, Doctype as lxml_Doctype, ElementTree as lxml_ElementTree, register_n(T_T)amespace as lxml_register_namespace, QName as lxml_QName, system_encoding as lxml_system_encoding, geterrortex(′ω`)t as lxml_geterro??rtext, __version__ as lxml__version__, __file__ as lxml__file__, __au(???)thor__ as lxml__??author__, __email__ as lxm?l__email__, __license__ as lxml__licens???e__, __ur??l__ as lxml??__url__, __all__ as lxml__all__, __na??me__ as lxml_(′?`*)_name__, __doc__ as lxmヽ(′ー`)ノl__doc__, __package__ as lxml__package__, __loader__ as lxml__loader__, __builtins__ as lxml__bui??ltins__, __ca(°ロ°) !ched__ as lxml__cached__, __spec__ as lxml__spec__(???), __importlib__ as lxml__importlib__, __import__() as lxml??__import__(), fin(╬ ò﹏ó)dall as lxml_findall, finditer as lxml_finditer, sub as lxml_sub, subnヽ(′▽?zhuān)?ノ as lxml_subn, search as lxml_search, match as lxml_match, split as lxml_split, translate as lxml_translate, escape as lxml_escape, quote as lxml_quote, unescape as lxml_unescape, maketrans as lxml_maketrans, getattr as lxml_getattr, setattr as lxml_setattr, hasattr as lx(′?_?`)ml_hasattr, de??lattr as landroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroi??drequestedresource not foundlandroid(╯‵□′)╯requestedresource not foundlandroidrequeヽ(′ー`)ノstedres(′?`)ource not foundlandroidrequeste??dresource not foundlandroidre??questedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedreso??urc??e not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidreques(′?`*)tedresource notˉ\_(ツ)_/ˉ foundlandroidrequestedresource not foundlandr(′Д` )oidrequestedresource not foundlandroidrequestedresource?? not foundlandroidrequ??estedresource not foundlandroidrequested(╯‵□′)╯resource not foundlandroidrequestedresource not found??landroidrequestedresource not foundlandroidrequestedresource not found(′▽?zhuān)?landroidrequestedresource not foundlandroidrequestedresource not foundlandroid??requestedresource not foundlandroidrequestedresource not foundlandroidrequestedre??source no(????)t foundlandroidrequestedresource not foundlandr( ?ヮ?)oidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedr┐(′д`)┌esource not foun(???)dlandroidreque??stedresource not foundlandroidrequestedresource not foundlandroidr(′▽?zhuān)?equestedresourc(′Д` )e not foundlandroidrequestedresource not foundeadvertisingid=673e570d8cec393fb8f9a0ee7d80??986e&utm_ヾ(′ω`)?campaign=%E7%9F%A5%E4%B9%8E%E4%BA%86??%E4%BB%80%E4%B9%88%EF%BC%9F&utm_medium=%E7%94%B5%E5%AD%90&utm_term=%E6%90%9C%E7%B4%A2%E5%BC%95%E6%8(′?ω?`)D%AE&utm_source=ba(T_T)idu&req_num=1&tj=utf8&referer=https://www.google.com/?gws_rd=ssl&ld=ヽ(′?`)ノwww.googl(′Д` )e.com&q=python+how+to+parse+html&ved=2ahUKEwitlu7uZvvjAhVJr10KHfTCCMEQvhd6BAgFEAE#v=onepage&q=python%20how%20to┐(′д`)┌%20parse%20html&fi??(′Д` )r=1&sa=X&ved=2ahUKEwi(?_?;)tlu7uZvvjAhVJr10KHfTCCME??Qvhd6BAgFEAE Google翻譯cetedResourceId=673e570d8cec393fb8f9a0ee7d80986e&utm_campaign=%E7%9F%A5%E4%B9%8E%E4%BA%86(′▽?zhuān)?%E4%BB%80%E4%B9%88%EF