Beautiful Soup 01節點擊擇器

beautiful soup自動將輸入文件轉換為unicode編碼，輸出文件轉換為utf-8編碼，不需要考慮編碼問題。

beautiful soup安裝：

直接呼叫節點的名稱就可以選擇節點元素，再呼叫string屬性就可以得到節點內的文字了，這種選擇方式速度非常快。如果單個節點結構層次非常清晰，可以選用這種方式來解析。

html = """
the dormouse's story
once upon a time there were three little sisters; and their names were
,lacie and
tillie;
and they lived at the bottom of a well.
..."""
from bs4 import beautifulsoup
soup=beautifulsoup(html,"lxml")
#首先列印輸出title節點的選擇結果，輸出的就是title節點和裡面文字內容。
print(soup.title)
#接下來，輸出它的型別是bs4.element.tag型別，這是beautiful soup中乙個重要的資料結構。
print(type(soup.title))
#經過選擇器選擇後，選擇結果都是這種tag型別。tag具有一些屬性，比如string屬性，呼叫該屬性，可以得到節點的文字內容。
print(soup.title.string)
#選擇了head節點，結果也是節點加其內部的所有內容。
print(soup.head)
#選擇p節點。結果是第乙個p節點的內容，後面的幾個p節點並沒有選到。也就是說，當有多個節點時，這種方式只會選擇第乙個匹配的節點，其他後面節點都會忽略。
print(soup.p)

執行結果：

檢視一下soup的屬性功能：

print(dir(soup))

執行結果：

['ascii_spaces', 'default_builder_features', 'no_parser_specified_warning', 'root_tag_name', '__bool__', '__call__', '__class__', '__contains__', '__copy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_strings', '_check_markup_is_url', '_feed', '_find_all', '_find_one', '_is_xml', '_lastrecursivechild', '_last_descendant', '_linkage_fixer', 'attrs', 'builder', 'can_be_empty_element', 'cdata_list_attributes', 'childgenerator', 'children', 'clear', 'contains_replacement_characters', 'contents', 'currenttag', 'current_data', 'declared_html_encoding', 'decode', 'decode_contents', 'decompose', 'descendants', 'encode', 'encode_contents', 'enddata', 'extend', 'extract', 'fetchnextsiblings', 'fetchparents', 'fetchprevious', 'fetchprevioussiblings', 'find', 'findall', 'findallnext', 'findallprevious', 'findchild', 'findchildren', 'findnext', 'findnextsibling', 'findnextsiblings', 'findparent', 'findparents', 'findprevious', 'findprevioussibling', 'findprevioussiblings', 'find_all', 'find_all_next', 'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings', 'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling', 'find_previous_siblings', 'format_string', 'formatter_for_name', 'get', 'gettext', 'get_attribute_list', 'get_text', 'handle_data', 'handle_endtag', 'handle_starttag', 'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before', 'isselfclosing', 'is_empty_element', 'is_xml', 'known_xml', 'markup', 'name', 'namespace', 'new_string', 'new_tag', 'next', 'nextgenerator', 'nextsibling', 'nextsiblinggenerator', 'next_element', 'next_elements', 'next_sibling', 'next_siblings', 'object_was_parsed', 'original_encoding', 'parent', 'parentgenerator', 'parents', 'parse_only', 'parserclass', 'parser_class', 'poptag', 'prefix', 'preserve_whitespace_tag_stack', 'preserve_whitespace_tags', 'prettify', 'previous', 'previousgenerator', 'previoussibling', 'previoussiblinggenerator', 'previous_element', 'previous_elements', 'previous_sibling', 'previous_siblings', 'pushtag', 'recursivechildgenerator', 'rendercontents', 'replacewith', 'replacewithchildren', 'replace_with', 'replace_with_children', 'reset', 'select', 'select_one', 'setup', 'smooth', 'string', 'strings', 'stripped_strings', 'tagstack', 'text', 'unwrap', 'wrap']

前後帶'__'的一般會不用到，我們看到有乙個'attrs'，可以用來獲取屬性值，每個節點可能有多個屬性，比如id和class等，選擇這個節點元素後，可以呼叫attrs獲取所有屬性:

print(soup.p.attrs)
print(soup.p.attrs['name'])

執行結果如下:

dromouse

html = """
"""from bs4 import beautifulsoup
soup=beautifulsoup(html,'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

執行結果如下:

the dormouse's story the dormouse's story

Beautiful Soup 01節點擊擇器

BeautifulSoup常用方法

BeautifulSoup學習筆記

爬蟲 BeautifulSoup 模組

Beautiful Soup 01節點擊擇器

BeautifulSoup常用方法

BeautifulSoup學習筆記

爬蟲 BeautifulSoup 模組

相關推薦