7 網路爬蟲基礎練習

0.可以新建乙個用於練習的html檔案，在瀏覽器中開啟。

1.利用requests.get(url)獲取網頁頁面的html檔案

import requests

newsurl=''

res = requests.get(newsurl) #返回response物件

res.encoding='utf-8'

2.利用beautifulsoup的html解析器，生成結構樹

from bs4 import beautifulsoup

soup = beautifulsoup(res.text,'html.parser')

3.找出特定標籤的html元素

soup.p #標籤名，返回第乙個

soup.head

soup.p.name #字串

soup.p. attrs #字典，標籤的所有屬性

soup.p. contents # 列表，所有子標籤

soup.p.text #字串

soup.p.string

soup.select(『li')

4.取得含有特定css屬性的元素

soup.select('#p1node')

soup.select('.news-list-title')

5.練習：

取出h1標籤的文字

取出a標籤的鏈結

取出所有li標籤的所有內容

取出第2個li標籤的a標籤的第3個div標籤的屬性

# 利用requests.get(url)獲取網頁頁面的html檔案

# 利用beautifulsoup的html解析器，生成結構樹

from bs4 import beautifulsoup

soup = beautifulsoup(res.text,'html.parser')

# 取出h1標籤的文字

print(soup.h1.text)

# 取出a標籤的鏈結

print(soup.a.attrs['href'])

#取出所有li標籤的所有內容

for i in soup.select('li'):

print(i.text)

# 取出第2個li標籤的a標籤的第3個div標籤的屬性

print(soup.select('li')[1].a.select('div')[2].attrs)

# 取出一條新聞的標題、鏈結、發布時間、**

print(soup.select('.news-list-title')[0].text)

print(soup.select('li')[1].a.attrs['href'])

print(soup.select('.news-list-info')[0].contents[0].text)

print(soup.select('.news-list-info')[0].contents[1].text)

08 網路爬蟲

原理 httprequest 新聞伺服器 dom 文件爬蟲應用 dom 解析資料庫網路爬蟲之dom解析 document element elements jsoup html 解析器匯入jsoup 1.6.3.jar 網路爬蟲的步驟 1 網路請求請求url 2 得到dom文件 docum...

0302網路爬蟲

1 正規表示式通用的字串表達框架，簡潔表達一組字串的表示式。針對字串表達簡潔和特徵思想的工具。判斷某字串的特徵歸屬。主要應用在字串匹配中正規表示式在文字處理中常用表達文字型別的特徵病毒入侵等同時查詢或替換一組字串匹配字串的全部或部分。正規表示式的使用編譯將符合正規表示式語法...

16 網路爬蟲

爬取整個靜態網頁並存入檔案。第乙個引數是要帶協議 http 二三引數可選,意思暫時不知道動態的暫時不會。加request import urllib2 request urllib2.request response urllib2.urlopen request print response...

7 網路爬蟲基礎練習

08 網路爬蟲

0302網路爬蟲

16 網路爬蟲

相關推薦