爬取網頁內容例項2

本例項爬取了該**有關python3的所有教程，並將其結果儲存在contents.txt檔案中。

import requests #匯入網頁請求庫
from bs4 import beautifulsoup #匯入網頁解析庫
import parser
defstart_requests
(url)
: headers =
response = requests.get(url, headers=headers)
return response.content.decode(
)if __name__ ==
'__main__'
: f =
open
('content.txt'
,'a+'
, encoding=
"utf-8"
)# 以追加形式開啟檔案
#f.truncate(0) # 清空原檔案中的內容
url=
''html=start_requests(url)
soup = beautifulsoup(html,
"html.parser"
) htmls=soup.find_all(
'a',target=
"_top"
)#獲取所有**
for html in htmls:
if html[
'href'][
0]=='/'
: url=
''+html[
'href'
]#拼接**
html = start_requests(url)
soup = beautifulsoup(html,
"html.parser"
) title= soup.find(
'title'
)# 獲取所有**
#print(url,title.string.strip())
texts = soup.find_all(
'p')
# 獲取所有**
for text in texts:
f.write(text.get_text())
else
: url =
'/python3/'
+ html[
'href'
] html = start_requests(url)
soup = beautifulsoup(html,
"html.parser"
) title= soup.find(
'title'
)# 獲取所有**
#print(url,title.string.strip())
texts = soup.find_all(
'p')
# 獲取所有**
for text in texts:
f.write(text.get_text())
f.close(
)

Python爬取網頁內容

其時序圖如圖所示。給定乙個要訪問的url，獲取這個html及內容，遍歷html中的某一類鏈結，如a標籤的href屬性，從這些鏈結中繼續訪問相應的html頁面，然後獲取這些html的固定標籤的內容，如果需要多個標籤內容，可以通過字串拼接，最後通過正規表示式刪除所有的標籤，最後將其中的內容寫入.txt檔...

python lxml爬取網頁內容

from lxml import etree import requests url response requests.get url text response.text html etree.html text 先獲取到這個頁面的html，對了，這裡還用到了xpath來選擇節點，具體用法請參考...

靜態網頁內容爬取（python）

以漏洞掃瞄為例 from bs4 import beautifulsoup from urllib.request import urlopen import pymysql as mysqldb import re import os 插入資料 def insertdata lis cursor...

爬取網頁內容例項2

Python爬取網頁內容

python lxml爬取網頁內容

靜態網頁內容爬取（python）

相關推薦