Python3爬取簡書首頁文章的標題和文章鏈結

from urllib import request  
from bs4 import beautifulsoup #beautiful soup是乙個可以從html或xml檔案中提取結構化資料的python庫 
#構造標頭檔案，模擬瀏覽器訪問 
url="" 
page = request.request(url,headers=headers) 
page_info = request.urlopen(page).read().decode('utf-8')#開啟url,獲取httpresponse返回物件並讀取其resposnebody 
# 將獲取到的內容轉換成beautifulsoup格式，並將html.parser作為解析器 
soup = beautifulsoup(page_info, 'html.parser') 
# 以格式化的形式列印html 
#print(soup.prettify()) 
titles = soup.find_all('a', 'title')# 查詢所有a標籤中class='title'的語句 
''''' 
for title in titles: 
print(title.string) 
print("" + title.get('href')) 
'''#open()是讀寫檔案的函式,with語句會自動close()已開啟檔案 
with open(r"d:\articles.txt","w") as file: #在磁碟以只寫的方式開啟/建立乙個名為 articles 的txt檔案 
for title in titles: 
file.write(title.string+'\n') 
file.write("" + title.get('href')+'\n\n')

執行結果如下：

用python3爬取百度首頁

import urllib.request import urllib url html urllib.request.urlopen url content html.read decode utf 8 html text bytes.decode html.read print html tex...

python3爬取CSDN個人所有文章列表頁

沒什麼技術含量就是簡單的xpath處理，不過有意思的是有一位csdn員工將自己的部落格位址寫到原始碼裡面了，是乙個隱藏的div,不過在中我已經給過濾掉了。優化了響應時長，畢竟需要分頁爬去，如果不用多執行緒響應時間必然會隨著文章分頁的增多而增多。實現 import requests from lxm...

Python3爬取簡書首頁文章的標題和文章鏈結

用python3爬取百度首頁

python3爬取CSDN個人所有文章列表頁

python3爬取CSDN個人所有文章列表頁

相關推薦