Python3 urllib 網路資料獲取模組

本文由 luzhuo 編寫,**請保留該資訊.

原文:

以下**以python3.6.1為例

less is more!

#coding=utf-8
# urllibdemo.py urllib演示
from urllib import request # 請求url, 支援 http(0.9/1.0) / ftp / 本地檔案 / url
from urllib import parse # 解析url, 支援 file / ftp / gopher / hdl / http / https / imap / mailto / mms / news / nntp / prospero / rsync / rtsp / rtspu / sftp / shttp / sip / sips / snews / svn / svn+ssh / telnet / wais
from urllib import robotparser # 分析 robots.txt 檔案
from urllib import error # 異常
import re # 正則模組
from bs4 import beautifulsoup
import os
def demo():
os.mkdir("images")
# -- 獲取網頁源** --
f = request.urlopen("")
data = f.read().decode("utf-8")
# -- 獲取網頁原始碼中的位址 --
# 方式一: 正則的方式
# 方式二: beautiful soup (安裝: pip install beautifulsoup4) 提取html/xml標籤中的內容
soup = beautifulsoup(data, "html.parser")
images = soup.find_all("img") # 取標籤
# -- 關閉 --
f.close
# 引數詳解
def fun():
neturl = ""
imgurl = ""
# --- urllib.parse --- 解析url
# - 編碼 -
neturl = "%s?%s" %(neturl, parse.urlencode()) # get傳參url構建
data = parse.urlencode().encode('ascii') # post參參data構建
# - 解碼 -
scheme = urls.scheme # 獲取相應資料
# - 替換 -
url = parse.urljoin('', 'fame.html') # 替換後部分 => 
# --- urllib.reques --- 請求資料
try:
# - request - 構建
req = request.request(neturl) # get
req = request.request(neturl, headers = ) # 新增請求頭
req = request.request(neturl, data=b'this is some datas.') # post 新增post請求資料
req = request.request(neturl, data) # post 新增post請求資料
req = request.request(neturl, data=b"this is some datas.", method="put") # put 其他型別的請求
# 獲取
url = req.full_url # 獲取url
reqtype = req.type # 請求型別(如http)
host = req.host # 主機名(如:luzhuo.me / luzhuo.me:8080)
host = req.origin_req_host # 主機名(如:luzhuo.me)
url = req.selector # url路徑(如:/blog/base1.html)
data = req.data # 請求的實體,沒有為noce
boolean = req.unverifiable # 是否是rfc 2965定義的不可驗證的
method = req.get_method() # 請求方式(如:get / post)
# 修改
req.add_unredirected_header("key", "value") # 新增不會重定向的請求頭
req.remove_header("key") # 刪除請求頭
req.get_header("key") # 獲取請求頭, 無返回none
req.get_header("key", "none.") # 獲取請求頭
boolean = req.has_header("key") # 是否有該請求頭
headers = req.header_items() # (所有)請求頭列表
req.set_proxy("220.194.55.160:3128", "http") # 設定**(主機,型別)
# - response - 請求結果
res = request.urlopen(neturl) # get 開啟url,返回response
res = request.urlopen(neturl, data=b'this is some datas.') # post 新增post請求資料
res = request.urlopen(req) # 支援 request 引數
# 獲取資訊
data = res.read().decode("utf-8") # 讀取全部資料
data = res.readline().decode("utf-8") # 讀取行資料
url = res.geturl() # 獲取url
info = res.info() # 元資訊,如頭資訊
code = res.getcode() # 狀態碼
# 釋放資源
# code / reason / headers 異常
print(e)
except error.contenttooshorterror as e:
print(e)
def robot():
# --- urllib.robotparser --- robots.txt
rp = robotparser.robotfileparser()
rp.set_url("") # 設定指向 robots.txt 檔案的**
rp.read() # 獲取資料給解析器
boolean = rp.can_fetch("*", "") # 是否允許提取該url
time = rp.mtime() # 獲取 robots.txt 的時間
rp.modified() # 將 robots.txt 時間設為當前時間
def callback(datanum, datasize, filesize): # (資料塊數量 資料塊大小 檔案大小)
down = 100 * datanum * datasize / filesize
if down > 100:
down = 100
print ("%.2f%%"%down)
if __name__ == "__main__":
demo()
fun()
robot()

python3 urllib使用debug輸出

python2.7.5中使用debug輸出，可以採用如下方式 python3 中統一使用的是urllib模組庫，將python2中的urllib和urllib2進行了整合，試圖按上述方式編寫如下 python3.4.2 window7 cmd 沒有語法錯誤提示，但是，也沒有任何除錯資訊出來。還有另...

Python3 urllib抓取指定URL的內容

python爬蟲主要使用的是urllib模組，python2.x版本是urllib2，很多部落格裡面的示例都是使用urllib2的，因為我使用的是python3.3.2，所以在文件裡面沒有urllib2這個模組，import的時候會報錯，找不到該模組，應該是已經將他們整合在一起了。下面是乙個簡單的 ...

Python3 urllib庫爬蟲基礎

add header 新增報頭url req urllib.request.request url req.add header user agent mozilla 5.0 x11 ubuntu linux x86 64 rv 56.0 gecko 20100101 firefox 56.0 da...

Python3 urllib 網路資料獲取 模組

python3 urllib使用debug輸出

Python3 urllib抓取指定URL的內容

Python3 urllib庫爬蟲 基礎

相關推薦

Python3 urllib 網路資料獲取模組

Python3 urllib庫爬蟲基礎