Python爬蟲中文亂碼問題

我們在爬蟲輸出內容時，常常會遇到中文亂碼情況（以如下**為例）。

在輸出內容時，出現如下圖的情況：

檢視網頁源**的head部分的編碼：，發現網頁編碼為gbk型別

利用requests庫的方法檢視預設輸出的編碼型別

import requests
url = ''
response = requests.get(url)
print(response.encoding)

輸出結果為編碼iso-8859-1，並不是原網頁的編碼型別。

利用requests庫改變輸出結果的編碼

import requests
url = ''
response = requests.get(url)
response.encoding = 'gbk'
print(response.encoding)

輸出結果為編碼gbk，與原網頁保持一致。

基於以上三個步驟，即可解決爬蟲中文亂碼問題。

import requests
def get_html(url):
try:
response = requests.get(url)
response.encoding = 'gbk' # 改變編碼
print(response.encoding)
html = response.text
return html
except:
print('請求**出錯')
url = ''
html = get_html(url)
print(html)

效果展示如下圖所示：

對於有些網頁編碼為utf-8的**，輸出事發現中文為亂碼，此時我們需要進行兩次重編碼。

response = requests.get(url, headers=headers)
response.encoding = 'gbk'
response.encoding = 'utf-8'

response.encoding = 'gbk'

python 爬蟲中文亂碼問題

在爬取是遇到requests得到的respone為先用import urllib import urllib.parse urllib.parse.unquote res.text 得到然後我們直接把 replace一下 urllib.parse.unquote res.text replac...

python爬蟲中文亂碼問題

iso 8859 1 gb2312 gb2312 說明預設的解析 iso 8859 1 不正確，應該用gb2312解碼。2 gb2312解碼過程中提示 gb2312 codec can t decode byte 0xf3 in position 67376 錯誤，大概意思是說解碼沒錯，但在某個位置...

解決python爬蟲中文亂碼問題

首先網頁時可能採用不同編碼的，類似這個我爬取的網頁當我直接使用.text函式列印時會出現如下亂碼嘗試編碼結果 print strhtml.text.encode utf8 但發現明顯中文被變成了位元組可以明顯的發現此處用於解碼的encoding是繼承自父類strhtml的，而沒有設定過的話父類...

Python爬蟲中文亂碼問題

python 爬蟲中文亂碼問題

python爬蟲中文亂碼問題

解決python爬蟲中文亂碼問題

相關推薦