初始Python爬蟲

python2與python3的區別：

python2將在2023年停止維護。

語法；預設編碼；

print用法；

xrange等函式變化；

建立例項：

python中主要由urllib和request來獲取網頁內容。

建立urllib例項：

from urllib.request import urlopen      #呼叫urlopen函式
f = urlopen('')
f = f.read().decode('utf-8') #read()方法是讀取返回資料內容， decode是轉換返回資料的bytes格式為str
print(f)

建立requests例項：

import requests
r = requests.get('')
print(r) # 直接返回response code
print(r.text) # text方法是提取返回中的文字內容

爬蟲三步走

第一步：使用request獲取網頁資料：

匯入requests

使用requests.get方法獲取網頁資料

第二步：使用beautif soup 4 解析資料

3. 匯入bs4

4. 解析網頁資料

5. 查詢想要的資料

6. for迴圈列印

第三步：使用pandas儲存資料

7. 匯入pandas

8. 新建list列表

9. 使用to_csv寫入

**：

import requests
from bs4 import beautifulsoup
import pandas
r = requests.get('').text
soup = beautifulsoup(r, 'lxml')
pattern = soup.find_all('p', 'comment-content')
for item in pattern:
print(item.string)
comments = 
for item in pattern:
df = pandas.dataframe(comments)
df.to_csv('comments.csv')

爬蟲筆記初始爬蟲（二）

什麼是爬蟲？爬蟲是什麼呢，一般說的爬蟲都是網路爬蟲。那什麼是網路爬蟲呢？網路爬蟲又被稱為網頁蜘蛛，網路機械人，在foaf社群中間，更經常的稱為網頁追逐者是一種按照一定的規則，自動地抓取全球資訊網資訊的程式或者指令碼。另外一些不常使用的名字還有螞蟻自動索引模擬程式或者蠕蟲。總結來說就是一句話，...

python爬蟲非同步爬蟲

壞處無法無限制的開啟多執行緒或者多程序。執行緒池程序池適當使用使用非同步實現高效能的資料爬取操作人多力量大環境安裝 pip install aiohttp 使用該模組中的clientsession 2表示同時存在兩個協程 pool pool 2 urls for i in range 1...

Python爬蟲初識爬蟲

模擬瀏覽器開啟網頁，獲取網頁中我們想要的那部分資料瀏覽器開啟網頁的過程當你在瀏覽器中輸入位址後，經過dns伺服器找到伺服器主機，向伺服器傳送乙個請求，伺服器經過解析後傳送給使用者瀏覽器結果，包括html,js,css等檔案內容，瀏覽器解析出來最後呈現給使用者在瀏覽器上看到的結果瀏覽器傳送訊息給...

初始Python爬蟲

爬蟲筆記 初始爬蟲（二）

python爬蟲 非同步爬蟲

Python爬蟲 初識爬蟲

相關推薦

爬蟲筆記初始爬蟲（二）

python爬蟲非同步爬蟲

Python爬蟲初識爬蟲