爬蟲入門 2

2021-10-05 09:47:28 字數 3097 閱讀 3108

一:bs4的功能與使用——成功

from bs4 import beautifulsoup

import requests

r = requests.get(

'')demo = r.text

soup = beautifulsoup(demo,

'html.parser'

)print

(soup.prettify())

# 有層次感的輸出解析後的html頁面

tag = soup.a

print

(tag.attrs)

print

(tag.attrs[

'class'])

print

(type

(tag.attrs)

)print

(soup.a.prettify())

newsoup = beautifulsoup(

'我明白了bs4的使用'

,'html.parser'

)print

(newsoup.prettify())

print

(soup.contents)

二:使用re爬取**——失敗

import requests

import re

defgethtmltext

(url)

:try

: kv =

r = requests.get(url, timeout =

30, headers = kv)

r.raise_for_status(

) return r.text

except

:return

"爬取失敗"

defparsepage

(glist, html)

:try

: price_list = re.findall(r''

, html)

name_list = re.findall(r''

, html)

for i in

range

(len

(price_list)):

price =

eval

(price_list[i]

.split(

":")[1

])name =

eval

(name_list[i]

.split(

":")[1

])[price, name]

)except

:print

("解析失敗"

)def

printgoodlist

(glist)

: tplt =

"\t\t"

print

(tplt.

format

("序號"

,"商品**"

,"商品名稱"))

count =0;

for g in glist:

count = count +

1print

(tplt.

format

(count, g[0]

, g[1]

))goods_name =

"書包"

start_url =

""+ goods_name

info_list =

page =

3count =

0for i in

range

(page)

: count +=

1try

: url = start_url +

"&s="

+str(44

* i)

html = gethtmltext(url)

parsepage(info_list, html)

print

("\r爬取頁面當前進度: %"

.format

(count *

100/ page)

, end ="")

except

:continue

printgoodlist(info_list)

失敗原因:頁面的cookie獲取有誤(但不清楚填充自己的cookie時**出現了錯誤)

三:lxml爬取丁香園論壇——失敗

from lxml import etree

import requests

url =

""req = requests.get(url)

html = req.text

tree = etree.html(html)

print

(tree)

user = tree.xpath('')

content = tree.xpath('')

results =

for i in

range(0

,len

(user)):

.strip()+

": "

+ content[i]

.xpath(

'string(.)'

).strip())

for i, result in

zip(

range(0

,len

(user)

), results)

:print((

"user"

+str

(i +1)

+"-"

+ result)

)print

("*"

*100

)

失敗原因:

這個xpath方法使用有誤麼?先記錄下哈哈

Python爬蟲入門(2) 爬蟲基礎了解

爬蟲,即網路爬蟲,大家可以理解為在網路上爬行的一直蜘蛛,網際網路就比作一張大網,而爬蟲便是在這張網上爬來爬去的蜘蛛咯,如果它遇到資源,那麼它就會抓取下來。想抓取什麼?這個由你來控制它咯。比如它在抓取乙個網頁,在這個網中他發現了一條道路,其實就是指向網頁的超連結,那麼它就可以爬到另一張網上來獲取資料。...

python爬蟲入門訓練 2

這次的爬蟲訓練是對豆瓣top250資訊的爬取,比較簡單的靜態頁面的爬取,本人也是初學者,為了防止學習的困難,我盡量寫的詳細點,建議先把 複製一遍,看能不能成功執行,再過來看,免得到時候全部看完了,不能執行,到時候自己解決也是蠻麻煩的,畢竟爬蟲更新換代也是蠻快的 對豆瓣top250所有資訊進行爬取,包...

爬蟲入門 2 BeautifulSoup庫

beautifulsoup拓展包安裝 pip3 install beautifulsoup4 default timeout 1000beautifulsoup簡介 beautifulsoup是乙個html xml的解析器,主要功能是解析和提取html xml中的資料。beautifulsoup支援...