爬蟲實戰爬取當當網top500書籍

1.這個好像是爬蟲入門必備專案，練練手

練習**：

2.requests + bs4模式，因為這個**比較簡單，不多說廢話了。

#
!/usr/bin/env python
#-*- coding:utf-8 -*-
'''爬取當當網top500書籍
'''import
requests
from bs4 import
beautifulsoup
from pymongo import
mongoclient
headers =
class
spiderdangdang(object):
def__init__
(self,url):
self.url =url
defget_collection(self):
''' 
'''client = mongoclient('
localhost
', 27017)
database =client.spider
collection =database.dangdang
return
collection
defget_response(self):
try:
response = requests.get(self.url,headers=headers)
response.raise_for_status
return
response.text 
except
exception as e:
print('aa'
,e) 
return
'none
'def
get_soup(self,response):
try:
soup = beautifulsoup(response,'
html.parser')
except
: soup = beautifulsoup(response,'
html.parser')
return
soup
defget_items(self,soup):
items = soup.select('
div.bang_list_box>ul>li')
return
items
defget_item_content(self,item):
num = item.select('
div.list_num
')[0].text.strip()
name = item.select('
div.name
')[0].text.strip()
star = item.select('
div.star
')[0].text.strip()
author = item.select('
div.publisher_info
')[0].text.strip()
try:
if item.select('
div.price>p>span.price_n'):
price_n = item.select('
div.price>p>span.price_n
')[0].text.strip()
else
: price_n = '
none
'except
: price_n = '
none
'try
: 
if item.select('
div.price>p>span.price_r'):
price_r = item.select('
div.price>p>span.price_r
')[0].text.strip()
else
: price_r = '
none
'except
: price_r = '
none
'try
: 
if item.select('
div.price>p>span.price_s'):
price_s = item.select('
div.price>p>span.price_s
')[0].text.strip()
else
: price_s = '
none
'except
: price_s = '
none
'content =
return
content 
defstart(self):
collection =self.get_collection()
response =self.get_response()
soup =self.get_soup(response)
items =self.get_items(soup)
for item in
items:
content =self.get_item_content(item)
query = 
ifcollection.find_one(query):
print('
\033[1;31m該item已經存在，不進行儲存\033[0m')
else
: collection.insert_one(content)
print('
\033[1;32m該item是新的, 進行儲存\033[0m')
if__name__ == '
__main__':
urls = ['
/1-'.format(**locals()) for page in range(1,26)]
for page,url in
enumerate(urls):
print('
\033[1;33m開始爬取第頁\033[0m
'.format(page=page+1))
ss =spiderdangdang(url)
ss.start()

3.注意資料儲存時，進行相應的判斷，去掉已經存入過的item；

簡單的for迴圈語句，建議使用列表推導式/字典/元組推導式等，簡單明瞭

最後，輸出必要的提示字元，了解程式目前執行的階段

爬取當當網 Top 500 本五星好評書籍

開啟這個書籍排行榜的位址可以看到一下網頁每一頁顯示 20 本書你可以發現位址變了也就是我們翻到第幾頁的時候鏈結位址的最後乙個引數會跟著變那麼我們等會在 python 中可以用乙個變數來實現獲取不同頁數的內容可以看到我們通過 get 請求，獲得的請求頭伺服器返回的資料我們要的就是...

爬取當當網 Top 500 本五星好評書籍

開啟這個書籍排行榜的位址很容易就能定位到書籍資訊難點在於如何寫乙個正則。直接看 import requests import reimport json headers defwrite item to file item with open book.txt a encoding utf 8 ...

爬蟲爬取當當網書籍

初學者學習爬蟲爬取當當網會比較容易，因為噹噹沒有反爬蟲import requests from lxml import html name input 請輸入要搜尋書籍的資訊 1.準備url url format name start 1 while true print start start 1...

爬蟲實戰 爬取當當網top500書籍

爬取當當網 Top 500 本五星好評書籍

爬取當當網 Top 500 本五星好評書籍

爬蟲爬取當當網書籍

相關推薦

爬蟲實戰爬取當當網top500書籍