爬蟲實戰爬取豆瓣電影top250

1.爬蟲入門必備知識

爬取**：

2.爬蟲思路講解：

a)了解翻頁url的變化規律

第一頁：

第二頁：

b)了解每一頁提取內容定位：

每一頁包含25部電影

c)了解如何提取每部電影的詳細資訊

3.完整**：

#!/usr/bin/env python
#-*- coding:utf-8 -*-
import requests
from bs4 import beautifulsoup
from pymongo import mongoclient
headers = 
class spiderdouban(object):
def __init__(self, url):
self.url = url
def get_collection(self):
client = mongoclient('localhost', 27017)
database = client.spider
collection = database.douban
return collection
def get_reponse(self):
try:
response = requests.get(self.url, headers=headers)
response.raise_for_status
html = response.text
except exception as e:
html = 'none'
return html
def get_soup(self,html):
try:
soup = beautifulsoup(html,'html.parser')
except:
soup = beautifulsoup(html, 'xml')
return soup
def get_items(self,soup):
items = soup.select('div.article>ol>li')
return items
def get_item_content(self,item):
try:
head = item.select('div.hd')[0].text.strip()
except:
head = 'none'
try:
people = item.select("div.article>ol>li>div p[class='']")[0].text.strip().replace(' ', '')
except:
people = 'none'
try:
star = item.select('div.article>ol>li>div div.star')[0].text.strip().replace('\n',' ')
except:
star = 'none'
try:
comment = item.select('div.article>ol>li>div p.quote')[0].text.strip()
except:
comment = 'none'
content = 
return content
def start(self):
collection = self.get_collection()
html = self.get_reponse()
soup = self.get_soup(html)
items = self.get_items(soup)
for item in items:
content = self.get_item_content(item)
if collection.find_one(content):
print('\033[1;31m該item已經在資料庫中,不進行儲存\033[0m')
else:
collection.insert_one(content)
print('\033[1;32m該item是新的, 進行儲存\033[0m')
if __name__ == '__main__':
urls = [''.format(num=num) for num in range(0,250,25)]
for page,url in enumerate(urls):
print('\033[1;33m開始爬取第頁\033[0m'.format(page=page+1))
ss = spiderdouban(url)
ss.start()

爬蟲教程用Scrapy爬取豆瓣TOP250

文章首發於 guanngxu 的個人部落格用scrapy爬取豆瓣top250 最好的學習方式就是輸入之後再輸出，分享乙個自己學習scrapy框架的小案例，方便快速的掌握使用scrapy的基本方法。本想從零開始寫乙個用scrapy爬取教程，但是官方已經有了樣例，一想已經有了，還是不寫了，盡量分享在網...

Python小爬蟲抓取豆瓣電影Top250資料

寫leetcode太累了，偶爾練習一下python，寫個小爬蟲玩一玩比較簡單，抓取豆瓣電影top250資料，並儲存到txt 上傳到資料庫中。通過分析可以發現，不同頁面之間是有start的值在變化，其他為固定部分。以物件導向的編碼方式編寫這個程式，養成好的編碼習慣。基本資訊在 init 函式中初始化...

python練習簡單爬取豆瓣網top250電影資訊

因為有的電影詳情裡沒有影片的又名，所以沒有爬取電影的又名。基本思路爬取top250列表頁展示中電影的排行榜排名，電影詳情鏈結，電影名稱。然後通過電影鏈結進入到詳情頁，獲取詳情頁的原始碼，再進行爬取，爬取後的資料儲存在字典中，通過字典儲存在mongo資料庫中的。from urllib.request...

爬蟲實戰 爬取豆瓣電影top250

爬蟲教程 用Scrapy爬取豆瓣TOP250

Python小爬蟲 抓取豆瓣電影Top250資料

python練習簡單爬取豆瓣網top250電影資訊

相關推薦

爬蟲實戰爬取豆瓣電影top250

爬蟲教程用Scrapy爬取豆瓣TOP250

Python小爬蟲抓取豆瓣電影Top250資料