scrapy爬取貓眼電影排行榜

2022-05-13 15:44:11 字數 3738 閱讀 9776

做爬蟲的人,一定離不開的乙個框架就是scrapy框架,寫小專案的時候可以用requests模組就能得到結果,但是當爬取的資料量大的時候,就一定要用到框架.

下面先練練手,用scrapy寫乙個爬取貓眼電影的程式,環境配置和scrapy安裝略過

第一步肯定是終端執行建立爬蟲專案和檔案

1

#建立爬蟲專案

2scrapy startproject maoyan

3cd maoyan4#

建立爬蟲檔案

5 scrapy genspider maoyan maoyan.com

然後在產生的items.py資料夾中定義需要爬取的資料結構

1 name =scrapy.field()

2 star =scrapy.field()

3 time = scrapy.field()

之後開啟maoyan.py檔案,編寫爬蟲檔案,記得匯入items.py檔案的maoyanitem類,並例項化

1

import

scrapy

2from ..items import

maoyanitem3​

4class

maoyanspider(scrapy.spider):

5 name = '

maoyan3

'6 allowed_domains = ['

maoyan.com']

7#去掉start_urls變數8​

9#重寫start_requests()方法

10def

start_requests(self):

11for offset in range(0,91,10):

12 url = '

'.format(offset)

13yield scrapy.request(url=url,callback=self.parse)14​

15def

parse(self, response):16#

給items.py中的類:maoyanitem(scrapy.item)例項化

17 item =maoyanitem()18​

19#基準xpath

20 dd_list = response.xpath('')

21#依次遍歷

22for dd in

dd_list:23#

是在給items.py中那些類變數賦值

24 item['

name

'] = dd.xpath('

./a/@title

').get().strip()

25 item['

star

'] = dd.xpath('

.//p[@class="star"]/text()

').get().strip()

26 item['

time

'] = dd.xpath('

.//p[@class="releasetime"]/text()

').get().strip()27​

28#把item物件交給管道檔案處理

29yield item

定義管道檔案pipelines.py,進行持久化儲存

1

class

maoyanpipeline(object):2#

item: 從爬蟲檔案maoyan.py中yield的item資料

3def

process_item(self, item, spider):

4print(item['

name

'],item['

time

'],item['

star'])

5​6return

item7​

8​9import

pymysql

10from .settings import *11​

12#自定義管道 - mysql資料庫

13class

maoyanmysqlpipeline(object):14#

爬蟲專案開始執行時執行此函式

15def

open_spider(self,spider):

16print('

我是open_spider函式輸出')

17#一般用於建立資料庫連線

18 self.db =pymysql.connect(

19 host =mysql_host,

20 user =mysql_user,

21 password =mysql_pwd,

22 database =mysql_db,

23 charset =mysql_char24)

25 self.cursor =self.db.cursor()26​

27def

process_item(self,item,spider):

28 ins = '

insert into filmtab values(%s,%s,%s)'29

#因為execute()的第二個引數為列表

30 l =[

31 item['

name

'],item['

star

'],item['

time']

32]33self.cursor.execute(ins,l)

34self.db.commit()35​

36return

item37​

38#爬蟲專案結束時執行此函式

39def

close_spider(self,spider):

40print('

我是close_spider函式輸出')

41#一般用於斷開資料庫連線

42self.cursor.close()

43 self.db.close()

接下來就是修改配置檔案settings.py

1 user_agent = '

mozilla/5.0

'2 robotstxt_obey =false

3 default_request_headers =

7 item_pipelines =11#

定義mysql相關變數

12 mysql_host = '

127.0.0.1

'13 mysql_user = '

root

'14 mysql_pwd = '

123456

'15 mysql_db = '

maoyandb

'16 mysql_char = '

utf8

'

最後,是建立run.py檔案,然後就可以執行了

1

from scrapy import

cmdline

2 cmdline.execute('

scrapy crawl maoyan

'.split())

爬取貓眼電影排行榜

匯入我們需要的模組 import reimport requests 一 獲取網頁內容 1 宣告目標url,就是爬取的 位址 base url 2 模仿瀏覽器 headers 3 發起請求 response requests.get base url,headers headers 4 接收響應的資...

爬取豆瓣電影推薦排行榜

import requests from bs4 import beautifulsoup class dianying def html url self,url html requests.get url soup beautifulsoup html.text,lxml pai soup.se...

爬取貓眼電影排行100電影

import json import requests from requests.exceptions import requestexception import re import time 獲取單頁的內容 def get one page url try response requests....