簡單的scrapy爬蟲 豆瓣劇情片排行榜

2021-08-24 20:22:45 字數 2277 閱讀 6088

目標:簡單的scrapy練習,抓取豆瓣劇情片排行榜前20%並寫入檔案儲存

**:網頁說明:

1,**中100:90部分控制排行榜中分數最高的20%

2,網頁解析過程略過

系統及軟體:windows7及pycharm,python3.6

**:1,編寫item

# -*- coding: utf-8 -*-

import scrapy

class doubanmovieitem(scrapy.item):

name = scrapy.field()

score = scrapy.field()

url = scrapy.field()

2,編寫spider

# -*- coding:utf-8 -*-

import scrapy

import json

from douban_movie.items import doubanmovieitem

class catchmoviespider(scrapy.spider):

name = 'catch_movie'

allowed_domains = ['douban.com']

start_urls = ['']

offset = 0

def parse(self,response):

# print(response.body.decode())

item = doubanmovieitem()

movie_list = json.loads(response.body.decode())

if movie_list == :

return

for movie in movie_list:

item['name'] = movie['title']

item['score'] = movie['score']

item['url'] = movie['url']

yield item

self.offset += 20

new_url = ''.format(self.offset)

yield scrapy.request(url = new_url,callback = self.parse)

3,編寫pipeline

# -*- coding: utf-8 -*-

import json

class doubanmoviepipeline(object):

def open_spider(self,spider):

self.file = open('douban_movie.txt','w',encoding='utf-8')

def process_item(self, item, spider):

content = json.dumps(dict(item),ensure_ascii=false)+'\n'

self.file.write(content)

return item

def close_spider(self,spider):

self.file.close()

4,編寫setting

# crawl responsibly by identifying yourself (and your website) on the user-agent

# obey robots.txt rules

robotstxt_obey = false

# configure item pipelines

# see

item_pipelines =

5,編寫main

from scrapy.cmdline import execute

execute('scrapy crawl catch_movie'.split())

儲存後檔案內容截圖:

筆記:1,編寫main是為了方便除錯

2,這個排行榜在**中限定了區間(**中類似於100:90這種引數)

Scrapy 豆瓣搜尋頁爬蟲

使用scrapy爬蟲框架對豆瓣圖書搜尋結果進行爬取 scrapy是乙個為了爬取 資料,提取結構性資料而編寫的應用框架 可以應用在包括資料探勘,資訊處理或儲存歷史資料等一系列的程式 它提供了多種型別爬蟲的基類,如basespider crawlspider等 scrapy框架主要由五大元件組成 排程器...

scrapy簡單爬蟲

coding utf 8 這只是爬蟲檔案內容,使用pycharm執行,在terminal中使用命令列,要用爬蟲名字 import scrapy from insist.items import insistitem class insistsspider scrapy.spider name ins...

scrapy爬蟲簡單案例

進入cmd命令列,切到d盤 cmd d 建立article資料夾 mkdir articlescrapy startproject articlescrapy genspider xinwen www.hbskzy.cn 命令後面加爬蟲名和網域名稱 不能和專案名同名 items檔案 define h...