scrapy爬蟲練習

這是一篇學習的練習

1.建立工程

scrapy startproject movie

2.建立爬蟲

cd movie

scrapy genspider meiju meijutt.com

3.自動建立目錄

4.設定資料儲存模板

items.py

import scrapy
class movieitem(scrapy.item):
# define the fields for your item here like:
# name = scrapy.field()
name = scrapy.field()

5.編寫爬蟲

meiju.py

# -*- coding: utf-8 -*-
import scrapy
from movie.items import movieitem
class meijuspider(scrapy.spider):
name = "meiju"
allowed_domains = ["meijutt.com"]
start_urls = ['']
def parse(self, response):
movies = response.xpath('//ul[@class="top-list fn-clear"]/li')
for each_movie in movies:
item = movieitem()
item['name'] = each_movie.xpath('./h5/a/@title').extract()[0]
yield item

6.設定配置檔案

settings.py增加如下內容

item_pipelines =

此處是設定資料讀取的優先順序，其實讀一條資料的時候不需要加這個

7.編寫資料處理指令碼

pipelines.py

class moviepipeline(object):
def process_item(self, item, spider):
with open("my_meiju.txt",'ab') as fp:
fp.write(item['name'].encode("utf8") + b'\r\n')

參考的原教程此處使用

with open("my_meiju.txt",'a') as fp:
fp.write(item['name'].encode("utf8") + '\n')

會發生typeerror: can』t concat str to bytes報錯，報錯的原因是「python3給open函式新增了名為encoding的新引數，而這個新引數的預設值卻是『utf-8』。這樣在檔案控制代碼上進行read和write操作時，系統就要求開發者必須傳入包含unicode字元的例項，而不接受包含二進位制資料的bytes例項。」所以把檔案開啟型別改為二進位制的ab而不是文字型別的a,同時在『\n』前新增b,但是這樣更改會導致在txt裡看到執行結果沒有換行，而使用word開啟meiju.txt則能看到換行，所以新增\r使得換行在txt中可見。

8.執行爬蟲

cd movie

scrapy crawl meiju --nolog

nolog可以去除執行日誌的顯示，讓命令列介面更乾淨，刪除–nolog即可看到執行日誌

scrapy爬蟲練習

scrapy爬蟲框架

scrapy 爬蟲框架

爬蟲安裝scrapy

scrapy爬蟲練習

scrapy爬蟲框架

scrapy 爬蟲框架

爬蟲安裝scrapy

相關推薦