Scrapy爬取並儲存到TXT檔案

在建立完成專案並建立爬蟲的基礎上，編寫儲存到txt的專案

1.將 robotstxt_obey 設定為false

2.將 item_pipelines 開啟

item是scrapy提供的類似於字典型別的資料容器，它與字典最大的區別在於它規定了統一的資料規格樣式，即具有統一性與結構性。這樣既方便資料的儲存與處理，也可以避免打錯字段或資料不一致的情況。

import scrapy
class baikeitem(scrapy.item):
# define the fields for your item here like:
name = scrapy.field()

parse()方法控制爬取的鏈結與爬取結果的處理，通常我們在獲取鏈結後使用 scrapy.request(url,callback=) 方法獲取網頁，可以callback=後面指定解析的方法。

在解析的方法中，需要定義乙個字典型別 dic={},將解析完的結果，按照items定義的容器模板，更新字典內容，並將字典返回。使用return或yield返回，返回後值被pipelines獲取到。

class demospider(scrapy.spider):
name = 'demo'
# allowed_domains = ['mp.csdn.net']
start_urls = ['']
def parse(self, response):
for i in range(45,1000):
url=''+str(i)+'_03.shtml'
try:
yield scrapy.request(url, callback=self.parse_history)
except:
continue
def parse_history(self, response):
dic={}
try:
school = response.css('h1 a::text').extract()[0]
dic['name'] = school
yield dic
except exception as e:
print(e)

定義:

open_spider(self,spider)

----爬蟲開始執行時執行

close_spider(self,spider)

----爬蟲關閉時執行

process_item(self,item,spider)

----在有spiders中的parse函式返回值時執行

我們在open_spider中開啟乙個txt檔案，如果沒有該檔案則建立，並指定文字寫入模式：

在此處指定寫入的編碼格式為'utf-8' (預設'gdk')

def open_spider(self,spider):
self.file = open('items2.txt', 'w'，encoding='utf-8')

在close_spider中關閉txt檔案的寫入：

def close_spider(self,spider):
self.file.close()

在process_item中指定item中內容按照一定格式寫入txt檔案:

def process_item(self, item, spider):
try:
res=dict(item)
line=res['name']
self.file.write(line+'\n')
except:
pass

注意：

windows預設的檔案寫入格式為'gdk'，我們往往要改變編碼才能正確寫入檔案，

在open方法中指定編碼方式為'utf-8'是常用的防止亂碼和無法寫入問題方法

1.為了便於處理，我們首先要將item使用dict()轉化為字典型別

2.文字預設為unicode編碼，這樣無法寫入到txt檔案中，我們需要將其轉換為『utf-8'編碼

可以對unicode字元使用str()方法轉化為字串，這樣可以將其寫入txt，但編碼還是unicode

可以對unicode字元使用.encode('utf-8')方法，寫入txt中開啟便是中文。

由於python2對漢字不太友好，導致這部分造成了額外的麻煩

全部**：

spiders/demo.py

# -*- coding: utf-8 -*-
import scrapy
import re
class demospider(scrapy.spider):
name = 'demo'
# allowed_domains = ['mp.csdn.net']
start_urls = ['']
def parse(self, response):
for i in range(45,1000):
url=''+str(i)+'_03.shtml'
try:
yield scrapy.request(url, callback=self.parse_history)
except:
continue
def parse_history(self, response):
dic={}
try:
school = response.css('h1 a::text').extract()[0]
dic['name'] = school
yield dic
except exception as e:
print(e)

items.py

import scrapy
class baikeitem(scrapy.item):
# define the fields for your item here like:
name = scrapy.field()

pipelines.py

class baikepipeline(object):
def open_spider(self,spider):
self.file = open('items2.txt', 'w')
def close_spider(self,spider):
self.file.close()
#item在後期使用的時候還要轉換回來，他存在的意義只是防止出錯導致程式中止
def process_item(self, item, spider):
try:
res=dict(item)
line=res['name']
self.file.write(line.encode('utf-8')+'\n')
except:
pass

scrapy爬取資料並儲存到文字

1.scrapy專案結構如下 2.開啟spidler目錄下的duba.py檔案，如下這個是根據豆瓣一部分頁面獲取的熱門話題內容，有6條資料 coding utf 8 import scrapy from scrapydemo.items import scrapydemoitem from lxm...

Scrapy爬取網頁並儲存到資料庫中

scrapy爬取網頁並儲存到資料庫中一.新建乙個scrapy工程。進入乙個你想用來儲存的資料夾，然後執行 t scrapy startproject fjsen 會生成一堆資料夾和檔案 scrapy.cfg 專案配置檔案 tutorial 專案python模組,呆會將從這裡匯入 tutoria...

scrapy爬取網頁資訊，儲存到MySQL資料庫

爬取網頁資訊分析我們要爬取頁面的名言內容和相對應的標籤內容，存入mysql資料庫中。通過分析頁面，每乙個名言的div盒子的class名稱都是quote，我們使用css選擇器，先把盒子中內容挑選出來，再對盒子中的內容進行提取。response.css quote 設v為盒子中的內容盒子中第乙個s...

Scrapy爬取並儲存到TXT檔案

scrapy爬取資料並儲存到文字

Scrapy爬取網頁並儲存到資料庫中

scrapy爬取網頁資訊，儲存到MySQL資料庫

相關推薦