基於Scrapy爬取網頁文章

settings定義爬取的一些設定如下;

# -*- coding: utf-8 -*-
# scrapy settings for jobbole project
## for simplicity, this file contains only settings considered important or
# commonly used. you can find more settings consulting the documentation:
## 
# 
# 
bot_name = 'jobbole'
spider_modules = ['jobbole.spiders']
newspider_module = 'jobbole.spiders'
item_pipelines = 
pages_store='f:\\spidertest\\pagetest'
# obey robots.txt rules
robotstxt_obey = true
# -*- coding: utf-8 -*-

import scrapy
from jobbole.items import jobboleitem
from bs4 import beautifulsoup
class jobbolespider(scrapy.spider):
name = 'jobbole'
allowed_domains = 
start_urls = [""]
def parse(self, response):
item = jobboleitem()
print 'image_urls', item['page_urls']
yield item
new_url = response.xpath('//*[@class="next page-numbers"]//@href').extract_first() # 翻頁
print 'new_url', new_url
if new_url:
yield scrapy.request(new_url, callback=self.parse)

items建立爬取模型如下

import scrapy
class jobbleitem(scrapy.item):
# define the fields for your item here like:
pass

pipeline定義如何使用爬取的鏈結獲得所需的爬取內容，如下

-*- coding: utf-8 -*-
# define your item pipelines here
## don't forget to add your pipeline to the item_pipelines setting
# see: 
from jobbole import settings
import os
import urllib
from bs4 import beautifulsoup
class jobbolepipeline(object):
def process_item(self, item, spider):
a=0dir_path = '%s/%s' % (settings.pages_store, spider.name) # 儲存路徑
print 'dir_path', dir_path
if not os.path.exists(dir_path):
os.makedirs(dir_path)
for page_url in item['page_urls']:
a=a+1
soup1 = beautifulsoup(html)
headitems = soup1.find("div", attrs=).gettext().encode("gb18030",'ignore')#將爬取內容按正確格式編碼
print headitems
list_name = page_url.split('/')
print 'listname',list_name
file_name = str(headitems).strip('\n')+'.txt' #去除爬取內容末尾的換行符，以便生成檔名
print 'filename', file_name
file_path = '%s/%s' % (dir_path, file_name)
print 'filepath', file_path
if os.path.exists(file_name):
continue
with open(file_path, 'wb') as file_writer:
content = soup1.find("div", attrs=).gettext().encode("gb18030",'ignore')#這行很重要，將爬取內容按正確格式編碼
file_writer.write(content)
file_writer.close()
return item

執行scrapy crawl projectname，即可在指定的資料夾看到批量儲存好的文章，檔名就是文章標題。

scrapy爬取網頁資訊，儲存到MySQL資料庫

爬取網頁資訊分析我們要爬取頁面的名言內容和相對應的標籤內容，存入mysql資料庫中。通過分析頁面，每乙個名言的div盒子的class名稱都是quote，我們使用css選擇器，先把盒子中內容挑選出來，再對盒子中的內容進行提取。response.css quote 設v為盒子中的內容盒子中第乙個s...

scrapy 爬取流程

什麼時候到pipeline，什麼時候到spider這個就不說了，這個是框架跳轉到的流程關鍵是訪問之前要登入怎麼辦，資料還要注入呢這是個列表，裡面就是爬取的鏈結了我們前面爬取就只是寫了乙個，但是其實可以寫多個鏈結又是怎麼訪問的呢這東西你就可以手動提取鏈結返回了這東西你就得好好注意了從入...

scrapy 爬取小說

速度是相當的快的爬取整站的最後結果儲存至mongodb資料庫 pycharm開發還是很好用的建立專案 scrapy startproject daomubiji 執行專案 scrapy crawl daomubi settings default request headers items t...

基於Scrapy爬取網頁文章

scrapy爬取網頁資訊，儲存到MySQL資料庫

scrapy 爬取流程

scrapy 爬取小說

相關推薦