scrapy基礎 當當網爬取

2021-10-02 22:37:06 字數 3210 閱讀 8415

xpath與正則簡單對比

1.xpath表示式效率更高

2.正規表示式功能更強大

3.一般優先選擇xpath,解決不了再用正則

xpath提取規則

/ 逐層提取

text()提取標籤下的文字

/html/head/title/text(

)

3.//標籤名** :提取所有名為的標籤

4.//標籤名[@屬性=『屬性值』] :提取屬性為**的標籤

@屬性 代表取某個屬性

#提取div中標籤的內容

//div[@class

='tools'

]

enter password: //登入密碼,初始root

create

database dangdang;

//建立資料庫檔案

use dangdang;

//使用該資料庫檔案

create

table goods(id int(32

)auto_increment

primary

key,title varchar

(100

),link varchar

(100

)unique

,comment

varchar

(100))

;//建立goods容器儲存資訊

scrapy專案中改動部分

1.item.py

# -*- coding: utf-8 -*-

# define here the models for your scraped items

## see documentation in:

# import scrapy

class

dangdangitem

(scrapy.item)

:# define the fields for your item here like:

# name = scrapy.field()

#建立三個儲存容器

['']#起始位址

defparse

(self, response)

: item=dangdangitem(

) item[

"title"

]=response.xpath(

"//a[@dd_name='單品標題']/@title"

).extract(

) item[

"link"

]= response.xpath(

"//a[@dd_name='單品標題']/@href"

).extract(

) item[

"comment"

]=response.xpath(

"//a[@name='itemlist-review']/text()"

).extract(

)yield item

for i in

range(2

,10):

#爬取前十頁

url=

''+str

(i)+

'-cid4008149.html'

yield request(url,callback=self.parse)

3.pipeline

# -*- coding: utf-8 -*-

import pymysql

# define your item pipelines here

## don't forget to add your pipeline to the item_pipelines setting

# see:

class

dangdangpipeline

(object):

defprocess_item

(self, item, spider)

: conn=pymysql.connect(host=

"127.0.0.1"

,user=

"root"

,passwd=

"root"

,db=

"dangdang"

)#連線資料庫

for i in

range(0

,len

(item[

"title"])

):title=item[

"title"

][i]

link = item[

"link"

][i]

comment = item[

"comment"

][i]

sql=

"insert into goods(title,link,comment) values('"

+title+

"','"

+link+

"','"

+comment+

"')"

#sql語句

#print(sql)

try:

conn.query(sql)

except exception as err:

print

(err)

conn.close(

)return item

scrapy爬當當網書籍資訊

本次只爬取搜尋的python相關的所有書籍 scrapy start project ddbook cd ddbook ddbook scrapy genspider t basic book dangdang.com 然後開啟 book.py 一共100頁 for i in range 1,101...

爬蟲爬取當當網書籍

初學者學習爬蟲爬取當當網會比較容易,因為噹噹沒有反爬蟲import requests from lxml import html name input 請輸入要搜尋書籍的資訊 1.準備url url format name start 1 while true print start start 1...

scrapy爬取噹噹

import scrapy from items import dangdangitem class ddspider scrapy.spider name dd allowed domains dangdang.com start urls def parse self,response 使用xp...