使用Scrapy框架爬取鏈家資料

# -*- coding: utf-8 -*-
import scrapy
from pachong6.items import pachong6item
class lianjiaspider(scrapy.spider):
name = 'lianjia'
allowed_domains = ['m.lianjia.com']
start_urls = ['' + str(x) for x in range(1,4)]
def parse(self, response):
agentlist = response.xpath('//*[@class="jingjiren-list__agent-item"]')
for agent in agentlist:
item = pachong6item()
item['name'] = agent.xpath('div/div/div[2]/div[1]/span/text()').extract_first()
item['region'] = agent.xpath('div/div/div[2]/p/text()').extract_first()
item['tran_num'] = agent.xpath('div/div/div[2]/div[3]/div[1]/span/text()').extract_first()
# print("經紀人姓名：", item['name'])
# print("經紀人所負責區域：", item['region'])
# print("經紀人歷史成交量為：", item['tran_num'])
yield item

資料存入mongodb：

# -*- coding: utf-8 -*-
from pymongo import mongoclient
class pachong6pipeline(object):
# 在open_spider方法中連線mongodb，建立資料庫和集合，也可以在__init__初始化方法中處理這些操作
def open_spider(self, spider):
da**** = '127.0.0.1'
dataport = 27017
dbname = 'lianjia_db'
sheetname = 'collections_db'
# 獲取資料庫連線
self.client = mongoclient(da****, dataport)
# 指定資料庫
self.db = self.client[dbname]
# 指定集合
self.collection = self.db[sheetname]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
# 把item轉化成字典方式，然後新增資料
#print("item資料：", item)
self.collection.insert_one(dict(item))
# return item

結果：

使用xpath解析爬取鏈家

from urllib import request from time import sleep from lxml import etree import csv import redis import re 1 資料抓取定義乙個函式,用於將頁碼,城市等資訊轉化為乙個request物件 def...

scrapy框架全站資料爬取

每個都有很多頁碼，將中某板塊下的全部頁碼對應的頁面資料進行爬取實現方式有兩種 1 將所有頁面的url新增到start urls列表不推薦 2 自行手動進行請求傳送推薦 yield scrapy.request url,callback callback專門用做於資料解析下面我們介紹第二種...

使用scrapy框架爬取資料並存入excel表中

爬取爬取目標獲得乙個地區七天之內的天氣狀況,並存入excel 中爬蟲檔案部分 import scrapy from items import tianqiyubaoitem class tianqispider scrapy.spider name tianqi allowed domains...

使用Scrapy框架爬取鏈家資料

使用xpath解析爬取鏈家

scrapy框架全站資料爬取

使用scrapy框架爬取資料並存入excel表中

相關推薦