利用DJANGO在GAE上實現小型網路爬蟲

2021-08-30 10:11:17 字數 3658 閱讀 3977

接下來是乙個很重要的模組,就是bootstrap.py了,我們不需要了解他到底是怎樣工作的,只要進入**之後,通過url,他就會自動對映到django了,其它的url就不用在這配了,這時候就輪到我們的django的urls發威了。

我的views模組**:

\s+(\d)室(\d)廳(\d)衛\s+(.?).?(.?)\s+"""

items = re.compile(re_s,re.m|re.s).findall(data)

return items

def sub_parse(data):

soup = beautifulsoup(data)

stag = soup.find('span',)

p = soup.find('span',)

if p:

p = p.string.strip()

try:

price = float(p)

except:

price = 0

str_date = stag.string.strip().split(u":")[-1]

pub_date = datetime.strptime(str_date,'%y-%m-%d %h:%m')

ctag = soup.find('code')

content = returntext(ctag)

table = soup.find('table',)

trs = table.findall('tr')

atag = trs[0].findall('a')

if len(atag) == 1:

subregion = atag[0].string.strip()

area_name = ''

else:

subregion = atag[0].string.strip()

area_name = atag[1].string.strip()

try:

community = trs[1].find('a').string.strip()

except:

dtag = trs[1].findall('td')[-1]

community = dtag.string.strip()

for tr in trs[2:]:

txt = tr.find('td').string.strip()

if txt == u'居室:' or txt == u'房型:':

tmp = tr.findall('td')[-1].string.strip()

num = tmp.split(u'室')[0]

li = dict.keys()

for key in dict.keys():

if key == num:

value = dict[key]

break

n = re.sub('\d','',tmp)

room_layout = int(str(value) + str(n))

break

for tr in trs[2:]:

txt = tr.find('td').string.strip()

if txt == u'建築面積:' or txt == u'合租情況:':

tmp = tr.findall('td')[-1].string.strip()

area = re.sub('\d','',tmp)

if area == u'':

area = 0

else:

area = int(area)

break

else:

area = 0

return (price,pub_date,content,subregion,area_name,community,area,room_layout)

def deal_func(item):

try:

price = float(item[0])

except:

price = 0

link = item[4]

tmp = link.split('/')[-1]

original = re.sub('\d','',tmp)

original = int(original)

title = item[5]

data = open_page(link)

result = sub_parse(data)

source_id = 1

article = articles()

article.price = price

article.titles = title

article.link = link

article.pub_date = result[1]

article.source_id = source_id

article.content = result[2]

article.region = 'beijing'

article.subregion = result[3]

article.area = result[6]

article.area_name = result[4]

article.community = result[5]

article.room_layout = result[7]

article.crawl_date = datetime.now()

article.original_id = original

article.put()

def open_page(url):

res = urlfetch.fetch(url)

data =res.content

return data

def home_page(request):

url = ""

data = open_page(url)

items = parse(data)

for item in items:

deal_func(item)

variable = 'if you want to check the data,please click the link'

return shortcuts.render_to_response('index.html',)

def result_page(request):

temp = db.gqlquery("select * from articles")

results = temp.fetch(1000)

return shortcuts.render_to_response('result.html',)

urls模組:

from django.conf.urls.defaults import *

from views import *

urlpatterns = patterns(

(r'^$',home_page),

(r'^result$',result_page),

)

settings模組:

注意,由於是在gae中使用,所以很多admin的元件和部分中介軟體是沒辦法使用的,我都注釋掉了。

GAE的datastore在index上的bug

這個entity的樣子是這樣 class greeting db.model author db.userproperty content db.stringproperty date db.datetimeproperty auto now add true 登陸到管理員後台,檢視index一項,...

Django部署在ubuntu上

一.安裝django sudo apt install python3 pip 安裝pip pip v 檢視pip的版本 pip install django 預設安裝最新的django版本 當然安裝前需要更新一下系統的檔案 sudo apt update 二.uwsgi wsgi是python程式...

在Windows上安裝Django

最近社群裡面問到關於django安裝的問題比較多,現在在這裡總結的寫一下。準備工作做完之後就可以開始安裝了。下面是具體的安裝步驟 1 安裝python 這 一步很簡單,直接雙擊執行python的安裝程式,按照嚮導一路next即可。在路經方面建議設定乙個比較好找的目錄,例如我就是設定為 d pytho...