爬取韓寒部落格

# -*- coding:utf-8 -*-
__author__ = 'fybhp'
import urllib2
from allfiledir import allfilrdir
import os
links = [''] * 7
n = 0
page = 0
while n < 7:
#將七頁部落格目錄的url放入links列表中.
links[n] = '' + str(n + 1) + '.html'
n += 1
while page < 7:
# 定義要建立的目錄
mkpath = allfilrdir + '/' + str(page + 1)
ifnot os.path.exists(mkpath):
print mkpath + u' 建立成功'
os.mkdir(mkpath)
else:
print mkpath + u' 目錄已存在'
#content為部落格目錄頁面的html**字串
content = urllib2.urlopen(links[page]).read()
url = [''] * 100
filename = [''] * 100
i = 0
set_local = content.find(r')
start = content.find(r'href', set_local)
end = content.find(r'html', start)
while set_local != 0
and i < 100:
url[i] = content[start + 6:end + 4]
#通過find方法的第二個引數指定開始位置,使得對字串的操作一直在往後進行.
set_local = content.find(r', end)
start = content.find(r'href', set_local)
end = content.find(r'html', start)
i += 1
j = 0
while url[j] != ''
and j < 50:
filename = url[j][url[j].find(r'blog_'):]
con = urllib2.urlopen(url[j]).read()
open(mkpath + '/' + filename, 'w').write(con)
print
'downloading', url[j]
j += 1
else:
print
'page' + str(page + 1) + 'ok'
page += 1
else:
print
'all finished!'

爬取部落格評論

通過抓包獲取資料還要找到真實的 url 位址多數在 networt xhr 中 import requests import json link headers r requests.get link,headers headers print 頁面狀態響應碼 r.status code 此時已...

WebCollector爬取CSDN部落格

新聞部落格爬取是資料採集中常見的需求，也是最容易實現的需求。一些開發者利用httpclient和jsoup等工具也可以實現這個需求，但大多數實現的是乙個單執行緒爬蟲，並且在url去重和斷點爬取這些功能上控制地不好，爬蟲框架可以很好地解決這些問題，開源爬蟲框架往往都自帶穩定的執行緒池 url去重機制...

Python 爬取CSDN部落格文章

新建乙個module，用於根據使用者名稱來獲取文章的url coding utf 8 from bs4 import beautifulsoup import requests 獲取部落格文章數量 def get page size user name article list url user n...

爬取韓寒部落格

爬取部落格評論

WebCollector爬取CSDN部落格

Python 爬取CSDN部落格文章

相關推薦