Python爬蟲Csdn系列III

核心**在前兩篇文章已經提到了，難度也不是很大。
#-*- coding:utf-8 -*-
import sys
import os
import codecs
import urllib
import urllib2
import cookielib
import mysqldb
import re
from bs4 import beautifulsoup
from article import csdnarticle
reload(sys)
sys.setdefaultencoding('utf-8')
class csdncrawler(object):
def __init__(self, author = 'whiterbear'):
self.author = author
self.domain = ''
self.articles = 
#給定url，得到所有的文章lists
def getarticlelists(self, url= none):
req = urllib2.request(url, headers=self.headers)
response = urllib2.urlopen(req)
soup = beautifulsoup(''.join(response.read()))
listitem = soup.find(id='article_list').find_all(attrs=)
href_regex = r'href="(.*?)"'
for i,item in enumerate(listitem):
enitem = item.find(attrs=).contents[0].contents[0]
href = re.search(href_regex,str(item.find(attrs=).contents[0])).group(1)
art = csdnarticle()
art.author = self.author
art.title = enitem.lstrip()
art.href = (self.domain + href[1:]).lstrip()
def getpagelists(self, url= none):
url = url if url else '%s?viewmode=list'%self.author
req = urllib2.request(url, headers=self.headers)
response = urllib2.urlopen(req)
soup = beautifulsoup(''.join(response.read()))
num_regex = '[1-9]\d*'
pagelist = soup.find(id='papelist')
self.getarticlelists(url)
if pagelist:
pagenum = int(re.findall(num_regex, pagelist.contents[1].contents[0])[1])
for i in range(2, pagenum + 1):
self.getarticlelists(self.domain + self.author + '/article/list/%s'%i)
def getallarticles(self):
#我們建立乙個該作者的資料夾來存放作者的文章
if not os.path.exists(self.author):
os.mkdir(self.author)
for subarticle in self.articles:
articleurl = subarticle.href
req = urllib2.request(articleurl, headers=self.headers)
response = urllib2.urlopen(req)
soup = beautifulsoup(''.join(response.read()))
article_content = soup.find(id='article_content')
title = subarticle.title.rstrip().encode('utf-8')
#將提取的內容封裝成html格式的字串
1> 中文編碼問題。雖然已經了解了幾種編碼問題的解決方式，但是還是常常被這個問題給卡住。 
2> 保持**的正交性。雖然我還沒做過大專案，但是已經能夠感受到，如果兩個模組的正交性提高，即乙個模組的改動並不會影響到另乙個模組的正常執行。這樣子能夠迫使你去思考一種清晰的框架，而不會寫了一團糟的**。 
3> 常見的錯覺，總覺得這個很簡單啊，今天就可以做完啊，結果總是遇到這樣那樣的問題，還是缺少經驗。 
4> 其他：保持**的整潔，嘗試迭代，從小的**開始一點點往上累計新的**，時刻保持兩個版本（其中乙個含有大量輸出來幫你確定每一步發生了什麼）。

下個系列可能就要開始做微博的爬蟲了，會涉及到相關的資料處理和分析，希望能順利點。
 Python爬蟲系列
部落格 python,爬蟲 由於近來學 lan 業 ai 繁 fa 忙 zuo 快乙個月沒有更新部落格了。這周完成了兩門課的結課考試，現下時間開始變得充裕。準備梳理一下前段時間學習的關於python爬蟲的內容，權當複習鞏固知識。而初次學習時遇到的疑難雜症，那時候的應對策略是抓大放下，在這梳理過程會下...
Python爬蟲系列 開端
網路爬蟲，顧名思義，是從網路上爬取特定資訊的工具。開發環境 乙個完整的爬蟲，一般包含以下四部分或其中幾個部分 主函式spiderurl管理器url manger網頁解析器html parser內容輸出器outputerbeautifulsoup 解析網頁 requests 或 urllib2 獲取網...
Python 爬蟲系列（一）
1 為了省去時間投入學習，推薦直接安裝整合環境 anaconda 2 ide pycharm pydev 3 工具 jupyter notebook 安裝完anaconda會有的 1 瘋狂的python 快速入門精講 python2.x，可體驗到與python3.x的差異 看完這些課程，自己對pyt...
Python爬蟲Csdn系列III

Python爬蟲系列

Python爬蟲系列 開端

Python 爬蟲系列（一）

相關推薦

Python爬蟲系列開端