個人知乎基礎九爬蟲入門PySpider

安裝：pip install pyspider scheduler：排程器，排程乙個url處理

processor：處理網頁器，並解析出新的url

class

return

《參考資料: >

< pyquery: >

< css選擇器參考資料：>

乙個網頁的框架
docurl
text
header
cookies
css選擇器:標籤解析
自定義選中html標籤
.class:class='class'
#id:id='id'
div.inner:a[href^="http://"] :帶http開頭的a標籤
p>div>span：p標籤下的div下的span,一層的
p div:在內層即可，不要求父子
[target=_blank]:target=_blank

#例子
q=pyquery(open('v2ex.html').read())
print q('title').text()
for each in q('div.inner>a').items():
#獲取屬性
print
1,each.attr.href
#獲取文字
print
2,each.html()

python的內嵌sql

#連線資料庫
db = mysqldb.connect( 'localhost', 'root', 'nowcoder', 'wenda',
charset= 'utf8')
try:
#游標處理多條結果
cursor = db.cursor()
#插入sql = 'insert into question(title, content, user_id, created_date,
comment_count) values ("%s","%s",%d, %s, %d)' % (
'title', 'content', random.randint(1, 10), 'now()', 0);
# print sql
cursor.execute(sql)
#最後新條目的id
qid = cursor.lastrowid
#所有事務需要提交到資料庫
db.commit()
print qid
#異常處理
except exception, e:
print e
#事物回滾
db.rollback()
#斷開連線
db.close()
#查取db = mysqldb.connect( 'localhost', 'root', 'nowcoder', 'wenda',
charset= 'utf8')
try:
cursor = db.cursor()
sql = 'select * from question order by id desc limit 2'
cursor.execute(sql)
#fetchall獲取條目列表
for each in cursor.fetchall():
#每個each都是乙個屬性列表
for row in each:
print row
#db.commit()
except exception, e:
print e
db.rollback()
db.close()

#v2ex

#知乎

python 爬蟲知乎

人生苦短，唯有python 是不是寫多了自己就熟練了呢？人人網爬蟲模板，具體操作以後有機會再新增吧！coding utf 8 import urllib2 import urllib import cookielib import re def zhihubrower url,user,passw...

知乎首頁爬蟲

嘗試了一下知乎首頁爬蟲 import re import requests from urllib import parse 首頁鏈結 headers resp requests.get headers headers print resp.text urls re.findall content ...

個人知乎基礎四多執行緒

threadlocal 執行緒本地變數，每個執行緒有乙個副本執行緒安全變數 atomicinteger new atomicinteger 0 blockingqueue 執行緒池減少執行緒建立銷毀開銷單執行緒excutor executorservice service excutors.n...

個人知乎 基礎九 爬蟲入門PySpider

python 爬蟲 知乎

知乎首頁爬蟲

個人知乎 基礎四 多執行緒

相關推薦

個人知乎基礎九爬蟲入門PySpider

python 爬蟲知乎

個人知乎基礎四多執行緒