title: csdn文章爬取
date: 2019-06-09 13:17:26
tags:
找到文章列表,進行文章爬取,提取到文章的url資訊。採用python語言來完成,使用pyquery庫進行爬取。進行文章內容的解析,提取文章內容。
儲存到本地。
嘗試對文章樣式進行儲存
article = doc('.blog-content-box')
#文章標題
title = article('.title-article').text()
#文章內容
content = article('.article_content')
進行文章的儲存
dir = "f:/python-project/spiderlearner/csdnblogspider/article/"+title+'.txt'
with open(dir, 'a', encoding='utf-8') as file:
file.write(title+'\n'+content.text())
對文章的url的提取
urls = doc('.article-list .content a')
return urls
分頁爬取
for i in range(3):
print(i)
main(offset = i+1)
**整合
#!/usr/bin/env python
# _*_coding:utf-8 _*_
#@time :2019/6/8 0008 下午 11:00
#@author :喜歡二福的滄月君([email protected])
#@filename: csdn.py
#@software: pycharm
import requests
from pyquery import pyquery as pq
def find_html_content(url):
headers =
html = requests.get(url,headers=headers).text
return html
def read_and_wiriteblog(html):
doc = pq(html)
article = doc('.blog-content-box')
#文章標題
title = article('.title-article').text()
content = article('.article_content')
try:
dir = "f:/python-project/spiderlearner/csdnblogspider/article/"+title+'.txt'
with open(dir, 'a', encoding='utf-8') as file:
file.write(title+'\n'+content.text())
except exception:
print("儲存失敗")
def geturls(url):
content = find_html_content(url)
doc = pq(content)
urls = doc('.article-list .content a')
return urls
def main(offset):
url = '此處為部落格位址' + str(offset)
urls = geturls(url)
for a in urls.items():
a_url = a.attr('href')
print(a_url)
html = find_html_content(a_url)
read_and_wiriteblog(html)
if __name__ == '__main__':
for i in range(3):
print(i)
main(offset = i+1)
Python 爬取CSDN部落格文章
新建乙個module,用於根據使用者名稱來獲取文章的url coding utf 8 from bs4 import beautifulsoup import requests 獲取部落格文章數量 def get page size user name article list url user n...
python3爬取CSDN個人所有文章列表頁
沒什麼技術含量就是簡單的xpath處理,不過有意思的是有一位csdn員工將自己的部落格位址寫到原始碼裡面了,是乙個隱藏的div,不過在 中我已經給過濾掉了。優化了響應時長,畢竟需要分頁爬去,如果不用多執行緒響應時間必然會隨著文章分頁的增多而增多。實現 import requests from lxm...
python3爬取CSDN個人所有文章列表頁
沒什麼技術含量就是簡單的xpath處理,不過有意思的是有一位csdn員工將自己的部落格位址寫到原始碼裡面了,是乙個隱藏的div,不過在 中我已經給過濾掉了。優化了響應時長,畢竟需要分頁爬去,如果不用多執行緒響應時間必然會隨著文章分頁的增多而增多。實現 import requests from lxm...