python 爬蟲之筆趣網（附原始碼）

需要使用的庫：

#匯入相關庫
import requests
from pyquery import pyquery as pq
import re
import os
from multiprocessing import process
from redis import strictredis
import logging
import chardet

全域性標量

con = strictredis(host='localhost',port='6379',db=10,password='')

con2 = strictredis(host='localhost',port='6379',db=10,password='')

base_file = os.path.join(os.path.dirname(os.path.abspath(__file__)),'**')

headers =

base_url = ""

為減少**冗餘：字串編碼不統一

檔名不規範

#解決文字編碼問題
def fix_code(text):
try:
text = text.encode('iso-8859-1').decode('gbk')
except (unicodeencodeerror,unicodedecodeerror) as e:
if "latin-1" in e.__str__():
return text
elif 'gbk' in e.__str__():
print(e)
return text.encode('iso-8859-1').decode()
return text
#解決檔名不規範問題
def format_filename(filename):
filename = re.sub("[、\"<>\/!,;:?？「」\\'\*]", '', filename)
if "（" in filename:
filename = re.sub("（.*?）", '',filename, re.s)
return filename
return filename

爬蟲筆趣網**的類

class spider():
"""該類實現爬去一部**
引數：acticle_url:該引數為**url
"""def __init__(self):
self.base_url = ""
self.headers = 
def get_article_url(self):
url = con.spop('novel')
if url:
con2.sadd('down', url)
return url.decode()
return none
def run(self):
"""獲取**章節url
:return:
"""url = self.get_article_url()
while url:
aa = requests.get(url,headers=self.headers).text
pattern = '.*?.*?.*?.*?.*?.*?.*?(.*?)
' results = re.findall(pattern, aa, re.s)[0]
global caption
caption = fix_code(results[0])
if not os.path.exists(base_file+ "//"+caption):
os.makedirs(base_file+ "//"+caption) #建立乙個目錄
pattern = '.*?(.*?)
.*?'
res = re.findall(pattern, results[1], re.s)
for i in res:
title = fix_code(i[1])
title_url = self.base_url + fix_code(i[0])
if "第" in title and "章" in title:
self.detial(title_url)
def detial(self,url):
"""獲取正文
f.write(txts)

獲取**urls,使用redis進行儲存

（感興趣的可擴充套件分布式爬取）

#爬去筆趣網所有**
def get_caption():
"""**大分類：
玄幻**
修真**
都市**
穿越**
網遊**
科幻**
其他**
#下面兩個分類包含上面的
排行榜單
完本**
:return:**url集合
#手動先做一遍分類
"""urls = ["/xuanhuanxiaoshuo/","/xiuzhenxiaoshuo/","/dushixiaoshuo/","/chuanyuexiaoshuo/",
"/wangyouxiaoshuo/","/kehuanxiaoshuo/","/qitaxiaoshuo/"]
for url in urls:
res = requests.get(base_url+url, headers=headers).text
doc = pq(res)
for cap_url in doc('.s2 a').items():
cap_url = base_url + cap_url.attr.href
is_exist = con2.sadd('down',cap_url)
if not is_exist:
con.sadd('novel',cap_url) #集合

這裡我實現的是多程序爬蟲，執行效率一般，感興趣的小夥伴可以根據整個爬蟲實現非同步，程序，執行緒的結合

if __name__=="__main__":
spider = spider()
queue = 
for i in range(6):
p = process(target=spider.run)
p.start()
for i in queue:
i.join()

這是筆趣網的爬蟲實現的全部**，如果感興趣的小夥伴可以自行複製執行，也可以修改**，提高效率。

**還並不是很完善，還存在一些問題，還請自行copy發現

Python爬蟲筆趣閣小說爬取

import requests from lxml import etree以我有百萬技能點為例，在筆趣閣搜尋進入目錄頁，複製目錄頁url 對目錄頁的每個章節的url進行爬取，分析網頁利用xpath定位每個章節的url然後進行爬取，然後重新構造url。目錄每一章節的url href html e...

筆趣閣小說 python3爬蟲例項

import urllib.request import re from bs4 import beautifulsoup as bs def urlopen url req urllib.request.request url html urllib.request.urlopen req htm...

python3 爬蟲繼續爬筆趣閣 ,,,,,,,

學如逆水行舟,不進則退今天想看找了半天,沒有資源.只能自己爬了想了半天.忘記了這個古老的技能撿了一下 import requests from bs4 import beautifulsoup cookies headers response requests.get headers hea...

python 爬蟲之筆趣網 （附原始碼）

Python爬蟲 筆趣閣小說爬取

筆趣閣小說 python3爬蟲例項

python3 爬蟲繼續爬筆趣閣 ,,,,,,,

相關推薦

python 爬蟲之筆趣網（附原始碼）

Python爬蟲筆趣閣小說爬取