python 爬蟲之筆趣網 (附原始碼)

2021-09-25 20:48:00 字數 4086 閱讀 3125

需要使用的庫:
#匯入相關庫

import requests

from pyquery import pyquery as pq

import re

import os

from multiprocessing import process

from redis import strictredis

import logging

import chardet

全域性標量
con = strictredis(host='localhost',port='6379',db=10,password='')

con2 = strictredis(host='localhost',port='6379',db=10,password='')

base_file = os.path.join(os.path.dirname(os.path.abspath(__file__)),'**')

headers =

base_url = ""

為減少**冗餘:

字串編碼不統一

檔名不規範

#解決文字編碼問題

def fix_code(text):

try:

text = text.encode('iso-8859-1').decode('gbk')

except (unicodeencodeerror,unicodedecodeerror) as e:

if "latin-1" in e.__str__():

return text

elif 'gbk' in e.__str__():

print(e)

return text.encode('iso-8859-1').decode()

return text

#解決檔名不規範問題

def format_filename(filename):

filename = re.sub("[、\"<>\/!,;:??「」\\'\*]", '', filename)

if "(" in filename:

filename = re.sub("(.*?)", '',filename, re.s)

return filename

return filename

爬蟲筆趣網**的類
class spider():

"""該類實現爬去一部**

引數:acticle_url:該引數為**url

"""def __init__(self):

self.base_url = ""

self.headers =

def get_article_url(self):

url = con.spop('novel')

if url:

con2.sadd('down', url)

return url.decode()

return none

def run(self):

"""獲取**章節url

:return:

"""url = self.get_article_url()

while url:

aa = requests.get(url,headers=self.headers).text

pattern = '.*?.*?.*?.*?.*?.*?.*?(.*?)

' results = re.findall(pattern, aa, re.s)[0]

global caption

caption = fix_code(results[0])

if not os.path.exists(base_file+ "//"+caption):

os.makedirs(base_file+ "//"+caption) #建立乙個目錄

pattern = '.*?(.*?)

.*?'

res = re.findall(pattern, results[1], re.s)

for i in res:

title = fix_code(i[1])

title_url = self.base_url + fix_code(i[0])

if "第" in title and "章" in title:

self.detial(title_url)

def detial(self,url):

"""獲取正文

f.write(txts)

獲取**urls,使用redis進行儲存

(感興趣的可擴充套件分布式爬取)

#爬去筆趣網所有**

def get_caption():

"""**大分類:

玄幻**

修真**

都市**

穿越**

網遊**

科幻**

其他**

#下面兩個分類包含上面的

排行榜單

完本**

:return:**url集合

#手動先做一遍分類

"""urls = ["/xuanhuanxiaoshuo/","/xiuzhenxiaoshuo/","/dushixiaoshuo/","/chuanyuexiaoshuo/",

"/wangyouxiaoshuo/","/kehuanxiaoshuo/","/qitaxiaoshuo/"]

for url in urls:

res = requests.get(base_url+url, headers=headers).text

doc = pq(res)

for cap_url in doc('.s2 a').items():

cap_url = base_url + cap_url.attr.href

is_exist = con2.sadd('down',cap_url)

if not is_exist:

con.sadd('novel',cap_url) #集合

這裡我實現的是多程序爬蟲,執行效率一般,感興趣的小夥伴可以根據整個爬蟲實現非同步,程序,執行緒的結合
if __name__=="__main__":

spider = spider()

queue =

for i in range(6):

p = process(target=spider.run)

p.start()

for i in queue:

i.join()

這是筆趣網的爬蟲實現的全部**,如果感興趣的小夥伴可以自行複製執行,也可以修改**,提高效率。

**還並不是很完善,還存在一些問題,還請自行copy發現

Python爬蟲 筆趣閣小說爬取

import requests from lxml import etree以 我有百萬技能點 為例,在筆趣閣搜尋進入目錄頁,複製目錄頁url 對目錄頁的每個章節的url進行爬取,分析網頁利用xpath定位每個章節的url然後進行爬取,然後重新構造url。目錄每一章節的url href html e...

筆趣閣小說 python3爬蟲例項

import urllib.request import re from bs4 import beautifulsoup as bs def urlopen url req urllib.request.request url html urllib.request.urlopen req htm...

python3 爬蟲繼續爬筆趣閣 ,,,,,,,

學如逆水行舟,不進則退 今天想看 找了半天,沒有資源.只能自己爬了 想了半天.忘記了這個古老的技能 撿了一下 import requests from bs4 import beautifulsoup cookies headers response requests.get headers hea...