python爬蟲初學

0x01環境搭建

import os
import requests
from lxml import etree
from urllib.parse import urljoin
import urllib

pip installl 包名字0x02介紹這裡寫了乙個爬**的爬蟲指令碼

如果不能解決就手動設定在head，meta下是網頁編碼的方式

response.encoding='gb2312'

其實是設定**的編碼方式和**的編碼方式一樣

div[@class='feilei_a']找到div標籤下的元素class='feilei_a』把這一標籤下的內容(準確說是內容的位址)存到乙個列表中，下面會更精確的定位

esponse=requests.get(url)
root=etree.html(response.text)
categorys=root.xpath("//div[@class='feilei_a']/a")

從之前的位址下繼續更精確查詢或取出來內容

text() 表示取出來文字內容

@href 表示取出來href元素的內容

取出來的內容全是列表，要加[0]

category_name=category.xpath("text()")[0]
category_href=category.xpath("@href")[0]

urllib.request.urlretrieve(img_src,path+"/"+"縮略"+img_name)
urllib.request.urlretrieve(img_src.replace("files","files").replace("_s.jpg",".jpg"),path+"高畫質"+"/"+img_name)

建立資料夾第乙個引數為路徑，第二個是乙個選項，設為true可以保證還能再次建立

os.makedirs(path,exist_ok=true)

0x03貼出**

\# -*- coding:utf-8 -*-
import os
import requests
from lxml import etree
from urllib.parse import urljoin
import urllib
#亂碼原因
url=""
response=requests.get(url)
root=etree.html(response.text)
categorys=root.xpath("//div[@class='feilei_a']/a")
categorys.pop(0)
print (categorys)
for category in categorys:
category_name=category.xpath("text()")[0]
category_href=category.xpath("@href")[0]
category_href=urljoin(url,category_href)
print(category_name, category_href)
path="img/"+category_name
os.makedirs(path,exist_ok=true)
os.makedirs(path+"高畫質",exist_ok=true)
page=0
while true:
if page==0:
pass
else:
category_href=category_href.replace(".html","_%s.html"%(page))
response=requests.get(category_href)
root=etree.html(response.text)
imgs=root.xpath("//div[@id='container']/div/div/a")
# """
# # 
# """
for img in imgs:
img_name=img.xpath("img/@alt")[0]
img_src = img.xpath("img/@src2")[0]
print("\t",img_name,img_src)
urllib.request.urlretrieve(img_src,path+"/"+"縮略"+img_name)
urllib.request.urlretrieve(img_src.replace("files","files").replace("_s.jpg",".jpg"),path+"高畫質"+"/"+img_name)
if not imgs:
break
page+=1

初學python爬蟲

上之前先說下這個簡易爬蟲框架的思路排程器爬蟲的入口知道沒有url或爬蟲終端，輸出結果上 1，排程器 from myspider import urls manager,html html paser,html outer class legendspider object def init...

Python 爬蟲初學

爬取中的1import re 正規表示式庫 2import urllib url鏈結庫34 defgethtml url 5 page urllib.urlopen url 開啟鏈結 6 html page.read 像讀文字一樣讀取網頁內容 7return html89 defgetimg ht...

初學python，爬蟲開刀

coding utf 8 import urllib import json import csv import codecs csvfile file pythonsalary.csv wb csvfile.write codecs.bom utf8 writer csv.writer csvfi...

python爬蟲初學

初學python爬蟲

Python 爬蟲初學

初學python，爬蟲開刀

相關推薦