爬蟲總結3

//div[@id='xx']/../*[last()]/a[2]/@href
# id是xx的div的父一級標籤下的所有標籤中最後乙個標籤下的第二個a標籤的名為href屬性的值
/html//a[text()="***"]/./text()
# html下文字內容是***的所有a標籤下的當前標籤（就還是那個a標籤）的文字內容

from lxml import etree
html_element = etree.html(html_str)
rets = html_element.xpath('xpath_str')
# rets返回list or 
# 如果xpath_str是定位標籤元素，構成rets這個列表中的每個元素都是乙個element物件，可以繼續xpath!
# 如果xpath_str是提取@屬性或文字()，就返回由字串構成的列表！

# lxml.etree.html()會修改html_str
# 爬蟲提取資料要以lxml.etree.tostring()返回結果為準！

json.dumps # python資料型別-->json_str json.loads # json_str-->python資料型別 json.dump # python資料型別-->寫入類檔案物件

json.load # 類檔案物件讀出-->python資料型別

from jsonpath import jsonpath
rets = jsonpath(python資料型別, '$..***')
# python資料型別中不管位置，只要key是***就把值放到list中返回
# 批量快速提取某個指定key的值！

a = '\n' # a是換行符！
b = r'\n' # b只是\n，不是換行符！

結構化 
json 
json模組
jsonpath
re xml
lxml(xpath)
re非結構化 html
lxml(xpath)
re

爬蟲感悟3

from bs4 import beautifulsoup import requests headers url path url url path content wb data requests.get url,headers headers soup beautifulsoup wb dat...

爬蟲基礎 3

入門小練習附註 moocpython網路爬蟲與資訊提取 coding utf 8 import requests from bs4 import beautifulsoup def gethtmltext url try req requests.get url req.raise for sta...

爬蟲初學3

import requests import urllib3 urllib 側重於 url 基本的請求構造，urllib2側重於 http 協議請求的處理，而 urllib3是服務於公升級的http 1.1標準，且擁有高效 http連線池管理及 http 服務的功能庫 import json imp...

爬蟲總結3

爬蟲感悟3

爬蟲基礎 3

爬蟲初學3

相關推薦