簡單爬蟲架構

2022-07-25 07:42:13 字數 3920 閱讀 2286

爬蟲架構

執行流程

網頁解析器 

網頁解析器-beautifulsoup-語法

簡單解析例項1

1

from bs4 import

beautifulsoup

2import

re3 html_doc = """45

6the documents's story78

once upon a time there were three littlesisters;and their name

9elsie

10lacie and

11title;

12and they lived at the bottom of a well.

1314

...15

"""16 soup = beautifulsoup(html_doc,"

html.parser

",from_encoding='

utf8')

17print ('

獲取所有的連線')

18 links = soup.find_all('a'

)19for link in

links:

20print (link.name,link['

href

'], link.get_text())

2122

print ('

獲取lacie的鏈結')

23 link_node = soup.find('

a', href='

')24print (link_node.name,link_node['

href

'],link_node.get_text())

2526

print ('

正則匹配')

27 link_node = soup.find('

a',href=re.compile(r"tl"

))28

print (link_node.name,link_node['

href

'], link_node.get_text())

2930

print ('

獲取p段落文字')

31 p_node = soup.find('

p', class_="

title")

32print (p_node.name, p_node.get_text())

簡單解析例項2

from bs4 import

beautifulsoup as bs

import

rehtml_doc = """

the dormouse's story

once upon a time there were three little sisters; and their names were

elsie,

lacie and

tillie;

and they lived at the bottom of a well.

..."""

#html.parser解析器解析

soup = bs(html_doc,"

html.parser")

print

(soup.prettify())

#獲取title標籤及內容

print

(soup.title)

#獲取title標籤的內容

print

(soup.title.string)

#獲取父標籤

print

(soup.title.parent.name)

#獲得p標籤及內容

print

(soup.p)

#獲得p標籤class元素的內容

print(soup.p['

class'])

#獲取當前a標籤及內容

print

(soup.a)

'''soup.tag只能獲取當前標籤所有標籤當中的第乙個

'''#

獲取所有a標籤及內容

print(soup.find_all('a'

))#獲得 link元素所在的標籤及內容

print(soup.find(id='

link1'))

#獲取link元素所在標籤的內容

print(soup.find(id='

link1

').string)

#獲取所有a標籤下的鏈結和內容

for link in soup.find_all('a'

):

print('

**為:

'+link.get('

href

')+'

內容為:

'+link.string)

#獲取p標籤裡 class 元素的值為story 下的所有標籤及內容

print(soup.find("

p",))

#獲取p標籤裡class元素的只為story下的所有內容

print(soup.find("

p",).get_text())

#獲取b開頭的標籤

for tag in soup.find_all(re.compile("^b"

)):

print

(tag.name)

#獲取a標籤下href包含的所有標籤及內容

綜合例項-爬取維基百科詞條 

#!/usr/bin/env python

#-*- coding:utf-8 -*-

#引入開發包

from urllib.request import urlopen

from bs4 import beautifulsoup

import re

resp = urlopen("").read().decode('utf-8')

#使用beautifulsoup去解析

soup = beautifulsoup(resp,"html.parser")

#查詢以wiki開頭的鏈結

listurls=soup.findall("a",href=re.compile("^/wiki/"))

#輸出所有詞條對應的名稱和url

for url in listurls:

#過濾掉.以jpg或jpg結尾的鏈結

#輸出url的文字和對應的鏈結

print(url.get_text+''+url['href'])

python爬蟲簡單 python爬蟲 簡單版

學過python的帥哥都知道,爬蟲是python的非常好玩的東西,而且python自帶urllib urllib2 requests等的庫,為爬蟲的開發提供大大的方便。這次我要用urllib2,爬一堆風景。先上重點 1 response urllib2.urlopen url read 2 soup...

python爬蟲入門簡單爬蟲

coding utf 8 from bs4 import beautifulsoup,soupstrainer from threading import lock,thread import sys,time,os from urlparse import urlparse,urljoin fro...

簡單python爬蟲

一段簡單的 python 爬蟲程式,用來練習挺不錯的。讀出乙個url下的a標籤裡href位址為.html的所有位址 一段簡單的 python 爬蟲程式,用來練習挺不錯的。讀出乙個url下的a標籤裡href位址為.html的所有位址 usr bin python filename test.py im...