使用BeautifulSoup的簡單小爬蟲

安裝beautifulsoup

先從官網上down下來然後解壓再用python安裝

官網位址

具體還是網上搜吧超級多

爬取模組

其實貼吧的**還是比較容易拼接的所以有挺多人拿貼吧練手來著

def
start
(self):
for i in range(self.topic_limit/50):
self.spide_listpage(i * 50)

因為計畫著要翻頁嘛拼接的頁碼就是這麼個格式做個迴圈呼叫方法

def
spide_listpage
(self, num):
url = self.baseurl + "&pn=" + str(num)
html = urllib2.urlopen(url).read()
soup = beautifulsoup(html, 'html.parser')
topic_list = soup.findall('a', attrs=)
for topic in topic_list:
if self.keyword in topic['title']:
print topic['title'], (self.domain + topic['href']).strip()
self.theurl = (self.domain + topic['href']).strip()
break

html就是拼接出來的位址，然後利用beautifulsoup來進行讀取，在找到所有class裡面帶 j_th_tit樣式的然後再把對應的標題和超連結列印出來

這個思路嘛就是找html裡面對應的css樣式，畢竟同類的格式肯定是一樣的這個估計大家都懂就不贅述了

然後迴圈把含有keyword的提取列印出來

檔案寫入模組

爬取出來索性就寫入txt文件好啦

class
writeinfile:
def__init__
(self, url):
self.url = url
defgettheweb
(self):
html = urllib2.urlopen(self.url).read()
soup = beautifulsoup(html, 'html.parser')
context_list = soup.findall('div', 'd_post_content j_d_post_content ')
for context in context_list:
# print context.text
self.wirtefile(context.text)
defwirtefile
(self, text):
with open( 'spider.txt', 'a') as f:
f.write(text)
f.write('\n')

把剛剛找到的url傳入這個方法，然後呼叫beautifulsoup吧帖子裡面的文字資訊找出來，最後呼叫python自帶的write方法寫入到txt裡面去

基本還是重複了上乙個模組的操作吧

呃。。。這個帖子貌似有點重口味。。下次換個keyword再說吧。

BeautifulSoup 安裝使用

linux環境 1.安裝方法一解壓 tar xzvf beautifulsoup4 4.2.0.tar.gz 安裝進入解壓後的目錄 python setup.py build sudo python setup.py install 方法二快速安裝 ubuntu sudo apt get i...

BeautifulSoup使用相關知識

1基礎使用，獲取某一內容的h1標籤 2複雜html解析 print name.get text get text 清除標籤，只保留內容 4通過網際網路採集外鏈 from urllib.request import urlopen from bs4 import beautifulsoup imp...

使用BeautifulSoup解析HTML

通過css屬性來獲取對應的標籤，如下面兩個標籤可以通過class屬性抓取網頁上所有的紅色文字，具體如下 from urllib.request import urlopen from bs4 import beautifulsoup html urlopen bsobj beautifulsou...

使用BeautifulSoup的簡單小爬蟲

BeautifulSoup 安裝使用

BeautifulSoup使用相關知識

使用BeautifulSoup解析HTML

相關推薦