說實話,這個爬蟲寫的很爛,貼在這裡希望大家幫忙

這是爬取cnbeta的爬蟲**.

網頁分析花了好久的時間,但是不如意.

分析這一塊沒有做好,希望有能之士幫忙指出方法,非常感謝.

import requests
import re
from bs4 import beautifulsoup
defarticle_num
(): headers = 
url = ''
wb_data = requests.get(url, headers=headers)
wb_data.encoding = 'utf-8'
soup = beautifulsoup(wb_data.text, 'lxml')
titles = soup.select('dt > a')
url = titles[0].get('href')
article_num = re.findall('\d+', url).pop(0)
#print("首個數字為: " + article_num)
return int(article_num)
defget_url
(num):
return
'articles/%d.htm' % num
the_num = article_num()
defget_data
():global the_num
url = get_url(the_num)
the_num -= 2
headers = 
wb_data = requests.get(url, headers=headers)
wb_data.encoding = 'utf-8'
soup = beautifulsoup(wb_data.text, 'lxml')
# print(soup)
data = {}
title = soup.select('header.title > h1')
# print(title)
title =title[0].text
data['title'] = title
article_summry = soup.select('div.article-summary > p')
summary = article_summry[0].text
article_content = soup.select('div.article-content > p')
all_content = summary
for content in article_content:
all_content = all_content + '\n\n' + content.text.replace('\n', '')
data['content'] = all_content
return data
defmain
(num):
with open('.//wenzhang/' + 'cnbeta1.txt','w',encoding='utf-8') as f:
for i in range(1, num + 1):
try:
data = get_data()
f.write('>> ' + str(i) + ' >>>> ' + data['title'] + ' <<<<\n\n')
f.write(data['content'] + '\n\n---------------------------------\n\n\n\n')
print(str(the_num + 2) + " ok ...", end='\n------------------------\n\n\n')
except:
f.write('>> ' + str(i) + ' >>>> ' + str(the_num + 2) + " 這個有問題..." + ' <<<<\n\n' + '\n\n---------------------------------\n\n\n')
print(str(the_num + 2) + " noooooo ...", end='\n------------------------\n\n\n')
num = 10
main(num)

說實話，我不太看好Nokia的筆記本之路

最近，比較重量級的新聞無非就是北歐通訊裝置巨頭nokia推出了第一款上網本booklet 3g。傳了很久了，今天總算是見到廬山真面目了。目前booklet 3g售價估計在800美元左右，合人名幣在5500上下，整台電腦是atom平台，亮點在於12小時的電池續航時間，支援hdmi高畫質輸出，還有就是它...

python寫爬蟲的彎路

from bs4 import beautifulsoup import requests import re reponse requests.get url reponse.encoding gbk html reponse.text 這是網頁的部分第一章你心裡沒點數嗎第二章原來是一場精...

廣搜（這個寫的最好啦）

include include include includeusing namespace std define max 13 int n char start 10 char end 10 int i,j,k char map max max max int dist max max max i...

說實話,這個爬蟲寫的很爛,貼在這裡希望大家幫忙

說實話，我不太看好Nokia的筆記本之路

python寫爬蟲的彎路

廣搜（這個寫的最好啦）

相關推薦