爬取全部的校園新聞

2022-05-14 22:30:28 字數 3031 閱讀 2985

本次作業的要求來自於:

0.從新聞url獲取點選次數,並整理成函式

1.熟練運用re.search(),match(),findall()

2.從新聞url獲取新聞詳情: 字典,anews

import requests

from bs4 import beautifulsoup

from datetime import datetime

import re

def click(url):

id=re.findall('(\d+)',url)[-1]

clickurl=''.format(id)

resclick=requests.get(clickurl)

newsclick=int(resclick.text.split(".html")[-1].lstrip("('").rstrip("');"))

return newsclick

def newsdt(showinfo):

newsdate=showinfo.split()[0].split(':')[1]

newstime=showinfo.split()[1]

newsdt=newsdate+' '+newstime

dt=datetime.strptime(newsdt,'%y-%m-%d %h:%m:%s')

return dt

def anews(url):

newsdetail={}

res=requests.get(url)

res.encoding='utf-8'

soup=beautifulsoup(res.text,'html.parser')

newsdetail['nenewstitle']=soup.select('.show-title')[0].text

showinfo=soup.select('.show-info')[0].text

newsdetail['newsdt']=newsdt(showinfo)

newsdetail['newsclick']=click(newsurl)

return newsdetail

def alist(listurl):

res=requests.get(listurl)

res.encoding='utf-8'

soup=beautifulsoup(res.text,'html.parser')

newslist=

for news in soup.select('li'):

if len(news.select('.news-list-title'))>0:

newsurl = news.select('a')[0]['href']

newsdesc = news.select('.news-list-description')[0].text

newsdict = anews(newsurl)

newsdict['newsurl'] = newsurl

newsdict['descricption'] = newsdesc

return newslist

listurl=''

alist(listurl)

截圖:

4.生成所頁列表頁的url並獲取全部新聞 :列表extend(列表) allnews

*每個同學爬學號尾數開始的10個列表頁

res=requests.get('')

res.encoding='utf-8'

soup=beautifulsoup(res.text,'html.parser')

soup.select('#pages')[0].text

int(re.search('..(\d+).下',soup.select('#pages')[0].text).groups(1)[0])

listurl=''

allnews=alist(listurl)

for i in range(5,15):

listurl='{}.html'.format(i)

allnews.extend(alist(listurl))

截圖:爬取一頁;

爬取多頁;

5.設定合理的爬取間隔

import time

import random

time.sleep(random.random()*3)

import time

import random

for i in range(1,3):

print(i)

time.sleep(random.random()*3)#休眠隨機數的3倍時間

print(tennews)

6.用pandas做簡單的資料處理並儲存

儲存到csv或excel檔案 

newsdf.to_csv(r'e:\gzcc.csv')
import pandas as pd

newsdf = pd.dataframe(allnews)

newsdf.to_csv(r'e:\gzcc.csv')

截圖:圖為(5-15頁內容)

爬取全部的校園新聞

1 從新聞url獲取新聞詳情 2 從列表頁的url獲取新聞url 3 生成所頁列表頁的url並獲取全部新聞 4 設定合理的爬取間隔 5 用pandas做簡單的資料處理並儲存成csv和sql檔案 import requests from bs4 import beautifulsoup from da...

爬取全部的校園新聞

本次作業 於 import包 import re import requests from bs4 import beautifulsoup from datetime import datetime import time import random import pandas as pd 0.從...

爬取全部的校園新聞

作業要求 0.從新聞url獲取點選次數,並整理成函式 1.從新聞url獲取新聞詳情 字典,anews 2.從列表頁的url獲取新聞url 3.生成所頁列表頁的url並獲取全部新聞 列表extend 列表 allnews 每個同學爬學號尾數開始的10個列表頁 4.設定合理的爬取間隔 import ti...