python爬蟲爬取nn online的資料

2021-09-03 02:38:16 字數 3373 閱讀 7908

寫乙個爬蟲爬取nn-online.org的np散射的pwa93的各分波的理論值

需要先安裝python3,python-pip3,selenium,geckodriver

python crawler**如下

# wang jianfeng  dec 14 2018

# python3

# install selenium first: pip3 install selenium

from selenium import webdriver

from selenium.webdriver.common.by import by

from selenium.webdriver.common.keys import keys

from selenium.webdriver.support.ui import select

from selenium.common.exceptions import nosuchelementexception

from selenium.common.exceptions import noalertpresentexception

import unittest, time, re

from urllib.request import urlopen

# install geckodriver first

driver = webdriver.firefox(

)# driver is firefox

phaselist =([

'1s0'

,'3p0'

,'1p1'

,'3s1'

,'3p1'

,'3d1'

,'e1'

,'1d2'

,'3p2'

,'3d2'

,'3f2'

,'e2'

,'1f3'

,'3p2'

,'3p2'

,'3f2'

,'e3'

,'1g4'

,'3f4'

,'3g4'

,'3h4'

,'e4'

,'1h5'

,'3g5'

,'3h5'

,'3i5'

,'e5'])

url1 =

""url2 =

"&tmax="

url3 =

"&tint=0.01&ps="

nntype =

"np_"

txt =

".txt"

tmin =

0.01

tmax =

10.00

for phase in phaselist:

fw =

open

(nntype + phase + txt,

"w",encoding=

"utf-8"

) fw.write(

"\b\b"

) tmin =

0.01

tmax =

10while tmax <=

300:

url = url1 +

str(

round

(tmin,2)

)+ url2 +

str(tmax)

+ url3 + phase

driver.get(url)

html = driver.page_source

res = re.findall(r"pwa93\b\b\b(.+?)

"

,html,flags=re.dotall)

fw.write(res[0]

.strip())

fw.write(

"\n"

)if tmax <

100:

fw.write(

"\b"

) tmin = tmin+

10 tmax = tmax+

10 fw.close(

)

由於有時nnonline返回的資料最後一行有時會有重複

再寫各python**重新編輯一下

**如下

phaselist =([

'1s0'

,'3p0'

,'1p1'

,'3s1'

,'3p1'

,'3d1'

,'e1'

,'1d2'

,'3p2'

,'3d2'

,'3f2'

,'e2'

,'1f3'

,'3p2'

,'3p2'

,'3f2'

,'e3'

,'1g4'

,'3f4'

,'3g4'

,'3h4'

,'e4'

,'1h5'

,'3g5'

,'3h5'

,'3i5'

,'e5'])

nntype =

"np_"

txt =

".txt"

dat =

".dat"

path =

"out/"

for phase in phaselist:

ii=0 fr =

open

(nntype + phase + txt,

"r",encoding=

"utf-8"

) fw =

open

(path + nntype + phase + dat,

"w",encoding=

"utf-8"

) line1 = fr.readline(8)

line2 = fr.readline(11)

line3 = fr.readline(

)while line1:

line11=line1

line22=line2

line1 = fr.readline(8)

line2 = fr.readline(11)

line3 = fr.readline(

)if line11 != line1:

fw.write(line11+line22+

'\b\n'

) ii=ii+

1 fr.close(

) fw.close(

)print

(phase +

" complete. line = "

+str

(ii)

)

Python 爬蟲爬取網頁

工具 python 2.7 import urllib import urllib2 defgetpage url 爬去網頁的方法 request urllib.request url 訪問網頁 reponse urllib2.urlopen request 返回網頁 return response...

python爬蟲爬取策略

在爬蟲系統中,待抓取url佇列是很重要的一部分。待抓取url佇列中的url以什麼樣的順序排列也是乙個很重要的問題,因為這涉及到先抓取那個頁面,後抓取哪個頁面。而決定這些url排列順序的方法,叫做抓取策略。下面重點介紹幾種常見的抓取策略 一 深度優先遍歷策略 深度優先遍歷策略是指網路爬蟲會從起始頁開始...

python爬蟲 seebug爬取

1.找相關的標籤一步一步往下查詢 2.有cookie才能查詢 3.用import re而不用from re import 是為了防止衝突 coding utf 8 from requests import import re from bs4 import beautifulsoup as bs h...