Python排除非法字元干擾讀取UTF 8檔案

2021-09-03 01:58:07 字數 2950 閱讀 1726

最近做乙個關於正則匹配的專案,用open()開啟utf-8格式的檔案,讀取每一行的內容;由於一些檔案中存在非utf-8標準的字元,指令碼執行會報錯。在debug過程中發現,實際上不論你寫的是read(1)(讀取乙個位元組的內容)還是readline()(讀取一行的內容),python庫函式會一次性讀取一大塊內容,一旦這塊資料中有非法位元組,整個呼叫就會出錯。

例如以下**讀取每一行內容並列印,實際上含有非法字元的行以及前後若干行都不會被列印出來。

while true:

try:

line = file.readline()

print(line, end='')

except:

continue

if line == '':

break

測試讀取example.txt檔案包含1000行的以下內容,注意在第500行前面新增了乙個非法位元組0xa0:

this is line no. 0……

…this is line no. 488

this is line no. 489

this is line no. 490

this is line no. 491

this is line no. 492

this is line no. 493

this is line no. 494

this is line no. 495

this is line no. 496

this is line no. 497

this is line no. 498

this is line no. 499

非法字元0xa0this is line no. 500

this is line no. 501

this is line no. 502

this is line no. 503

this is line no. 504

this is line no. 505

this is line no. 506

this is line no. 507……

…this is line no. 999

最終執行python**的輸出為:

...

this is line no. 383

this is line no. 384

this is line no. 385

this is line no. 386

this is line no. 387

this is line no. 388

this is line no. 389

this is line no. 390

this is line no. 391

this is line no. 392

this is line no. 393

this is line no. 394

line no. 785

this is line no. 786

this is line no. 787

this is line no. 788

this is line no. 789

this is line no. 790

this is line no. 791

this is line no. 792

this is line no. 793

this is line no. 794

this is line no. 795

this is line no. 796

this is line no. 797

...

可以看到第500行之前到395行和之後到785行的內容實際上被讀取了,並且因為包含非法字元而呼叫出錯,這些行都沒有顯示。

如果要忽略這些非法字元,正常讀取某行的其他內容,可以在開啟的時候傳遞引數 errors = 『ignore』。

file = open('/home/hyphen/example.txt', mode='r', errors='ignore')
這樣再執行**,就可以正常讀取了。

this is line no. 484

this is line no. 485

this is line no. 486

this is line no. 487

this is line no. 488

this is line no. 489

this is line no. 490

this is line no. 491

this is line no. 492

this is line no. 493

this is line no. 494

this is line no. 495

this is line no. 496

this is line no. 497

this is line no. 498

this is line no. 499

this is line no. 500

this is line no. 501

this is line no. 502

this is line no. 503

this is line no. 504

this is line no. 505

this is line no. 506

this is line no. 507

this is line no. 508

this is line no. 509

this is line no. 510

this is line no. 511

this is line no. 512

關於Python文件讀取UTF 8編碼檔案問題

近來接到乙個小專案,讀取目標檔案中每一行url,並逐個請求url,拿到想要的資料。coding utf 8 class ipurlmanager object def init self self.newipurls set self.oldipurls set defis has ipurl se...

Go語言 讀取帶有BOM頭的UTF8檔案

bom頭是utf8檔案開頭的三個固定取值的位元組,讀檔案的時候如果遇到bom頭需要忽略。在golang裡,比較有效率的方法是用ioutil.readfile,返回byte之後擷取從第四個位元組到末尾的切片。因為由切片擷取生成的新切片和原切片共同指向同乙個陣列,所以沒有額外的拷貝,這一點比轉換成字串之...

Python去除文字中非utf8字元

在處理文件相關專案中,經常會碰到utf8的非法字元,例如使用者上傳乙個檔案,系統根據使用者檔案產生相應結果返回。如果使用者檔案 utf編碼的csv檔案 中有utf8的非法字元,需要程式能自動去掉這些字元,因為這些字元也是無意義的。錯誤資訊 utf 8 codec can t decode byte ...