Python2 中文編碼處理

今天寫了幾個指令碼，都遇到了中英文混編的情況。需求要將其中的中文標點符號切換為英文符號。

舉個例子:

tags = '你好，good, 國語'

要將其中的中文半形逗號替換為英文逗號，為了方便後續的處理

如下處理:

tags = tags.replace('，', ',')

會丟擲如下異常：

unicodedecodeerror: 'ascii' codec can't decode byte ...

python中字串分成兩種，byte string 和unicode string

一般來說，設定好#coding=utf-8後，所有帶中文的引數都會宣告成utf-8編碼的byte string

但是在函式中產生的字串則是unicode string

byte string 和 unicode string不能混用，所以就會丟擲unicodedecodeerror異常

byte_str = 'hello, this is byte string'

unicode_str = u'hello, this is unicode string'

所以有三種解決方案：

1. 全都轉為byte string

2. 全都轉為unicode string

3. 設定系統編碼

1. 全都轉為byte string

'你好' + request.forms.tags.encode('utf-8')

2. 全都轉unicode.string

u'你好' + request.forms.tags

byte string 和unicode string相互轉換

b_s = 'test'
u_s = unicode(b_si, 'utf-8')
back_to_b_s = u_s.encode('utf-8')

3. 設定系統預設編碼

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

這樣就可以任意的使用了

所以上面的問題就有解了：

tags = tags.replace(unicode('，','utf-8'), ',')

或者

tags = tags.encode('utf-8').replace('，', ',')

或者

呼叫setdefaultencoding設定系統encoding了

此外，還有讀取utf-8檔案

可以使用codecs模組

import codecs
handler = codecs.open('test', 'r', 'utf-8')
u = handler.read() # returns a unicode string from the utf-8 bytes in the file

codesc還能將傳給write的unicode string轉換為任何編碼

在編寫**過程中，變數必須是ascii編碼的，為了可以在檔案中寫中文，python需要知道檔案不是ascii編碼

在

#!/usr/bin/env python

下新增

# -*- coding: utf-8 -*-

以上在python2中有效，在python3中已經區分了unicode string 和byte string,並且預設編碼不再是ascii

參考資料

python2 中文編碼問題

在python 中，寫入中文是經常出現亂碼和錯誤。知識背景 1 首先看一下系統預設編碼就是說系統預設編碼形式為ascii。2 現在了解一下ascii和非ascii編碼在計算機內部，所有的資訊最終都表示為乙個二進位制的字串。每乙個二進位制位 bit 有0和1兩種狀態，因此八個二進位制位就可以組合出...

python2 中文輸出問題

使用python查詢mysql之後的中文類似以下這種，有時候是字典，也有展示問題 x89 xe8 x8e x89 xe8 x8a xb1 xe8 怎麼正常顯示呢，兩種方案第一 request dict print json.dumps request dict,encoding utf 8 ens...

python2中編碼問題

1.python 3 中 str 與 bytes 在 python3中，字串有兩種型別 str和bytes。在 python 3 中你定義的所有字串，都是 unicode string型別，使用 type 和 isinstance 可以判別 python3 str obj 你好 type str o...

Python2 中文編碼處理

python2 中文編碼問題

python2 中文輸出問題

python2中編碼問題

相關推薦