python字串編碼問題

問題

程式如下：

# -*- coding: utf-8 -*- 
raw_input(u'輸入')

輸出

但是對print語句，卻沒有問題

# -*- coding: utf-8 -*- print u'輸入'

原因可能是raw_input函式在接受引數u'輸入'時，採用的是ascii解碼方式，而首行注釋 coding: utf-8，只是宣告.py檔案在讀取的時候的解碼方式，對raw_input並不起作用。

解決方法：

在python27目錄下的lib目錄中的site.py檔案

修改為

查了網上很多種方法，即使解決了異常，但是會引起輸出亂碼的問題。上面這種方法親測有效，如果哪位親按照我這種方法出現了問題，請告訴我(*^__^*)。

原因解決完問題之後，讓我們來分析一下原因。因為自己是python的初學者，也只能是猜測一下，如果有不對的地方，請指出。

原因可能是raw_input函式在呼叫的時候，它自己可能直接或者間接的呼叫了上圖中的setencoding函式

def setencoding():
"""set the string encoding used by the unicode implementation. the
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # default value set by _pyunicode_init()
if 0:
enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# enable to switch off string to unicode coercion and implicit
# unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# on non-unicode builds this will raise an attributeerror...
sys.setdefaultencoding(encoding) # needs python unicode build !

而這個函式在第五行，設定了 encoding=ascii，這就導致了raw_input函式在對引數 u'輸入'，按照ascii碼來處理，由於在ascii碼表中沒有匹配到字元，便丟擲異常。解決方式就是將encoding設定為中文編碼『gbk』，這樣對引數

u'輸入' 就能正確處理了。將『gbk』修改為『cp936』也是正確的。更多原因以及字元編碼，

關於字元編碼，請參考：

點這裡。補充

你可能聽說過utf-8不需要bom，這種說法是不對的，只是絕大多數編輯器在沒有bom時都是以utf-8作為預設編碼讀取。即使是儲存時預設使用ansi(mbcs)的記事本，在讀取檔案時也是先使用utf-8測試編碼，如果可以成功解碼，則使用utf-8解碼。記事本這個彆扭的做法造成了乙個bug：如果你新建文字檔案並輸入"奼塧"然後使用ansi(mbcs)儲存，再開啟就會變成"漢a"，你不妨試試：）

讓我們用python來**一下為什麼：

新建文字檔案並輸入"奼塧"然後使用ansi(mbcs)儲存，再開啟就會變成""

我們在儲存記事本檔案時，採用的編碼方式為：ansi。仔細閱讀了超連結中的文章，就知道在window簡體中文系統中，ansi指的就是gbk編碼。用python程式檢視一下「」和漢a

的gbk編碼和utf-8編碼：

# -*- coding: utf-8 -*- 
u1 = u'漢a'
print repr(u1)
print repr(u1.encode('utf-8'))
u2 = u'奼塧'
print repr(u2)
print repr(u2.encode('gbk'))

輸出如下

可以看到

漢a的utf-8編碼和

奼塧的gbk編碼是一樣的。

這麼一來，原因就顯而易見了，記事本在儲存

奼塧時，按照ansi也就是gbk編碼來儲存，在開啟檔案時，先按照utf-8進行測試編碼，如果可以解碼，則按照utf-8解碼。

再**一下u'字串'和'字串'的區別。

u表示後面跟的字串按照unicode編碼，而不加u表示字串按照類似c語言中char型別，乙個位元組乙個位元組的儲存編碼。用程式驗證，區別如下

# -*- coding: utf-8 -*- 
u1 = '漢'
u2 = u'漢'
print repr(u1)
print repr(u2)print len(u1)
print len(u2)

輸出如下：

從程式的輸出可以體會到，它們的卻別類似於c/c++中的char和wchar_t兩種資料型別的區別。

python字串編碼問題

python 字串編碼問題

Python字串的編碼問題

Python字串編碼

python字串編碼問題

python 字串編碼問題

Python字串的編碼問題

Python字串編碼

相關推薦