zz使用 Python 分離中文與英文的混合字串

liyanruiposted @

大約 1 年前 in

程式設計 with tags

python , 614 閱讀

這個問題是做 mkiv 預處理程式時搞定的，就是把乙個混合了中英文混合字串分離為英文與中文的子字串，譬如，將」我的 english 學的不好「分離為「我的" 、" english 」與"學的不好" 三個子字串。

中英文混合字串處理最省力的辦法就是把它們的編碼都轉成 unicode，讓乙個漢字與乙個英文本母的記憶體位寬都是相等的。這個工作用 python 來做，比較合適，因為 python 內碼採用的是 unicode，並且為了支援 unicode 字串的操作，python 做了乙個 unicode 內建模組，把 string 物件的全部方法重新實現了一遍，另外提供了 codecs 物件，解決各種編碼型別的字串解碼與編碼問題。

譬如下面的 python **，可實現 utf-8 編碼的中英文混合字串向 unicode 編碼的轉換：

# -*- coding:utf-8 -*-

a =

"我的 english 學的不好"

type

( a ) ,

len( a ) , a

b =

unicode

( a,

"utf-8"

)print

type

( b ) ,

len( b ) , b

字串 a 是 utf-8 編碼，使用 python 的內建物件 unicode 可將其轉換為 unicode 編碼的字串 b。上述**執行後的輸出結果如下所示，比較字串 a 與字串 b 的長度，顯然 len (b) 的輸出結果是合理的。

'str' >

27 我的 english 學的不好

'unicode' >

15 我的 english 學的不好

要注意的乙個問題是 unicode 雖然號稱是「統一碼」，不過也是存在著兩種形式，即：

使用python sys 模組提供的乙個變數 maxunicode 的值可以判斷當前 python 所使用的 unicode 型別是 ucs-2 的還是 ucs-4 的。

import

sysprint

sys .

maxunicode

若 sys.maxunicode 的值為 1114111，即為 ucs-4；若為 65535，則為 ucs-2。

一旦中英文本串的編碼獲得統一，那麼對它們進行**就是很簡單的事情了。首先要為中文字串與英文本串分別準備乙個收集器，使用兩個空的字串物件即可，譬如 zh_gather 與 en_gather；然後要準備乙個列表物件，負責按分離次序儲存 zh_gather 與 en_gather 的值。下面這個 python 函式接受乙個中英文混合的 unicode 字串，並返回儲存中英文子字串的列表。

def split_zh_en

( zh_en_str

) :zh_en_group =

[]zh_gather =

""en_gather =

""zh_status =

false

for c

in zh_en_str:

ifnot zh_status

and is_zh

( c) :

zh_status =

true

if en_gather !=

"" :

zh_en_group.

([ mark

["en"

] ,en_gather

])en_gather =

""elif

not is_zh

( c)

and zh_status:

zh_status =

false

if zh_gather !=

"" :

zh_en_group.

([ mark

["zh"

] , zh_gather

])if zh_status:

zh_gather += c

else :

en_gather += c

zh_gather =

""if en_gather !=

"" :

zh_en_group.

([ mark

["en"

] ,en_gather

])elif zh_gather !=

"" :

zh_en_group.

([ mark

["zh"

] ,zh_gather

])return zh_en_group

上述**所實現的功能細節是：對中英文混合字串 zh_en_str 的遍歷過程中進行逐字識別，若當前字元為中文，則將其新增到 zh_gather 中；若當前字元為英文，則將其新增到 en_gather 中。zh_status 表示中英文本元的切換狀態，當 zh_status 的值發生突變時，就將所收集的中文子字串或英文子字串新增到 zh_en_group 中去。

判斷字串 zh_en_str 中是否包含中文字元的條件語句中出現了乙個 is_zh () 函式，它的實現如下：

def is_zh

( c) :

x =

ord( c )

# punct & radicals

if x >= 0x2e80

and x <= 0x33ff:

return

true

# fullwidth latin characters

elif x >= 0xff00

and x <= 0xffef:

return

true

# cjk unified ideographs &

# cjk unified ideographs extension a

elif x >= 0x4e00

and x <= 0x9fbb:

return

true

# cjk compatibility ideographs

elif x >= 0xf900

and x <= 0xfad9:

return

true

# cjk unified ideographs extension b

elif x >= 0x20000

and x <= 0x2a6d6:

return

true

# cjk compatibility supplement

elif x >= 0x2f800

and x <= 0x2fa1d:

return

true

else :

return

false

這段**來自 jjgod 寫的 xetex 預處理程式。

對於分離出來的中文子字串與英文子字串，為了使用方便，在將它們存入 zh_en_group 列表時，我對它們分別做了標記，即 mark["zh"] 與 mark["en"]。mark 是乙個 dict 物件，其定義如下：

mark =

如果要對 zh_en_group 中的英文本串或中文字串進行處理時，標記的意義在於快速判定字串是中文的，還是英文的，譬如：

forstr

in zh_en_group:

ifstr[

0] = mark

["en"

] :

do somthing

else :

do somthing

zz使用 Python 分離中文與英文的混合字串

使用oracle與python進行中文分詞

python如何使用檔案 Python中檔案的應用

python中文編碼錯誤使用Python編碼錯誤

zz使用 Python 分離中文與英文的混合字串

使用oracle與python進行中文分詞

python如何使用檔案 Python中檔案的應用

python中文編碼錯誤 使用Python編碼錯誤

相關推薦

python中文編碼錯誤使用Python編碼錯誤