檢測位元組流是否是UTF8編碼

幾天前偶爾看到有人發帖子問「如何自動識別判斷url中的中文引數是gb2312還是utf-8編碼」

也拜讀了wcwtitxu使用巨牛的正規表示式檢測utf8編碼的演算法。

使用無數或條件的正規表示式用起來卻是效能不高。

先聊聊原理：

utf8的編碼規則如下表

看起來很複雜，總結起來如下：

ascii碼（u+0000 - u+007f），不編碼

其餘編碼規則為

•第乙個byte二進位制以形式為n個1緊跟個0 (n >= 2), 0後面的位數用來儲存真正的字元編碼，n的個數說明了這個多byte位元組組位元組數（包括第乙個byte）

•結下來會有n個以10開頭的byte，後6個bit儲存真正的字元編碼。

因此對整個編碼byte流進行分析可以得出是否是utf8編碼的判斷。

根據這個規則，我給出的c#**如下：

/// ///   determines whether the given is utf8 encoding bytes.
/// 
/// /// the input stream.
/// 
/// /// if given bystes stream is in utf8 encoding; otherwise, .
/// 
/// /// all ascii chars will regards not utf8 encoding.
/// 
public static bool istextutf8(ref byte inputstream)
// first byte
if (encodingbytescount == 0)
if ((current & 0xc0) == 0xc0)
} 
else
} 
else
else}}
if (encodingbytescount != 0)
// although utf8 supports encoding for ascii chars, we regard as a input stream, whose contents are all ascii as default encoding.
return !alltextsareasciichars;
}

再附上單元測試**：

/// ///this is a test class for encodinghelpertest and is intended
///to contain all encodinghelpertest unit tests
///[testclass()]
public class encodinghelpertest
else
}string str = new string(chars.toarray());
byte inputstream = encoding.utf8.getbytes(str);
bool expected = true; 
bool actual;
actual = encodinghelper.istextutf8(ref inputstream);
assert.areequal(expected, actual, string.format("utf8_assert fails at:", str));
inputstream = encoding.getencoding(932).getbytes(str);
expected = false;
actual = encodinghelper.istextutf8(ref inputstream);
assert.areequal(expected, actual, string.format("shiftjis_assert fails at:", str));}}
/// /// check with all ascii chars
/// 
[testmethod]
public void istextutf8test_allascii()
", str));}}

另：如果是判斷乙個檔案是否使用了utf8編碼，不一定非用這種方法，因為通常以utf8格式儲存的檔案最初兩個字元是bom頭，標示該檔案使用了utf8編碼。

參考：維基百科：

檢測位元組流是否是UTF8編碼

utf8的編碼規則總結起來如下 ascii碼 u 0000 u 007f 不編碼其餘編碼規則為第乙個byte二進位制以形式為n個1緊跟個0 n 2 0後面的位數用來儲存真正的字元編碼，n的個數說明了這個多byte位元組組位元組數包括第乙個byte 接下來會有n個以10開頭的byte，後6個bi...

python 檢測是否是UTF 8編碼

utf 8 8 bit unicode transformation format 是一種針對unicode的可變長度字元編碼，又稱萬國碼，由ken thompson於1992年建立。現在已經標準化為rfc 3629。utf 8用1到6個位元組編碼unicode字元。用在網頁上可以統一頁面顯示中文簡...

PHP檢測字串是否為UTF8編碼的常用方法

例子1複製如下檢測字元程式設計客棧串是否為utf8編碼 param string str 被檢測的字串 return boolean function is utf8 str return true 例子2 複製如下 function is utf8 string straight 3 byt...

檢測位元組流是否是UTF8編碼

檢測位元組流是否是UTF8編碼

python 檢測是否是UTF 8編碼

PHP檢測字串是否為UTF8編碼的常用方法

相關推薦