Linux C 中文處理

utf-8 介紹

首先，我們可以假定我們接受到的字串是 utf-8 編碼的。如果在本地的話可以通過本地環境配置來保證。命令列下執行 locale 命令，lc_ctype 應該是 utf-8 的。vim 開啟檔案敲下 :set 命令，應該有一行是 fileencoding=utf-8。這樣我們就有了工作的基礎。

utf-8 是對 unicode 字符集的實現，它是一種變長編碼，對於乙個unicode 的字元編碼成 1 至 4 個位元組。我們可以認為，在 utf-8 中，英文是 1 個位元組，中文是 3 個位元組。

utf-8 的詳細介紹可以看：

unicode 和 utf-8 有何區別？ — 知乎

utf-8 — 維基百科

設計思路

既然知道 utf-8 的中英文本元位元組長度，那我們可能想用這樣乙個方案：遍歷字串，判斷當前位元組屬於中文還是英文，如果英文則對長度加一併從下乙個位元組繼續處理，如果是中文則對長度加一併跳到後面第三個位元組繼續處理。達到我們需要的文案長度時，break 跳出迴圈，返回當前遍歷得到的子字串。

但是這樣的實現會感覺很 hack，有點暴力，程式容易寫出問題。而且我們前面的假設畢竟是一般情況下（雖然概率很低），如果出現乙個四位元組的字元那程式會錯得一塌糊塗。

如果有一種編碼或資料型別，每個中英文本元都佔據相同長度，那我們的處理就會簡單多了。這時候我們想到了 c++ 的 wstring 型別，wstring 的 size() 函式返回的就是包含的中英文本元個數。wstring 與 string 一樣都是基於 basic_string 類模板，不同的是 string 使用 char 為基本型別，而 wstring 是 wchat_t。wchar_t 可以支援 unicode 字元的儲存，在 win 下是兩個位元組， linux 的實現則是四個位元組，可以直接用 sizeof(wchar_t) 檢視型別長度。

到這裡我們已經有了基本的思路：實現 string 和 wstring 的互相轉換，並用 wstring 來判斷字元個數，在超長時進行截斷。

string 與 wstring 的轉換

轉換版本一

如果你的 g++ 版本夠高（5.0以上），那麼可以採用下面的寫法，這是最好的：

#include

std::wstring s2ws(const std::string& str)

std::string ws2s(const std::wstring& wstr)

std::wstring_convert 是 c++11 標準庫提供的對 string 和 wstring 的轉換，對 unicode 進行了語言和庫級別的支援。但這一特性在 gcc/g++ 5.0 以上才被支援。

參考資料：

how to convert wstring into string? — stackoverflow

std::wstring_convert — cppreference

std::wstring_convert — cplusplus

如果你的 g++ 版本是支援部分 c++11 特性，那麼第二個版本可以用 unique_ptr 來管理記憶體，這樣可以避免直接操作指標的尷尬，程式更加安全。

#include

std::wstring s2ws(const std::string& str)

unsigned len = str.size() + 1;

setlocale(lc_ctype, "en_us.utf-8");

std::unique_ptrp(new wchar_t[len]);

mbstowcs(p.get(), str.c_str(), len);

std::wstring w_str(p.get());

return w_str;

}std::string ws2s(const std::wstring& w_str)

unsigned len = w_str.size() * 4 + 1;

setlocale(lc_ctype, "en_us.utf-8");

std::unique_ptrp(new char[len]);

wcstombs(p.get(), w_str.c_str(), len);

std::string str(p.get());

return str;

}new 陣列的長度要考慮到，因為 wchar_t 為 4 個位元組，對於 s2ws， wstring 的長度肯定小於等於 string 的長度，而對 ws2s， string 的長度也肯定小於等於 wstring 4 倍的長度。+1 是預留給字串的結束符『\0』。

setlocale 函式用於執行時的語言環境，可以在命令列用 locale 檢視當前系統的語言環境設定，lc_ctype 指語言符號及其分類。網上很多版本使用 setlocale(lc_ctype, ""); ，這裡第二個引數用空字串，會使用系統當前預設的 locale 設定。但是這樣有個問題，也許你寫出來的程式在本機執行正確，但到伺服器上就錯了，因為伺服器的 locale 不一定是 utf8，所以這裡要強制設定為 en_us.utf-8。

mbstowcs 和 wcstombs 是兩個 c 語言中對多位元組字串和寬字元字串的互相轉換函式，依賴於當前 locale 中所指定的字元編碼。

如果 g++ 連 unique_ptr 都不支援，那就只能使用下面的 new/delete 了。

#include

std::wstring s2ws(const std::string& str)

unsigned len = str.size() + 1;

setlocale(lc_ctype, "en_us.utf-8");

wchar_t *p = new wchar_t[len];

mbstowcs(p, str.c_str(), len);

std::wstring w_str(p);

delete p;

return w_str;

}std::string ws2s(const std::wstring& w_str)

unsigned len = w_str.size() * 4 + 1;

setlocale(lc_ctype, "en_us.utf-8");

char *p = new char[len];

wcstombs(p, w_str.c_str(), len);

std::string str(p);

delete p;

return str;

}實現了 string 和 wstring 的轉換後，接下來的處理就很簡單了。實現處理函式 formattext，然後加入 main 函式測試，完整**如下：

#include

static const int ktextsize = 10;

std::wstring s2ws(const std::string& str)

unsigned len = str.size() + 1;

setlocale(lc_ctype, "");

wchar_t *p = new wchar_t[len];

mbstowcs(p, str.c_str(), len);

std::wstring w_str(p);

delete p;

return w_str;

}std::string ws2s(const std::wstring& w_str)

unsigned len = w_str.size() * 4 + 1;

setlocale(lc_ctype, "");

char *p = new char[len];

wcstombs(p, w_str.c_str(), len);

std::string str(p);

delete p;

return str;

}bool formattext(std::string* txt)

std::cout << "before:" << *txt << std::endl;

std::wstring w_txt = s2ws(*txt);

std::cout << "wstring size:" << w_txt.size() << std::endl;

std::cout << "string size:" << (*txt).size() << std::endl;

if (w_txt.size() > ktextsize)

std::cout << "after:" << *txt << std::endl;

return true;

}int main()

Linux C 中文處理

Linux C預處理命令

Linux C 預處理命令

Linux C 預處理命令

Linux C 中文處理

Linux C預處理命令

Linux C 預處理命令

Linux C 預處理命令

相關推薦