Python與R的異同（二）字串操作

r本身設計初衷主要是用來處理矩陣運算這類數學問題，因此在字串操作方面比較薄弱。python並不是專門用來進行數學計算的，沒有偏向性，字串操作優良。但是多年前聽說還是比不上perl，不知道現在怎麼樣了。

r的字串無論數量多少都是存放在向量資料結構中，複雜一點的存放就是array, matrix, data.frame, list；python有專門的字串型別資料結構，如果需要存放多個字串，則可以用序列型別，如list（列表）, dict（字典）, tuple（元組），set(集合)。

下面是手動建立字串的一些操作，基本上r就比python少了乙個'''操作而已，這個在python裡是用於賦值多行字串的。r連多行注釋都沒有，這個也能理解吧

# r
s <- 'abc' ； s <- "abc"; s<- "s'b" ; s <- 's\'b'
ss <- c('abc','efg')
ss_matrix <- matrix(c('ab','bc','cd','de'),nrow=2)
ss_list <- list('a','b','c','d')
# python
ss = 'abc' ; ss = "abc" ; ss = "what's your name"; ss = 'what\'s your name' ; ss ='''what's your name； "sb?" '''
ss_list = ['abc','edf']
ss_dict= 
ss_tuple = ('a','b')

如果是從文字裡面讀取資料的話,python是先用open定義乙個檔案物件，由於檔案物件是可迭代的，所以最後可以儲存成序列型別的資料結構，如列表

[line for line open('file.txt', 'r')]
with open('file.txt', 'r') as f:
strings = f.readlines()

r原本是用來進行資料分析的，所以用的是read.table類函式讀取多列存放的資料，成為後續操作會用到的data.frame物件。當然對於普通的文字檔案，與python的open和readlines對應的是file和readlines，注意這裡的lines，打錯就是其他函式了。

# 類似open
filea <- file("text.txt", "r")
# 類似python的readline和readlines
# 可以指定固定行, readline(filea, n=1), 預設全部讀取
text <- readlines(filea)
length(text)

這裡的text的每乙個元素對應為text.txt的行資料。

可以練習一下讀取fasta檔案，並儲存為r的list格式。

r語言本身的目的不是做文字處理的，基礎功能比較薄弱是可以理解的。基礎函式大致是如下幾個：

nchar(): 函式返回字串長度

paste(),paste0(): 連線若干個字串

sprintf()：格式化輸出

toupper(): 大寫轉換

tolower(): 小寫轉換

substr(): 提取或替換乙個字串向量的子串

正規表示式相關的函式，如grep, grepl, regexpr, gregexpr, sub, gsub, strsplit.

後來hadley大神看不下去，寫了乙個stringr用來強化r語言字串操作，效果拔群。

stringr函式主要分為四類：

字元操作：操作字元向量中的單個字元 str_length, str_sub, str_dup

新增，移除和操作空白符 str_pad, str_trim, str_wrap

大小寫轉換處理 str_to_lower, str_to_upper, str_to_title

模式匹配函式 str_detect, str_subset, str_count, str_locate, str_locate_all, str_match, str_match_all, str_replace, str_replace_all, str_split_fix, str_split, str_extract, str_extract_all

python中字串資料結構本身就有許多的方法，而且還有一些包提供其他功能，比如說re提供了正規表示式功能，string擴充套件了更多功能。

用dir看下有哪些字串型別函式

dir(str)

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

對於乙個的字元單位的操作而言，r和python基本上都有一一對應的函式，比如說r的str_to lower(), str_to_upper(), str_to_title()對應就是python的lower(), upper(), title(),r的模式匹配函式對應的是python的re模組。

然對於多個字串組成的整體而言，需要記住r是向量化操作，相對應的是python必須要用列表推導式，舉個例子就是

# r
library(stringr)
ss <- c('abc', 'efg')
str_to_upper(ss)
# python
ss = ['abc', 'efg']
[string.upper() for string in ss]

Python與R的異同（二）字串操作

Python筆記（二）字串

python基礎之二字串

二字串操作 Python基礎

Python與R的異同（二） 字串操作

Python筆記（二）字串

python基礎之二 字串

二 字串操作 Python基礎

相關推薦

Python與R的異同（二）字串操作

python基礎之二字串

二字串操作 Python基礎