（pandas）評論資料清洗

df = df.dropna(subset=

['comment'])

# 根據使用者id與comment兩列作為參照，如存在使用者id與comment同時相同，那麼只保留最開始出現的。
df.drop_duplicates(subset=
['user_id'
,'comment'
], keep=
'first'
, inplace=
true
)# 重置索引
df.reset_index(drop=
true
, inplace=
true
)

# 用空字串('')替換純數字('123')
df['comment'
]= df[
'comment'].
str.replace(
'^[0-9]*$',''
)

# 用空字串('')替換('111','aaa','....')等
df['comment'
]= df[
'comment'].
str.replace(r'^(.)\1*$',''
)

# 用空字串('')替換('2020/11/20 20:00:00')等
df['comment'
]= df[
'comment'].
str.replace(r'\d+/\d+/\d+ \d+:\d+:\d+',''
)

4.對開頭連續重複的部分進行壓縮

效果：『aaabdc』—>『adbc』

『很好好好好』—『很好』

# 將開頭連續重複的部分替換為空''
prefix_series = df_comment.
str.replace(r'(.)\1+$',''
)# 將結尾連續重複的部分替換為空''
suffix_series = df_comment.
str.replace(r'^(.)\1+',''
)for index in
range
(len
(df_comment)):
# 對開頭連續重複的只保留重複內容的乙個字元(如'aaabdc'->'abdc')
if prefix_series[index]
!= df_comment[index]
: char = df_comment[index][-
1]df_comment[index]
= prefix_series[index]
+ char
# 對結尾連續重複的只保留重複內容的乙個字元(如'bdcaaa'->'bdca')
elif suffix_series[index]
!= df_comment[index]
: char = df_comment[index][0
] df_comment[index]
= char + suffix_series[index]

將空字串轉為』np.nan』,在使用dropna（）來進行刪除

df[

'comment'

].replace(to_replace=r'^\s*$'

, value=np.nan, regex=

true

, inplace=

true

)# 刪除comment中的空值，並重置索引

df = df.dropna(subset=

['comment'])

df.reset_index(drop=

true

, inplace=

true

)

哪怕對自己的一點小小的克制,也會使人變得強而有力

pandas資料清洗

1 檢視重複的行 df.duplicated 2 檢視某列重複的行df.duplicated 列標籤 3 刪除重複的行df.drop duplicates 4 刪除某一列重複的行df.drop duplicates 列標籤 1 判斷資料缺失df.isnull 2 資料未缺失df.notnull 3 ...

pandas資料清洗

df.query 查詢符合某個條件語句的 and or 新增一列的值等於df其中兩列的加和分組求和 df.groupby 可以指定某列進行求和df.groupby 姓名 df插入一列在指定索引方法一 df.insert 0,colname,value insert one col at firs...

Pandas的資料清洗

如果一列中含有多個型別,則該列的型別會是object,同樣字串型別的列也會被當成object型別.提取需要的2列資料 data statistic key data statistic time key 刪除空資料的行 data statistic key data statistic key.dr...

（pandas）評論資料清洗

pandas資料清洗

pandas資料清洗

Pandas的資料清洗

相關推薦