突變資料清洗

mutation_data_tidy

perl

這次偶然課題需要用到突變相關資料，在這做一下簡單的總結，拋磚引玉。

其中manifest.txt檔案中包含的是所有有檔案的資訊，長介個樣子:

## 6e92e6639f6c1f76588ae54777f1c289 tcga-g2-a2ef-01.maf.txt ## 4acdd7c2af2f0bd89e4562759113b8ef tcga-fd-a3na-01.maf.txt ## b47e7e2572ff8155cfac01a4381aaa2a tcga-dk-a1a5-01.maf.txt ## 60232e72368959618ddc019ad4e66b9a tcga-bt-a20u-01.maf.txt

## f98a76eddbc68f9a1969551057a68bb0 tcga-g2-a2ej-01.maf.txt

以其中乙個檔案為例，裡面的內容是介個樣子的，有用的資訊其中只有第2,9兩列(好慘啊，其他的都沒什麼用):

接下來，我們就需要利用manifest.txt遍歷所有的樣本檔案，將所有的非同義突變的gene輸出，下面是寫的簡單的指令碼:

#! /usr/bin/env bash
cd path
if [ -d result ] 
then echo "result exists"
else 
mkdir result
fifor file_name in `awk '' manifest.txt`
do sample_id=$
`awk '' $file_name | grep -v silent | sed -e "1d" | sed -e "s/$/\t$/" | awk '' >> result/result.txt`
done

最終，我們得到的結果檔案是長介個樣子的，之所以把資料存成這個樣子是為了接下去結合包整理成矩陣形式:

## 26155 tcga-g2-a2ef ## 84069 tcga-g2-a2ef ## 254173 tcga-g2-a2ef ## 23013 tcga-g2-a2ef

## 55672 tcga-g2-a2ef

接下去就是我們熟悉的r指令碼部分啦，程式挺簡單的就不加以說明了，最終就得到了行是gene，列是樣本的突變表達譜:

library(igraph)
library(dplyr)
setwd("path")
ensembl_entrez <- tbl_df(read.table("ensembl_entrez.txt", header = t, stringsasfactors = f, sep = "\t"))
data <- tbl_df(read.table("mutation_data/result/result.txt", header = f, stringsasfactors = f, sep = "\t"))
colnames(data) <- c("entrezgene", "sample_id")
final_data <- data %>% left_join(ensembl_entrez, by = "entrezgene") %>% na.omit
need_data <- as.data.frame(final_data[, c(3, 2)])
mutation_net <- graph_from_edgelist(as.matrix(need_data, ncol = 2), directed = f)
mutation_adj <- as_adj(mutation_net)
sample_id <- unique(unlist(need_data[,2]))
gene_id <- unique(unlist(need_data[,1]))
final_adj <- mutation_adj[gene_id, sample_id]

嗯，這樣就結束了整個流程，不是很複雜，感謝quan fei同學，提供了新的思路:

library(dplyr)
setwd("path")
ensembl_entrez <- tbl_df(read.table("ensembl_entrez.txt", header = t, stringsasfactors = f, sep = "\t"))
data <- tbl_df(read.table("mutation_data/result/result.txt", header = f, stringsasfactors = f, sep = "\t"))
colnames(data) <- c("entrezgene", "sample_id")
final_data <- data %>% left_join(ensembl_entrez, by = "entrezgene") %>% na.omit
final_adj <- table(unlist(final_data[,3]), unlist(final_data[,2]))
as.matrix(final_adj, ncol = ncol(final_adj)),
1,function(x)
))rownames(result) <- rownames(final_adj)
colnames(result) <- colnames(final_adj)
write.table(result, "mutation_profile.txt", col.names = t, row.names = t, sep = "\t", quote = f)

資料清洗之資料清洗概述

從廣泛的意義上來講，資料是乙個寬泛的概念，包括但不限於我們要了解資料清洗，就需理解資料的內涵和外延常見的資料有其中，比較重要比較常見的分析資料是資料。這裡重點介紹一些關於資料的內容。資料資料物件由屬性 attributes 及其值 value 構成資料的特徵什麼是資料清洗資料清洗是...

excel資料清洗資料清洗excel

資料清洗與加工目的獲得具備準確性完整性和一致性符合分析質量的資料。資料處理第一步資料清洗 1 資料去重方式1 刪除重複項功能。適用於有重複項出現的列，並且這樣的重複無意義，比如標識列。操作資料選項卡下的刪除重複值按鈕方式2 排序刪除重複項。適用於需要人工判斷無用重複項的資料，即將...

資料清洗技術 Excel資料清洗

1 了解 excel 的基本功能和用途 2 掌握 excel 資料清洗的基本步驟 3 了解 excel 資料清洗的方法 4 掌握 excel 常用的資料分析函式 5 掌握 excel 資料清洗常用的函式作業系統 windows xp 7 8 10 excel版本 2007 2019 jdk版本 1...

突變資料清洗

資料清洗之資料清洗概述

excel資料清洗 資料清洗excel

資料清洗技術 Excel資料清洗

相關推薦

excel資料清洗資料清洗excel