PH525.4x第二周内容围绕GRange类的操作和使用Annotation进行数据关联两个主题展开,并展示了几个比较“炫”的功能。由于内容繁多,信息量大,故笔记之以便日后参考。
该课程的演示数据为ChIP-seq的实验数据,背景为人类肝细胞(cell line:HepG2和GM12878)中被ESRRA (estrogen related receptor alpha)绑定的基因片段。
在展示数据操作之前,首先检查bioconductor的版本号,不同版本的输出可能存在差异。
library(BiocInstaller) biocVersion() #[1] ‘3.0’
载入演示数据:
library(devtools)
install_github("genomicsclass/ERBS")
library(ERBS)
data(GM12878)
data(HepG2)
class(HepG2)
#[1] "GRanges"
#attr(,"package")
#[1] "GenomicRanges"
HepG2
# GRanges object with 303 ranges and 7 metadata columns:
#   seqnames                 ranges strand   |      name     score       col signalValue    pValue
# <Rle>              <IRanges>  <Rle>   | <numeric> <integer> <logical>   <numeric> <numeric>
#   [1]     chr2 [ 20335378,  20335787]      *   |      <NA>         0      <NA>      58.251    75.899
# [2]    chr20 [   328285,    329145]      *   |      <NA>         0      <NA>      10.808    69.685
# [3]     chr6 [168135432, 168136587]      *   |      <NA>         0      <NA>      17.103    54.311
# [4]    chr19 [  1244419,   1245304]      *   |      <NA>         0      <NA>      12.427    43.855
# [5]    chr11 [ 64071828,  64073069]      *   |      <NA>         0      <NA>       10.85    40.977
# ...      ...                    ...    ... ...       ...       ...       ...         ...       ...
# [299]     chr4 [  1797182,   1797852]      *   |      <NA>         0      <NA>       9.681    10.057
# [300]     chr1 [110198573, 110199126]      *   |      <NA>         0      <NA>       7.929    10.047
# [301]    chr17 [ 17734052,  17734469]      *   |      <NA>         0      <NA>       5.864      9.99
# [302]     chr1 [ 48306453,  48306908]      *   |      <NA>         0      <NA>        5.66     9.948
# [303]    chr12 [123867207, 123867554]      *   |      <NA>         0      <NA>      13.211     9.918
# qValue      peak
# <numeric> <integer>
#   [1] 6.143712e-72       195
# [2] 5.028065e-66       321
# [3] 7.930665e-51       930
# [4] 1.359756e-40       604
# [5] 7.333863e-38       492
# ...          ...       ...
# [299] 1.423343e-08       402
# [300] 1.442076e-08       197
# [301] 1.638918e-08       227
# [302] 1.799414e-08       211
# [303] 1.921805e-08       163
# -------
#   seqinfo: 93 sequences (1 circular) from hg19 genome
我们可以看见该HepG2数据有3个column为位置信息,7个为特征信息(metadata column)。我们可以直接使用granges提取所有位置信息或seqnames,ranges,strand提取相关信息,并支持IRanges操作,譬如start,end,width等;使用values或mcols函数可以提取metadata;提取row的方法同dataframe类操作或使用chr=="xxx"的形式索引。值得注意的是seqnames的输出并不是character类,有character需求的话需要转换。同时seqnames按原始order计算染色体位置出现频率,即同一个染色体位置如果分开于不相邻的row按不同染色体位置计数,需要直接计算每一个染色体位置出现的频率可以对原始数据order以后再提取或使用table函数。
seqnames(HepG2) # factor-Rle of length 303 with 292 runs # Lengths: 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 # Values : chr2 chr20 chr6 chr19 chr11 chr20 chr19 chr2 ... chr17 chr3 chr4 chr1 chr17 chr1 chr12 # Levels(93): chr1 chr2 chr3 chr4 chr5 ... chrUn_gl000246 chrUn_gl000247 chrUn_gl000248 chrUn_gl000249 seqnames(HepG2[order(HepG2)]) # factor-Rle of length 303 with 23 runs # Lengths: 18 38 15 4 8 24 14 11 ... 21 6 16 27 3 5 4 # Values : chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 ... chr17 chr18 chr19 chr20 chr21 chr22 chrX # Levels(93): chr1 chr2 chr3 chr4 chr5 ... chrUn_gl000246 chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
与GRanges类相关的是IRanges类的操作,以下代码一图读懂基本操作~
ir <- IRanges(5,10)
# set up a plotting window so we can look at range operations
plot(0,0,xlim=c(0,23),ylim=c(0,13),type="n",xlab="",ylab="",xaxt="n")
axis(1,0:15)
abline(v=0:14 + .5,col=rgb(0,0,0,.5))
# plot the original IRange
plotir <- function(ir,i) { arrows(start(ir)-.5,i,end(ir)+.5,i,code=3,angle=90,lwd=3) }
plotir(ir,1)
# draw a red shadow for the original IRange
polygon(c(start(ir)-.5,start(ir)-.5,end(ir)+.5,end(ir)+.5),c(-1,15,15,-1),col=rgb(1,0,0,.2),border=NA)
# draw the different ranges
plotir(shift(ir,-2), 2)
plotir(narrow(ir, start=2), 3)
plotir(narrow(ir, end=5), 4)
plotir(flank(ir, width=3, start=TRUE, both=FALSE), 5)
plotir(flank(ir, width=3, start=FALSE, both=FALSE), 6)
plotir(flank(ir, width=3, start=TRUE, both=TRUE), 7)
plotir(ir * 2, 8)
plotir(ir * -2, 9)
plotir(ir + 2, 10)
plotir(ir - 2, 11)
plotir(resize(ir, 1), 12)
text(rep(15,12), 1:12, c("ir","shift(ir,-2)","narrow(ir,start=2)",
                         "narrow(ir,end=5)",
                         "flank(ir, start=T, both=F)",
                         "flank(ir, start=F, both=T)",
                         "flank(ir, start=T, both=T)",
                         "ir * 2","ir * -2","ir + 2","ir - 2",
                         "resize(ir, 1)"), pos=4)
GRanges类是IRanges类的拓展,除了IRanges类的基本特征之外,还需要提供染色体位置seqnames和正链还是反链strand两个关键信息(strand默认为“+”)。c函数可以合并两个GRanges类,gaps可以寻找未被原始序列覆盖的序列(补集),disjoint可以将原始序列分割成互不相交的序列,reduce输出整个序列覆盖的范围和gaps互补。%over%命令,如"HepG2 %over% GM12878",可以生成长度为HepG2长度的逻辑向量判断HepG2的各序列是否与GM12878存在交集。值得一提的是findOverlaps函数,输出Hits类对象:
fo <- findOverlaps(HepG2, GM12878) fo # Hits of length 75 # queryLength: 303 # subjectLength: 1873 # queryHits subjectHits # <integer> <integer> # 1 1 12 # 2 2 78 # 3 4 777 # 4 5 8 # 5 8 13 # ... ... ... # 71 285 621 # 72 287 174 # 73 291 1855 # 74 294 512 # 75 300 144 queryHits(fo) subjectHits(fo)
queryHits输出在HepG2中与GM12878重叠的序列,subjectHits反之。当然也可以用%over%。
HepG2 %over% GM12878 HepG2[HepG2 %over% GM12878]
除了findOverlaps,还可以使用distanceToNearest函数找到最近序列,输出类似findOverlaps,distance可以输出距离值。
distanceToNearest(HepG2[17],GM12878) # Hits of length 1 # queryLength: 1 # subjectLength: 1873 # queryHits subjectHits distance # <integer> <integer> <integer> # 1 1 945 2284 mcols(d)$distance
当然这里所有的操作都需要注意strand方向。
接着就看看我们的ERBS究竟绑了些什么基因。载入基因注释(annotation)数据Homo.sapiens并查看基因信息。
library(Homo.sapiens) ghs = genes(Homo.sapiens) genome(ghs)
我们捕捉在两种cell line,HepG2和GM12878,ERBS绑定一致的区域。
res = findOverlaps(HepG2,GM12878) erbs = HepG2[queryHits(res)] erbs = granges(erbs)
#Another way
erbs = intersect(HepG2, GM12878)
precede函数可以找出转录起始端在绑定位置上游的基因。
index <- precede(erbs, subject = ghs) ghs[index,]
更好的做法是首先对ghs使用resize函数收集转录起始端位置,再使用distanceToNearest函数,可以找出绑定位置即为转录起始端的基因和对距离进行筛选(precede命令不能做到)。
tssge <- resize(ghs, 1) d <- distanceToNearest(erbs, tssge) queryHits(d) dist <- values(d)$distance index <- subjectHits(d)[dist < 1000] tssge[index,] # GRanges object with 41 ranges and 1 metadata column: # seqnames ranges strand | GENEID # <Rle> <IRanges> <Rle> | <IntegerList> # 80023 chr20 [ 327370, 327370] + | 80023 # 2101 chr11 [ 64073044, 64073044] + | 2101 # 7086 chr3 [ 53290130, 53290130] - | 7086 # 5478 chr7 [ 44836241, 44836241] + | 5478 # 286262 chr9 [140095163, 140095163] - | 286262 # ... ... ... ... ... ... # 55090 chr17 [ 17380300, 17380300] + | 55090 # 11165 chr6 [ 34360457, 34360457] - | 11165 # 56181 chr1 [ 26146397, 26146397] + | 56181 # 347734 chr6 [ 44225283, 44225283] - | 347734 # 2948 chr1 [110198698, 110198698] + | 2948 # ------- # seqinfo: 93 sequences (1 circular) from hg19 genome
使用select函数对Homo.sapiens进行注释查询,select参数包括查询数据库,keys和对应的key type(关联注释的关键)及注释column。
首先我们看Homo.sapiens支持的key type。
keytypes(Homo.sapiens) # [1] "GOID" "TERM" "ONTOLOGY" "DEFINITION" "ENTREZID" "PFAM" # [7] "IPI" "PROSITE" "ACCNUM" "ALIAS" "CHR" "CHRLOC" # [13] "CHRLOCEND" "ENZYME" "MAP" "PATH" "PMID" "REFSEQ" # [19] "SYMBOL" "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "GENENAME" # [25] "UNIPROT" "GO" "EVIDENCE" "GOALL" "EVIDENCEALL" "ONTOLOGYALL" # [31] "OMIM" "UCSCKG" "GENEID" "TXID" "TXNAME" "EXONID" # [37] "EXONNAME" "CDSID" "CDSNAME"
我们使用GENEID进行关联,关联注释选择“SYMBOL”和“GENENAME”。注意keys一般接受character输入,需要转换。
res <- select(Homo.sapiens, keys = as.character(values(tssge[index, ])$GENEID), keytype = "GENEID", columns = c("SYMBOL", "GENENAME"))
head(res)
# GENEID  SYMBOL                                   GENENAME
# 1  80023   NRSN2                                neurensin 2
# 2   2101   ESRRA            estrogen-related receptor alpha
# 3   7086     TKT                              transketolase
# 4   5478    PPIA peptidylprolyl isomerase A (cyclophilin A)
# 5 286262    TPRN                                    taperin
# 6  79696 ZC2HC1C       zinc finger, C2HC-type containing 1C
接着查找绑定的序列(motif),需要载入人类基因组数据库。注意在这里载入Hsapines数据和载入BSgenome.Hsapiens.UCSC.hg19是一样的(不是载入包,而是数据!)。提取单条染色体基因可以使用$。
library(BSgenome.Hsapiens.UCSC.hg19) Hsapines # Human genome # | # | organism: Homo sapiens (Human) # | provider: UCSC # | provider version: hg19 # | release date: Feb. 2009 # | release name: Genome Reference Consortium GRCh37 # | 93 sequences: # | chr1 chr2 chr3 chr4 # | chr5 chr6 chr7 chr8 # | chr9 chr10 chr11 chr12 # | chr13 chr14 chr15 chr16 # | chr17 chr18 chr19 chr20 # | ... ... ... ... # | chrUn_gl000233 chrUn_gl000234 chrUn_gl000235 chrUn_gl000236 # | chrUn_gl000237 chrUn_gl000238 chrUn_gl000239 chrUn_gl000240 # | chrUn_gl000241 chrUn_gl000242 chrUn_gl000243 chrUn_gl000244 # | chrUn_gl000245 chrUn_gl000246 chrUn_gl000247 chrUn_gl000248 # | chrUn_gl000249 # | (use ‘seqnames()‘ to see all the sequence names, use the ‘$‘ or ‘[[‘ operator to access a given # | sequence)
getSeq命令可以输出目标位置的基因序列以便对其进一步分析。
hepseq <- getSeq(Hsapiens, erbs) # A DNAStringSet instance of length 75 # width seq # [1] 410 GAGACAGGGTTTCACCATGTTGGCCAGGCTGGTTTCGAACTCCTGA...GAAGGAGATGCAACCTTCCAGGAAGCAGAAATGTTCAAGGACTCTC # [2] 861 TGGGAAGGACACAACTGAATGAGGCTGTGCAGAGGCGACAGATTCC...CCGTGTGACTCCAAGCAGAACCTCCAACCGTGTGTGTGTGTGTGTG # [3] 886 GTGAAGGCCCTGGAGTAGGCGGTGCGTACCCGGTGTCCCGAGGCCC...GTGCCGAATCTGCAGTGTTTTTGGCACCTCCGTGGGCACCTAGGCT # [4] 1242 CATCCTCCACCTTAACACTCAGCACCCTTAGAGATGCAAAGTTCCC...GGCCGCGATGTCCTTTTGTGTCCTACAAGCAGCCGGCGGCGCCGCC # [5] 477 AGGAATGAGCACCCACGGAGAAATGCAGTCCCGACCCCTTCCACCC...GTGATTGGACAAGGATGGTCAAAATGTATCCTTCTGGAATCTGAGC # ... ... ... # [71] 429 TGAGCCCCAACAAAACAAAAAAGCCAATCCTGCTGGTCTAAACTGG...CCATTCAACAAAGCACAGTATGCACCATGCAGTATTTTCTTGTGAT # [72] 551 GTGGCGGAAGGAAGCGGCGCGCTTGGCCGCCAGGAGCCGCTTGCGC...CAGACAAGGTCCAGGGACCTGGCGGCTGAGGGGACCCTCAAGGCTC # [73] 805 TACCAGGCCGAGGGATCTCCAAACGCAGCCAACTCCTCCCGGGGAC...GGAGAGGGTCCGCGCGGGTCTCCGGCCACTGCTGCGCGCGGCAAAG # [74] 1037 ATCCCCTCCGCCCTCCAGAGGGGACCACACTGGAGTAGTGCCCGGA...CGAGGCTCTACATGGCCTCGCCCGCGTTCCTCGCCCACTCACTCGC # [75] 554 TCCGGAGAAGAAGAAACGGGGGAAGAACTTTTCTCTTACGATCTGG...CGCGTGGGCGGGAAGTGCCGAGCGGCTGGGGACCGGCTCTAGGGAC
使用countPattern,vcountPattern,alphabetFrequency等函数可对某种pattern在序列中的出现进行统计分析。
vcountPattern("GC", hepseq)
# [1]  34  75  94 139  28  52  47  79   9 109 187  58  32  21 113  18  23  83 110  58 142  33 106  40  89
# [26]  24  22 116  64  25  91 128  58  19  38 101  42  89  85  68  23 184  88  58  38 138  56  40  23  83
# [51] 101  90  43  20  47 104  62  71 133  62 144  67  42 119  82 127  78  27  47  99  22  64  98 121  51
待续...
HarvardX: PH525.4x Introduction to Bioconductor第二周笔记
原文:http://www.cnblogs.com/nutastray/p/4416020.html