首页 > 代码库 > GOSemSim
GOSemSim
GOSemSim:GO-terms Semantic Similarity Measures
Installation
Install GOSemSim
is easy, follow the guide in the Bioconductor page:
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
## biocLite("BiocUpgrade") ## you may need this
biocLite("GOSemSim")
GO ID
找到 Gene Ontology (GO)勾选下面几个选项
然后获得所有酵母蛋白的Gene ontology数据
提取Gene Ontology ID
# -*- coding: utf-8 -*-
"""
Created on Fri Oct 28 19:04:38 2016
@author: sun
"""
import pandas as pd
import re
yeast=pd.read_csv(‘yeast.csv‘)
#Gene ontology (biological process)
#Gene ontology (molecular function)
#Gene ontology (cellular component)
bp=yeast[‘Gene ontology (biological process)‘]
bp=bp.fillna(value=‘‘)
for i in range(len(bp)):
temp=re.findall(r"GO:\d{7}",bp[i])
bp[i]=‘;‘.join(temp)
mf=yeast[‘Gene ontology (molecular function)‘]
mf=mf.fillna(value=‘‘)
for i in range(len(mf)):
temp=re.findall(r"GO:\d{7}",mf[i])
mf[i]=‘;‘.join(temp)
cc=yeast[‘Gene ontology (cellular component)‘]
cc=cc.fillna(value=‘‘)
for i in range(len(cc)):
temp=re.findall(r"GO:\d{7}",cc[i])
cc[i]=‘;‘.join(temp)
yeast[‘Gene ontology (biological process)‘]=bp
yeast[‘Gene ontology (molecular function)‘]=mf
yeast[‘Gene ontology (cellular component)‘]=cc
yeast.to_csv(‘go.csv‘,index=False,columns =[‘Entry‘,
‘Gene ontology (cellular component)‘,
‘Gene ontology (molecular function)‘,
‘Gene ontology (biological process)‘])
获得Gene ontology
获取gold_yeast的Gene ontology
yeast=pd.read_csv(‘yeast_gold_protein_pair.csv‘)
go=pd.read_csv(‘go.csv‘,index_col=0)
protein_a=go.loc[yeast.idA,:]
protein_b=go.loc[yeast.idB,:]
protein_a.to_csv(‘GOProteinA.csv‘)
protein_b.to_csv(‘GOProteinB.csv‘)
R语言的安装
R的一个可视化界面(RStudio)的安装
软件界面如下
然后开始安装GOSemSim,运行文章开头安装的代码
如果没FQ的话,一般会有如下错误
GOSemSim 说明文档
刚刚下载好的Supported organisms的安装
For IC-based methods, information of GO term is species specific. We need to calculate IC
for all GO terms of a species before we measure semantic similarity. GOSemSim support all organisms that have an OrgDb
object available.
Bioconductor have already provided OrgDb
for about 20 species, seehttp://bioconductor.org/packages/release/BiocViews.html#___OrgDb.
Once we have OrgDb
, we can build annotation data needed by GOSemSim via godata
function.
library(GOSemSim)hsGO <- godata(‘org.Hs.eg.db‘, ont="MF")
## [1] "preparing gene to GO mapping data..."
## [1] "preparing IC data..."
User can set computeIC=FALSE
if they only want to use Wang’s method.
goSim 和mgoSim的介绍
In GOSemSim, we implemented all these IC-based and graph-based methods. goSim function calculates semantic similarity between two GO terms, while mgoSim function calculates semantic similarity between two sets of GO terms.
goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang")
## [1] 0.155
goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Wang")
## [1] 0.158
go1 = c("GO:0004022","GO:0004024","GO:0004174")go2 = c("GO:0009055","GO:0005515")mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
## GO:0009055 GO:0005515
## GO:0004022 0.205 0.158
## GO:0004024 0.185 0.141
## GO:0004174 0.205 0.158
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")
## [1] 0.192
> library(GOSemSim)
> hsGO <- godata(‘org.Hs.eg.db‘, ont="MF")
[1]"preparing gene to GO mapping data..."
[1]"preparing IC data..."
> goSim("GO:0004022","GO:0005515", semData=hsGO, measure="Jiang")
[1]0.155
> goSim("GO:0004022","GO:0005515", semData=hsGO, measure="Wang")
[1]0.158
> go1 = c("GO:0004022","GO:0004024","GO:0004174")
> go2 = c("GO:0009055","GO:0005515")
> mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
GO:0009055 GO:0005515
GO:00040220.2050.158
GO:00040240.1850.141
GO:00041740.2050.158
> mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")
[1]0.192
使用gold_yeast数据获得GO特征
library(GOSemSim)
library(org.Sc.sgd.db)
GOProteinA<- read.csv("GOProteinA.csv", stringsAsFactors = F)
GOProteinB<- read.csv("GOProteinB.csv", stringsAsFactors = F)
cc <- c()
mf <- c()
bp <- c()
scGo <- godata(‘org.Sc.sgd.db‘, ont ="cc", computeIC = F)
for(i in1:length(GOProteinA$Entry)){
go1 <- c(strsplit(GOProteinA[i,2], split =";")[[1]])
go2 <- c(strsplit(GOProteinB[i,2], split =";")[[1]])
cc[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
}
scGo <- godata(‘org.Sc.sgd.db‘, ont ="mf", computeIC = F)
for(i in1:length(GOProteinA$Entry)){
go1 <- c(strsplit(GOProteinA[i,3], split =";")[[1]])
go2 <- c(strsplit(GOProteinB[i,3], split =";")[[1]])
mf[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
}
scGo <- godata(‘org.Sc.sgd.db‘, ont ="bp", computeIC = F)
for(i in1:length(GOProteinA$Entry)){
go1 <- c(strsplit(GOProteinA[i,4], split =";")[[1]])
go2 <- c(strsplit(GOProteinB[i,4], split =";")[[1]])
bp[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
}
GOFeature<-
data.frame(GOProteinA$Entry,GOProteinB$Entry, cc, mf, bp)
write.csv(GOFeature,
‘GOFeature.csv‘,
na =‘0‘,#将nan值填充为0
row.names = FALSE)
检查错误
- 以第一对蛋白质为例,首先单独获得P00546和P25302的GO ID
P00546
cellular bud neck [GO:0005935];
cyclin-dependent protein kinase holoenzyme complex [GO:0000307];
cytoplasm [GO:0005737];
endoplasmic reticulum [GO:0005783];
nucleus [GO:0005634]
ATP binding [GO:0005524];
cyclin-dependent protein serine/threonine kinase activity [GO:0004693];
histone binding [GO:0042393];
protein serine/threonine kinase activity [GO:0004674];
RNA polymerase II core binding [GO:0000993]
7-methylguanosine mRNA capping [GO:0006370];
cell division [GO:0051301];
meiotic DNA double-strand break processing [GO:0000706];
mitotic sister chromatid biorientation [GO:1990758];
negative regulation of double-strand break repair via nonhomologous end joining [GO:2001033];
negative regulation of meiotic cell cycle [GO:0051447];
negative regulation of mitotic cell cycle [GO:0045930];
negative regulation of sister chromatid cohesion [GO:0045875];
negative regulation of transcription, DNA-templated [GO:0045892];
negative regulation of transcription from RNA polymerase II promoter during mitosis [GO:0007070];
peptidyl-serine phosphorylation [GO:0018105];
peptidyl-threonine phosphorylation [GO:0018107];
phosphorylation of RNA polymerase II C-terminal domain [GO:0070816];
positive regulation of meiotic cell cycle [GO:0051446];
positive regulation of mitotic cell cycle [GO:0045931];
positive regulation of nuclear cell cycle DNA replication [GO:0010571];
positive regulation of spindle pole body separation [GO:0010696];
positive regulation of transcription, DNA-templated [GO:0045893];
positive regulation of transcription from RNA polymerase II promoter [GO:0045944];
positive regulation of triglyceride catabolic process [GO:0010898];
protein localization to nuclear periphery [GO:1990139];
protein localization to nucleus [GO:0034504];
protein phosphorylation involved in cellular protein catabolic process [GO:1902002];
protein phosphorylation involved in DNA double-strand break processing [GO:1990802];
protein phosphorylation involved in double-strand break repair via nonhomologous end joining [GO:1990804];
protein phosphorylation involved in mitotic spindle assembly [GO:1990801];
protein phosphorylation involved in protein localization to spindle microtubule [GO:1990803];
regulation of budding cell apical bud growth [GO:0010568];
regulation of double-strand break repair via homologous recombination [GO:0010569];
regulation of filamentous growth [GO:0010570];
regulation of nucleosome density [GO:0060303];
regulation of spindle assembly [GO:0090169];
regulation of telomere maintenance via telomerase [GO:0032210];
synaptonemal complex assembly [GO:0007130];
vesicle-mediated transport [GO:0016192]
P25302
nuclear chromatin [GO:0000790];
SBF transcription complex [GO:0033309]
DNA binding [GO:0003677];
identical protein binding [GO:0042802];
transcriptional activator activity, RNA polymerase II core promoter proximal region sequence-specific binding [GO:0001077]
positive regulation of transcription from RNA polymerase II promoter in response to heat stress [GO:0061408];
positive regulation of transcription involved in G1/S transition of mitotic cell cycle [GO:0071931]
- 分别计算CC,MF,BP
ccA <- c(strsplit(GOProteinA[1,2], split =";")[[1]])
ccB <- c(strsplit(GOProteinB[1,2], split =";")[[1]])
mfA <- c(strsplit(GOProteinA[1,3], split =";")[[1]])
mfB <- c(strsplit(GOProteinB[1,3], split =";")[[1]])
bpA <- c(strsplit(GOProteinA[1,4], split =";")[[1]])
bpB <- c(strsplit(GOProteinB[1,4], split =";")[[1]])
scGo <- godata(‘org.Sc.sgd.db‘, ont ="cc")
cc <- mgoSim(ccA, ccB, semData = scGo, measure ="Wang")
scGo <- godata(‘org.Sc.sgd.db‘, ont ="mf")
mf <- mgoSim(mfA, mfB, semData = scGo, measure ="Wang")
scGo <- godata(‘org.Sc.sgd.db‘, ont ="bp")
bp <- mgoSim(bpA, bpB, semData = scGo, measure ="Wang")
未找到错误继续
- 检查原论文说明。
结论。
- 可以肯定一点。程序没错。
- 可能是数据更新了。原始论文是2014的论文。2年可是GO数据的更新导致数据不一致。
附件列表
GOSemSim