首页 > 代码库 > Biopython - basics

Biopython - basics

Introduction

From the biopython website their goal is to “make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and scripts.” These modules use the biopython tutorial as a template for what you will learn here.  Here is a list of some of the most common data formats in computational biology that are supported by biopython.

UsesNote
Blastfinds regions of local similarity between sequences
ClustalWmultiple sequence alignment program
GenBankNCBI sequence database
PubMed and MedlineDocument database
ExPASySIB resource portal (Enzyme and Prosite)
SCOPStructural Classification of Proteins (e.g. ‘dom’,’lin’)
UniGenecomputationally identifies transcripts from the same locus
SwissProtannotated and non-redundant protein sequence database

Some of the other principal functions of biopython.

  • A standard sequence class that deals with sequences, ids on sequences, and sequence features.
  • Tools for performing common operations on sequences, such as translation, transcription and weight calculations.
  • Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector Machines.
  • Code for dealing with alignments, including a standard way to create and deal with substitution matrices.
  • Code making it easy to split up parallelizable tasks into separate processes.
  • GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc.

Getting started

>>> import Bio>>> Bio.__version__‘1.58‘

Some examples will also require a working internet connection in order to run.

>>> from Bio.Seq import Seq>>> my_seq = Seq("AGTACACTGGT")>>> my_seqSeq(‘AGTACACTGGT‘, Alphabet())>>> aStringSeq = str(my_seq)>>> aStringSeq‘AGTACACTGGT‘>>> my_seq_complement = my_seq.complement()>>> my_seq_complementSeq(‘TCATGTGACCA‘, Alphabet())>>> my_seq_reverse = my_seq.reverse()>>> my_seq_rc = my_seq.reverse_complement()>>> my_seq_rcSeq(‘ACCAGTGTACT‘, Alphabet())

There is so much more, but first before we get into it we should figure out how to get sequences in and out of python.

File download

FASTA formats are the standard format for storing sequence data.  Here is a little reminder about sequences.

Nucleic acid codeNoteNucleic acid codeNote
AadenosineKG/T (keto)
TthymidineMA/C (amino)
CcytidineRG/A (purine)
GguanineSG/C (strong)
NA/G/C/T (any)WA/T (weak)
UuridineBG/T/C
DG/A/TYT/C (pyrimidine)
HA/C/TVG/C/A

Here is quickly a bit about how biopython works with sequences

>>> for seq_record in SeqIO.parse(os.path.join("data","ls_orchid.fasta"), "fasta"):...     print seq_record.id...     print repr(seq_record.seq)...     print len(seq_record)...gi|2765658|emb|Z78533.1|CIZ78533Seq(‘CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC‘, SingleLetterAlphabet())740

Biopython - basics