You are here Biopharmaceutical Glossaries & Taxonomies Homepage/Search  > Genomics > Informatics > Genomic informatics

Genomic Informatics glossary & taxonomy
Evolving Terminology for Emerging Technologies
Comments? Questions? Revisions?  Mary Chitty
Last revised November 14, 2013
View a Printer-Friendly Version of this Web Page!

Drug discovery & development term index   Informatics term index   Technologies term index    Biology term index    Chemistry term index   Finding guide to terms in these glossaries   Site Map

Related glossaries include Applications: Drug discovery & Development   Drug Targets  Molecular Diagnostics 
Informatics:   Drug discovery informatics  Bioinformatics    Cheminformatics   Ontologies & Taxonomies  Protein Informatics  
Technologies Microarrays   PCR   Sequencing  Biology Genetic variations 
covers both technologies for detecting and informatics for interpreting genetic variants. 

ab initio gene prediction: Traditionally, gene prediction programs that rely only on the statistical qualities of exons have been referred to as performing ab initio predictions. Ab initio prediction of coding sequences is an undeniable success by the standards of the machine- learning algorithm field, and most of the widely used gene prediction programs belong to this class of  algorithms. It is impressive that the statistical analysis of raw genomic sequence can detect around 77- 98% of the genes present ...  This is, however, little consolation to the bench biologist, who wants the complete sequences of all genes present, with some certainty about the accuracy of the predictions involved. As Ewan Birney (European Bioinformatics Institute, UK) put it, what looks impressive to the computer scientist is often simply wrong to the biologist. Meeting report "Gene prediction: the end of the beginning" Colin Semple, Genome Biology 2000 1(2): reports 4012.1-4012.3

All ab initio gene prediction programs have to balance sensitivity against accuracy.  Broader term: gene prediction.

alignment: The process of lining up two or more sequences to achieve maximal levels of  identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.  NCBI BLAST Glossary

assembled: The term used to describe the process of using a computer to join up bits of sequence into a larger whole. Peer Bork, Richard Copley "Filling in the gaps" Nature 409: 218-820, 15 Feb. 2001

This is different from assembly language, and the source of some confusion between biologists and computer scientists. 

Related terms: contig assembly, genome assembly

biocomputing:  Biocomputing could be defined as the construction and use of computers which function like living organisms or contain biological components, so-called biocomputers (Kaminuma, 1991). Biocomputing could, however, also be defined as the use of computers in biological research and it is this definition which I am going to use in this essay. With this interpretation of biocomputing the complicated ethical questions connected with concepts like artificial life and intelligence are not dealt with.  Peter Hjelmström, Ethical issues in biocomputing  

biological computing: Simson Garfinkel "Biological computing" Technology Review, May/ June 2000  Related terms: biocomputing, DNA computing    

BLAST (Basic Local Alignment Search Tool): Software program from NCBI for searching public databases for homologous sequences or proteins. Designed to explore all available sequence databases regardless of whether query is protein or DNA.

comparative genome annotation:  Recent advances in genome sequencing technology and algorithms have made it possible to determine the sequence of a whole genome quickly in a cost-effective manner. As a result, there are more than 200 completely sequenced genomes. However, annotation of a genome is still a challenging task. One of the most effective methods to annotate a newly sequenced genome is to compare it with well-annotated and closely related genomes using computational tools and databases. Comparing genomes requires use of a number of computational tools and produces a large amount of output, which should be analyzed by genome annotators. Because of this difficulty, genome projects are mostly carried out at large genome sequencing centers. To alleviate the requirement for expert knowledge in computational tools and databases, we have developed a web-based genome annotation system, called CGAS (a comparative genome annotation system;   CGAS: a comparative genome annotation system. Choi K, Yang Y, Kim S. Methods Mol Biol. 2007;395:133-146   Broader term: genome annotation Related term: Functional genomics comparative genomics

complex genomes: Is there a specific definition of complex genomes?  Or is it a more general category (beyond viral, bacterial,  microbial?)  

computational gene recognition: Interpreting nucleotide sequences by computer, in order to provide tentative annotation on the location, structure and functional class of protein- coding genes. JW Fickett 1996

Gene recognition is much more difficult in higher eukaryotes than in prokaryotes, as coding regions (exons) are often interrupted by non- coding regions (introns) and genes are highly variable in size.  This is particularly so for human genes. As someone remarked sometime ago people have non- coding regions occasionally interrupted by genes.
Broader terms: gene recognition, molecular recognition.

computational genomics: Our laboratory develops new machine learning techniques and algorithms to model the transcriptional regulatory networks that control gene expression programs in living cells. We have a very productive interdisciplinary collaboration with leading biologists that has allowed us to tackle extraordinarily difficult and interesting problems that underlie cellular function and development. Computational Genomics Research Group, C SAIL, MIT   Related terms: ExpressionMicroarrays    

concordance: Similarity of results between different microarray platforms. Related terms: discordance, mismatches

consensus sequence: A theoretical representative nucleotide or amino acid sequence in which each nucleotide or amino acid is the one, which occurs most frequently at that site in the different forms which occur in nature. The phrase also refers to an actual sequence, which approximates the theoretical consensus. A known CONSERVED SEQUENCE set is represented by a consensus sequence. Commonly observed supersecondary protein structures (AMINO ACID MOTIFS) are often formed by conserved sequences. MeSH, 1991

A sequence of DNA, RNA, protein or carbohydrate derived from a number of similar molecules, which comprises the essential features for a particular function. IUPAC Bioinorganic

conserved sequence: A sequence of amino acids in a polypeptide or of nucleotides in DNA or RNA that is similar across multiple species. A known set of conserved sequences is represented by a CONSENSUS SEQUENCE. AMINO ACID MOTIFS are often composed of conserved sequences. MeSH, 1993

A "highly conserved sequence" is a DNA sequence that is very similar in several different kinds of organisms. Scientists regard these cross species similarities as evidence that a specific gene performs some basic function essential to many forms of life and that evolution has therefore conserved its structure by permitting few mutations to accumulate in it. NHGRI

contig: A contiguous (i.e. without gaps) stretch of DNA sequence which has been assembled solely on the basis of direct sequencing information, i.e. sequencer reads. Note however that 'contig' is used in other contexts in genomics to mean a contiguous assembly of something (e.g. clones), without necessarily implying that all the bases in the assembly have been determined. Ensembl Glossary;ref= 

Published genome sequence has many gaps and interruptions. Concept of  "contig" is crucial to our understanding of current limitations. David Galas "Making sense of the sequence" Science 291 (5507): 1257, Feb. 16, 2001  Wikipedia  

contig assembly: One of the most difficult and critical functions in DNA sequence analysis is putting together fragments from sets of overlapping segments. Some programs do this better than others, particularly when dealing with sequences containing gaps. Laura De Francesco "Some things considered" Scientist 12[20]:18, Oct. 12, 1999

contig mapping: Maps & mapping

DDBJ DNA DataBank of Japan: Shares information daily with EMBL and GenBank.

disconcordance: Lack of standard results among microarray experiments.  Related terms: concordance, mismatches

distributed sequence annotation: The pace of human genomic sequencing has outstripped the ability of sequencing centers to annotate and understand the sequence prior to submitting it to the archival databases. Multiple third-party groups have stepped into the breach and are currently annotating the human sequence with a combination of computational and experimental methods. Their analytic tools, data models, and visualization methods are diverse, and it is self-evident that this diversity enhances, rather than diminishes, the value of their work.  Lincoln Stein, et. al. Distributed Sequence Annotation, 2000

DNA computers: Seeks to use biological molecules such as DNA and RNA to solve basic mathematical problems. Fundamentally, many of these experiments recapitulate natural evolutionary processes that take place in biology, especially during the early evolution of life and the creation of genes. Laura Landweber, "DNA Computing" Princeton Univ. Freshman Seminar, 1999.   

DNA computing: An interdisciplinary field that draws together molecular biology, chemistry, computer science and mathematics. There are currently several research disciplines driving towards the creation and use of DNA nanostructures for both biological and non-biological applications. These converging areas are:  The miniaturization of biosensors and biochips into the nanometer scale regime; The fabrication of nanoscale objects that can be placed in intracellular locations for monitoring and modifying cell function; The replacement of silicon devices with nanoscale molecular- based computational systems, and The application of biopolymers in the formation of novel nanostructured materials with unique optical and selective transport properties DNA Computing & Informatics at Surfaces, Univ. of Wisconsin- Madison, June 1-4 2003. 

Wikipedia   Related terms: molecular computing, quantum computing Or are these the same/overlapping? 

Ensembl: A joint project between EMBL- EBI and the Sanger Centre (UK) to develop a software system which produces and maintains automatic annotation on eukaryotic genomes.

exon parsing: Identifying precisely the 5' and 3' boundaries of genes (the transcription unit) in metazoan genomes, as well as the correct sequences of the resulting mRNA ("exon parsing") has been a major challenge of bioinformatics for years. Yet, the current program performances are still totally insufficient for a reliable automated annotation (Claverie 1997; Ashburner 2000). It is interesting to recapitulate quickly the research in this area to illustrate the essential limitation plaguing modern bioinformatics. Encoding a protein imposes a variety of constraints on nucleotide sequences, which do not apply to noncoding regions of the genome. These constraints induce statistical biases of various kinds, the most discriminant of which was soon recognized to be the distribution of six nucleotide- long "words" or hexamers. Claverie and Bougueleret 1986; Fickett and Tung 1992).  JM Claverie "From Bioinformatics to Computational Biology" Genome Res 10: (9) 1277- 1279 Sept. 2000

exon prediction:  Since prokaryotes don't have introns, exon prediction implies working with eukaryotes. Is exon prediction equivalent to gene prediction in prokaryotes?  Related terms: ab initio gene prediction; GRAIL Sequencing

exon shuffling theory: Contends that introns act as spacers where breaks for genetic recombination occur. Under this scenario, exons - which usually contain instructions for building a protein subunit - remain intact when shuffled during recombination. In this way, proteins with new functional repertoires can evolve.  Peter Schmidt, "Shuffling, Recombination, and the Importance of ...Nonsense"  Swarthmore College  
Wikipedia   Related terms: DNA shuffling, domain shuffling, gene shuffling, protein shuffling  

extreme phenotype selection studies: Systematic collection of phenotypes and their correlation with molecular data has been proposed as a useful method to advance in the study of disease. Although some databases for animal species are being developed, progress in humans is slow, probably due to the multifactorial origin of many human diseases and to the intricacy of accurately classifying phenotypes, among other factors. An alternative approach has been to identify and to study individuals or families with very characteristic, clinically relevant phenotypes. This strategy has shown increased efficiency to identify the molecular features underlying such phenotypes. While on most occasions the subjects selected for these studies presented harmful phenotypes, a few studies have been performed in individuals with very favourable phenotypes. The consistent results achieved suggest that it seems logical to further develop this strategy as a methodology to study human disease, including cancer. The identification and the study with high-throughput techniques of individuals showing a markedly decreased risk of developing cancer or of cancer patients presenting either an unusually favourable prognosis or striking responses following a specific treatment, might be promising ways to maximize the yield of this approach and to reveal the molecular causes that explain those phenotypes and thus highlight useful therapeutic targets.  Selection of extreme phenotypes; the role of clinical observation in translational research José Luis Pérez-Gracia  Clinical and Translational Oncology 2010 Mar;12(3):174-80.  Broader term: phenotype

false negative: The chance of declaring an expression change (e.g., in gene expression) to be insignificant when in fact a change has occurred. The opposite situation is the false positive. 

false positive: The chance of declaring an expression change to be significant when in fact no change has occurred. This tends to be a more pressing concern than false negatives in microarray experiments. 

filtering: A process whose aim is to reduce a microarray dataset to a more manageable size, by getting rid of genes that show no significant expression changes across the experiment or that are uninteresting for biological reasons. 

finished sequence - human: Sequence in which bases are identified to an accuracy of no more than 1 error in 10,000 and are placed in the right order and orientation along a chromosome with almost no gaps. History of the Human Genome Project" A Genome Glossary" Science 291: pullout chart Feb. 16, 2001

At some level it’s a little arbitrary when you declare a sequence essentially complete." says NHGRI Director Francis Collins…The definition of finished is evolving. Our definition today is different from 10 years ago. Ten years ago we didn’t even think at the level of genomes." says Laurie Goodman, editor of Genome Research. "I think the community at large should define done. Not everyone is going to agree, but when you’re using the word you should define what it means." Francis Collins says "You’re done when you’ve exhausted the standard methods for closing the gaps. There should be some biological reason why those last bits of sequence eluded you – not because you just didn’t bother." "Are we there yet?" The Scientist :12 July 19, 1999

fold change: A way of describing now much larger or smaller one number is compared with another. When the first number is larger than the second, it is simply the ratio of the first to the second. When the first number is smaller than the second, it is the ratio of the second to the first with a minus sign in the front. When the numbers are equal, it is 1. For example, the fold change of 50 versus 10 is 50/10 = 5, while the fold change of 10 versus 50 is -5. 

gap: A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. NCBI BLAST Glossary

GenBank: Located at NCBI, shares information daily with DDBJ and EMBL. NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. Currently estimated (early 2000) that over 2 million bases are deposited here each day. This growth will only accelerate in the future.

Now accommodates > 10 10 nucleotides and more than doubles in size every year. [David Roos "Bioinformatics -- Trying to Swim in a Sea of Data" Science 291:1260-1261 Feb. 16, 2001]

gene finding programs: Bioinformatics Resource, Center for Molecular and Genetic Medicine, Stanford Univ. School of Medicine. List of programs has been compiled and updated from James W. Fickett, "Finding genes by computer: the state of the art" Trends in Genetics, August 1996, 12 (8) 316- 320

gene identification: The effectiveness of finding genes by similarity to a given sequence segment is determined by a much simpler statistic, the total  coverage of the genome by the collective set of sequence contigs. As the overall coverage of the genome is virtually complete (> 90%), there is a strong likelihood that every gene is represented, at least in part, in the data. Thus, finding any gene by  sequence similarity searches using sufficient sequence to ensure significance is almost always possible using the data published  this week. Caution must be exercised, however, as the identification of the gene may still be ambiguous. This is because a  highly similar sequence from a receptor gene from Drosophila, for example, could be found in several different, homologous  genes, which may have similar or entirely different functions or are nonfunctioning pseudogenes. In other words, common  domains or motifs can be present in many different genes. The use of the approximate similarity search tool BLAST is probably still the best way to find similar sequences. David Galas "Making Sense of the Sequence" Science 291: 12257-1260 Feb. 16, 2001

There are two basic approaches to gene identification: by homology and ab initio approaches.  Using marker SNPs to hone in on otherwise hard to find genes.

Gene OntologyTM (GO):  Functional genomics

gene parsing:  Initial gene parsing methods were then simply based on word frequency computation, eventually combined with the detection of splicing consensus motifs. The next generation of software implemented the same basic principles into a simulated neural network architecture (Uberbacher and Mural 1991). Finally, the last generation of software, based on Hidden Markov Models, added an additional refinement by computing the likelihood of the predicted gene architectures (e.g., favoring human genes with an average of seven coding exons, each 150 nucleotides long) is added (Kulp et al. 1996; Burge and Karlin, 1997)). These ab initio methods are used in conjunction with a search for sequence similarity with previously characterized genes or expressed sequence tags (EST). JM Claverie "From Bioinformatics to Computational Biology" Genome Res 10: (9) 1277- 1279.Sept. 2000  

gene prediction: Wikipedia 

One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification during the past decade, the accuracy of gene prediction tools is not sufficient to locate the genes reliably in higher eukaryotic genomes. Thus, while the precise sequence of the human genome is increasingly deciphered, gene number estimations are becoming increasingly variable. ... In 1996 we published a comprehensive evaluation of gene prediction programs accuracy (Burset and Guigó, 1996). ... Recently  we have published a revised version of this evaluation (Guigó et al., 2000). This revised evaluation suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology. Genome Bioinformatics Research Lab, Center for Genomic Regulation (Centre de Regulació Genòmica - CRG, Barcelona, 2004 

Many methods for predicting genes are based on compositional signals that are found in the DNA sequence. These methods detect characteristics that are expected to be associated with genes, such as splice sites and coding regions, and then piece this information together to determine the complete or partial sequence of a gene. Unfortunately, these ab initio methods tend to produce false positives, leading to overestimates of gene numbers, which means that we cannot confidently use them for annotation. They also do not work well with unfinished sequence that has gaps and errors, which may give rise to frameshifts, when the reading frame of the gene is disrupted by the addition or removal of bases. ... The most effective algorithms integrate gene- prediction methods with similarity comparisons.... The most powerful tool for finding genes may be other vertebrate genomes. Comparing conserved sequence regions between two closely related organisms will enable us to find genes and other important regions in both genomes with no previous knowledge of the gene content of either.  Ewan Birney et. al "Mining the draft human genome" Nature 409: 827-828 15 Feb. 2001 

Sadly, it is often claimed that matching back cDNA to genomic sequences is the best gene identification protocol; hence, admitting that the best way to find genes is to look them up in a previously established catalog! Thus, the two main principles behind state- of- the- art gene prediction software are (1) common statistical regularities and (2) plain sequence similarity. From an epistemological point of view, those concepts are quite primitive. JM Claverie "From Bioinformatics to Computational Biology" Genome Res 10: (9) 1277- 1279.Sept. 2000 

Algorithms have been developed and are combined to recognize gene structural components.  Narrower/synonymous? term: ab initio gene prediction Related term: comparative genomics

gene recognition: Principally used for finding open reading frames, tools of this type also recognize a number of features of  genes, such as regulatory regions, splice junctions, transcription and  translation stops and starts, GC islands, and poly adenylation sites. Laura De Francesco "Some things considered" Scientist 12[20]:18, Oct. 12, 1998

genetic association studies: The analysis of a sequence such as a region of a chromosome, a haplotype, a gene, or an allele for its involvement in controlling the phenotype of a specific trait, metabolic pathway, or disease. MeSH 2010   See also Genome Wide Association Studies GWAS

genetic models: Theoretical representations that simulate the behavior or activity of genetic processes or phenomena. They include the use of mathematical equations, computers, and other electronic equipment. MeSH 1980

genome annotation: It is now apparent that the bottleneck in genomics is no longer in sequencing the genomes, but lies in their annotation. Large- scale annotation efforts require handling massive amounts of genome data through automated pipelines, with a need to combine diverse sources of data and methods. In addition, it requires visualisation tools to manually examine the automatic annotation, since integration of human expertise to assess the validity and authenticity of all computational results goes a long way to improve the quality of gene annotation. The "Annotation Jamboree", a collaboration between Celera, the Berkeley Drosophila Genome Project, and a team of experts on the annotation of the Adh region of Drosophila, is an exemplary attempt on how to transform the process of manual annotation into a high- throughput operation. [Paradigm Shifts in the Approaches for Gene Annotation, a special issue of "Briefings in Bioinformatics" which reports on the proceedings from the recently concluded symposium on "Genome Based Gene Structure Determination" conducted at the EMBL European Bioinformatics Institute (EBI) during June 1- 2, 2000.]  Narrower term: comparative genome annotation
Genome Annotation Data Warehouse: Databases & software

genome database mining: The identification of the protein- encoding regions of a genome and the assignment of functions to these genes on the basis of sequence similarity homologies against other genes of known function.  [John L. Houle et. al., White Paper: Database Mining in the Human Genome Initiative, AMITA Corp. 2000]  Related terms: Expression, genes & beyond: gene expression database mining; Proteomics: proteome database mining

genome informatics: The Twelfth International Conference on Genome Informatics (GIW 2001) focuses on Genome Informatics, including, but not limited to, the following areas: genomic database, knowledge extraction from literature, knowledge discovery and data mining from databases, structural genomics, protein structure and function prediction, proteome analysis, pathway analysis, functional genomics, gene expression analysis, gene network analysis, gene structure and function prediction, sequence analysis, motif extraction and search, multiple alignment, phylogenetic tree, linkage analysis program, systems for supporting experimental works (mapping, sequencing, primer design, etc.), high performance computing, simulation of biological system, DNA computing, artificial life, etc. [GIW 2001 homepage, Dec. 17-19, 2001, Tokyo, Japan]

Genome informatics can be divided into a few large categories: data acquisition and sequence assembly, database management, and genome analysis tools. ...Managing such a diverse informatics effort is a considerable challenge [JASON Program Office, Human Genome Project report "Genome Informatics" 1997]

genome map: Maps & mapping genetic & genomic

genome visualization: A significant challenge for genome centers is to make the data being generated available to biologists in a succinct and meaningful way. We are addressing this problem by creating extensible, reusable graphical components specifically designed for developing genome visualization applications. With careful planning and design this toolkit enhances the ability for others and ourselves to rapidly develop genome visualization applications for the Internet and as editing applications. Data Visualization for Distributed Bioinformatics, Gregg Helt, Suzanna Lewis, Nomi Harris, Gerald M. Rubin, DOE Human Genome Program Contractor- Grantee Workshop VII, Jan. 12-16, 1999 Oakland, CA

genomic computing: A genomic computing network is a variant of a neural network for which a genome encodes all aspects, both structural and functional, of the network. The genome is evolved by a genetic algorithm to fit particular tasks and environments. The genome has three portions: one for specifying links and their initial weights, a second for specifying how a node updates its internal state, and a third for specifying how a node updates the weights on its links. Preliminary experiments demonstrate that genomic computing networks can use node internal state to solve POMDPs more complex than those solved previously using neural networks. Association for Computing Machinery, ACM Digital Library, Guide to Computing Literature   

genomic data:
The strength of genomic studies lies in the global comparisons between biological systems rather than detailed examination of single genes or proteins. Genomic information is often misused when applied exclusively to individual genes. If one is interested only in one particular genes, there are many more conclusive experiments that should be consulted before using the results from genomic datasets.  Therefore, genomic data should not be used in lieu of traditional biochemistry, but as an initial guidelines to identify areas for deeper investigation and to see how those results fit in with the rest of the genome.  Moreover, most genomics datasets give relative rather than absolute information, which means that information about a single gene has little meaning in isolation. [Dov Greenbaum, Mark Gerstein et. al. "Interrelating Different Types of  Genomic Data" Dept. of Biochemistry and Molecular Biology, Yale Univ., 2001]  Related terms: Expression genes & proteins    -Omes & -Omics interactome;  Proteomics

Genome and Transcriptome AnalysisGenome and Transcriptome Analysis  February 10-12, 2014 • Molecular Medicine TriConference San Francisco, CA Program | Register |

Genomics & Sequencing Data Integration, Analysis and Visualization February 13-14, 2014 • Molecular Medicine TriConference San Francisco, CA Program | Register | 

GRAILexp: Gene Recognition and Assembly Internet Link software  

The GRAILexp  FAQ with references to Perceval, an exon prediction program; Galahad, a gene message alignment program and Gawain, a gene assembly program clearly has scientific and literary finesse.  Does this name relate in any way to Walter Gilbert's description of the Human Genome Project as the "Holy Grail" of molecular biology?  I should ask them.   

global normalization or mean scaling:. The standard solution for errors that effect entire arrays is to scale the data so that the average measurement is the same for each array (and each color). The scaling is accomplished by computing the average expression level for each array, calculating a scale factor equal to the desired average divided by the actual average, and multiplying every measurement from the array by that scale factor. The desired average can be arbitrary, or computed from the average of a group of arrays. 

GWAS Genome Wide Association Sequencing: An analysis comparing the allele frequencies of all available (or a whole GENOME representative set of) polymorphic markers in unrelated patients with a specific symptom or disease condition, and those of healthy controls to identify markers associated with a specific disease or condition. MeSH 2009

The NIH is interested in advancing genome-wide association studies (GWAS) to identify common genetic factors that influence health and disease. For the purposes of this policy, a genome-wide association study is defined as any study of genetic variation across the entire human genome that is designed to identify genetic associations with observable traits (such as blood pressure or weight), or the presence or absence of a disease or condition. Whole genome information, when combined with clinical and other phenotype data, offers the potential for increased understanding of basic biological processes affecting human health, improvement in the prediction of disease and patient care, and ultimately the realization of the promise of personalized medicine. In addition, rapid advances in understanding the patterns of human genetic variation and maturing high-throughput, cost-effective methods for genotyping are providing powerful research tools for identifying genetic variants that contribute to health and disease.   GWAS, NIH 2008 

Pronounced gee-wahs  Related term: next generation sequencing

high throughput nucleotide sequencing: [analysis] Techniques of nucleotide sequence analysis that increase the range, complexity, sensitivity, and accuracy of results by greatly increasing the scale of operations and thus the number of nucleotides, and the number of copies of each nucleotide sequenced. The sequencing may be done by analysis of the synthesis or ligation products, hybridization to preexisting sequences, etc. MeSH 2011

homologyNarrower terms sequence homology, sequence homology- nucleic acid; Functional genomics  homology Related terms homolog (homologue), similarity, ortholog, paralog, xenology; Molecular modeling homology modeling   

International Nucleotide Database: Composed of  DDBJ, EMBL and GenBank.

local alignment: The alignment of some portion of two nucleic acid or protein sequences. NCBI BLAST glossary

Best alignment method for sequences for whom no evolutionary relatedness is known. See Smith- Waterman alignment.  Compare global alignment.

log ratios: DNA microarray assays typically compare two biological samples and present the results of those comparisons gene-by-gene as the logarithm base two of the ratio of the measured expression levels for the two samples. The limits of log ratios, Vasily Sharov,1 Ka Yin Kwong,1 Bryan Frank,1 Emily Chen,1 Jeremy Hasseman,1 Renee Gaspard,1 Yan Yu,1 Ivana Yang,1 and John Quackenbush BMC Biotechnology 4, 2004 doi: 10.1186/1472-6750-4-3. 

MAGE  Microarray and Gene Expression: The group aims to provide a standard for the representation of microarray expression data that would facilitate the exchange of microarray information between different data systems.

MAGE-ML MicroArray and Gene Expression Markup Language: A language designed to describe and communicate information about microarray based experiments. MAGE-ML is based on XML and can describe microarray designs, microarray manufacturing information, microarray experiment setup and execution information, gene expression data and data analysis results. MAGE-ML has been automatically derived from Microarray Gene Expression Object Model (MAGE-OM), which is developed and described using the Unified Modelling Language (UML) -- a standard language for describing object models. [Robin Cover, XML Cover Pages: Microarray and Gene Expression Markup Language, 2002]  Related terms: GEML, MAML, MIAME

MAML Microarray Markup Language: MAML (Microarray Markup Language) is no longer supported by MGED and has been replaced by MAGE-ML.  Broader term: standards; Related terms: data analysis - microarray, MGED, MIAME  

MGED Microarray Gene Expression Database group: The MGED group is a grass- roots movement whose goal is to facilitate the adoption of standards for DNA- array experiment annotation and data representation, as well as the introduction of standard experimental controls and data normalization methods. The group was founded at the Microarray Gene Expression Database meeting MGED1 (November, 1999, Cambridge, UK).  There are four major standardization projects being pursuing by the group:, MIAME, MAGE, Ontologies, Normalisation.  Broader term: standards; Related terms: data analysis - microarray, MAGE, MAML, MIAME.

MGED Network, Ontology Working Group: The primary purpose of the MGED Ontology is to provide standard terms for the annotation of microarray experiments. These terms will enable structure queries of elements of the experiments. Furthermore, the terms will also enable unambiguous descriptions of how the experiment was performed. The terms will be provided in the form of an ontology which means that the terms will be organized into classes with properties and will be defined. A standard ontology format will be used.

MIAME Minimum Information About a Microarray Experiment: MIAME aims to outline the minimum information required to unambiguously interpret microarray data and to subsequently allow independent verification of this data at a later stage if required. MIAME is not a dogma for microarray experiments to follow, but just a set of guidelines. This set of guidelines will then assist with the development of microarray repositories and data analysis tools. [MIAME homepage 2003]
Broader term: standards; Related terms: data analysis - microarray, MAGE, MAML, MGED

MIAME Checklist, MGED, 2003
MIAME glossary
, MGED, MIAME, 2003
MIAME software
, MGED, 2003 A list of possibly MIAME compliant software
The boundaries between MIAME concepts, the MIAME- compliant MAGE-OM and the MGED ontology (that try to define and structure the MIAME concepts) is neither well defined nor easy to understand. In order to provide some help, this webpage contains explanatory documentation to understand the MIAME concepts, how its requirements map to the MAGE-OM and where the MGED ontology inclusion is required. [MGED, MIAME MAGE-OM, 2002]

microarray analysis techniques: Wikipedia 

microarrays - data analysis - : It is obvious to reviewers of submitted manuscripts that many researchers have used microarrays to perform experiments that provide no biological insight whatsoever. That's not due to any failure on the part of the technology, but rather on the failure to design experiments or appreciate the limitations of microarray technology. The use of microarrays will not turn a poorly conceived or poorly executed experiment into a groundbreaking scientific achievement, any more than buying a sports car will turn one into a NASCAR driver. Catherine Ball, Director Stanford Microarray Database as part of a panel on Has the Promise of Microarrays been oversold? Science Functional Genomics weblog 

Microarrays have revolutionized molecular biology. The numbers of applications for microarrays are growing as quickly as their probe density. Paradoxically, microarray data still contains a large number of variables and a small number of replicates creating unique data analysis sets. Still, the first and most important goal is to design microarray experiments that yield statistically defensible results. Related terms: image analysis - microarrays; standards; cluster analysis, pattern recognition Algorithms & data management glossary

Stanford MicroArray Database (SMD)
, Stanford Univ., US    Stores raw and normalized data from microarray experiments, as well as their corresponding image files. In addition, SMD provides interfaces for data retrieval, analysis and visualization. 
Terry Speed's Microarray Data Analysis Group Page
, UC- Berkeley, US http://www.stat.Berkeley.EDU/users/terry/zarray/Html/index.html
Microarray databases: Databases & software directory

microarrays image analysis: Although the visual image of a microarray panel is alluring, its information content, per se, is minimal without significant image processing. To mine its lode effectively, quantitative signal must be determined optimally, which means subtracting background, calculating confidence intervals - outside of which a difference in signal ratio is deemed to be significant - and calibrated. Editorial “Getting hip to the chip” Nature Genetics 18(3): 195- 197 March 1998

This process starts with the image of a microarray that is produced in the laboratory and produces intensity information indicating the amount of light emitted by each probe. In particular, after the array has been hybridized, it is scanned to obtain an image that shows the amount of light emitted across the surface of the microarray. The image is then analyzed to identify the "spots" (i.e., the parts of the image corresponding to the DNA probes on the microarray) and the amount of light that can be attributed to target molecules bound to each probe.  Related term: normalization

microarray informatics: The microarray field is experiencing an overwhelming push toward robust statistics and mathematical analytic methods that go far beyond the simple fold analysis and basic clustering that were once the mainstays of researchers in this area. This push toward better statistics is also driving the recognition of the need for more replication of experiments. These stronger analytical techniques also help researchers identify problem areas in the technology and laboratory processes, and these improvements, in turn, greatly improve the quality of results that can be provided.  Related terms microarray analysis, microarray data analysis

mismatches: Gene expression microarray data is notoriously subject to high signal variability. Moreover, unavoidable variation in the concentration of transcripts applied to microarrays may result in poor scaling of the summarized data which can hamper analytical interpretations. This is especially relevant in a systems biology context, where systematic biases in the signals of particular genes can have severe effects on subsequent analyses. Conventionally it would be necessary to replace the mismatched arrays, but individual time points cannot be rerun and inserted because of experimental variability. It would therefore be necessary to repeat the whole time series experiment, which is both impractical and expensive. Correction of scaling mismatches in oligonucleotide microarray data, Mrtino Barenco, Jaroslav Stark3 ,1, Daniel Brewer2 ,1, Daniela Tomescu1, Robin Callard1 ,2 and Michael Hubank1 BMC bioinformatics 2006, 7:251 doi:10.1186/1471-2105-7-251 

molecular sequence annotation: The addition of descriptive information about the function or structure of a molecular sequence to its MOLECULAR SEQUENCE DATA record. MeSH 2011  

Next-Gen Sequencing Informatics April 29-May 1, 2014 • Boston, MA 

noise characterization:  Noise is a big problem in analyzing gene expression microarray data. Of course noise is a problem with biological data in general. 

normality: The collection of log ratios from a single microarray experiment is typically quite unlike a random sample from a single normal population. This is particularly so when a lot (say > 10%) of genes are differentially expressed. ..Conclusion. It is dangerous to use normal statistical theory to guide your selection of differentially expressed genes. The normal thinking which says that about 68% should be within 1 standard deviation (SD), 95% within 2 SDs and 99% within 3 SDs of the mean does not apply, even when no differential expression is present.  Avoid assuming normality, Terry Speed Group Microarray Page, 2000  Related term: Clinical genomics glossary normal

normalization: One approach is to place a modest number of control probes on the array and add known quantities of matching target molecules to the sample. This is often called a spike-in method, because the sample is "spiked" with known quantities of control target. The idea is that by correlating the readout of each control with the known amount of target, it should be possible to better account for variations in the process. A nice study of this approach for Affymetrix GeneChips was done by Gene Brown’s laboratory at Genetics Institute/ Wyeth- Ayerst Research. Footnote: Hill AA, Brown EL, et al. "Evaluation of normalization procedures for oligonucleotide array data based on spiked cRNA controls." Genome Biology. 2001 2(12): research0055.1-0055.13 

RNA labelled between the 2 channels. Spatial biases in ratios across the surface of the microarray. MGED Normalization Working Group, 2002

The conversion of intensity information (from image analysis) into estimates of gene expression levels. For researchers who are using statistical methods, this process also characterizes the uncertainty in the measurements. The goal of normalization is to convert the intensity measurements generated by image analysis into estimates of gene expression levels in the original biological source. Concretely, the challenge is to compensate for as many sources of error as possible.  Related terms: fold changes, image analysis, log ratios; See also normalization: Algorithms glossary

Normalization for cDNA microarrays
, Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed, 2001

oligonucleotide array sequence analysis: Hybridization of a nucleic acid sample to a very large set of oligonucleotide probes, which are attached to a solid support, to determine sequence or to detect variations in a gene sequence or expression or for gene mapping. MeSH, 1999

Useful to know this MeSH heading for microarrays, but use free- text as well to search PubMed.

Ontology Working Group: Charged with developing an ontology for describing samples used in microarray experiments. MGED Network, Ontology Working Group

ORF prediction: Related terms: exon prediction, gene prediction, gene recognition.

ORF recognition: ESTs provide candidate genes, useful in positional cloning (during walks and for recognizing ORFs) and for ORF recognition in cloning of insertion sites. Report from the Workshop on Genomic and Genetic Tools for the Zebrafish May 10-11, 1999, Trans- NIH Zebrafish Initiative

Phred: Base calling program for DNA sequence traces; ... developed by Drs. Phil Green and Brent Ewing, and is distributed under license from the University of Washington.

Phred base calling: 

reverse transfection: A microarray- based system for the functional analysis in mammalian cells of many genes in parallel. Mammalian cells are cultured on a glass slide printed in defined locations with solutions containing different DNAs. Cells growing on the printed areas take up the DNA, creating spots of localized transfection within a lawn of non- transfected cells. ... we have developed two methods to reverse transfect cells....  By printing sets of complementary DNAs (cDNAs) cloned in expression vectors, we can make microarrays whose features are groups of live cells that express a defined cDNA at each location. These 'transfected cell microarrays' should be of broad utility for the high- throughput expression cloning of genes, particularly in areas such as signal transduction and drug discovery. For many applications these arrays can serve as substitutes for protein microarrays, particularly for proteins that are difficult to purify, such as membrane proteins.  David Sabatini "Reverse transfection" Whitehead Institute, MIT, US

RNA sequence analysis: A multistage process that includes cloning, physical mapping, subcloning, sequencing, and information analysis of an RNA SEQUENCE MeSH 1993

scaffolds: A series of contigs that are in the right order but are not necessarily connected in one continuous stretch of sequence. History of the Human Genome Project" A Genome Glossary" Science 291: pullout chart Feb. 16, 2001

Contig sequences separated by gaps NCBI Whole Genome Shotgun Submissions 

The definition of a scaffold appears to be quite different in the Science and Nature draft published sequences. David Galas "Making sense of sequence" Science 291: 1257-  Feb. 16, 2001 This is also different from the scaffold defined in Drug discovery and development.  

scoring methods: Many choices, best choice often problem dependent. Nice review "Sequence Analysis: Which scoring method should I use? Pittsburgh Supercomputing Center, Carnegie Mellon Univ. 1999  Narrower term: SNP scoring

sequence alignment:  The arrangement of two or more amino acid or base sequences from an organism or organisms in such a way as to align areas of the sequences sharing common properties. The degree of relatedness or homology between the sequences is predicted computationally or statistically based on weights assigned to the elements aligned between the sequences. This in turn can serve as a potential indicator of the genetic relatedness between the organisms. MeSH, 1991 Broader term? alignments.

sequence homology:  The degree of similarity between sequences. Studies of amino acid and nucleotide sequences provide useful information about the genetic relatedness of certain species. MeSH, 1993 Broader term Functional genomics homology;   Related terms Functional genomics evolutionary homology; Proteomics regulatory homology;  

sequence homology - nucleic acid: The sequential correspondence of nucleotide triplets in a nucleic acid molecule which permits nucleic acid hybridization. Sequence homology is important in the study of mechanisms of oncogenesis and also as an indication of the evolutionary relatedness of different organisms. The concept includes viral homology. MeSH, 1991 Broader term sequence homology

Sequence Ontology Project:  The Sequence Ontology is a set of terms and relationships used to describe the features and attributes of biological sequence.    

sequencing algorithms: See BLAST, FASTA, Needleman - Wunsch, Smith - Waterman   

similarity search: BLAST, FASTA and Smith- Waterman are examples of similarity search algorithms.

Smith-Waterman alignment: An amino acid sequence alignment that illustrates sequence similarity. The alignment is generated using the Smith- Waterman algorithm (Temple Smith and MS Waterman, J Mol Biol. 147: 195-197, 1981; WR Pearson, Genomics 11:635-650, 1991) SGD Saccharomyces Genome Database glossary, Stanford Univ. standards microarrays:  See GEML, MGED, MIAMI  

DOE, Human Genome Project Information, Oak Ridge National Laboratory, Dictionary of Genetic Terms. 2007, 100+ definitions. 
MeSH Medical Subject Headings, PubMed 
NHGRI (National Human Genome Research Institute), Talking Glossary of Genetic Terms, 100+ definitions. Includes extended audio definitions.
Schlindwein Birgid, Hypermedia Glossary of Genetic Terms, 2006, 670 definitions.
Systems Biology Gateway, BioMedCentral 

Informatics Conferences
BioIT World Expo
Molecular Medicine Tri Conference

Informatics CDs, DVDs  
Informatics Short courses

BioIT World magazine   
   BioIT World archives

Insight Pharma Reports Informatics series 

Alpha glossary index
How to look for other unfamiliar  terms

IUPAC definitions are reprinted with the permission of the International Union of Pure and Applied Chemistry.

Contact | Privacy Statement | Alphabetical Glossary List | Tips & glossary FAQs | Site Map