You are here Biopharmaceutical Glossary homepage > Biology > Gene definitions

Gene definitions & taxonomy
  Evolving terminology for emerging technologies

Comments? Suggestions? Questions?
Mary Chitty MSLS
mchitty@healthtech.com
Last revised January 09, 2020



One of the most unfortunate legacies of Mendelian genetics is the lumping together  of gene defects and genes. People with various genetic defects may or may not manifest a disease phenotype.  As both Horace Freeland Judson and Sydney Brenner point out in the articles cited below classical genetics was so firmly based on gene defects that only recently have we begun to determine what "normal" or wild- type genes really are. Careful reading and/or listening will often reveal that people use the word gene and a number of related words and phrases (mutations and other variants) very loosely and interchangeably.  And we are only starting to realize the full extent of the diversity which characterizes "normal" variants.

Biology term index   Related glossaries  include Gene categoriesNomenclature  Genomics and Proteomics are also key, since it is the gene's protein products which are ultimately of interest.
Informatics Bioinformatics  Drug discovery informatics   Technologies Sequencing
Not until the technologies for working with DNA and proteins are better integrated will their researchers be better integrated than they are now.
Biology DNA, Expression, Proteins, RNA, SNPs & genetic variations, Sequences, DNA & beyond. 
  

How past history leads to present confusion
Horace Freeland Judson, writing in the Feb. 2001 human genome issue of Nature notes problems with terminology. "The phrases current in genetics that most plainly do violence to understanding begin "the gene for": the gene for breast cancer, the gene for hypercholesterolaemia, the gene for schizophrenia, the gene for homosexuality, and so on. We know of course that there are no single genes for such things. We need to revive and put into public use the term "allele". Thus, "the gene for breast cancer" is rather the allele, the gene defect - one of several - that increases the odds that a woman will get breast cancer. "The gene for" does, of course, have a real meaning: the enzyme or control element that the unmutated gene, the wild- type allele, specifies. But often, as yet, we do not know what the normal gene is for. ... Pleiotropy. Polygeny. Perhaps these terms will not easily become common parlance; but the critical point never to omit is that genes act in concert with one another - collectively with the environment. Again, all this has long been understood by biologists, when they break  free of habitual careless words. We will not abandon the reductionist Mendelian programme for a hand- wringing holism: we cannot abandon the term gene and its allies. On the contrary, for ourselves, for the general public, what we require is to get more fully and precisely into the proper language of genetics." Horace Freeland Judson "Talking about the genome" Nature 409: 769, 15 Feb. 2001

Sydney Brenner, writing in the special Drosophila genome issue of Science made a similar observation "Old geneticists knew what they were talking about when they used the term "gene", but it seems to have become corrupted by modern genomics to mean any piece of expressed sequence, just as the term algorithm has become corrupted in much the same way to mean any piece of a computer program. I suggest that we now use the term "genetic locus" to mean the stretch of DNA that is characterized either by mapped mutations as in the old genetics or by finding a complete open reading frame as in the new genomics. In higher organisms, we often find closely related genes that subserve closely related, but subtly different, functions." Sydney Brenner "The End of the beginning" Science 287 (5451): 2173, Mar. 24, 2000

Don't expect to know anytime soon exactly how many human genes there are. About 60% of our genes exhibit alternative splicing, making the number of protein products close to 100,000, not a very different number from the more recent estimates.  Expect to hear more about genes and the cell cycle, and how gene expression differs throughout it. 

After all, the yeast (Saccharomyces cerevisiae) genome has been sequenced since 1996 and the precise number of genes is not yet confirmed. It is also useful to read the Oxford English Dictionary's definitions for genome and note the quotation from Scientific American Oct. 1970 "The human genome consists of perhaps as many as 10 million genes."

Definitions of gene
Gene is a good example of a word in the process of evolving from classical genetics meanings (fairly abstract concepts, rooted in the Mendelian model of monogenic diseases with high penetrance). The concept of "gene" has been changing so fast that most print resources (and some online) are out of date. The absolute best source I've found is at http://www.ergito.com/ a project of Benjamin Lewin and colleagues  Molecular Biology: The best- selling textbook GENES online (which also has an extensive glossary).

Michael Snyder and Mark Gerstein,  Even with the availability of the genome sequences of many different organisms, we are still left wondering about the definition of a true gene. In their Perspective, Snyder and Gerstein discuss different criteria that can be used to define what a gene is in the era of genomics.  GENOMICS: Defining Genes in the Genomics Era, Science 300: 258- 260, Apr. 11, 2003

Rat Genome Database definition: "the DNA sequence necessary and sufficient to express the complete complement of functional products derived from a unit of transcription" Splice variants for each gene are also listed in the gene report ... Does not include:  variations such as mutations, pseudogenes, Protein products which are derived from the modification of precursor proteins. Rat Genome Database, Medical College of Wisconsin, Milwaukee, Wisconsin, accessed May 30, 2003   http://rgd.mcw.edu/tu/genes/#definition

The definition of gene is evolving (and lengthening) as we tease apart the incredible complexity of biological and molecular processes and discover that "junk DNA" has important regulatory functions.  Gene identification in prokaryotes is almost trivial as their genomes consist almost entirely of exons.  However human genes are only about 2 % of total human DNA.  Human exons are widely separated by immense stretches of introns.

The concept of "gene" didn't come along until 1909, three years after the term genetics in 1906 (Evelyn Fox Keller, The century of the gene, Harvard University Press, 2000).  For some time it remained a quite abstract term.  With advances in molecular biology the definition is far from settled. Is a monolithic gene concept still valid?  

William Gelbart writing on "Databases in Genomic Research" in Science (282 (5389: 659- 661, 23 Oct. 1998) notes:  Nonetheless, we may well have come to the point where the use of the term "gene" is of limited value and might in fact be a hindrance to our understanding of the genome. Although this may sound heretical, especially coming from a card- carrying geneticist, it reflects the fact that, unlike chromosomes, genes are not physical objects but are merely concepts that have acquired a great deal of historic baggage over the past decades. Ultimately, we want to understand the relationships between heritable units, their gene products, and their phenotypes. ...  the realities of genome organization are much more complex than can be accommodated in the classical gene concept. Genes reside within one another, share some of their DNA sequences, are transcribed and spliced in complex patterns, and can overlap in function with other genes of the same sequence families. Consider so- called alternative splicing, in which one or more exons are shared among multiple transcripts. There is a continuum ranging from cases in which two transcripts are almost identical along their entire length to examples in which only a small portion of the two mRNAs is shared. Sometimes these products have very similar biological activities, whereas in other cases their activities are disparate. What are the rules for deciding whether two partially overlapping mRNAs should be  declared to be alternative transcripts of the same gene or products of different genes? We have none.

Independent of this question is the question of how to relate a mutant phenotype to alterations in multiple overlapping gene products. Suppose that we have a missense mutation that falls within one or more exons that contribute to more than one mRNA and thereby to more than one polypeptide chain. How do we assess the contributions of defects in the different polypeptides to the ultimate phenotype elicited by this mutation? For reasons such as these, I believe that we are entering a period in which we must shift to the view that the genome largely encodes a series of functional RNAs and polypeptides that are expressed in characteristic spatial, temporal, and quantitative patterns. The classical concept of the gene ultimately forms a barrier to trying to understand phenotypes in terms of encoded functional products. This is not a purely abstract discussion but may well demand that we reexamine how we are organizing data within genome- related databases. In most or all of these databases, much biological data is attached to these suspect units called genes. Although some aspects of these phenotypes might be associated with different subsets of alternative products of these genes, the databases might not support the most rigorous parsing of this phenotypic information." https://www.ncbi.nlm.nih.gov/pubmed/9784119

The concept of the difference between the potential for a trait and the trait proper, i.e., between the genotype and the phenotype, became clear only during the first decade of the century, mainly through the work of Johannsen. Although Johannsen insisted on that the terms he coined were only helpful devices to organize data about heredity, it is obvious that they were bound from the beginning to the hypothesis that there was "something" in the gametes that could be rendered to analysis as discrete units. These units were the genes. This reductionist yet materially non-committed attitude has been developed into what I called instrumental-reductionism: the genes were hypothetical constructs that were accepted "as if" they were real entities. The research program developed on such a concept was very successful, not least because this instrumental approach allowed maximum flexibility in the attachment of meaning of the genes. While most geneticists accepted one or another position of this flexible concept, others took more extreme positions. At the one extreme end of the conceptual continuum was the realist approach that argued that genes were discrete, measurable, material particles, and on the other end, the claim that the attempts to identify discrete units only led to hyperatomism of a holistic view appropriate to heredity. The acceptance of the gene as a material and discrete unit, in the beginning of 1950s, opened the way to a deeper level of conceptualizing both its structure ("cistron-recon-muton") and function ("one gene--one enzyme"). The discovery of the structure of DNA finally offered a chemical-physical explanation to the geneticist's requirements of a material gene. Thus, within less than 20 years the gene has been established as a "sharply limited segment of the linear structure" that is involved in the structure of a product or its regulation. However, with turning of much of the attention to the eucaryotic DNA, it was necessary to accommodate the gene to an increasing flood of findings that did not tally with its concept as a discrete material unit.  The gene in search of an identity. Falk R.  Hum Genet. 1984;68(3):195-204. https://www.ncbi.nlm.nih.gov/pubmed/6389318

See also Does defining gene only get harder?

Petter Portin in "The Origin, Development and Present Status of the Concept of the Gene: A Short Historical Account of the Discoveries" Current Genomics  (2000) writes "The current view of the gene is of necessity an abstract, general, and open one, despite the fact that our comprehension of the structure and organization of the genetic material has greatly increased. Simply, our comprehension has outgrown the classical and neoclassical terminology. ...  In fact it should be stressed that our comprehension of the very concept of gene has always been abstract and open as indicated already by Wilhelm Johannsen [2]. Due to the openness of the concept of the gene, it takes different meanings depending on the context. Maxime Singer and Paul Berg [148] have pointed out that many different definitions of the gene are possible. If we want to adopt a molecular definition, they suggest the following definition: "A eukaryotic gene is a combination of DNA segments that together constitute an expressible unit. Expression leads to the formation of one or more specific functional gene products that may be either RNA molecules or polypeptides. Each gene includes one or more DNA segments that regulate the transcription of the gene and thus its expression." (p. 622). Thus the segments of a gene include [1] a transcription unit, which includes the coding sequences, the introns, the flanking sequences - the leader and trailer sequences, and [2] the regulatory sequences, which flank the transcription unit and which are necessary for its specific function."  

Portin P, Wilkins A. The Evolving Definition of the Term "Gene". Genetics. 2017;205(4):1353-1364.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5378099/

Bioinformatics expert Nat Goodman writing in the April 2001 issue of Genome Technology states that gene "is a highly nuanced noun like "truth". Ten years ago, it commonly meant "genetic locus" - a region of the genome linked to a disease or other phenotype. Over time biologists became more comfortable thinking of a gene as a transcribed region of the genome that results in functional molecular product.  In its published human genome paper [Science Feb. 16, 2001] Celera defines a gene as "a locus of cotranscribed exons" in order to emphasize the importance of alternative splicing. Ensembl's Gene Sweepstake Web page [see below] took the definition to new depth: "A gene is a set of connected transcripts. ... Two transcripts are connected if they share at least part of one exon in the genomic coordinates. Implicit in the new definitions of a gene is a belief that the genome can be partitioned into regions such that all exons in a given region belong to a single gene.  These regions are the loci of Celera's definition. A theoretically possible alternative is that the genome might contain long chains of overlapping transcripts in which the first transcript overlaps the second which overlaps the third, but the first and third don't overlap. I'm not aware of any such examples, but if they exist, then all bets are off." Nat Goodman "Human Transcriptome Project" Genome Technology: 55-58 April 2001

While some of the terms included below are relevant to all genes, some are specific to humans and/ or other organisms. 

Gene definitions 
gene: (cistron) Structurally, a basic unit of hereditary material; an ordered sequence of nucleotide bases that encodes one polypeptide chain (via mRNA). The gene includes, however, regions preceding and following the coding region (leader and trailer) as well as (in eukaryotes) intervening sequences (introns) between individual coding segments (exons). Functionally, the gene is defined by the cis- trans test that determines whether independent mutations of the same phenotype occur within a single gene or in several genes involved in the same function. IUPAC Compendium

There are many discussions between biologists to find a comprehensive definition of a gene, which is not easy, if possible at all. For our purposes a gene is a continuous stretch of a genomic DNA molecule, from which a complex molecular machinery can read information (encoded as a string of A, T, G, and C) and make a particular type of a protein or a few different proteins. Alvis Brazma, et. al., A quick introduction to elements of biology: 3.3 Genes and protein synthesis, European Bioinformatics Institute, Draft, 2001 https://lost-contact.mit.edu/afs/ific.uv.es/user/t/tortosa/public/biology_intro.html#Genomes

Specific sequences of nucleotides along a molecule of DNA (or, in the case of some viruses, RNA) which represent the functional units of heredity. The majority of eukaryotic genes contain coding regions (codons) that are interrupted by non- coding regions (introns) and are therefore labeled split genes. MeSH, 1965

The functional and physical unit of heredity passed from parent to offspring. Genes are pieces of DNA, and most genes contain the information for making a specific protein [NHGRI glossary] This definition doesn't specify that it applies only to humans - but by specifying "parents" it seems to rule out non- animal genes, and almost implies mammals, or at least warm- blooded organisms.

A gene is a DNA segment that contributes to phenotype/ function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology. HUGO, J.A. White et. al. Guidelines for Human Gene Nomenclature HGNC Human Genome Nomenclature Committee, https://www.genenames.org/about/guidelines    https://beta.genenames.org/

From Genesweep, Ensembl, European Bioinformatics Institute, UK http://www.ensembl.org/genesweep.html  At the 2000 Cold Spring Harbor Genome conference [May 10-14] “one of the hotly debated topics was the number of human genes. This has been estimated at anything from 35,000 to 150,000. Considering the spread of opinion, the only way to resolve was to get people to bet on it … This led to an interesting debate on the definition of a gene … and how to assess that number.”

A gene is a set of connected transcripts. A transcript is a set of exons via transcription followed (optionally) by pre- mRNA splicing. Two transcripts are connected if they share at least part of one exon in the genomic coordinates. At least one transcript must be expressed outside of the nucleus and one transcript must encode a protein (see Footnotes).

MGI Glossary, Mouse Genome Informatics, Jackson Lab outlines 5 different possibilities referred to by "gene". http://www.informatics.jax.org/glossary/gene 

Assessment of the method used to determine the gene will occur by voting at Cold Spring Harbor Genome Meeting 2002  Footnotes  

  • We are restricting ourselves to protein coding genes to allow an effective assessment. RNA genes were considered too difficult to assess by 2003.

  • The key definition in the gene is that alternatively spliced transcripts all belong to the same gene, even if the proteins that are produced are different.

  • The hope is that by 2003 we should have at least a hard floor to the gene numbers. The voting should be able to determine the best method. [The cost of betting goes up over the years because people will have more information.]

  • The scope of the genome are the autosomal chromosomes and X and Y. No epigenetic nor mitochondrial genes are counted.

  • Encoding a protein assumes that the translation machinery does translate the sequence at some time. The scope of the expression of genes is across all cell types and all developmental stages (obviously!).

  • The genome is defined as the reference sequence (hence a mosaic of haplotypes) as defined by Greg Schuler, NCBI.

  • Somatic recombinant loci are counted after recombination: i.e., Ig [immunoglobulin] and TCR [T cell receptor] loci will form one gene per locus.

  • Transcripts from repetitive regions are not counted even if expressed. A repetitive region is an element which is both repeated in the genome and has good evidence that the method of replication is based on a selfish replication strategy.

  • If trans- splicing is found in humans (which it has not been so far, and is unlikely to occur. But just in case) the definition of the transcript occurs after the trans splicing event. This will split trans- spliced, polycistronic transcripts into multiple genes by this definition. 

June 3, 2003 And the winner is... Nature "Human Gene Number Wager Won"  https://www.nature.com/news/2003/030602/full/news030602-3.html
But the definitive answer is still sometime off.  (What were they thinking in 2000?)  After all, does Saccharomyces cerevisae, whose genome was completed in 1996, have an absolutely definitive gene count yet? 

Although the sequence of the human genome has been (almost) completely determined by DNA sequencing, it is not yet fully understood. Most (though probably not all) genes have been identified by a combination of high throughput experimental and bioinformatics approaches, yet much work still needs to be done to further elucidate the biological functions of their protein and RNA products. Recent results suggest that most of the vast quantities of noncoding DNA within the genome have associated biochemical activities, including regulation of gene expression, organization of chromosome architecture, and signals controlling epigenetic inheritance.

There are an estimated 19,000-20,000 human protein-coding genes.[4] The estimate of the number of human genes has been repeatedly revised down from initial predictions of 100,000 or more as genome sequence quality and gene finding methods have improved, and could continue to drop further.[5][6] Protein-coding sequences account for only a very small fraction of the genome (approximately 1.5%), and the rest is associated with non-coding RNA molecules, regulatory DNA sequences, LINEs, SINEs, introns, and sequences for which as yet no function has been determined.[7]Wikipedia accessed 2018 March 22 https://en.wikipedia.org/wiki/Human_genome

Does defining “gene” only get harder?  Or are we making progress by recognizing how complicated it really is?   This is not a new problem. The report of the Invitational DOE Workshop on Genome Informatics (26-27 April 1993, Baltimore MD) pointed out "The concept of  “gene” is perhaps even more resistant to unambiguous definition now than before the advent of molecular biology. Our inability to produce a single definition for “gene” has no adverse effect upon bench research, [is this true?] but it poses real challenges for the development of federated genome databases. http://www.esp.org/rjr/white.pdf

Molecular biology has a communication problem. There are many databases using their own labels and categories for storing data objects and some using identical labels and categories but with a different meaning. Conversely, one concept is often found under different names. Prominent examples are the concepts "gene" and "protein sequence" which are used with different semantics by major international genomic and protein databases thereby making database integration difficult and strenuous. Adding semantics to genome databases: towards an ontology for molecular biology. Schulze-Kremer S1.  Proc Int Conf Intell Syst Mol Biol. 1997;5:272-5. https://www.ncbi.nlm.nih.gov/pubmed/9322049

An account of a Gene Nomenclature workshop held in conjunction with the annual American Society of Human Genetics meeting in Philadelphia PA, US Oct. 2 2000 reported on discussion between the human and mouse nomenclature committees (and other interested parties): "A gene can be defined as an abstraction that is useful for the purposes of nomenclature and for the assignment of a symbol. It was originally described as a "unit of inheritance" and has since been described a "set of features on the genome that can produce a functional unit", but this latter term does not encompass all of those objects to which symbols are assigned. Designations in MGD [Jackson Lab's Mouse Genome Database] specify whether each object is a marker, gene, D segment etc., so in this context the actual definition of a gene is not so important. 

The GeneSweep definition is not particularly useful for nomenclature as it indicates all genes must code for a protein, and hence does not include mRNAs etc. It was agreed that the term "gene" has been used for a collection of object types and should not be removed as it is still a very useful term, particularly for the clinician and for those with a clearly defined locus of interest; however, perhaps it is not so useful for nomenclature, and the term "genomic feature" should be used instead. Possible definitions of genomic features were discussed, including an object which shares exons, that are assumed to be transcripts from the same gene. Another suggestion was that the term "symbol" should be defined, rather than "gene", as this is what nomenclature committees work with, and it can incorporate a number of variations on the term "gene". HM Wain et. al "Report of ASHG- NW Gene Nomenclature Workshop", HUGO, Jan. 2001 http://www.gene.ucl.ac.uk/nomenclature/ashgnw_report.html  See also under gene family.

GeneSweep winners  A guy walks into a bar and asks, “What’s the difference between a weed, a mouse, and a human?” The answer -- if it refers to total number of genes -- is “not much.” That is the result of the three-year betting pool known as Genesweep, where scientists bet on the sum total of genes in the human genome.  The winning number, announced by Ewan Birney of the European Bioinformatics Institute (EBI) during a May symposium at Cold Spring Harbor Laboratory (CSHL) on “The Genome of Homo sapiens,” came to a mere 21,000 -- far short of the conventional wisdom of the 1990s of about 100,000. ... Birney initially hesitated about announcing a winner, given the lingering uncertainty over the precise number of human genes despite the completion of the sequence last April. Bets ranged from Lee Rowen’s low bid to more than 300,000 genes. But the original rules of Genesweep dictated a winner be announced this year. Several researchers at this year’s meeting presented data showing that the number of protein-coding regions (the definition of a gene, according to Genesweep rules) was well under 25,000.  Laurie Goodman, 2003 BioIT World
http://www.bio-itworld.com/archive/071503/genesweep/

Nomenclature and terminology promise to continue to be ongoing challenges as comparative genomics matures. 

Gene structure, parts of  genes (and potential genes) and gene processes: Parts of genes and gene processes constitute the rest of this section. Broader term: DNA  Is protein broader, narrower or somewhere in between? The genome is smaller, in a sense than the proteome, but the number of proteins is infinitely larger than the number of genes.  At the biochemical and molecular level these hierarchies are being redefined, in ways we are just beginning to comprehend.    See also Gene categories

alternative exons: When interrupted genes produce messenger RNA, there occurs in certain genes tissue and stage- specific alternative splicing. The interrupted gene produces primary transcription product a heterogeneous nuclear RNA molecule, in which both exons and introns are represented. Introns, however, are removed from the primary transcript during the processing of messenger RNA in specific splicing reactions. Splicing is usually constitutive, which means that all exons are joined together in the order in which they occur in the heterogeneous nuclear RNA. In many genes, however, alternative splicing has also been observed, in which the exons may be combined in some other way (Fig. 2). For example, some exon or exons may be skipped in the splicing reaction. The primary order of the exons is not, however, altered even in alternative splicing. Thus alternative splicing makes it possible for a single gene to produce more than one messenger RNA molecule, which contradicts the basic conceptual framework of the neoclassical view of the gene.  Petter Portin in "The Origin, Development and Present Status of the Concept of the Gene: A Short Historical Account of the Discoveries" Current Genomics, 2000 https://pdfs.semanticscholar.org/a61a/4e1a2c28e517d6e4ca9a43fd63bbb65379e4.pdf   Related term: constitutive exons

alternative splicing: The production of two or more distinct mRNAs from RNA transcripts having the same sequence via differences in splicing (by the choice of different exons).   Mouse Genome Informatics, Jackson Lab

Different ways of combining a gene's exons to make variants of the complete protein  DOE, Genome Glossary, Oak Ridge National Lab, US

Recent genome- wide analyses of alternative splicing indicate that 40- 60% of human genes have alternative splice forms, suggesting that alternative splicing is one of the most significant components of the functional complexity of the human genome. Here we review these recent results from bioinformatics studies, assess their reliability and consider the impact of alternative splicing on biological functions. Although the 'big picture' of alternative splicing that is emerging from genomics is exciting, there are many challenges. High- throughput experimental verification of alternative splice forms, functional characterization, and regulation of alternative splicing are key directions for research. B. Modrek, C. Lee,  "A genomic view of alternative splicing" Nature Genetics30 (1) :13- 19, Jan. 2002 

Alternative splicing was first observed in animal viruses [87 - 95]. The first observations of alternative splicing in the genes of eukaryotes concerned murine immunoglobulin genes [96 - 99]. Since then, alternative splicing has been observed in hundreds of genes in various eukaryotic organisms, man included [see 100 for review].  The tissue specificity of alternative splicing was first shown in the fibrinogen genes of rat and man [101]. The first observations of developmental stage specificity concerned the alcohol dehydrogenase gene of Drosophila melanogaster [102]. The first demonstration that alternative splicing was both tissue and stage-specific concerned the trompomyosin gene of D. melanogaster and rat [103, 104]. The tissue and stage specificity of alternative splicing naturally constitutes a previously unknown and effective mechanism of gene regulation. Petter Portin in "The Origin, Development and Present Status of the Concept of the Gene: A Short Historical Account of the Discoveries" Current Genomics, 2000 https://pdfs.semanticscholar.org/a61a/4e1a2c28e517d6e4ca9a43fd63bbb65379e4.pdf 
Broader term: splicing Related terms:
  alternative splice sites; RNA glossary pre- mRNA splicing, RNA splicing; Sequences, DNA & beyond  protein splicing, trans- splicing

biochemical genomics: Functional genomics Can identify genes by the function of their products.

cDNA complementary DNA: A single stranded DNA molecule with a nucleotide sequence that is complementary to an RNA molecule; cDNA is formed by the action of the enzyme reverse transcriptase on an RNA template. After conversion to the double stranded form, cDNA is used for molecular cloning or for hybridization studies. IUPAC Biotech

A complementary DNA for a messenger RNA molecule. Unlike an mRNA, a cDNA can be easily propagated and sequenced. NCBI

Single-stranded complementary DNA synthesized from an RNA template by the action of RNA- dependent DNA polymerase. cDNA (i.e., complementary DNA, not circular DNA, not cDNA) is used in a variety of molecular cloning experiments as well as serving as a specific hybridization probe. MeSH, 1994

The term cDNA can encompass "proper" cDNAs and ESTs. "Proper" cDNAs are long segments of genes, often full-length. Many specialists believe that cDNAs (including ESTs) are the highest-value sequences, because they represent experimentally determined genes.  CHI Outlook for DNA Microarrays report

Logic of Molecular Approaches to Biological Problems, John Wagner (Cornell Univ. Graduate School of Medical Science, US ) has an extensive and articulate section on the use of cDNA in experimental design. http://www-users.med.cornell.edu/~jawagne/cDNA_cloning.html  Narrower term cDNA maps  Related terms transcript clusters;  DNA EST expressed sequence tag, genomic DNA; Expression gene expression 

cis-:  This side of; compare with  trans-, meaning across.

cis trans test: In the cis-trans test cis- and trans -heterozygotes are compared. In the cis -heterozygote the mutations are in the same chromosome but in the trans -heterozygote in homologous chromosomes. Thus the genotype of the cis -heterozygote is designated as a b/+ + and that of the trans -heterozygote as a +/+ b. If the cis -heterozygote is of a wild type phenotype and the trans -heterozygote is mutant, a and b are mutations of the same cistron. If, however, both cis- and trans -heterozygotes are phenotypically of a wild type, a and b are mutations of different cistrons. The cistron is a synonym of the gene, but this term should be used only when it is based on cis- trans test or biochemical evidence.  Petter Portin in "The Origin, Development and Present Status of the Concept of the Gene: A Short Historical Account of the Discoveries"  Current Genomics, 2000 https://pdfs.semanticscholar.org/a61a/4e1a2c28e517d6e4ca9a43fd63bbb65379e4.pdf

cistron: HF Judson in the Eighth Day of Creation tells how Seymour Benzer "wanted to scrap the word "gene" and replace it with three new terms, "muton" for the smallest spot at which mutation could take place, "recon' for the irreducibly shortest length on the map that could not be split by a genetic recombination even at the fine scale he had reached, and "cistron" for the shortest stretch that comprised a functional genetic unit. (The last was derived from the mating tactic Benzer used to determine which mutations lay near each other on the map, which was technically called the "cis- trans test"... Over the next decade, Benzer's new terms came into a considerable vogue, especially "cistron". But the other two were superfluous once mutations and recombinations could be thought of in terms of base pairs, while the cistron was, in effect, the gene in its principal sense; it is the older usage that has lasted and the newer one that has died away. Horace F Judson Eighth Day of Creation, Cold Spring Harbor Laboratory Press, 1996 pp. 320-321

Term coined by Seymour Benzer in 1955 referring to DNA coding for a single polypeptide. Originally did not include the start and stop codonsRelated term: polycistronic

coding regions: The part of a gene that specifies the structure of a protein. [SNP Consortium] Also referred to as a "coding sequence" or protein coding region or sequence. Narrower terms mature peptide or protein coding sequence, signal peptide coding sequence, transit peptide coding sequence 

coding sequence CDS: Sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon). Feature includes amino acid conceptual translation. DDBJ/ EMBL/ GenBank Feature Table  http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html  Related terms: coding regions, mature peptide or protein coding sequence

complementary DNA: See cDNA.

constitutive exons: Splicing is usually constitutive, which means that all exons are joined together in the order in which they occur in the heterogeneous nuclear RNA.  [Petter Portin in "The Origin, Development and Present Status of the Concept of the Gene: A Short Historical Account of the Discoveries" Current Genomics, 2000 https://pdfs.semanticscholar.org/a61a/4e1a2c28e517d6e4ca9a43fd63bbb65379e4.pdf 

CpG islands: Areas of increased density of the dinucleotide sequence cytosine- phosphate diester-- guanine. They form stretches of DNA several hundred to several thousand base pairs long. In humans there are about 45,000 CpG islands, mostly found at the 5' ends of genes. They are unmethylated except for those on the inactive X chromosome and some associated with imprinted genes.  MeSH, 1996 
Wikipedia http://en.wikipedia.org/wiki/CpG_island
 

EST Expressed sequence tag: DNA  May but don't necessarily represent genes.

exons A section of DNA which carries the coding sequence for a protein or part of it. Exons are separated by intervening, non- coding sequences (introns). In eukaryotes most genes consist of a number of exons. IUPAC Bioinorganic

The portion of the genome that is expressed as a processed mRNA. NHLBI

The parts of a genetic transcript remaining after the INTRONS are removed and which are spliced together to become a messenger or structural RNA. MeSH, 1987

The term "exon" is normally applied for regions which are not spliced out from a pre- mRNA sequence (5' untranslated region (5' UTR), coding sequences (CDS) and 3' untranslated region (3' UTR). But this term is often used also to indicate the protein- coding regions only.  Martin J Bishop, editor, Guide to Human Genome Computing, Academic Press, 1998 https://books.google.com/books?id=g_bJGN-MecIC&printsec=frontcover#v=onepage&q&f=false

Exons contain the coding sequences of a gene - in contrast to introns, or "junk DNA," which are excised before mRNA is translated into a protein. Narrower terms: alternative exons, constitutive exons, non-coding first exons; Sequences, DNA & beyond non- coding first exons

expressed sequence: See coding sequence (coding regions)

gene: Definitions and history of are at the beginning of this glossary.

Gene categories: Narrower terms include antibody genes, candidate genes, chimeric gene, constitutive genes, developmental genes, differentiated genes, DNA library, DNA repair genes, essential genes, extranuclear genes, gene components, gene library, genomic library, housekeeping genes, hypothetical genes, intermediate early genes, immunoglobulin genes Ig, lethal genes, luxury genes, microchondrial genes, nested genes, non-structural genes, nuclear gene, operator genes, orphan genes, overlapping genes, parasitism genes, plasmid genes, pleiotropic genes, pseudogenes, putative genes, regulator genes, regulatory region, reporter genes, silent genes, split genes, structural genes, suppressor genes, virulence genes 

gene cluster:
A set of closely related genes that code for the same or similar proteins and which are usually grouped together on the same chromosome  Life Science 

In UniGene (an experimental system for automatically partitioning GenBank sequences into a non- redundant set of gene- oriented clusters), each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. NCBI, UniGene, US http://www.ncbi.nlm.nih.gov/UniGene/index.html

King's Dictionary of Genetics cross references gene cluster with "multigene family" (which can be on the same or different chromosomes, and descended by gene duplication) and reiterated genes (which are multiple genes on the same chromosome). The MeSH term "multigene family" is based on King's definition. Y and M. Zhang's Dictionary of Gene Technology Terms definition specifies "identical or related genes" coding for "the same or similar proteins". The Oxford Dictionary of  Biochemistry & Molecular Biology defines "gene cluster or gene complex" and specifies "functionally related" and "closely linked" genes on a chromosome and notes these are "often structural genes coding for the enzymes that catalyse the various steps of a metabolic pathway".  Is a consensus definition possible?

In bacteria see operon 

gene coding: See coding regions, coding sequences.

gene discovery methods: See SNPs & other Genetic variations : candidate gene approach, direct approach, functional cloning,  indirect approach, linkage analysis, positional cloning, random genome-wide association studies; Functional genomics;  https://www.nature.com/articles/ng1196  Molecular modeling: gene identification, gene prediction 

gene families: The HUGO Gene Nomenclature Committee http://www.genenames.org/   has been working to develop a unique symbol, as well as a longer and more descriptive name, for each human gene. Thus, members of many gene families, previously cloned in different laboratories and known by a variety of terms, now share a common gene symbol. A text search in any of the genome browsers will often return links to all named members of a gene family that have been mapped to the genome. Whereas Ensembl and UCSC currently return lists of the genes, the NCBI presents both a list and a graphical overview.  "How can one find all the members of a human gene family?
Nature Genetics 32 supplement: 49- 52, 2002
https://www.nature.com/articles/ng1196  Related terms: Functional genomics
Wikipedia http://en.wikipedia.org/wiki/Gene_families
 

gene grouping: HGNC Gene Grouping/ Family Nomenclature, HUGO, Human Genome Nomenclature Committee, with link to gene families currently under review  http://www.genenames.org/genefamily.html 

gene imprinting: A phenomenon in which the phenotype of the disease depends on which parent passed on the disease gene. For instance, Prader- Willi syndrome and Angelman syndrome are both inherited when the same part of chromosome 15 is missing. When the father's complement of 15 is missing, then the child has Prader-Willi, but when the mother's complement of 15 is missing, the child has Angelman syndrome. [PhRMA] See also under epigenetics

gene order:
The sequential location of genes on a chromosome. MeSH, 2001

gene products:  The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases … The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. There are three separate aspects to this effort: first, the development and maintenance of the ontologies themselves; second, the annotation of gene products, which entails making associations between the ontologies and the genes and gene products in the collaborating databases; and third, development of tools that facilitate the creation, maintenance and use of ontologies. … It is easy to confuse a gene product name with its molecular function, and for that reason many GO molecular functions are appended with the word "activity". The documentation on the function ontology explains more about GO functions and the rules governing them. Gene Ontology, Documentation ftp://ftp.geneontology.org/go/www/GO.doc.shtml

The biochemical material, either RNA or protein, resulting from expression of a gene. The amount of gene product is used to measure how active a gene is; abnormal amounts can be correlated with disease causing alleles.  DOE

gene regulation: http://en.wikipedia.org/wiki/Regulation_of_gene_expression 
BioBase http://www.gene-regulation.com/  Databases

gene structural components: Includes exons, introns, regulatory sequences, splice sites, other?

gene superfamily: Gene superfamily is defined as "a cluster of evolutionarily related sequences" (Dayhoff, 1976), and consists of homologous gene families, which are clusters of genes from different genomes that include both orthologs and paralogs (Tatusov et al., 1997).

gene validation: Genetic validation of  predicted genes. See under transcription clusters. Related term: Targets target validation

genetic structures
: The biological objects that contain genetic information and that are involved in transmitting genetically encoded traits from one organism to another. MeSH 2003 

Subcategories include base sequence, chromosome structures, chromosomes, gene library, genes, genetic code, genetic vectors, genome components, plasmids and others. 

localize: Determination of the original position (locus) of a gene or other marker on a chromosome. [DOE] Related terms: gene localization; Labels: immunohistochemistry;  Proteins protein localization, subcellular localization

locus (plural loci):
The word "locus" is not a synonym for gene but refers to a map position. A more precise definition is given in the Rules and Guidelines from the International Committee on Standardized Genetic Nomenclature for Mice which states: "A locus is a point in the genome, identified by a marker, which can be mapped by some means. It does not necessarily correspond to a gene; it could, for example, be an anonymous non-coding DNA segment or a cytogenetic feature. A single gene may have several loci within it (each defined by different markers) and these markers may be separated in genetic or physical mapping experiments. In such cases, it is useful to define these different loci, but normally the gene name should be used to designate the gene itself, as this usually will convey the most information." HUGO Gene Nomenclature Committee, HGNC Guidelines https://www.genenames.org/about/guidelines

Position on a chromosome of a gene or other chromosome marker; also the DNA at that position. The use of locus is sometimes restricted to mean regions of DNA that are expressed. DOE

Any genomic site, whether functional or not, that can be mapped through formal genetic analysis. NHLBI Related term: Expression gene expression;

Mendelian genetics: Genomics
metagenes: Expression

mobile genetic elements: Includes retrotransposons, transposons; Sequences, DNA & beyond  LINES,  SINES
molecular function: See Functional genomics Gene Ontology
muton: See under cistron

operon: A functional unit consisting of a promoter, an operator and a number of structural genes, found mainly in prokaryotes. The structural genes commonly code for several functionally related enzymes, and although they are transcribed as one (polycistronic) mRNA each is independently translated. In the typical operon, the operator region acts as a controlling element in switching on or off the synthesis of mRNA. (operator gene) IUPAC Biotech

The genetic unit consisting of a feedback system under the control of an operator gene, in which a  structural gene transcribes its message in the form of mRNA upon blockade of a repressor produced by a regulator gene. Included here is the attenuator site of bacterial operons where transcription termination is regulated. MeSH, 1972

ORF open reading frame: Sequences, DNA & beyond May, but don't necessarily represent genes. Broader term reading frames Sequences, DNA & beyond

ORFans: ORFans comprise 20-30% of the ORFs of most completely sequenced genomes. Because nothing can be learnt about ORFans via sequence homology, the functions and evolutionary origins of ORFans remain a mystery... We find that functional and structural studies of ORFans are not as underemphasized as previously suggested. These recently determined structures correspond to ORFans from all Kingdoms of life, and include proteins that have previously been functionally characterized, as well as structural genomics targets of unknown function labeled as "hypothetical proteins". This suggests that many of the ORFans in the databases are likely to correspond to expressed, functional (and even essential) proteins. Furthermore, the recently determined structures include examples of the various types of ORFans, suggesting that the functions and evolutionary origins of ORFans are diverse.  N. Siew and D. Fischer, Structural Biology Sheds Light on the Puzzle of Genomic ORFans, J Mol Biol. 342(2): 369- 373, Sept. 10, 2004

Protein encoding regions [ORFs] with no apparent similarity to proteins in other genomes. D. Fischer and D. Eisenberg, Finding families for genomic ORFans, Bioinformatics, 15(9): 759- 762, Sept. 1999  Related terms: Gene categories: hypothetical genes, pleiotropic gene; -Omes & -omics ORFeome.

polycistronic: Implies coding for two or more proteins. See also cistron.

polygene: Genetics. A gene which acts together with other genes to influence quantitative traits (such as size or weight).  Oxford English Dictionary

Seems to have begun as a concept which referred to a hypothetical single "gene" which acted with other genes in a less than Mendelian fashion, and evolved into a class of  "genes" which we have yet to truly begin to understand.  Related terms   Genomics polygenic, post- genomic, post- Mendelian

regulon: In eukaryotes, a genetic unit consisting of a noncontiguous group of genes under the control of a single regulator gene. In bacteria, regulons are global regulatory systems involved in the interplay of pleiotropic regulatory domains. These regulatory systems consist of several operons. MeSH, 1994

repressors: See under regulator genes retrotransposon:  DNA fragments copied from viral RNA with reverse transcriptase that insert in the host chromosomes. Edward Bollenbach, Life Sciences Dictionary Related term transposons.

signal peptide coding sequence: Coding sequence for an N-terminal domain of a secreted protein; this domain is involved in attaching nascent polypeptide to the membrane; leader sequence. DDBJ/ EMBL/ GenBank Feature Table http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html

splice variants: Splice variants play an important role within the cell in both increasing the proteome diversity and in cellular function. Splice variants are also associated with disease states and may play a role in their etiology. Information about splice variants has, until now, mostly been derived from the primary transcript or through cellular studies. In this study information from the transcript and other studies is combined with tertiary structure information derived from homology models. Splice variants: a homology modeling approach. Furnham N, Ruffle S, Southan C. Proteins. 2004 Feb 15; 54(3): 596-608.

synteny: Two genes which occur on the same chromosome are syntenic; however, syntenic genes may or may not be linked. NHLBI

The presence of two or more genetic loci on the same chromosome. Extensions of this original definition refer to the similarity in content and organization between chromosomes, of different species for example. MeSH, 2002

transcript clusters: [Bo] Yuan [Ohio State Univ.] avoids calling the index entries genes, preferring to call them transcript clusters, a careful term referring to how cDNAs and ESTs from different databases are grouped together based on homology. "They should be genes, but we don't have the evidence yet," he says. "We still have to confirm that all those transcripts and ESTs that align with the genome are functional." ... Confirming that predictions are real genes, known as validation, is a major reason the gene count will remain open for a while. "A prediction is just a prediction," says [Michael] Cooke [Genomics Institute, Novartis Research Foundation]. "You have to validate the prediction experimentally before you can call it a gene." Tom Hollon "Human Genes: How Many?" Scientist 15 (20): 1, Oct. 15, 2001   
Related terms: gene validation; In silico & Molecular modeling: gene prediction

transit peptide coding sequence:  Coding sequence for an N-terminal domain of a nuclear- encoded organellar protein; this domain is involved in post- translational import of the protein into the organelle. DDBJ/ EMBL/ GenBank Feature Table   http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html

Genes resources
DDBJ/ EMBL/ GenBank Feature Table, 2013, 100+ definitions. http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html  
IUPAC  International Union of Pure and Applied Chemistry, Glossary of Terms used in Bioinorganic Chemistry, Recommendations, 1997. 450+ definitions. http://www.chem.qmw.ac.uk/iupac/bioinorg/
IUPAC  International Union of Pure and Applied Chemistry, Glossary for Chemists of terms used in biotechnology. Recommendations, Pure & Applied Chemistry 64 (1): 143-168, 1992. 200 + definitions.
Jackson Lab, M
ouse Genome Informatics Glossary, Jackson Lab, US,  250+ definitions, 2013 http://www.informatics.jax.org/glossary  
MeSH Qualifiers with Scope Notes, National Library of Medicine,
https://www.nlm.nih.gov/mesh/topsubscope.html  

How to look for other unfamiliar  terms

IUPAC definitions are reprinted with the permission of the International Union of Pure and Applied Chemistry

Contact | Privacy Statement | Alphabetical Glossary List | Tips & glossary FAQs | Site Map