|
With changes in sequencing technology and methods, the rate
of acquisition of human and other genome data over the next few years will
be ~100 times higher than originally anticipated. Assembling and interpreting
these data will require new and emerging levels of coordination and collaboration
in the genome research community to develop the necessary computing algorithms,
data management and visualization system. Lawrence Berkeley
Lab, US "Advanced Computational Structural Genomics"
Finding guide to terms in these glossaries
Informatics
Map Site
Map The dividing line between this glossary and
Information management&
interpretation is fuzzy - in general Algorithms
& data analysis focuses on structured data, while Information management
& interpretation centers on unstructured data.
Other related glossaries include
Applications: Drug Discovery &
Development Proteomics
Informatics: Bioinformatics
Chemoinformatics Computers
& computing In
silico & molecular
Modeling Ontologies Research
Technologies: Microarrays & protein chips
Sequencing
Biology: Protein
Structures Sequences, DNA
& beyond.
3D-QSAR Three-Dimensional Quantitative Structure-Activity Relationships: In
silico & molecular
modeling glossary
adaptive clinical
trials: Clinical trials and drug approvals
ANOVA Analysis Of Variance: Error model based on a standard
statistical approach. a generalization of the familiar t-test that allows
multiple effects to be compared simultaneously, in contrast to the t-test. An
ANOVA model is expressed as a large set of equations that can be solved, given a
dataset of measurements, using standard software.
affinity based data mining:
Large and complex data sets are analyzed
across multiple dimensions, and the data mining system identifies data
points or sets that tend to be grouped together. These systems differentiate
themselves by providing hierarchies of associations and showing any underlying
logical conditions or rules that account for the specific groupings of
data. This approach is particularly useful in biological motif analysis.
"Data mining" Nature Biotechnology 18: 237-238 Supp. Oct. 2000
Broader term: data
mining
agglomerative method: See under cluster analysis
algorithm:
A procedure consisting of a sequence of algebraic formulas and/or logical steps to calculate or determine a given task.
MeSH, 1987
Algorithms fuel the
scientific advances in the life sciences. They are required for dealing with the
large amounts of data produced in sequencing projects, genomics or proteomics.
Moreover, they are crucial ingredients in making new experimental approaches
feasible... Algorithm development for Bioinformatics applications combines
Mathematics, Statistics, Computer Science as well as Software Engineering to
address the pressing issues of today's biotechnology and build a sound
foundation for tomorrow's advances. Algorithmics Group, Max Planck
Institute for Molecular Genetics, Germany http://algorithmics.molgen.mpg.de/
Rules or a process, particularly in computer
science. In medicine a step by step process for reaching a diagnosis or
ruling out specific diseases. May be expressed as a flow chart in
either sense. Greater efficiencies in algorithms, as well as improvements in computer
hardware have led to advances in computational biology. A computable set of steps to achieve a desired result.
From the Persian author Abu Ja'far
Mohammed ibn Mûsâ al-Khowârizmî who wrote a book with
arithmetic rules dating from about 825
A.D. [NIST]
Narrower terms: docking algorithms, sequencing algorithms, genetic
algorithm, heuristic algorithm. Related terms heuristic, parsing; Sequencing
dynamic programming methods.
artificial intelligence (AI): A wide- ranging term encompassing
computer applications that have the ability to make decisions; the ability
to explain reasoning is evidence of intelligence. Also covers methods
that have the ability to learn. [J Glassey et al. “Issues in the development
of an industrial bioprocess advisory system” Trends in Biotechnology 18
(4):136-41 April 2000]
Or as some people have noted, laboriously trying to get computers to
do what people do intuitively, without great effort. Conversely there are things
computer can do (relatively) effortlessly such as massive numbers of
error- free calculations. The most promising applications seem to involve
incorporating both computer aided consideration of many possibilities, combined
with human judgment.
Narrower terms: cellular automata, expert systems, fuzzy logic, genetic algorithms, neural nets
Related term: training sets.
American Association of Artificial Intelligence:
Topics http://www.aaai.org/AITopics/html/current.htmlz
How to do research in the MIT AI
Lab,
a whole bunch of current, former, and honorary MIT AI Lab graduate students,
1988-1997? http://www.cs.indiana.edu/mit.research.how.to/mit.research.how.to.html Virtual Library Artificial
Intelligence,
Jonathan Bowen, South Bank Univ. UK, 2002 http://www.afm.sbu.ac.uk/ai/
University and government research sites, newsgroups, commercial sites
and products, programming languages, journals, bibliographies, “interactive
things” and other information
artificial neural nets:
Algorithms simulating the functioning
of human neurons and may be used for pattern recognition problems,
e.g., to establish quantitative structure- activity relationships.
[IUPAC Computational]
Broader term neural nets; Related
terms: Drug
discovery and development drug design
Bayesian clinical
trials: Drug approvals
Bayesian network
modeling: This report describes a powerful and novel predictive tool
called Bayesian network modeling and demonstrates its application in clinical
forecasting. Insight Pharma Reports, Bayesian
Forecasting of Phase III Outcomes: The Next Wave in Predictive Tools,
2007
Bayesian network:
Wikipedia http://en.wikipedia.org/wiki/Bayesian_network
Bayesian
networks: A quick intro, Karen Sachs, Biomedical
Computation Review, Summer 2005 http://www.biomedicalcomputationreview.org/1/1/9.pdf
A computational analysis approach, machine learning tool.
Bayesian
statistics: The fundamental idea in Bayesian
statistics is that one’s uncertainty about an unknown quantity of interest is
represented by probabilities for possible values of that quantity.... The
Bayesian paradigm states that probability is the only measure of one’s
uncertainty about an unknown quantity. In a Bayesian clinical trial, uncertainty
about an endpoint (also called parameter) is quantified according to
probabilities, which are updated as information is gathered from the
trial. Center for Devices & Radiological Health, FDA, Guidance for the
Use of Bayesian Statistics in Medical Device Clinical Trials - Draft Guidance
for Industry and FDA Staff , This guidance document is being distributed for
comments purposes only. Draft released for comment on May 23, 2006 docket number
2006D-0191. http://www.fda.gov/cdrh/osb/guidance/1601.html#4
biomathematics:
The application of mathematics to
problems in biology and medicine. An essential tool in fields such as population
genetics, cellular neurobiology, comparative genetics, biomedical imaging,
pharmacokinetics, and epidemiology. It plays an increasingly vital role in the
effort to understand diseases and disorders, and to improve therapies.
Collection Development Manual, National Library of Medicine, US 2004 http://www.nlm.nih.gov/tsd/acquisitions/cdm/subjects14.html Related
terms: bioinformatics, computational
biology, molecular modeling
biometrics:
The information age is quickly revolutionizing the way
transactions are completed. Everyday actions are increasingly being handled
electronically, instead of with pencil and paper or face to face. This growth in
electronic transactions has resulted in a greater demand for fast and accurate
user identification and authentication. Biometric technology is a way to achieve
fast, user- friendly authentication with a high level of accuracy. [Biometrics
Consortium] http://www.biometrics.org/REPORTS/CTSTG96/
Bonferroni correction:
A multiple test correction
method. To address false positives through this method, you can simply
divide your desired false- positive rate by the number of tests, and use that
modified number to declare any single change to be significant. The Bonferroni
correction is extremely conservative, and many statisticians argue against its
use.
bootstrapping:
Kerr and Churchill use a bootstrapping procedure to
calculate confidence intervals for the fitted values. Any bootstrapping
procedure works by perturbing the original dataset and re-solving the model many
times, often thousands of times. [Similar methods are sometimes called resampling
or jackknife methods.] This generates a large number of values for
each variable (one for each perturbed dataset), and one then estimates the true
values of the variable, confidence intervals, and so on, from these values. The
tricky part of the procedure is deciding how to perturb the dataset.
chaos theory and biology:
Skinner JE et al, Application of chaos theory to biology and medicine, Integr
Physiol Behav Sci, 1992 Jan-Mar; 27 (1): 39- 53 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=1576087&dopt=Abstract
Chaos Hypertextbook,
Glen Elert, 1995- 2002 http://hypertextbook.com/chaos/
cluster analysis:
The clustering, or grouping, of large
data sets (e.g., chemical and/ or pharmacological data sets) on the basis
of similarity criteria for appropriately scaled variables that represent
the data of interest. Similarity criteria (distance based, associative,
correlative, probabilistic) among the several clusters facilitate the recognition
of patterns and reveal otherwise hidden structures (Rouvray, 1990; Willett,
1987, 1991). [IUPAC Computational]
A set of statistical methods used to group variables or observations into strongly
inter- related subgroups. In
epidemiology, it may be used to analyze a closely grouped series of events or cases of disease or other
health- related
phenomenon with well- defined distribution patterns in relation to time or place or both.
MeSH, 1990 Has been used in medicine to create
taxonomies of diseases and
diagnosis and in archaeology to establish taxonomies of stone tools and funereal
objects. Cluster analysis can be
supervised, unsupervised or
partially supervised Related terms: clustering
analysis, dendogram, heat map, pattern
recognition, profile chart.
Narrower terms: hierarchical clustering, k-means clustering, self- organizing maps
ClogP values: In
silico & molecular modeling
glossary
clustering analysis: This is a general type of
analysis that involves grouping gene or array expression profiles based on
similarity. Clustering is a major subfield within the broad world of numerical
analysis, and many specific clustering methods are known.
coefficient of variation (CV):
The standard
deviation of a set of measurements divided by their mean.
common factor analysis: See under principle component analysis PCA
comparative data mining:
Focuses on overlaying large and complex
data sets that are similar to each other ...particularly useful in all
forms of clinical trial meta analyses ... Here the emphasis is on
finding dissimilarities, not similarities. "Data mining" Nature Biotechnology
Vol. 18: 237-238 Supp Oct.. 2000
Broader term: data mining
Comparative Molecular Field Analysis CoMFA: In
silico & molecular
modeling glossary
curse of dimensionality:
(Bellman
1961) refers to the exponential growth of hypervolume as a function of
dimensionality. In the field of NNs [neural nets], curse of dimensionality
expresses itself in two related problems. Janne Sinkkonen "What is
the curse of dimensionality?" Artificial Intelligence FAQ, at
comp-ai-neuralnets.org http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-13.html
Related
term: high-dimensionality
data cleaning:
Removal and/or correction of erroneous data introduced
by data entry errors, expired validity of data, or by some other means. Lawrence Berkeley Lab "Advanced Computational Structural Genomics"
Glossary
The quality of data in sequence databases is highly variable. This is
receiving increasing attention. Ensembl Bioinformatics
glossary
differentiates data of varying quality.
Related terms: data cleansing, data scrubbing
Data
integrity and cleansing tools, DMOZ, 2003
* http://dmoz.org/Computers/Software/Databases/Data_Warehousing/Data_Integrity... data
credibility: Different labs have different
reputations, and scientists look at work produced by their peers in a subjective
light. This data credibility issue creates a need to tag almost every data item
with a confidence factor. This is so that, as you create your next experimental
hypothesis, you know that the quality of the information you are relying upon is
high enough that you can go profitably down the scientific line of inquiry that
you are pursuing. The Life Science Industry Represents
Unique Opportunities for Informatics Companies: An Interview with Shiv Tasker of
Blackstone Computing, CHI's GenomeLink 25.1 http://www.healthtech.com/newsarticles/issue25_1.asp data integration: The term "data integration" is used
generically within the industry for describing disparate situations.
Consequently, considerable confusion results regarding the best practices for
solving specific, data integration problems. There are a number of markedly
different approaches to data integration, each with its own strengths and
weaknesses, and many different technologies are available for each approach. All
data integration efforts are initiated to support particular research
objectives. Although they are aimed toward the same strategic goal, they can
differ substantially in the specific problems that they are trying to solve, in
the scale of the integration, and in the types of data that are integrated. The
strategies and technologies that best apply to address specific objectives are
unlikely to be the same. Key Trends Influencing Informatics Initiatives in
Life Science Companies: An Interview with Eric Meyers and Jack Pollard of 3rd
Millennium, CHI's GenomeLink 29.2 http://www.healthtech.com/newsarticles/issue29_2.asp
Related terms: data mining - integrating, data
reduction methods; Information
management & interpretation glossary interoperability; Computers
XML; Omes & -omics glossary integromics:
data management:
Each new generation of DNA sequencers, mass spectrometers,
microscopes, and other lab equipment produces a richer, more detailed set of
data. We’re already way beyond gigabytes (GB): a single next-generation
sequencing experiment can produce terabytes (TB) of data in a single run. As a
result, any organization running hundreds of routine experiments a month or
year, or trying to handle the output of next-generation sequence instruments,
quickly finds itself with a massive data management problem. Data
Management: The Next Generation, Salvatore Salamone, BioIT World, Oct 2007
http://www.bio-itworld.com/issues/2007/oct/cover-story-data-management
SEE
also algorithms, artificial
intelligence, data cleaning, data mining, data reduction methods, expert
systems, factorial design, fuzzy logic, knowledge based systems, neural
networks, normalization, parsing, pattern recognition, SPC Structure- Property Correlations, visualization and
various statistical methods. CoMFA, decision tress, factorial
design, mosaic plots, multivariate statistics, Partial Least Squares PLS, Principal
Components Analysis PCA, recursive partitioning; Clinical
trials glossary meta-analysis; Information management
& interpretation glossary data mart:
A department specific data warehouse. There are two types
of data marts - independent and dependent. An independent data mart is fed data
directly from the legacy environment. A dependent data mart is fed data from the
enterprise data warehouse. In the long run, dependent data marts are
architecturally much more stable than independent data marts. Bill Inmon, Glossary of
Data Warehousing, 2002-2005 http://www.inmoncif.com/library/glossary/
data mining:
The
biopharmaceutical industry is grappling not only with sheer data volume but with
the ability of researchers to extract information through identification and
contextual analysis of those data that are relevant to a particular set of
investigations.
Data
Mining in Drug Development and Translational Medicine July 2009 Table
of Contents | Tables
and Figures | Executive
Summary
Nontrivial extraction
of implicit, previously unknown and potentially useful information from
data, or the search for relationships and global patterns that exist in
databases. W. Frawley and G. Giatetsky-Schapiro and C. Matheus,
“Knowledge Discovery in Databases: An Overview.” AI Magazine, 213-
228, Fall 1992 Exploration and analysis, by automatic
or semi- automatic means, of large quantities of data in order to discover
meaningful patterns or rules. Berry, MJA, Data Mining Techniques for
Marketing, Sales and Customer Support John Wiley & Sons, New York
1997 cited in Nature Genetics 21(15): 51-55 ref 11, 1999
May need to incorporate related techniques such as
cluster analysis or visualization.
Narrower terms: affinity based data mining, comparative data mining,
data mining - integrated, data mining- structure based, influence-based data mining, predictive data mining, text mining, time delay data mining,
trends analysis
data mining. Molecular Imaging image
data mining.
Related terms: data warehouse
BioMedCentral
for data mining http://www.biomedcentral.com/info/about/datamining/
Data Mine
http://www.the-data-mine.com/
KDNuggets
http://www.kdnuggets.com/
SIGKDD, Special Interest Group Knowledge
Discovery in Data and Data Mining, Association for Computing Machinery
http://www.acm.org/sigkdd/
data reduction methods:
Includes cluster analysis, currently the best
known data reduction method in the microarray field.
Related terms: data cleaning, data cleansing, data scrubbing
data scrubbing: Correcting, completing,
verifying and deduping data, particularly for data warehouses.
Data scrubbing,
Tommy Peterson, ComputerWorld, Feb. 10, 2003 http://www.computerworld.com/databasetopics/data/story/0,10801,78230,00.html
What is.com definition,
2003 http://searchdatabase.techtarget.com/sDefinition/0,,sid13_gci880972,00.html
Related terms: data cleaning, data cleansing, data
reduction methods
data sharing: Research
glossary
data visualization: Information management
& interpretation glossary
data warehouse:
An integrated repository of data from multiple,
possibly heterogeneous data sources, presented with consistent and coherent
semantics. Warehouses usually contain summary information represented on
a centralized storage facility. Lawrence Berkeley Lab "Advanced Computational
Structural Genomics" Glossary
A collection of
integrated subject oriented data bases designed to support the DSS function,
where each unit of data is relevant to some moment in time. The data warehouse
contains atomic data and lightly summarized data. A data warehouse is a subject
oriented, integrated, non volatile, time variant collection of data designed to
support management DSS needs. Inmon, Bill, Glossary
of Data Warehousing, 2002-2005 http://www.inmoncif.com/library/glossary/
Related
terms: data mining, global schema
database mining: See data mining
databases: Bioinformatics glossary
Databases & software directory
decision trees:
Hierarchical series of questions leading to specific action
steps -- to guide manufacturers and reviewers in determining the level and
extent of safety testing needed at various stages. Report Recommends More
Explicit Guidelines For Assessing Safety of New Ingredients Added to Infant
Formula, National Academy of Sciences press release, 2004 http://www4.nationalacademies.org/news.nsf/isbn/0309091500?OpenDocument
dendogram:
A tree diagram that depicts the results of hierarchical
clustering. Often the branches of the tree are drawn with lengths that are
proportional to the distance between the profiles or clusters. Dendograms are
often combined with heat maps, which can give a clear visual representation of
how well the clustering has worked.
Related terms: cluster analysis, heat maps, profile charts
distance functions or similarity scores:
The key issue in comparing expression
profiles is deciding what it means for two profiles to be
"similar." Mathematically, we need a function that takes two
expression profiles and calculates a similarity score. It is sometimes easier to
work with the opposite concept of distance, and people often speak of distance
functions instead of similarity scores. Many similarity or distance functions
are used in microarray work, and there is no consensus as to which one is best.
Narrower terms: Euclidean distance, Pearson correlation
divisive method: See under cluster analysis
error model: A mathematical formulation that identifies the sources of
error in an experiment. An error model provides a mathematical means of
compensating for the errors in the hope that this will lead to more accurate
estimates of the true expression levels and also provides a means of estimating
the uncertainty in the answers. An error model is generally an approximation of
the real situation and embodies numerous assumptions; therefore, its utility
depends on how good these assumptions are. The model can be expressed as a set
of equations, as an algorithm, or using any other mathematical formalisms. ...
The term error model has become very popular among software providers,
particularly in light of the success of Rosetta’s Resolver, which incorporates
an error model. As a result, some software developers may use the term
inappropriately. Not everything that is called an error model really is one.
Euclidean distance: Commonly used distance function, which works by
treating each expression profile as defining a point in a multidimensional
space.
evolutionary computation:
Encompasses methods of simulating
on a computer. The term is relatively new and represents an effort bring
together researchers who have been working in closely related fields but
following different paradigms. The field is now seen as including research in Strategies, Evolutionary
PROGRAMMING, ARTIFICIAL
LIFE, and so forth. For a good overview see the editorial
introduction to Vol. 1, No. 1 of "Evolutionary Computation"
(MIT Press, 1993). That, along with the papers in the issue, should give you a
good idea of representative research Evolutionary computing glossary, Hitch
Hiker's Guide to Evolutionary Computation Issue 8.1, released 29 March 2000
expert systems:
A computer-based program that encodes rules obtained from process experts
usually in the form of “if - then” statements. J Glassey et al.
“Issues in the development of an industrial bioprocess advisory system”
Trends in Biotechnology 18 (4):136-41 April 2000
Related term: artificial intelligence.
factorial design FD:
An experimental design technique in which each
variable (factor or descriptor) is investigated at fixed levels.
In a two- level FD, each variable can take two values, e.g., high and low
lipophilicity. [IUPAC Computational]
fuzzy:
In contrast to binary (true/ false) terms allows for looser
boundaries for sets or concepts.
fuzzy logic:
A superset of conventional (Boolean) logic that
has been extended to handle the concept of partial truth- truth values
between “completely true” and ‘completely false”. Introduced by Dr.
Lotfi Zadeh (Univ. of California - Berkeley) in the 1960’s as a means to model the uncertainty
of natural language. AI FAQ, Carnegie Mellon University Computer Science
Department http://www.cs.cmu.edu/Groups/AI/html/faqs/ai/fuzzy/part1/faq-doc-2.html
Approximate, quantitative reasoning that is concerned with the linguistic ambiguity which exists in natural or
synthetic language. At its core are variables such as good, bad, and young as well as modifiers such as more, less,
and very. These ordinary terms represent fuzzy sets in a particular problem. Fuzzy logic plays a key role in many
medical expert systems. MeSH, 1993
genetic algorithm GA: Method for library design by evaluating
the fit of a parent library to some desired property (e.g. the level of
activity in a biological assay, or the computationally determined diversity
of the compound set) as measured by a fitness function. The design of more
optimal daughter libraries is then carried out by a heuristic process
with similarities to genetic selection in that it employs replication, mutation, deletions etc. over a number of generations. [IUPAC Combinatorial
Chemistry]
An optimization algorithm based on the mechanisms of Darwinian evolution
which uses random mutation, crossover and selection procedures to breed
better models or solutions from an originally random starting population
or sample. (Rogers and Hopfinger, 1994). IUPAC Computational
Related terms: evolutionary computation ; Drug
discovery & development drug design. Narrower term: genetic programming
genetic programming:
A subset of genetic algorithms. The members
of the populations are the parse trees of computer programs whose fitness
is evaluated by running them. The reproduction operators (e.g. crossover)
are refined to ensure that the child is syntactically correct (some protection
may be given against semantic errors too). This is achieved by acting upon
subtrees. Genetic programming is most easily implemented where the computer
language is tree structured so there is no need to explicitly evaluated its
parse tree. This is one of the reasons why Lisp is often used for genetic
programming. This is the common usage of the term genetic programming
however it has also been used to refer to the programming of cellular automata
and neural networks using a genetic algorithm. William Langdon "Genetic programming and data structures glossary"
UK, 1997
Genetic Programming Organization
http://www.genetic-programming.org
International Society for Genetic and Evolutionary
Computation http://www-illigal.ge.uiuc.edu:8080/
genome mining: In an initial data- mining effort, the draft human genome was searched to find paralogs of known tumor suppressor genes, and for gene arrangements, which are typical of
oncogenes, in cancer cells. The results were disappointing, indicating that although knowledge of the human genome will undoubtedly be of great help, other approaches to identify new oncogenes are needed. TG Boyer et. al. "Genome mining for human cancer genes: wherefore art thou?"
Trends in Molecular Med 189, May 2001
genomic data: Genomics glossary
global schema: A schema, or a map of the data content of a data
warehouse that integrates the schemata from several source repositories.
It is "global", because it is presented to warehouse users as the schema
that they can query against to find and relate information from any of
the sources, or from the aggregate information in the warehouse. Lawrence
Berkeley Lab "Advanced Computational Structural Genomics" Glossary]
Broader term: schema
Hansch analysis:
The investigation of the quantitative relationship
between the biological activity of a series of compounds and their
physicochemical substituent or global parameters representing hydrophobic,
electronic, steric and other effects using multiple regression correlation
methodology. [IUPAC Medicinal Chemistry]
Related term: QSAR
heat map:
A rectangular display that is a direct translation of
a Cluster- format data table. Each cell of the data table is represented as a
small color- coded square in which the color indicates the expression value.
Generally green indicates low values, black medium values, and red high ones,
although this is user- settable. The net effect is a colored picture in which
regions of similar color indicate similar profiles or parts of profiles.
Related terms: cluster analysis, dendogram, heat map, profile chart;
Expression glossary
heuristic:
Tools such as
genetic algorithms or neural
networks employ heuristic methods to derive solutions which may be
based on purely empirical information and which have no explicit rationalization.
[IUPAC Combinatorial Chemistry]
Trial and error methods.
Narrower terms: heuristic
algorithm, metaheuristics
heuristic algorithm: A programming strategy for solving
computationally resistant problems that utilizes self- educating techniques
(i.e., feedback evaluation) to improve performance (e.g., FASTA). Problem
solving by such experimental, trial- and- error methods does not guarantee
the optimal solution. [labvelocity.com]
Hidden Markov Models HMM:
In
silico & molecular
modeling glossary
Related term: simulated annealing
hierarchical clustering:
Unsupervised clustering approach used to
determine patterns in gene expression data. Output is a tree- like structure.
Related term: cluster analysis, self- organizing maps
high-dimensionality: Many applications of machine learning methods
in domains such as information retrieval, natural language processing, molecular
biology, neuroscience, and economics have to be able to deal with various sorts
of discrete data that is typically of very high dimensionality.
One standard approach to deal with high dimensional data is to perform a
dimension reduction and map the data to some lower dimensional representation.
Reducing the data dimensionality is often a valuable analysis by itself, but it
might also serve as a pre- processing step to improve or accelerate subsequent
stages such as classification or regression. Two closely related methods that
are often used in this context and that can be found in virtually every textbook
on unsupervised learning are principal component analysis (PCA) and
factor analysis. Thomas Hoffmann, Brown Univ. Statistical Learning in
High Dimensions, Breckenridge CO, Dec. 1999 http://www-2.cs.cmu.edu/~mmp/workshop-nips99/speakers.html
See also under learning algorithms;
Related terms: cluster analysis, curse of dimensionality, dimensionality
reduction, ill- posed problem, neural nets, principal components analysis
ill-posed problems:
In the 1960s [Russian mathematician Andrei Nikolaevich]
Tikhonov began to produce an important series of papers on ill- posed problems. He defined a class of regularisable
ill- posed
problems and introduced the concept of a regularising operator which was used in the solution of these problems. Combining his computing
skills with solving problems of this type Tikhonov gave computer implementations
of algorithms to compute the operators which he used in the solution of these
problems.. "Andrei Nikolaevich Tikhonov",
MacTutor History of Mathematics, Univ. of St. Andrews, Scotland, 1999
Problems without a unique solution, problems without any solution. Life
sciences data tends to be very noisy, leading to ill-posed problems.
Interpretation of microarray gene expression
data is an ill- posed problem.
Compare well- posed problem
influence based data mining:
Complex and granular (as opposed
to linear) data in large databases are scanned for influences between specific
data sets, and this is done along many dimensions and in multi- table formats.
These systems find applications wherever there are significant cause and
effect relationships between data sets - as occurs, for example in large
and multivariant gene expression studies, which are behind areas such as
pharmacogenomics. "Data mining" Nature
Biotechnology Vol. 18: 237- 238 Supp. Oct. 2000
Broader
term: data mining
informatics: Information
management & interpretation glossary
information management: Information
management & interpretation glossary
information theory:
Founded
by Claude Shannon in the 1940's, has had an enormous impact on communications
engineering and computer sciences.
Shannon's work,
Bell Labs http://cm.bell-labs.com/cm/ms/what/shannonday/work.html
Information theory
primer, Tom Schneider,
National Cancer Institute, US, 2002 http://www.lecb.ncifcrf.gov/~toms/paper/primer/
jackknife: See under bootstrapping
k-means clustering:
The researcher picks a value for k, say k = 10,
and the algorithm divides the data into that many clusters in such a way that
the profiles within each cluster are more similar than those across clusters.
The actual algorithms for this are quite sophisticated. Although the core
algorithms require that a value of k be selected up front, methods exist that
adaptively select good values for k by running the core algorithm several times
with different values. A non-hierarchical method.
Broader terms: cluster analysis, neural nets
knowledge based systems:
An extension of the expert system concept
wherein additional forms of knowledge, such as mathematical models, are
incorporated with the expert rules. J Glassey et al. “Issues in the development
of an industrial bioprocess advisory system” Trends in Biotechnology 18
(4):136- 141 April 2000
Related term: data mining.
Knowledge Discovery in Databases (KDD):
The notion of Knowledge Discovery in Databases (KDD) has been given various
names, including data mining, knowledge extraction, data pattern
processing, data archaeology, information harvesting, siftware, and even (when
done poorly) data dredging. Whatever the name, the essence of KDD is the
"nontrivial extraction of implicit, previously unknown, and potentially
useful information from data" (Frawley et al 1992). KDD encompasses a
number of different technical approaches, such as clustering, data
summarization, learning classification rules, finding dependency networks,
analyzing changes, and detecting anomalies (see Matheus et al 1993). Gregory
Piatetsky- Shapiro, KDD Nuggets FAQ, KDD Nuggets News, 1994 http://www.kdnuggets.com/news/94/n6.txt
Google = about 241,000 July 19, 2002;
about 321,000 Oct 22, 2007
Related term: data mining
knowledge management: Information management
& interpretation glossary
lexical parsing:
See under parsing
machine learning:
Wikipedia http://en.wikipedia.org/wiki/Machine_learning
In Knowledge Discovery, machine
learning is most commonly used to mean the application of induction
algorithms, which is one step in the knowledge discovery process. This is
similar to the definition of empirical learning or inductive learning in
Readings in Machine Learning by Shavlik and Dietterich. Note that in their
definition, training examples are ``externally supplied,'' whereas here they are
assumed to be supplied by a previous stage of the knowledge discovery process. Machine
Learning is the field of scientific study that concentrates on induction
algorithms and on other algorithms that can be said to ``learn.'' Glossary of
terms, Ron Kohavi, Machine Learning, 30, 271- 274, 1998
Related term: supervised, training sets
AAAI Machine
Learning, http://www.aaai.org/AITopics/html/machine.html
American Association for Artificial Intelligence
MathML:
Intended to
facilitate the use and re-use of mathematical and scientific content on the Web,
and for other applications such as computer algebra systems, print typesetting,
and voice synthesis. W3C http://www.w3.org/Math/whatIsMathML.html
medical informatics: Molecular
Medicine glossary
metadata: Information management
& interpretation glossary
metaheuristics:
Widely used to solve important practical combinatorial
optimization problems. However, due to the variety of techniques and concepts
comprised by metaheuristics, there is still no commonly agreed definition for
metaheuristics. The definition used in the Metaheuristics Network is the
following.
A metaheuristic
is a set of concepts that can be used to define heuristic methods that can be
applied to a wide set of different problems. In other words, a metaheuristic can
be seen as a general algorithmic framework which can be applied to different
optimization problems with relatively few modifications to make them adapted to
a specific problem. Examples of metaheuristics include simulated annealing (SA),
tabu search (TS), iterated local search (ILS), evolutionary algorithms (EC), and
ant colony optimization (ACO). Project Summary, Metaheuristics Network,
Improving Human Potential, European Community http://www.metaheuristics.org/index.php?main=1
molecular information theory: In our laboratory we
use Claude
Shannon's information
theory, computers (Unix, Pascal and PostScript
graphics on Sun workstations) and genetic engineering (protein and DNA gels,
cloning, sequencing and magnetic bead technology) to study genetic control
patterns on DNA and RNA. "Molecular Information Theory" Tom
Schneider, National Cancer Institute, US, 2002 http://www.lecb.ncifcrf.gov/~toms/introduction.html
Molecular Information Theory and the theory of
molecular machines, Tom Schneider, NCI, US http://www.lecb.ncifcrf.gov/~toms/
molecular pattern recognition:
Developing computational methodologies
for the analysis and interpretation of large-scale expression datasets
generated by DNA microarray experiments. Analysis of genome-wide
expression patterns and their correlations with phenotypes of interest
may provide unique insights into the structure of genetic networks and
into biological processes not yet understood at the molecular level.
Whitehead/ MIT [US] Genome Center's Molecular Pattern Recognition
web site. http://www.genome.wi.mit.edu/MPR/index.html
Broader term: pattern recognition. Related terms Expression glossary
mosaic plots:
A graphical alternative for qualitative, or categorical,
data … display cross- classified data by constructing rectangles of area
proportional to the counts … likely to become more familiar [to scientists]
and their use is likely to grow. Are to categorical variables what scatterplots
are to continuous variables, and their purpose is the same, to find interesting
patterns of association between variables. RD Meyer & D Book “Visualization
of data” Current Opinion in Biotechnology 11:89- 1196, 2000
multivariate statistics:
A set of statistical tools to analyze
data (e.g., chemical and biological) matrices using regression and/ or pattern
recognition techniques. [IUPAC Computational]
neural networks:
Technique for optimizing a desired property
given a set of items which have been previously characterized with respect
to that property (the 'training set'). Features of members
of the training set which correlate with the desired property are 'remembered
and used to generate a model for selecting new items with the desired property
or to predict the fit of an unknown member. [IUPAC Combinatorial Chemistry]
Communication between statisticians and neural net researchers is often
hindered by the different terminology used in the two fields. There is a
comparison of neural net and statistical jargon in ftp://ftp.sas.com/pub/neural/jargon
IEEE Neural
Networks, Institute of Electrical
and Electronics Engineers http://www.ieee-nns.org/
Neural Network FAQ
ftp://ftp.sas.com/pub/neural/FAQ.html
Narrower terms: artificial neural networks, probabilistic neural
networks. Often uses fuzzy logic; Related terms: artificial
intelligence; In silico & molecular
modeling glossary self- organizing maps
nonparametric: See under parametric versus nonparametric methods:
normalization:
A knotty area in any measurement process, because
it is here that imperfections in equipment and procedures are addressed. The
specifics of normalization evolve as a field matures since the process usually
gets better, and one’s understanding of the imperfections also gets better. In
the microarray field, even larger changes are occurring as robust statistical
methods are being adopted.
See also normalization Microarrays glossary
Narrower terms: thresholding
ontology, ontologies: Information management
& interpretation glossary
paraphrase problem: Information
management & interpretation glossary
parsing:
Using algorithms to analyze data into components. Semantic
parsing involves trying to figure out what the components mean. Lexical
parsing refers to the process of deconstructing the data into components.
Narrower term: In
silico & molecular modeling glossary gene parsing
Partial Least Squares PLS:
Projection to latent structures
(PLS) is a robust multivariate generalized regression method using projections
to summarize multitudes of potentially collinear variables (Wold et al.,
1993). [IUPAC Computational]
pattern recognition PR:
The identification of patterns in large
data sets using appropriate mathematical methodologies. Examples
are principal component analysis (PCA), SIMCA, partial least squares
(PLS) and artificial neural networks (ANN) (Rouvray, 1990;
Van de Waterbeemd, 1995ab) [IUPAC Computational]
Narrower terms: artificial neural networks,
molecular
pattern recognition, principal component analysis (PCA), SIMCA, partial least squares
(PLS)
Pearson correlation:
Commonly used similarity function which looks
explicitly at the shape of the expression profile, avoiding the need to
transform the data beforehand. It’s easiest to understand what this function
does by using a different spatial representation of the data. Take two
expression profiles and draw a scatter plot of corresponding values. In other
words, pair the first value of the first profile with the first value of the
second, the second value of the first profile with the second value of the
second, and so forth. The Pearson correlation measures how well a straight line
can be fit to the data. A correlation of +1 means the fit is perfect to a line
that slants up, 0 means the fit is random, and –1 means the fit is perfect to
a line that slants down. predictive data mining;
Combines pattern matching, influence
relationships, time set correlations, and dissimilarity analysis to offer
simulations of future data sets...these systems are capable of incorporating
entire data sets into their working, and not just samples, which make their
accuracy significantly higher ... used often in clinical trial analysis
and in structure- function correlations. "Data mining" Nature Biotechnology
Vol. 18: 237-238 Supp. Oct. 2000
Broader term: data mining
predictor: See under classifier
Principal Components Analysis PCA:
Computational approach to
reducing the complexity of, for example, a set of descriptors, by identifying
those features which provide the major contributions to observed properties,
and thus reducing the dimensionality of the relevant property space. [IUPAC
Combinatorial Chemistry]
A data reduction method using mathematical techniques to identify patterns
in a data matrix. The main element of this approach consists of the construction
of a small set of new orthogonal, i.e., non- correlated, variables derived
from a linear combination of the original variables. [IUPAC Computational]
Often confused with common factor analysis. [Neural Network FAQ Part 1] ftp://ftp.sas.com/pub/neural/FAQ.html
probabilistic neural networks: Statsoft
probability:
Probability web http://www.mathcs.carleton.edu/probweb/probweb.html
profile chart: A line graph that is a direct translation of a
Cluster- format data table. Each cell of the data table is represented as a point
whose Y coordinate indicates the expression value, and whose X coordinate is the
ordinal position of the value in its profile. The points for each profile are
connected by lines. A profile chart is a good way to visualize individual
clusters.
Related terms: cluster analysis, dendogram, heat map
protein and mRNA data: Proteomics glossary
Quantitative Structure-Activity Relationships QSAR: In
silico & molecular
modeling glossary
recursive partitioning:
Process for identifying complex structure-
activity relationships in large sets by dividing compounds into
a hierarchy of smaller and more homogeneous sub- groups on the basis of
the statistically most significant descriptors.
Related terms: clustering,
principal components analysis. [IUPAC Combinatorial Chemistry]
regression analysis:
The use of statistical methods for
modeling a set of dependent variables, Y, in terms of combinations of
predictors, X. It includes methods such as multiple linear
regression (MLR) and partial least squares (PLS). [IUPAC Computational]
resampling: See under bootstrapping
regression to the mean:
A common misconception about genetics has to
do with overgeneralization about the likelihood of increased quality by selective breeding.
Two very tall parents will tend to produce offspring who are taller than the
average population -- but less tall than the average of the parents'
heights. Or as George Bernard Shaw is supposed to have said to a famous
beauty who suggested they have a child ""With your brains and my looks
..." He said to have replied, "But what if the child had my looks and your
brains?"
remembrance agents: Information management
& interpretation glossary
robust:
A statistical test
that yields approximately correct results despite the falsity of certain
of the assumptions on which it is based Oxford English Dictionary
Hence, can refer to
a process which is relatively insensitive to human foibles and variables
in the way (for example, an assay) is carried out.
Idiot- proof.
SIMCA (SIMple Classification Analysis or Soft Independent Modeling
of Class Analogy):
This method is a pattern recognition and
classification technique (Dunn and Wold, 1995). [IUPAC Computational]
SPC Structure-Property Correlations: In
silico & molecular
modeling glossary
scalable, scaling: Drug discovery &
development glossary
schema (plural schemata):
A description of the data represented
within a database. The format of the description varies but includes a
table layout for a relational database or an entity- relationship diagram.
Lawrence Berkeley Lab "Advanced Computational Structural Genomics"
Glossary
Narrower term: global schema
self-organization: Typically
refers to a process by which systems organize themselves without external
direction, manipulation or control. The term is difficult to define precisely
because it is used in reference to a variety of processes generating a variety
of systems. M. Beth L. Dempster, Glossary, A Self- Organizing Systems
Perspective on Organizing for Sustainability, Univ. of Waterloo, Canada, 1998 http://www.nesh.ca/jameskay/ersserver.uwaterloo.ca/jjkay/grad/bdempster/gloss.html
A process where the organization (constraint, redundancy) of a system spontaneously increases, i.e. without this increase being controlled by the environment or an encompassing or otherwise external system.
[F. Heylighen, "Self Organization" Jan 27, 1997
in: F. Heylighen, C. Joslyn and V. Turchin (editors): Principia Cybernetica Web (Principia
Cybernetica, Brussels) http://pespmc1.vub.ac.be/SELFORG.html
self- organizing map:
Similar to
k-means, but the algorithm organizes
the clusters in a two- dimensional grid, such that clusters that are close
together in the grid are more similar than those further apart. This is a very
useful feature when working with large numbers of clusters.
A type of mathematical cluster analysis
that is particularly well suited for recognizing and classifying features
in complex, multidimensional data. The method has been implemented in a
publicly available computer package, GENECLUSTER, that performs the analytical
calculations and provides easy data visualization … Expression patterns
of some 6,000 human genes were assayed, and an online database was created.
GENECLUSTER was used to organize the genes into biologically relevant clusters
that suggest novel hypotheses about hematopoietic differentiation. [P. Tamayo
et al “Interpreting patterns of gene expression with self- organizing maps:
methods and application to hematopoietic differentiation” PNAS 96(6): 2907-
2912
Mar 16, 1999]
Related term: neural networks
semantic data integration:
Information management & interpretation glossary
Google = about 214 July 19, 2002;
about 24, 200 Oct 8, 2007
semantic parsing: See under parsing
Google = about 1,380 Aug. 20, 2002;
about 45,200 Oct 8, 2007
sequencing algorithms: See Sequencing Glossary
BLAST, FASTA, Needleman - Wunsch,
Smith - Waterman
similarity scores: See under distance functions or similarity
scores:
simulated annealing: In
silico & molecular modeling
stochastic:
"Aiming, proceeding by guesswork" (Webster's Collegiate
Dictionary). Term which is often applied to combinatorial processes involving
true random sampling, such as selection of beads from an encoded library,
or certain methods for library design. [IUPAC COMBINATORIAL CHEMISTRY]
Truly random, based on probability.
Structure Activity
Relationship SAR: Drug
discovery & development; Narrower terms 3D-QSAR, QSAR
Support Vector Machines SVMs:
A new generation learning system based
on recent advances in statistical learning theory. SVMs deliver state- of- the-
art performance in real- world applications such as text categorisation,
hand- written character recognition, image classification, biosequences analysis,
etc., and are now established as one of the standard tools for machine learning
and data mining. Nello Cristianini, John Shawe-Taylor, An Introduction to
Support Vector Machines and Other Kernel- based Learning Methods, Cambridge
University Press, 2000 http://uk.cambridge.org/engineering/catalogue/0521780195/default.htm
taxonomy, taxonomies: Information management
& interpretation glossary
text mining: Information management
& interpretation glossary
thresholding:
The researcher defines minimum and maximum values that
are considered reliable; measurements that are too low or too high are dropped
from the dataset or marked as unreliable. It also makes sense to subtract the
minimum value from all other measurements, because this reflects baseline noise.
This approach implicitly assumes that microarrays normally operate in the linear
part of the dynamic range, and that the transitions between the linear and flat
regimes occur abruptly.
Broader term: normalization
time delay data mining:
The data is collected over time and systems
are designed to look for patterns that are confirmed or rejected as the
data set increases and becomes more robust. This approach is geared
toward long- term clinical trial analysis and multicomponent
mode of action
studies. "Data mining" Nature Biotechnology Vol. 18: 237-238 Supp. Oct.
2000
Broader term: data mining
training set:
An initial dataset for which the correct answers are
known and feeding the data and correct answers into a program that adjusts the
parameters of the general model. The training program adjusts the model
parameters so that the model works well on the given dataset. There are usually
enough parameters so that this can be accomplished, provided the dataset is
reasonably consistent. The training set usually has to be very large to produce
a good classifier. trends-based data mining:
Software analyzes large and complex
data sets in terms of any changes that occur in specific data sets over
time. Data sets can be user- defined or the system can uncover them
itself...This is especially important in cause- and- effect biological experiments.
Screening is a good example. Data mining, Nature Biotechnology Vol. 18: 237-
238 Supp. Oct. 2000
Broader term: data mining
unsupervised training sets:
Unsupervised training is where the network has to make sense of the inputs
without outside help. ... Unsupervised training is used to perform some initial
characterization on inputs. However, in the full blown sense of being truly self
learning, it is still just a shining promise that is not fully understood, does
not completely work, and thus is relegated to the lab. Artificial Neural
Networks Technology, Data and Analysis Software, Dept. of Defense, 2000 http://www.dacs.dtic.mil/techs/neural/neural3.html
visualization: Information management
& interpretation glossary:
well-posed problem:
A problem is well-posed if (and only if):
it has one and only one solution; a small change in the data (such as prescribed
boundary conditions, source strengths, coefficients in the PDE, etc) produces
only a small change in the solution. [Nils Andersson "Appropriate boundary
conditions", Partial Differential Equations, Univ. of Southampton, UK,
2001] http://www.maths.soton.ac.uk/staff/Andersson/MA361/node38.html
Related term: ill-posed problems
IUPAC definitions are reprinted with the permission of the International
Union of Pure and Applied Chemistry.
Bibliography
M. Beth L. Dempster, Glossary, A Self-Organizing Systems
Perspective on Organizing for Sustainability, Univ. of Waterloo, Canada, 1998,
30 + terms. http://www.nesh.ca/jameskay/ersserver.uwaterloo.ca/jjkay/grad/bdempster/gloss.html
Evolutionary
Algorithms, terms and definitions, Hans-Georg Beyer, Eva Brucherseifer, Wilfried
Jakob, Hartmut Pohlheim, Bernhard Sendhoff, Thanh Binh To, 2002
http://ls11-www.cs.uni-dortmund.de/people/beyer/EA-glossary/
Flake Gary Computational Beauty of Nature: Computer Explorations of
Fractals, Chaos, Complex Systems and Adaptation. Glossary MIT Press, 2000.
280+ definitions. http://mitpress.mit.edu/books/FLAOH/cbnhtml/glossary-intro.html
Glossary of terms, Ron Kohavi, Machine Learning, 30, 271-
274, 1998, 45 definitions. http://ai.stanford.edu/~ronnyk/glossary.html
Inmon, Bill, Glossary
of Data Warehousing, 2002-2005 http://www.inmoncif.com/library/glossary/
IUPAC Combinatorial International Union of Pure and Applied
Chemistry, Glossary of Terms Used in Combinatorial Chemistry, D. Maclean, J.J.
Baldwin, V.T. Ivanov, Y. Kato, A. Shaw, P. Schneider, and E.M. Gordon, Pure
Appl. Chem., Vol. 71, No. 12, pp. 2349- 2365, 1999, 100+ definitions http://www.iupac.org/reports/1999/7112maclean/
IUPAC Computational] International Union of Pure and Applied Chemistry,
Glossary of Terms used in Computational Drug Design, H. van de Waterbeemd, R.E.
Carter, G. Grassy, H. Kubinyi, Y. C.. Martin, M.S. Tute, P. Willett, 1997. 125+
definitions. http://www.iupac.org/reports/1997/6905vandewaterbeemd/glossary.html
NIST National Institute of Standards and Technology, Dictionary of
Algorithms, Data Structures and Problems, Paul Black, 2001, 1300+ terms
http://www.nist.gov/dads/
Statsoft, Inc. Statistics glossary, Electronic Statistics Textbook, Tulsa
OK, US 2001, 1200 + definitions. http://www.statsoft.com/textbook/stathome.html
Tollenaere
JP, EE Moret, Hyperglossary of [Molecular Modelling in Drug Design] Terminology,
Utrecht University, 1996. 150+ definitions. http://wwwcmc.pharm.uu.nl/webcmc/glossary.html
Hao Zhang, A Statistical Learning/ Pattern Recognition Glossary, Univ. of
Wisconsin - Madison, 1999, 80 terms. http://www.cs.wisc.edu/~hzhang/glossary.html
Alpha glossary index
How
to look for other unfamiliar terms |