With changes in sequencing technology and methods, the rate
of acquisition of human and other genome data over the next few years will
be ~100 times higher than originally anticipated. Assembling and interpreting
these data will require new and emerging levels of coordination and collaboration
in the genome research community to develop the necessary computing algorithms,
data management and visualization system. Lawrence Berkeley
Lab, US "Advanced Computational Structural Genomics"
The dividing line between this glossary and
Information management &
interpretation is fuzzy  in general Algorithms
& data analysis focuses on structured data, while Information management
& interpretation centers on unstructured data. Data Science & Machine Learning
is another closely related glossary.
Finding guide to terms in these glossaries
Informatics term index
Related
glossaries include Drug Discovery &
Development Proteomics Informatics: Bioinformatics
Chemoinformatics
Clinical informatics Drug
discovery informatics
IT
infrastructure Ontologies Research
Technologies: Microarrays & protein chips
Sequencing
Biology: Protein
Structures Sequences, DNA
& beyond.
affinity based data mining:
Large and complex data sets are analyzed
across multiple dimensions, and the data mining system identifies data
points or sets that tend to be grouped together. These systems differentiate
themselves by providing hierarchies of associations and showing any underlying
logical conditions or rules that account for the specific groupings of
data. This approach is particularly useful in biological motif analysis.
"Data mining" Nature Biotechnology 18: 237238 Supp. Oct. 2000 Broader term: data
mining
algorithm:
A procedure consisting of a sequence of algebraic formulas and/or logical steps to calculate or determine a given task.
MeSH, 1987
Algorithms fuel the
scientific advances in the life sciences. They are required for dealing with the
large amounts of data produced in sequencing projects, genomics or proteomics.
Moreover, they are crucial ingredients in making new experimental approaches
feasible... Algorithm development for Bioinformatics applications combines
Mathematics, Statistics, Computer Science as well as Software Engineering to
address the pressing issues of today's biotechnology and build a sound
foundation for tomorrow's advances. Algorithmics Group, Max Planck
Institute for Molecular Genetics, Germany http://algorithmics.molgen.mpg.de/
Rules or a process, particularly in computer
science. In medicine a step by step process for reaching a diagnosis or
ruling out specific diseases. May be expressed as a flow chart in
either sense. Greater efficiencies in algorithms, as well as improvements in computer
hardware have led to advances in computational biology. A computable set of steps to achieve a desired result.
From the Persian author Abu Ja'far
Mohammed ibn Mûsâ alKhowârizmî who wrote a book with
arithmetic rules dating from about 825
A.D. NIST Narrower terms: docking algorithms, sequencing algorithms, genetic
algorithm, heuristic algorithm. Related terms heuristic, parsing; Sequencing
dynamic programming methods.
ANOVA Analysis Of Variance:
Error model based on a standard
statistical approach. a generalization of the familiar ttest that allows
multiple effects to be compared simultaneously, in contrast to the ttest. An
ANOVA model is expressed as a large set of equations that can be solved, given a
dataset of measurements, using standard software.
bootstrapping: In statistics, bootstrapping is
any test or metric that relies on random
sampling with replacement.
Bootstrapping allows assigning measures of accuracy (defined in
terms of bias, variance, confidence
intervals, prediction error or
some other such measure) to sample estimates.[1][2] This
technique allows estimation of the sampling distribution of almost
any statistic using random sampling methods.[3][4]
Wikipedia accessed 2018 Oct 27
https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
chaos
theory: a branch of mathematics focusing
on the behavior of dynamical
systems that are highly
sensitive to initial
conditions. 'Chaos' is an
interdisciplinary theory stating that within the apparent randomness
of chaotic
complex systems, there are
underlying patterns, constant feedback
loops, repetition, selfsimilarity, fractals, selforganization,
and reliance on programming at the initial point known as sensitive
dependence on initial conditions. The butterfly
effect describes how a small
change in one state of a deterministic nonlinear system can result
in large differences in a later state, e.g. a butterfly flapping its
wings in Brazil can cause a hurricane in Texas.[1]
Wikipedia accessed 2018 Oct 27
https://en.wikipedia.org/wiki/Chaos_theory
cluster analysis:
The clustering, or grouping, of large
data sets (e.g., chemical and/ or pharmacological data sets) on the basis
of similarity criteria for appropriately scaled variables that represent
the data of interest. Similarity criteria (distance based, associative,
correlative, probabilistic) among the several clusters facilitate the recognition
of patterns and reveal otherwise hidden structures (Rouvray, 1990; Willett,
1987, 1991). IUPAC Computational
A set of statistical methods used to group variables or observations into strongly
inter related subgroups. In
epidemiology, it may be used to analyze a closely grouped series of events or cases of disease or other
health related
phenomenon with well defined distribution patterns in relation to time or place or both.
MeSH, 1990 Has been used in medicine to create
taxonomies of diseases and
diagnosis and in archaeology to establish taxonomies of stone tools and funereal
objects. Cluster analysis can be
supervised, unsupervised or
partially supervised Related terms: clustering
analysis, dendogram, heat map, pattern
recognition, profile chart.
Narrower terms: hierarchical clustering, kmeans clustering
clustering analysis:
This is a general type of
analysis that involves grouping gene or array expression profiles based on
similarity. Clustering is a major subfield within the broad world of numerical
analysis, and many specific clustering methods are known.
coefficient of variation (CV):
The standard
deviation of a set of measurements divided by their mean.
comparative data mining:
Focuses on overlaying large and complex
data sets that are similar to each other ...particularly useful in all
forms of clinical trial meta analyses ... Here the emphasis is on
finding dissimilarities, not similarities. "Data mining" Nature Biotechnology
Vol. 18: 237238 Supp Oct.. 2000 Broader term: data mining
curse of dimensionality:
(Bellman
1961) refers to the exponential growth of hypervolume as a function of
dimensionality. In the field of NNs [neural nets], curse of dimensionality
expresses itself in two related problems. Janne Sinkkonen "What is
the curse of dimensionality?" Artificial Intelligence FAQ http://www.faqs.org/faqs/aifaq/neuralnets/part2/section13.html
refers to various phenomena that arise
when analyzing and organizing data in highdimensional
spaces (often with hundreds or
thousands of dimensions) that do not occur in lowdimensional
settings such as the threedimensional physical
space of everyday experience.
The expression was coined by Richard
E. Bellman when considering
problems in dynamic
optimization.[1][2]
There are multiple phenomena referred to by this name in domains
such as numerical
analysis, sampling, combinatorics, machine
learning, data
mining and databases.
The common theme of these problems is that when the dimensionality
increases, the volume of
the space increases so fast that the available data become sparse.
Wikipedia accessed 2018 Dec 9
https://en.wikipedia.org/wiki/Curse_of_dimensionality
Related
term: highdimensionality
data warehouse:
In computing,
a data warehouse (DW or DWH),
also known as an enterprise data
warehouse (EDW),
is a system used for reporting and data
analysis,
and is considered a core component of business
intelligence.^{[1]} DWs
are central repositories of integrated data from one or more
disparate sources. They store current and historical data in one
single place^{[2]} that
are used for creating analytical reports for workers throughout the
enterprise.^{[3]}^{
}
Wikipedia accessed 2018 Aug 25
https://en.wikipedia.org/wiki/Data_warehouse
decision tree:
a decision
support tool
that uses a treelike graph or model of
decisions and their possible consequences, including chance event
outcomes, resource costs, and utility.
It is one way to display an algorithm that
only contains conditional control statements. Wikipedia accessed 2018 Jan 26
https://en.wikipedia.org/wiki/Decision_tree
dendogram:
A tree diagram that depicts the results of hierarchical
clustering. Often the branches of the tree are drawn with lengths that are
proportional to the distance between the profiles or clusters. Dendograms are
often combined with heat maps, which can give a clear visual representation of
how well the clustering has worked. Related terms: cluster analysis, heat maps, profile charts
dimensionality reduction:
In statistics, machine
learning, and information
theory, dimensionality
reduction or dimension reduction is the process of reducing the
number of random variables under consideration[1] by
obtaining a set of principal variables. It can be divided into feature
selection and feature
extraction.[2]
Wikipedia accessed 2018 Dec 9 https://en.wikipedia.org/wiki/Dimensionality_reduction
Narrower term: Principal
Components Analysis PCA
error model:
A mathematical formulation that identifies the sources of
error in an experiment. An error model provides a mathematical means of
compensating for the errors in the hope that this will lead to more accurate
estimates of the true expression levels and also provides a means of estimating
the uncertainty in the answers. An error model is generally an approximation of
the real situation and embodies numerous assumptions; therefore, its utility
depends on how good these assumptions are. The model can be expressed as a set
of equations, as an algorithm, or using any other mathematical formalisms. ...
The term error model has become very popular among software providers,
particularly in light of the success of Rosetta’s Resolver, which incorporates
an error model. As a result, some software developers may use the term
inappropriately. Not everything that is called an error model really is one.
evolutionary
algorithm:
An
umbrella term used to describe computerbased problem solving systems which use
computational models of some of the known mechanisms of EVOLUTION
as key elements in their design and implementation. A variety of EVOLUTIONARY
Algorithms have been proposed. The major ones are: GENETIC
Algorithms (see Q1.1),
EVOLUTIONARY
PROGRAMMING (see Q1.2),
EVOLUTION
Strategies (see Q1.3),
CLASSIFIER
Systems (see Q1.4),
and GENETIC
PROGRAMMING (see Q1.5).
They all share a common conceptual base of simulating the evolution of INDIVIDUAL
structures via processes of SELECTION,
MUTATION,
and REPRODUCTION.
The processes depend on the perceived PERFORMANCE
of the individual structures as defined by an ENVIRONMENT.
More
precisely, EAs
maintain a POPULATION
of structures, that evolve according to rules of selection and other operators,
that are referred to as "search operators", (or GENETIC
Operators), such as RECOMBINATION
and mutation. Each individual in the population receives a measure of its FITNESS
in the environment. Reproduction focuses attention on high fitness individuals,
thus exploiting (cf. EXPLOITATION)
the available fitness information. Recombination and mutation perturb those
individuals, providing general heuristics for EXPLORATION.
Although simplistic from a biologist's viewpoint, these algorithms are
sufficiently complex to provide robust and powerful adaptive search mechanisms.
Heitkötter, Jörg
and Beasley, David, eds. (2001) "The HitchHiker's Guide to Evolutionary
Computation: A list of Frequently Asked Questions (FAQ)",
USENET: comp.ai.genetic Available via anonymous
FTP
from ftp://rtfm.mit.edu/pub/usenet/news.answers/aifaq/genetic/
evolutionary computation:
In computer
science, is
a family of algorithms for global
optimization inspired
by biological
evolution,
and the subfield of artificial
intelligence and soft
computing studying
these algorithms. In technical terms, they are a family of populationbased trial
and error problem
solvers with a metaheuristic or stochastic
optimization character.
Wikipedia accessed 2018 Sept 7
https://en.wikipedia.org/wiki/Evolutionary_computation
expert systems:
A computerbased program that encodes rules obtained from process experts
usually in the form of “if  then” statements. J Glassey et al.
“Issues in the development of an industrial bioprocess advisory system”
Trends in Biotechnology 18 (4):13641 April 2000 Related term: artificial intelligence.
fuzzy:
In contrast to binary (true/ false) terms allows for looser
boundaries for sets or concepts.
fuzzy logic:
A superset of conventional (Boolean) logic that
has been extended to handle the concept of partial truth truth values
between “completely true” and ‘completely false”. Introduced by Dr.
Lotfi Zadeh (Univ. of California  Berkeley) in the 1960’s as a means to model the uncertainty
of natural language. AI FAQ, Carnegie Mellon University Computer Science
Department http://www.cs.cmu.edu/Groups/AI/html/faqs/ai/fuzzy/part1/faqdoc2.html
Approximate, quantitative reasoning that is concerned with the linguistic ambiguity which exists in natural or
synthetic language. At its core are variables such as good, bad, and young as well as modifiers such as more, less,
and very. These ordinary terms represent fuzzy sets in a particular problem. Fuzzy logic plays a key role in many
medical expert systems. MeSH, 1993
Generative Adversarial Networks (GANs)
are a class of artificial
intelligence algorithms
used in unsupervised
machine learning,
implemented by a system of two neural
networks contesting
with each other in a zerosum
game framework.
They were introduced by Ian
Goodfellow et
al. in 2014.^{[1]}
…One
network generates candidates (generative) and the
other evaluates them (discriminative).^{[3]}^{[4]}^{[5]}^{[6]} Typically,
the generative network learns to map from a latent
space to a particular data distribution of interest, while the
discriminative network discriminates between instances from the true
data distribution and candidates produced by the generator. The
generative network's training objective is to increase the error
rate of the discriminative network (i.e., "fool" the discriminator
network by producing novel synthesised instances that appear to have
come from the true data distribution).^{[3]}^{[7]}
In practice, a known dataset serves as the initial training data
for the discriminator. Training the discriminator involves
presenting it with samples from the dataset, until it reaches some
level of accuracy. Typically the generator is seeded with a
randomized input that is sampled from a predefined latent space^{[4]}(e.g.
a multivariate normal
distribution).
Thereafter, samples synthesized by the generator are evaluated by
the discriminator. Backpropagation is
applied in both networks ^{[5]} so
that the generator produces better images, while the discriminator
becomes more skilled at flagging synthetic images.^{[8]} The
generator is typically a deconvolutional neural network, and the
discriminator is a convolutional neural network.
Wikipedia accessed 2018 Dec 8
https://en.wikipedia.org/wiki/Generative_adversarial_network
Related
terms: dimensionality reduction, high dimensionality
Genetic Algorithm GA: In computer
science and operations
research, a genetic algorithm (GA) is a
metaheuristic
inspired by the process of natural
selection that belongs to the larger class of evolutionary
algorithms (EA). Genetic algorithms are commonly used to
generate highquality solutions to
optimization and
search
problems by relying on bioinspired operators such as
mutation,
crossover and
selection.[1]
John Holland introduced Genetic Algorithm (GA) in 1960 based on the
concept of the Darwin’s theory of evolution Wikipedia accessed 2018
October 27
https://en.wikipedia.org/wiki/Genetic_algorithm
Method for library design by evaluating the
fit of a parent library to some desired property (e.g. the level of
activity in a biological assay,
or the computationally determined diversity of the compound set) as
measured by a fitness function. The design of more optimal daughter
libraries is then carried out by a heuristic process with
similarities to genetic selection in that it employs replication,
mutation, deletions etc. over a number of generations. IUPAC
Combinatorial Chemistry
An optimization algorithm based on the
mechanisms of Darwinian evolution which uses random mutation,
crossover and selection procedures to breed better models or
solutions from an originally random starting population or sample.
(Rogers and Hopfinger, 1994). IUPAC Computational Related
terms: evolutionary computation: drug design. Narrower term: genetic
programming
genetic programming: In artificial
intelligence, genetic programming (GP) is a technique whereby
computer programs are encoded as a set of genes that are then
modified (evolved) using an evolutionary
algorithm (often a genetic
algorithm, "GA") – it is an application of (for example) genetic
algorithms where the space of solutions consists of computer
programs. The results are computer programs that are able to perform
well in a predefined task. Wikipedia accessed 2018 Oct 27
https://en.wikipedia.org/wiki/Genetic_programming
A subset of genetic algorithms. The members
of the populations are the parse trees of computer programs whose
fitness is evaluated by running them. The reproduction operators
(e.g. crossover) are refined to ensure that the child is
syntactically correct (some protection may be given against semantic
errors too). This is achieved by acting upon subtrees. Genetic
programming is most easily implemented where the computer language
is tree structured so there is no need to explicitly evaluated its
parse tree. This is one of the reasons why Lisp is often used for
genetic programming. This is the common usage of the term genetic
programming however it has also been used to refer to the
programming of cellular automata and neural networks using a genetic
algorithm. William Langdon "Genetic programming and data structures
glossary" 2012
https://books.google.com/books?id=SVHhBwAAQBAJ&dq=William+Langdon+%22Genetic+programming+and+data+structures+glossar&source=gbs_navlinks_s
Genetic Programming Organization
http://www.geneticprogramming.org
global schema: A schema, or a map of the data content of a data
warehouse that integrates the schemata from several source repositories.
It is "global", because it is presented to warehouse users as the schema
that they can query against to find and relate information from any of
the sources, or from the aggregate information in the warehouse. Lawrence
Berkeley Lab "Advanced Computational Structural Genomics" Glossary Broader term: schema
Hansch analysis:
The investigation of the quantitative relationship
between the biological activity of a series of compounds and their
physicochemical substituent or global parameters representing hydrophobic,
electronic, steric and other effects using multiple regression correlation
methodology. IUPAC Medicinal Chemistry Related term: QSAR
heat map:
A rectangular display that is a direct translation of
a Cluster format data table. Each cell of the data table is represented as a
small color coded square in which the color indicates the expression value.
Generally green indicates low values, black medium values, and red high ones,
although this is user settable. The net effect is a colored picture in which
regions of similar color indicate similar profiles or parts of profiles. Related terms: cluster analysis, dendogram, heat map, profile chart;
Expression
heuristic:
Tools such as
genetic algorithms or neural
networks employ heuristic methods to derive solutions which may be
based on purely empirical information and which have no explicit rationalization.
IUPAC Combinatorial Chemistry
Trial and error methods.
Narrower terms: heuristic
algorithm, metaheuristics
heuristic algorithms:
one
that is designed to solve a problem in a faster and more efficient fashion than
traditional methods by sacrificing optimality, accuracy, precision, or
completeness for speed. Heuristic algorithms often times used to solve
NPcomplete problems, a class of decision problems. In these problems, there is
no known efficient way to find a solution quickly and accurately although
solutions can be verified when given. Heuristics can produce a solution
individually or be used to provide a good baseline and are supplemented with
optimization algorithms.
https://optimization.mccormick.northwestern.edu/index.php/Heuristic_algorithms
hierarchical clustering:
Unsupervised clustering approach used to
determine patterns in gene expression data. Output is a tree like structure. Related term: cluster analysis, self organizing maps
highdimensionality:
Many applications of machine learning methods
in domains such as information retrieval, natural language processing, molecular
biology, neuroscience, and economics have to be able to deal with various sorts
of discrete data that is typically of very high dimensionality. One standard approach to deal with high dimensional data is to perform a
dimension reduction and map the data to some lower dimensional representation.
Reducing the data dimensionality is often a valuable analysis by itself, but it
might also serve as a pre processing step to improve or accelerate subsequent
stages such as classification or regression. Two closely related methods that
are often used in this context and that can be found in virtually every textbook
on unsupervised learning are principal component analysis (PCA) and
factor analysis. Thomas Hoffmann, Brown Univ. Statistical Learning in
High Dimensions, Breckenridge CO, Dec. 1999 http://www2.cs.cmu.edu/~mmp/workshopnips99/speakers.html
See also under learning algorithms;
Related terms: cluster analysis, curse of dimensionality, dimensionality
reduction, Generative Adversarial Networks GANS, ill posed problem, neural nets, principal components analysis
illposed problems:
Problems that are not
wellposed in the sense of Hadamard are termed illposed. Inverse
problems are often
illposed. .. Continuum models must often be discretized in
order to obtain a numerical solution. While solutions may be continuous with
respect to the initial conditions, they may suffer from numerical
instability when
solved with finite precision, or with errors in the data. Even if a problem is
wellposed, it may still be illconditioned, meaning that a small error
in the initial data can result in much larger errors in the answers. An
illconditioned problem is indicated by a large condition
number…
If [a problem] is not wellposed, it needs to be reformulated for numerical
treatment. Typically this involves including additional assumptions, such as
smoothness of solution. This process is known as regularization.
Wikipedia accessed 2018
Sept 7
https://en.wikipedia.org/wiki/Wellposed_problem
Problems without a unique solution, problems without any solution. Life
sciences data tends to be very noisy, leading to illposed problems.
Interpretation of microarray gene expression
data is an ill posed problem. Compare well posed problem
influence based data mining:
Complex and granular (as opposed
to linear) data in large databases are scanned for influences between specific
data sets, and this is done along many dimensions and in multi table formats.
These systems find applications wherever there are significant cause and
effect relationships between data sets  as occurs, for example in large
and multivariant gene expression studies, which are behind areas such as
pharmacogenomics. "Data mining" Nature
Biotechnology Vol. 18: 237 238 Supp. Oct. 2000 Broader
term: data mining
information theory:
Founded by Claude Shannon in the 1940's, has had an
enormous impact on communications engineering and computer sciences.
https://www.scientificamerican.com/article/claudeeshannonfounder/
kmeans clustering:
The researcher picks a value for k, say k = 10,
and the algorithm divides the data into that many clusters in such a way that
the profiles within each cluster are more similar than those across clusters.
The actual algorithms for this are quite sophisticated. Although the core
algorithms require that a value of k be selected up front, methods exist that
adaptively select good values for k by running the core algorithm several times
with different values. A nonhierarchical method. Broader terms: cluster analysis, neural nets
Knowledge Discovery in Databases (KDD):
The notion of Knowledge Discovery in Databases (KDD) has been given various
names, including data mining, knowledge extraction, data pattern
processing, data archaeology, information harvesting, siftware, and even (when
done poorly) data dredging. Whatever the name, the essence of KDD is the
"nontrivial extraction of implicit, previously unknown, and potentially
useful information from data" (Frawley et al 1992). KDD encompasses a
number of different technical approaches, such as clustering, data
summarization, learning classification rules, finding dependency networks,
analyzing changes, and detecting anomalies (see Matheus et al 1993). Gregory
Piatetsky Shapiro, KDD Nuggets FAQ, KDD Nuggets News, 1994 http://www.kdnuggets.com/news/94/n6.txt
Related term: data mining
latent variables:
In statistics, latent
variables (from Latin: present
participle of lateo (“lie
hidden”), as opposed to observable
variables), are variables that
are not directly observed but are rather inferred (through
a mathematical
model) from other variables
that are observed (directly measured). Mathematical models that aim
to explain observed variables in terms of latent variables are
called latent
variable models. Latent
variable models are used in many disciplines, including
psychology,
demography,
economics, engineering, medicine, physics, machine
learning/artificial
intelligence, bioinformatics, natural
language processing, econometrics, management and
the social
sciences. …One advantage of
using latent variables is that they can serve to reduce
the dimensionality of data. A
large number of observable variables can be aggregated in a model to
represent an underlying concept, making it easier to understand the
data. In this sense, they serve a function similar to that of
scientific theories. At the same time, latent variables link
observable ("subsymbolic")
data in the real world to symbolic data in the modeled world.
Wikipedia accessed 2018 Dec 9
https://en.wikipedia.org/wiki/Latent_variable
MathML:
Intended to
facilitate the use and reuse of mathematical and scientific content on the Web,
and for other applications such as computer algebra systems, print typesetting,
and voice synthesis. W3C http://www.w3.org/Math/whatIsMathML.html
metadata: Ontologies & taxonomies
metaheuristic:
In computer
science and mathematical
optimization,
a metaheuristic is
a higherlevel procedure or heuristic designed
to find, generate, or select a heuristic (partial search
algorithm)
that may provide a sufficiently good solution to an optimization
problem,
especially with incomplete or imperfect information or limited computation
capacity.^{[1]} Metaheuristics
sample a set of solutions which is too large to be completely sampled.
Metaheuristics may make few assumptions about the optimization problem being
solved, and so they may be usable for a variety of problems.^{[2]}^{
}Wikipedia accessed 2018 Jan 26
https://en.wikipedia.org/wiki/Metaheuristic
Monte Carlo technique: A
simulation procedure consisting of randomly sampling the
conformational space of a molecule. IUPAC Computational Broader
term: simulation
multivariate statistics:
A set of statistical tools to analyze
data (e.g., chemical and biological) matrices using regression and/ or pattern
recognition techniques. IUPAC Computational
neural networks: Data Science
normalization:
A knotty area in any measurement process, because
it is here that imperfections in equipment and procedures are addressed. The
specifics of normalization evolve as a field matures since the process usually
gets better, and one’s understanding of the imperfections also gets better. In
the microarray field, even larger changes are occurring as robust statistical
methods are being adopted. See also normalization Microarrays
Narrower terms: thresholding
OASIS Organization for the
Advancement of Structured Information Systems:
A
not for profit, global consortium that drives the development, convergence and
adoption of ebusiness standards. http://www.oasisopen.org/who/
OASIS Glossary of terms
http://www.oasisopen.org/glossary/index.php
Open standards
parsing:
Using algorithms to analyze data into components. Semantic
parsing involves trying to figure out what the components mean. Lexical
parsing refers to the process of deconstructing the data into components.
Narrower term:
Drug
discovery informatics gene parsing
pattern
recognition PR: The
identification of patterns in large data sets using appropriate
mathematical methodologies. Examples are principal component
analysis (PCA), SIMCA, partial least squares (PLS) and artificial
neural networks (ANN) (Rouvray, 1990; Van de Waterbeemd, 1995ab)
IUPAC Computational
Narrower terms: artificial neural networks, molecular pattern
recognition, principal component analysis (PCA), SIMCA, partial
least squares (PLS)
predictive data mining;
Combines pattern matching, influence
relationships, time set correlations, and dissimilarity analysis to offer
simulations of future data sets...these systems are capable of incorporating
entire data sets into their working, and not just samples, which make their
accuracy significantly higher ... used often in clinical trial analysis
and in structure function correlations. "Data mining" Nature Biotechnology
Vol. 18: 237238 Supp. Oct. 2000 Broader term: data mining
Principal Components Analysis PCA:
Computational approach to reducing the complexity of, for example, a
set of descriptors, by identifying those features which provide the
major contributions to observed properties, and thus reducing the
dimensionality of the relevant property space. IUPAC Combinatorial
Chemistry
A data reduction method using mathematical
techniques to identify patterns in a data matrix. The main element
of this approach consists of the construction of a small set of new
orthogonal, i.e., non correlated, variables derived from a linear
combination of the original variables. IUPAC Computational
Often confused
with common factor analysis. Neural Network FAQ Part 1
ftp://ftp.sas.com/pub/neural/FAQ.html
probability:
Probability web http://www.mathcs.carleton.edu/probweb/probweb.html
Probability web resources include journals
societies and quotes recursive
partitioning: Process for
identifying complex structure activity relationships in large sets
by dividing compounds into a hierarchy of smaller and more
homogeneous sub groups on the basis of the statistically most
significant descriptors. IUPAC Combinatorial Chemistry
Related terms: clustering, principal components analysis regression
analysis:
The use of
statistical methods for
modeling
a set of dependent variables, Y, in terms of combinations of
predictors, X. It includes methods such as multiple linear
regression (MLR) and partial least squares (PLS). IUPAC
Computational
regression to the mean:
A common misconception about genetics has to
do with overgeneralization about the likelihood of increased quality by selective breeding.
Two very tall parents will tend to produce offspring who are taller than the
average population  but less tall than the average of the parents'
heights. Or as George Bernard Shaw is supposed to have said to a famous
beauty who suggested they have a child ""With your brains and my looks
..." He said to have replied, "But what if the child had my looks and your
brains?"
selforganization:
also called (in the social
sciences) spontaneous
order, is a process
where some form of overall order arises
from local interactions between parts of an initially disordered system.
The process is spontaneous, not needing control by any external agent. It is
often triggered by random fluctuations,
amplified by positive
feedback. The
resulting organization is wholly decentralized, distributed over
all the components of the system. As such, the organization is typically robust and
able to survive or selfrepair substantial perturbation. Chaos
theory discusses
selforganization in terms of islands of predictability in
a sea of chaotic unpredictability.
Selforganization occurs in many physical, chemical, biological, robotic,
and cognitive systems.
Examples of selforganization include crystallization,
thermal convection of
fluids, chemical
oscillation, animal swarming, neural
circuits, and artificial
neural networks.
Wikipedia accessed 2018 Sep 7
https://en.wikipedia.org/wiki/Selforganization
SIMCA (SIMple Classification Analysis or Soft Independent Modeling
of Class Analogy): This
method is a pattern recognition and classification technique (Dunn
and Wold, 1995). IUPAC Computational
time delay data mining:
The data is collected over time and systems
are designed to look for patterns that are confirmed or rejected as the
data set increases and becomes more robust. This approach is geared
toward long term clinical trial analysis and multicomponent
mode of action
studies. "Data mining" Nature Biotechnology Vol. 18: 237238 Supp. Oct.
2000 Broader term: data mining
trendsbased data mining:
Software analyzes large and complex
data sets in terms of any changes that occur in specific data sets over
time. Data sets can be user defined or the system can uncover them
itself...This is especially important in cause and effect biological experiments.
Screening is a good example. Data mining, Nature Biotechnology Vol. 18: 237
238 Supp. Oct. 2000 Broader term: data mining
wellposed problem:
The mathematical term wellposed
problem stems from a definition given by Jacques
Hadamard. He
believed that mathematical models of physical phenomena should have the
properties that: a solution exists, the solution is unique, the solution's
behavior changes continuously with the initial conditions. ….
If the problem is wellposed, then it stands a good chance of solution on a
computer using a stable
algorithm.
Wikipedia
accessed 2018 Sept 7
https://en.wikipedia.org/wiki/Wellposed_problem
Compare: illposed problems
Algorithms resources
Algorithms, terms and definitions, HansGeorg Beyer, Eva Brucherseifer, Wilfried
Jakob, Hartmut Pohlheim, Bernhard Sendhoff, Thanh Binh To, 2002
http://ls11www.cs.unidortmund.de/people/beyer/EAglossary/
Flake Gary Computational Beauty of Nature: Computer Explorations of
Fractals, Chaos, Complex Systems and Adaptation. Glossary MIT Press, 2000.
280+ definitions. http://mitpress.mit.edu/books/FLAOH/cbnhtml/glossaryintro.html
Glossary of Probability and Statistics, Wikipedia
https://en.wikipedia.org/wiki/Glossary_of_probability_and_statistics
IUPAC Glossary of Terms Used in Combinatorial Chemistry, D. Maclean, J.J.
Baldwin, V.T. Ivanov, Y. Kato, A. Shaw, P. Schneider, and E.M. Gordon, Pure
Appl. Chem., Vol. 71, No. 12, pp. 2349 2365, 1999, 100+ definitions http://www.iupac.org/reports/1999/7112maclean/
IUPAC
Glossary of Terms used in Computational Drug Design Part II 2015
https://www.degruyter.com/downloadpdf/j/pac.2016.88.issue3/pac20121204/pac20121204.pdf
NIST National Institute of Standards and Technology, Dictionary of
Algorithms, Data Structures and Problems, Paul Black, 2001, 1300+ terms
http://www.nist.gov/dads/
How
to look for other unfamiliar terms IUPAC definitions are reprinted with the permission of the International
Union of Pure and Applied Chemistry.
