Vol. 17 no. 5 2001 Pages 438–444

BIOINFORMATICS Unsupervised classification of noisy chromosomes Tony Y. T. Chan

The University of Aizu, Aizu-Wakamatsu Shi, Fukushima Ken, 965-80 Japan Received on May 17, 2000; accepted on December 22, 2000

ABSTRACT Motivation: Almost all methods of chromosome recognition assume supervised training; i.e. we are given correctly classified chromosomes to start the training phase. Noise, if any, is confined only in the representation of the chromosomes and not in the classification of the chromosomes. During the recognition phase, the problem is simply to calculate the string edit distance of the unknowns to the representatives chosen from the training phase and classify the unknowns accordingly. Results: In this paper, a general method to tackle the difficult unsupervised induction problem is described. The success of the method is demonstrated by showing how the inductive agent learns weights in a dynamic manner that allows it to distinguish between noisy median and telocentric chromosomes without knowing their proper labels. The process of learning is characterized as the process of finding the right distance function, i.e. the distance function that can nicely separate the classes. Contact: [email protected]

INTRODUCTION Lindenmayer (1968) introduced the idea of using formal languages to model the developmental processes and structures of organisms. (See also Prusinkiewicz and Lindenmayer, 1990.) Fu (1982) popularized the use of formal grammars for chromosome recognition. Chromosomes are represented by strings, i.e. sentences of formal languages. Classes of chromosomes are represented by grammars. The problem of identifying a chromosome becomes parsing a sentence to find out to which one of the languages it belongs. Goldfarb (1984) pioneered a new metric approach to pattern recognition. Goldfarb and Chan (1984) applied the metric approach to chromosome recognition. Edit distance is used to calculate the distance between two chromosomes represented by strings. Clarson and Liang (1989) used the metric approach to classify brain waves. Myers and Miller (1988) presented a space-saving algorithm for the calculation of edit distances. Chan (1999b) described a parallel space-saving version written in APL. Uses of principal component analysis are common 438

(Goldfarb and Chan, 1984; Forster et al., 1999). All these methods assumed the supervised case, i.e. correct classifications for all training examples. Based on Goldfarb’s metric approach to pattern recognition, Chan (1999a) proposed a model for the unsupervised inductive learning in two classes. The model was applied successfully to the benchmark exclusive-or and parity problems. This paper deals with the problems of supervised and unsupervised recognition of chromosomes. The computer learning agent has never been told what median and telocentric chromosomes are and it is successful in autonomously setting up an on-line recognition system to check for median and telocentric chromosomes. Unlike the case for the parity problem, these chromosomes can be distinguished without resorting to the macro transformation system; the basic transformation system is sufficient. A pattern language describes a set P of objects. An object could be represented by a vector, a string, a graph, etc. depending on the language. We will call this set the object set. The object set is divided into exactly one special subset Q ⊂ P and its complement Q ⊂ P. Presumably, each of these subsets is a class of objects. The ‘teacher’ supplies two finite training groups of objects to the agent, one from Q and one from Q. Training group Qˇ ˇ is, by default, labeled is labeled positive; training group Q

ˇ is called the training set. negative. Their union Qˇ ∪ Q The agent then autonomously devises a way to separate the (possibly infinite) positive subset Q from the (possibly infinite) negative subset Q based on the two training groups. Hence, the next time an unknown object p ∈ P is presented to the agent, it will be able to classify it as either positive or negative. The induction problem is the problem of building such an agent. This agent is said to be intelligent, to be able to learn, and to be able to perform self-organization. The more adaptive the agent, the more able it is to learn to classify from diverse and complex pattern languages and classes.

THE BASIC MODEL Two foundational concepts of the new learning model for the unsupervised inductive learning process are transforc Oxford University Press 2001 

Unsupervised classification of noisy chromosomes

mation systems and stability optimization. Let T = (P, S) be a transformation system, where P is an underlying set of structural objects described by a pattern language. Let s = x ↔ y be a substitution operation where x and y are subobjects. An object X can be transformed to the object Y via rule s by matching a subobject x from X and replacing it with y. Substitution rules are bidirectional in that the substitution of subobject x by subobject y implies that the substitution of y by x is also possible. Either x or y could be the empty string θ (or empty graph in general); in that case, the operation is called insertion or deletion. The set S = (s1 , s2 , . . . , sm ) is a list of m bidirectional substitution operations. We associate each transformation system (P, S) with its stability optimization by the following very natural and convenient ideas. First, we introduce into the transformation system (P, S) an intrinsic distance function 1 such that 1 ( p1 , p2 ) = the minimum number of substitutions needed to transform object p1 into object p2 , where p1 , p2 ∈ P. Second, with each substitution si , we associate a weight wi , wi  0, so that it costs wi to operate si . Let W = (w1 , w2 , . . . , wm ) and W ( p1 , p2 ) = smallest total cost to transform p1 into p2 . It does indeed follow from the earlier definition that 1 = (1,1,...,1) , where all the substitution rules have the same weight. Third, for the purpose of leading to the optimization procedure later on, we force W to vary and subject it to constraint. We associate the transformation system  T with a parametric family of distance functions, i.e. W |  n i=1 wi = 1 . It is a family of distance functions whose weights sum to unity. D EFINITION 1. The average intra-group distance for a positive training group Qˇ is

ˇ = {qˇ , qˇ , . . . , qˇ }. where Q 1 2 n Here n can equal 1. We need only at least one negative training object. In essence, these two rather standard definitions capture the idea of the average distance of a distance table where distances are listed between pairs of objects. D EFINITION 3. The stability quotient is Z

ˇ (W ) ˇ Q Q,

=

ρ Qˇ (W ) υˇ

ˇ Q, Q

(W )

.

This important quotient serves as a goodness or fitness measure for clustering/separation of groups (i.e. clusters). For these three definitions, we will drop the subscripts when it is clear from the context which subscripts are meant. The learning algorithm or strategy is extremely simple. The stability quotient serves as the objective function of an optimization procedure so that we can simultaneously minimize the within group distance and maximize the between group distance. Obviously, we would like to configure the topology in such a way that, within the positive group, objects are close to each other, while at the same time they are far from objects not within the positive group. D EFINITION 4. The stability optimization is to minimize Z

ˇ (W ) ˇ Q Q,

subject to the constraint that n 

wi = 1.

i=1

ρ Qˇ (W ) =

n  i−1  2 W (qˇi , qˇi ) n(n − 1) i=2 j=1

where Qˇ = {qˇ1 , qˇ2 , . . . , qˇn }. Given a specific training group and a specific distance function, ρ returns the average distance within a group of training objects. Note that for this formula to work, there must be at least two positive training objects. When n is equal to 1, we trivially define ρ = 0. D EFINITION 2. The average inter-group distance beˇ is tween groups Qˇ and Q υˇ

ˇ (W ) =

Q, Q

n  n 1  W (qˇi , qˇ j ) nn i=1 j=1

In other words, we want Z to get as close to zero as possible since negative costs are not allowed. In the ˇ as fixed, each supervised case, we consider Qˇ and Q containing correctly classified objects, and we only vary the weights. In the unsupervised case, we vary the pair ˇ as well as the weights. ˇ Q) ( Q, After the training phase is over, one can prepare for the recognition stage or the performance stage. Let W ∗ be the optimal cost vector and p be the unknown. In the ideal situations (Z ∗ = 0), one W ∗ distance calculation between any positive training object qˇ and the unknown is sufficient to classify the unknown. If the distance is zero, then the unknown belongs to the positive class; otherwise, it belongs to the negative class. A good rule of thumb here is to choose the positive training object with the shortest description; e.g. in the case of strings, it is good to choose 439

T.Y.T.Chan

a a b

b

a

b

c

d

b

e

b b

b

b b

Fig. 1. Primitive patterns for chromosomes.

c

d

d

c

one with the smallest length. This choice will make the distance calculation more efficient. In the less than ideal cases, the stable substitution cost vector W , the associated distance function W , and the ˇ form a finite metric space learning groups Qˇ and Q ˇ  ) which can then be transformed into a ( Qˇ ∪ Q, W vector space of training vectors. Then a set of reference objects from the learning groups must be selected to form the ‘basis’ of the vector space. Unknown objects can be transformed into vectors via this ‘basis’ and be classified using the classical techniques available in pattern recognition. Alternatively, conventional nearestneighbor classification can be used within the metric space itself without the vector space. To be more precise, there are three distinctive stages. The first stage is usually offline; it involves learning of the optimal cost vector. The second stage is also usually off-line and it is sometimes optional; it involves the isometric embedding. The third stage is usually on-line; it involves real-time performance of recognizing.

LEARNING MEDIAN AND TELOCENTRIC CHROMOSOMES We now apply the general learning strategy in the previous section to the problem of classifying chromosomes. To simplify our discussion, we shall limit our substitution operations to insertions/deletions only. (Anyway, a substitution can be simulated by a deletion followed by an insertion.) We will use the term insertion or deletion to mean the insertion/deletion operation since when we have one, we always have the other in accordance with the definition of the transformation system. Consider the set P of cyclic strings made up from the five primitives in Figure 1. The median chromosome in Figure 2 can be represented by the cyclic string abbdbbabbcbbabbdbbabbcbb and the telocentric chromosome in Figure 3 is encoded by ebbbabbbcbbbabbb. The transformation system for this problem consists of the set of all finite strings made up from a, b, c, d, and e plus the set containing the insertion (deletion) of a, of b, of c, of d, and of e. So we have infinite number of objects in P and 5 operations in S. This transformation system is complete in the sense that any object from P can be transformed to any other object from P. We assign each 440

b b b

b

b

b

b

b

a

a

Fig. 2. A median chromosome abbdbbabbcbbabbdbbabbcbb.

b e

b

b

b

c b

b

b b b

b b

a

b a

Fig. 3. A telocentric chromosome ebbbabbbcbbbabbb.

of the five insertion operations a weight or a cost subject to the constraint that the sum of the five weights is 1. Now we can define a distance between any two objects as the cheapest total cost for transforming one object into the other.

Supervised: no noise We begin the application of the unifying metric theory with the simplest case first. All the positive objects in the positive training group are perfect examples of the median chromosome and all the negative objects in the negative training group are perfect examples of the telocentric chromosome. In other words, there are no mis-classifications in the training groups and no noise associated with any of the training objects. Presented with the problem in Table 1, the learning agent would want to find the optimal costs for inserting a, b, c, d, and e, subject to the constraint that the sum of the costs is unity in order to produce the most stable Z value. Basically, it wants to find the metric space ˇ  ) such that the distance between the two ( Qˇ ∪ Q, W

Unsupervised classification of noisy chromosomes

Table 1. Case 1 training set 1 2

Table 2. Cost vectors and stability values

Positive objects abbdbbabbcbbabbdbbabbcbb babdbabcbabdbabc

3

Negative object ebbbabbbcbbbabbb

median chromosomes is as small as possible, and the average distance between the telocentric chromosome and the median chromosomes is as large as possible. In less than 1 s of CPU time on a Sun workstation, it found the cost vector (0, 0, 0, 0, 1) so that Z ∗ = 0. This solution is ideal. Let us examine the weight space generated by W = (w1 , w2 , w3 , w4 , w5 ) such that w1 +w2 +w3 +w4 +w5 = 1. This is the 4-dimensional unit simplex. The stability values for the five corners and the midpoints of the various edges and faces are shown in Table 2. We see that 15 of the 31 possible cost vectors have the ideal stability value. The optimization function is riddled with global optima. This explains the quickness of the optimization process in this case. The learning agent has found the first cost vector in this list to be a perfect answer to the problem of classifying the two types of chromosomes represented by the training set. It could have found any 1 of those 15, depending on the actual optimization procedure used. To prepare for the on-line recognition stage, it chose the second training object from the positive group as the reference object or prototype because it was shorter than the other training object in the positive group. So it stored away W ∗ = (0, 0, 0, 0, 1) and babdbabcbabdbabc. Now at the on-line recognition stage, when the agent is presented with a chromosome x ∈ P, it calculates W ∗ (x, babdbabcbabdbabc). If the resulting distance is 0, x is classified as median; otherwise, it classifies x as telocentric. For all the unknown chromosomes represented without noise, this simple calculation will result in 100% correct classification. For those unknowns with noise, all are classified correctly except when the noise involved the primitive e. Even with this noise, sometimes the classification is correct; and when it is not, a human, too, would have the same trouble because the unknown is just too noisy to be identified for sure. Contrasting with the neural net approach where the Euclidean geometry is fixed by the object representation, classes are regions in the space, and learning is to find a partition of this fixed Euclidean space, the unifying metric approach is of a fundamentally different character. Even from this simple example, we can see how a continuous structure (a metric geometry) is built on a set of discrete training objects. This construction is automatic given only training objects and substitution rules, i.e. the transformation system. This construction is meaningful

Cost vector 0, 0, 0, 0, 1

Z 0

0, 0, 0, 1, 0 0, 0, 1, 0, 0

0 0

0, 1, 0, 0, 0

2

1, 0, 0, 0, 0

0

0, 0, 0, 12 , 12 0, 0, 12 , 0, 12

0

0, 0, 12 , 12 , 0 0, 12 , 0, 0, 12

0

0, 12 , 0, 12 , 0

1.3

0, 12 , 12 , 0, 0 1 , 0, 0, 0, 1 2 2

1.6

1 , 0, 0, 1 , 0 2 2 1 , 0, 1 , 0, 0 2 2 1 , 1 , 0, 0, 0 2 2 0, 0, 13 , 13 , 13 0, 13 , 0, 13 , 13 0, 13 , 13 , 0, 13 0, 13 , 13 , 13 , 0 1 , 0, 0, 1 , 1 3 3 3 1 , 0, 1 , 0, 1 3 3 3 1 , 0, 1 , 1 , 0 3 3 3 1 , 1 , 0, 0, 1 3 3 3 1 , 1 , 0, 1 , 0 3 3 3 1 , 1 , 1 , 0, 0 3 3 3 0, 14 , 14 , 14 , 14 1 , 0, 1 , 1 , 1 4 4 4 4 1 , 1 , 0, 1 , 1 4 4 4 4 1 , 1 , 1 , 0, 1 4 4 4 4 1, 1, 1, 1,0 4 4 4 4 1, 1, 1, 1, 1 5 5 5 5 5

0

0 1.6

0 0 1.3 0 1.1 1.3 1.1 0 0 0 1.1 1 1 1 0 0.89 0.89 0.8 0.36

for the purpose of classification. And this construction is dynamic; i.e. it is not based on a fixed space. In fact, a family of metric spaces via W are examined implicitly before the optimal metric space is found.

Supervised: noise The second case that we want to consider is supervised training with noise in the strings that represent chromosomes. For training objects, we simply corrupt the ones from Case 1. Still, we keep correct chromosomes in the correct training group. Table 3 contains the training set. Again in less than a second of CPU time, this optimization is achieved by the same cost vector (0, 0, 0, 0, 1) and Z ∗ = 0. However, when we examine the optimization function under these two training groups, we find that there are only seven points in the simplex with the ideal stability value. In Case 1, there were 15. We can see how 441

T.Y.T.Chan

Table 3. Case 2 training set 1 2

Table 5. Case 4 training set

Positive objects abbdbbbcbbabbdbbabbcbb babdbabcbabdbabc

3

Negative object ebbbabbbcbbabbb

Table 4. Case 3 training set 1 2

Positive objects abbdbbbcbbabbdbbabbcbb babdbabcbabdbabc

3 4

Negative objects ebbbabbbcbbabbb cbabdbbabbcbbabbdbbab

the introduction of noise reduces the number of locations where a global minimum is found. The same Case 1 pre-recognition stage occurs here by storing the same prototype and cost vector. For the on-line recognition stage, the same correct classification performance happens here as in Case 1.

Mis-labels in negative training group: noise The third case that we want to consider is semi-supervised training with noise. We keep the same first three training objects as in Case 2 and add a median chromosome into the negative group which supposedly contains only telocentric chromosomes (see Table 4). But as far as the learning agent is concerned, it has no knowledge that one of the objects has been mis-labeled. For this example, it took a bit longer than the last one because of the increase in the size of the training set. Still, in less than a second of CPU time, the optimum was achieved by the same cost vector (0, 0, 0, 0, 1) and Z ∗ = 0! When we examine the stability function under these two training groups, we find again that there are seven points in the simplex with the ideal stability value as in Case 2. The learning agent went on learning the classes from this problem with a mis-labeled negative training object as easily as in the previous case when all the training objects were correctly labeled. This is accomplished by the use of the average inter-group distance. No other existing general methods, including neural networks and the maximization of the quality of the class perception function, can handle these two training groups so beautifully. During the pre-recognition stage, the learning agent stored the same prototype and cost vector as it did for Case 1. For the on-line recognition stage, the same correct classification performance happens here as in Case 1. There is an important insight that we can learn from this example. Let us consider this example from the point of 442

0 1 2 3 4

Training objects bbabbcbbabbbe abbdbbbcbbabbdbbabbcbb babdbabcbabdbabc ebbbabbbcbbabbb cbabdbbabbcbbabbdbbab

view of supervised learning, or learning with a teacher. When we check the training set of these four training objects with their class labels given by the supposedly infallible teacher, we find that the teacher incorrectly labeled one in the set. Nevertheless, the learner proceeded with the training smoothly and come out stable. When the learning process stopped, it was ready to classify any of the objects from the infinite set P including those labeled incorrectly in the training set. This gives rise to the idea that unsupervised learning is equivalent to supervised learning in a mis-labeled noisy environment. This idea I will elaborate on in the next subsection.

Unsupervised: noise The fourth and final case in this paper that we want to consider is unsupervised training with noise. We keep the same four training objects as in Case 3 and add a telocentric chromosome (see Table 5). As far as the learning agent is concerned, it has been given five noisy training objects with no labels. The first thing that the agent needs to do is to partition the training set into two groups. Since mis-labeling can be tolerated in the negative group and minimizing the size of the positive group will result in a faster optimization, it seems good to begin by choosing two objects from the training set to form the positive group and put the rest into the negative group. The danger of choosing only two is that if the training set is very noisy, it may happen that the two chosen objects accidentally form a noisy stable class according to the specification of the optimization. This danger can be lessened by choosing more than two training objects to form the positive group at the expense of efficiency. Statistically, the bigger the positive group, the lower the chance that the agent will pick up a noisy stable class. This danger can be eliminated altogether at the expense of careful extraction of perfect primitives so that the training objects are noiseless but this is not always practical. Without any external knowledge or heuristics, the agent simply systematically chooses two objects at a time to form a positive group and then performs optimization until a stable (e.g. ideal) metric space is found. In this case, the agent began by putting objects 0 and 1 into the positive training group and objects 2, 3 and 4 into the negative training group and then performed optimization. After a few seconds of CPU time, no stable

Unsupervised classification of noisy chromosomes

metric space could be found; the best stability value was 0.857. The agent repartitioned the training set by putting objects 0 and 2 into the positive training group and objects 1, 3 and 4 into the negative training group and performed optimization. Here, too, no stable metric could be found; the best stability value is 0.24. Then the agent put objects 0 and 3 into the positive group and the rest into negative group. After less than a second of CPU time, the cost vector (0, 0, 0, 0, 1) was found to yield Z ∗ = 0. By further examining the structure of the optimization function, it found fourteen other global minima equaling 0. Interestingly enough, the fifteen cost vectors that yielded 0 here are the same fifteen cost vectors that yielded 0 for the Case 1 training groups (see Tables 1 and 2). Here we have telocentric chromosomes in the positive group and median chromosomes in the negative group, while in Case 1, it is the other way around. Also, here we have five training objects while in Case 1 we have only three. To prepare for the on-line recognition stage, it chose the second training object from the positive group as the reference object or prototype because it was shorter than the other training object in the positive group. So it stored away W ∗ = (0, 0, 0, 0, 1) and ebbbabbbcbbabbb. Note that unlike the previous three cases, this last case happened to find the telocentric chromosomes in the positive group. At the on-line recognition stage, when the agent is presented with a chromosome x ∈ P, it calculates W ∗ (x, ebbbabbbcbbabbb). If the resulting distance is 0, it classifies x as telocentric; otherwise, it classifies x as median. For all the unknown chromosomes represented without noise, this simple calculation will result in 100% correct classification. For those unknowns with noise, all are classified correctly except when the noise involved the primitive e. The on-line performance is as it was for Case 1.

DISCUSSION For more complicated problems, e.g. the case of three or more clusters being discovered, not only does the learner not have the class label information, it does not even know how many classes are present in the training set. Nevertheless, the basic model here with its transformation system and stability optimization naturally lends itself to general unsupervised learning. One could, for example, concentrate on learning one class at a time, and temporarily put all the other clusters together as the negative class. Learning, then, will proceed in a hierarchical (vertical) or tall manner. Other schemes, such as in a flat (horizontal) manner, are also possible. For example, for the case of c classes, one can try minimizing the sum of the average intra-group distances of each

of the c groups divided by the (continued) product of the average inter-group distances between all possible c(c − 1)/2 pairs of groups. The goal is to keep all intragroup distances equal to zero and none of the inter-group distances equal to zero. I will treat this generalization on multiple classes more rigorously in a future paper. The learning agent was never told what median and telocentric chromosomes were and it was successful in autonomously setting up an on-line recognition system to check for median and telocentric chromosomes. There is a tradeoff between the specificity of the so-called bias (i.e. problem assumptions) and training time: more bias, less CPU time; less bias, more CPU time. The model that I proposed in the paper achieves, I believe, the best of that tradeoff. The learning agent, based on the transformation system and stability optimization, assumes very little, yet it can learn a lot quickly. It only assumes to know the structural (discrete) representation of the objects. We have also done some work that derives the structural representation from raw (numerical) images (Chan and Goldfarb, 1992) based on a learning approach. During the learning phase, noise in the representation of objects is always acceptable but it could affect learning efficiency. From the efficiency point of view, during the learning stage it is best to keep the training set small, noiseless, and with the correct label for each training object. At the on-line stage, the recognition system automatically generated by the learning agent can recognize unknown noiseless chromosomes with a perfect classification record for these two types of chromosomes. For those unknowns with noise, all are classified correctly except when the noise involved the primitive e. Even with this noise sometimes the classification is correct. Induction is seen here as the process of deriving a stable metric space to separate the training groups. A stable metric space is one containing well-separated, compact clusters. The unifying metric approach goes beyond the limitations of the Euclidean space to the metric space and beyond the limitations of the fixed space to a dynamic selection from an infinite family of spaces.

REFERENCES Chan,T.Y.T. (1999a) Inductive pattern learning. IEEE Trans. Syst., Man, Cybern., 29, 667–674. Chan,T.Y.T. (1999b) Running parallel algorithms with APL on a sequential machine. APL Quote Quad, 29, 25–26. Chan,T.Y.T. and Goldfarb,L. (1992) Primitive pattern learning. Pattern Recognit., 25, 883–889. Clarson,V. and Liang,J.J. (1989) Mathematical classification of evoked potential waveforms. IEEE Trans. Syst., Man, Cybern., 19, 68–73. Forster,M.J., Heath,A.B. and Afzal,M.A. (1999) Application of distance geometry to 3D visualization of sequence relationships.

443

T.Y.T.Chan

Bioinformatics, 15, 89–90. Fu,K.S. (1982) Syntactic Pattern Recognition and Applications. Prentice-Hall, Englewood Cliffs, NJ. Goldfarb,L. (1984) A unified approach to pattern recognition. Pattern Recognit., 17, 572–582. Goldfarb,L. and Chan,T.Y.T. (1984) An application of a new approach to pictorial pattern recognition. In Proceedings of the 4th IASTED International Symposium on Robotics and Automation. Acta Press, Zurich, pp. 70–73. Lindenmayer,A. (1968) Mathematical models for cellular interactions in development. J. Theor. Biol., 18, 280–315. Myers,E.W. and Miller,W. (1988) Optimal alignments in linear space. Comput. Appl. Biosci., 4, 11–17. Prusinkiewicz,P. and Lindenmayer,A. (1990) The Algorithmic Beauty of Plants. Springer, New York.

APPENDIX: METRIC DEFINITION Given a set S, a real-valued scalar function δ of two arguments, δ:S×S→ is called a metric function on S if for any x, y, z ∈ S the following conditions are satisfied: non-negative δ(x, y)  0, semi-reflexive x = y ⇒ δ(x, y) = 0, symmetric δ(x, y) = δ(y, x), triangle inequality δ(x, y) + δ(y, z)  δ(x, z). The pair (S, δ) is called a metric space or unifying metric space. Note that the semi-reflexive condition means that the

444

distance between itself must be 0 and the distance between two different members from the set S could possibly be 0 as well. In order to see the relationship between this definition of the metric space from the usual definition, let us define standard metric space by the inclusion of the following additional axiom: definiteness x = y ⇐ δ(x, y) = 0. Now, induce an equivalent relation on the set S by saying that x ∼ y if δ(x, y) = 0 and define a function δ˜ : S˜ × S˜ →  with the set restricted to the set of equivalent classes ˜ x, S˜ = S/∼ so that δ( ˜ y˜ ) = δ(x, y) where x ∈ x˜ (i.e. x˜ is the equivalent class of x) and y ∈ y˜ . ˜ δ) ˜ is a standard metric space. The pair ( S, ˜ δ) ˜ is not a standard Proof by contradiction. Assume ( S, metric space. Then there exists x˜ and y˜ such that x˜ = y˜ ˜ x, and δ( ˜ y˜ ) = 0. But ˜ x, δ( ˜ y˜ ) = 0 ⇒ δ(x, y) = 0 ⇒x ∼y ⇒ x, y ∈ x˜ and x, y ∈ y˜ ⇒ x˜ = y˜ . In this way, we can always induce the standard metric space from a given (unifying) metric space. Hence, in this regard, we are justified in our usage of the term metric space.

bioinformatics

in autonomously setting up an on-line recognition system to check for median .... mis-classifications in the training groups and no noise associated with any of ...

117KB Sizes 1 Downloads 202 Views

Recommend Documents

BMC Bioinformatics
Feb 10, 2015 - BMC Bioinformatics. This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted. PDF and full text (HTML) versions will be made available soon. An evidence-based approach to identify aging-related ge

Bioinformatics Technologies - Semantic Scholar
methods. Key techniques include database management, data modeling, ... the integration of advanced database technologies with visualization tech- niques such as ...... 3. Fig. 1.1. A illustration of a bioinformatics paradigm (adapted from.

BMC Bioinformatics
Jan 14, 2005 - ogy and increasingly available genomic databases have made it possible to .... the six Bacterial species appear much more heterogeneous.

bioinformatics
Our approach is able to eliminate a large majority of noise edges and uncover large consistent ... patterns in graph representations of NOESY data (Fig. 2) due to.

bioinformatics
Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland,. OH 44106, USA ... Bioinformatics online. ..... coefficient that measures the degree of relatedness between two individuals. The kinship ..... MERL

BMC Bioinformatics
Jun 12, 2009 - Software. TOMOBFLOW: feature-preserving noise filtering for electron tomography .... respect to x (similar applies for y and z); div is the diver-.

bioinformatics
Jun 27, 2007 - technologies, such as gene expression arrays and mass spectrometry, has generated .... the grand means for xij , yij , i = 1, 2 ...,m, j = 1, 2 ...,n,.

bioinformatics
2Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokane-dai Minato-ku,. Tokyo 108-8639, Japan ..... [20:17 18/6/03 Bioinformatics-btn162.tex]. Page: i235 i232–i240. Prediction of drug–target interaction networks.

BMC Bioinformatics - Springer Link
Apr 11, 2008 - Abstract. Background: This paper describes the design of an event ontology being developed for application in the machine understanding of infectious disease-related events reported in natural language text. This event ontology is desi

BMC Bioinformatics
Jul 2, 2010 - platforms including Linux, Windows and Mac OS. Using a ..... Xu H, Wei CL, Lin F, Sung WK: An HMM approach to genome-wide identification.

bioinformatics
data sets (e.g. for different genes), then their clusters are often contra- dicting. ... contradictory phylogenetic information in a single diagram. Suppose we wish ...... These transformations preserve the reticulation number of the net- work and do

bioinformatics
A good fit and thus accurate P-value estimates can be obtained with a drastically reduced ... Bioinformatics online. ..... If a good fit is never reached, the GPD cannot be used. However ..... the Student's t-distribution with one degrees of freedom)

bioinformatics
be estimated and the limited experimental data available. In this paper, ... First, choosing a modeling framework is important, because it determines the ...

BMC Bioinformatics
Jun 16, 2009 - The application of this procedure to a very large set of sequences is possible ..... Internet-connected computers can run an ACNUC client .... cations or any phylogenetic profile of interest. Also ..... for biological sequence banks.

bioinformatics
prediction of many false positive and false negative interactions. [12, 2]. In addition, even .... Existing models of transcription factor DNA-binding specificity, as ...... GO:: TermFinder-open source software for accessing Gene Ontology information

bioinformatics
Mar 17, 2008 - For Permissions, please email: [email protected]. 1293 ..... Figure 7 records the average identification accuracies of peak bagging .... Management, ACM Press, Atlanta, Georgia, USA, pp. 427–433.

Bioinformatics Technologies - Semantic Scholar
and PDB were overlapping to various degrees (Table 3.4). .... provides flexibility in customizing analysis and queries, and in data ma- ...... ABBREVIATION.

bioinformatics - Research at Google
studied ten host-pathogen protein-protein interactions using structu- .... website. 2.2 Partial Positive Labels from NIAID. The gold standard positive set we used in (Tastan et ..... were shown to give the best performance for yeast PPI prediction.

bioinformatics
Figure 2 shows that accounting ... The complete list of our predictions of ..... GO:: TermFinder-open source software for accessing Gene Ontology information.

bioinformatics
yield positive evidence in support of the hypothesized crosstalk between the two pathways. Contact: ...... We ran the estimation procedure on a Pentium 4 PC with a. 2.8GHz processor and .... In Silico Biology, 3, 347–365. de Jong,H. (2002) ...

bioinformatics
Nov 16, 2006 - Multipartite Sequence Data. Syst. Biol., 52 (5), 649-664. [26]Thompson, J. D., T. J. Gibson, F. Plewniak, F. Jeanmourgin, and D. G. Higgins.

bioinformatics
senting the data; the NMR graph is a significantly corrupted, ambi- guous version of .... We use here the classes output by RESCUE (Pons and Delsuc,. 1999) ...