↔
Large font size, bold, black font color
4 ↔ ↔ ↔ ↔
Green font color, no background Lilac background color, black font color Black font color, no background Image
Basic operations
With CF-grammar GƊ a forest FƊ = {T1, T2, …, TN, …} is associated, such that each grammar rule has a corresponding subtree (bush) in FƊ. The nodes in these trees correspond to terminal and non-terminal symbols form grammar GƊ, the edges form a reflection of the production rules. This is registrated in the function:
ℑ: PƊ → ℙƊ that associates each production of grammar GƊ with some bush from ℙƊ , where ℙƊ is the set of all possible bushes from FƊ. In our running example grammar rule (4) is represented by the bush in Figure 2. LECTION
LEC_NUM
LEC_THEME
LEC_BODY
Figure 2. The bush representant of rule (4) Let p1, p2 ∈ PƊ be productions of grammar GƊ and ℘1, ℘2 ∈ ℙƊ their corresponding bushes. Let productions p1 and p2 be as follows: p1:
α → β1 β2 … βk
and
p2:
γ → δ1 δ2 … δm,
where α, γ ∈ NƊ; βi, δj ∈ NƊ ∪ TƊ, i = 1, 2, …, k; j = 1, 2, …, m. By analogy with traditional syntactical analysis [1] we add some following definitions. Definition 4. Bush ℘1 is an ancestor of ℘2 or, equivalently, ℘2 a descendant of ℘1, if the righthand-side of rule p1 contains the lefthand-side of rule p2: γ = βi,
for some i ∈ [1, k].
Definition 5. The bushes ℘1 and ℘2 are neighbors if they have a common ancestor p3 ∈ PƊ p3: ϕ → ψ1 ψ2 … ψl, where ϕ ∈ NƊ, ψi ∈ NƊ ∪ TƊ, i = 1, 2, …, l, such that among the units of its right part there are two units ψq and ψs with α = ψq ∧ β = ψs;
q, s ∈ [1, l].
Definition 6. The bush ℘1 is the left (right) neighbor of ℘2 if ℘1 is neighbor ℘2 and q < s (q > s). Note that production rules can be both left and right neighbors. Graphic representations of these definitions and relations are shows in Figure 3. To define a context-probability model of the document class it is necessary to define four relations between bushes in the tree [4]. Let ℘, ℘′, ℘′′ ∈ ℙℑ. Then (a) NT (℘, ℘′, ℘′′) ≡ ℘ is ancestor of ℘′ and ℘′ ancestor of ℘′′. (b) NB (℘, ℘′, ℘′′) ≡ ℘ is descendant of ℘′ and ℘′ descendant of ℘′′. (c) NL (℘, ℘′, ℘′′) ≡ ℘ is left neighbor for ℘′ and ℘′ left neighbor for ℘′′. (d) NR (℘, ℘′, ℘′′) ≡ ℘ is right neighbor of ℘′ and ℘′ right neighbor of ℘′′. These relations are not independent: NB (℘, ℘′, ℘′′) = NT (℘′′, ℘′, ℘) and NL (℘, ℘′, ℘′′) = NR (℘′′, ℘′, ℘).
5
℘′′
℘
℘′
℘′
℘′′ ℘
(a)
℘
(b)
℘′
℘′′
(c)
℘′′
℘′
℘
(d)
Figure 3. (a) – NT (℘,℘′,℘′′);
(b) – NB (℘,℘′,℘′′);
(c) – NL (℘,℘′,℘′′);
(d) – NR (℘,℘′,℘′′)
6 1.3
The probability model
The probability that a bush ℘ is ancestor of ℘′ and ℘′ is ancestor of ℘′′ for every ℘, ℘′, ℘′′ can be reduce to a cubic context-probability matrix
ΜT = ||μi j k||, i, j, k = 1, …, N + 1, where the elements of the matrix express the following conditional probability:
μ i j k = Ρ(℘ = ℘i | NT (℘i, ℘j, ℘k)). Note that ℘ is not tree, it’s a bush in some tree within FƊ (see Figure 3). This conditional probability can be approximated as follows:
Ρ (℘ = ℘i N T (℘,℘ j ,℘k )) =
Ρ (N T (℘i ,℘ j ,℘k )) , Ρ (N T (℘ j ,℘k ))
W (N T (℘i ,℘ j ,℘k )) Ρˆ (℘ = ℘i N T (℘,℘ j ,℘k )) =
( (
W N T ℘ j ,℘k
))
ˆ is a practical estimation of the conditional probability P. Here W(A) is a frequency of the event A, P The cubic context-probability matrixes ΜB, ΜL and ΜR are analogously reduced. The resulting context-probability model of the document class is more close to power linguistic models but allows the effective algorithmization of methods. According to the context-probability model of the document class the iterative algorithm for logical structure grammar determination can be designed. This algorithm represents the solution of the parse inverse problem for this case.
2 Constructing the structure grammar The construction of document logical structure grammar can be divided into two main stages - a low-level grammar construction and a top-level grammar construction. The process of top-level grammar lexeme separation according to knowledge base rules falls into the low-level grammar construction. From the point of view of the systems analysis this process corresponds to the transformation of an abstract model to a concrete model. The use of a knowledge base, in which the initial information about lexeme selection is saved, allows charging a designed abstract model of the document class with concrete data about the considered document class. First we consider the construction process of a document class logical structure grammar G. The input for this process is a (probably infinite) sequence of documents:
Ɗ = {D1, D2, …, DN, …} where each document Di (i ≥ 1) is described by some grammar Gi. Then the grammar construction may be carried out by an iterative method for which G[0] = ∅ is the initial approximation. G [k] (k > 0) is introduced by successive approximations
G [k] = ψ (G [k-1] ∪ ψ ° ϕ ° R (Dk)), k = 1, 2, …, N where ψ is a function that associates grammar alternatives, ϕ a grammar generalization function, and R a function that
~
converts the lexeme set to the document class grammar. This grammar sequence which converges to some solution G :
~ G [k] → G as k → ∞ ~ Let L[k] (k > 0) be the language that is generated by grammars G[k], and let language L be generated by limit gram~ [k] mar G . Then apparently there is an increase in the size of the generated languages: L ⊆ L[k+1]. Thus the languages L[k] ~ (k > 0) gradually come near to the language L . ~ Let's assume a residual function f as the residual of the precise and approximated languages L and L[k]. Apparently the ~ iterative approximation process should minimize the absolute value | f ( L , L[k])| of the residual function:
~
~
min → | f ( L , L[k]) | = | L \ L[k] | Obviously this criterion can not be used in practice as: ~ (a) the language L is not known beforehand and is a theoretical abstraction; ~ ~ (b) the determination of the language L generated by an arbitrary CF-grammar G without additional information is algorithmically insoluble problem.
7 More practical is the following criterion: if in iteration N grammar G[N] does not undergo any changes in comparison with grammar G[N - 1] the repetitive process can be terminated. The most composite task at definition of document class logical structure grammar is the construction of low-level grammar set from some set of lexemes. The given task generally is uncorrected and insoluble problem without additional information; therefore the use of a predetermined knowledge base is necessary.
3 The algorithm in more detail 3.1
The algorithm description
Next we analyze this construction process. Let Di ∈ Ɗ be a training document of the selected class, ℒi is the set of all logic areas, on which document Di is broken, and Gi is a logic structure grammar of this document Di. The process of the grammar construction is divided into two fundamental stages: construction of grammar set { Gi(1) , Gi( 2) , …, Gi(q ) } on the lower level (it’s implied that this grammar set is constructed from logical areas ℒi through function R)
and construction of grammar Gi on the upper level (this grammar is constructed from grammar set { Gi(1) , Gi( 2) , …, Gi(q ) } through iteration process above). ϕ and ψ functions are described below, on sizes of abstract how much allow. Let's assume each document Di (i > 0) from the training set has been fragmented into a set of strings
Si = { Si(1) , Si( 2) , …, Si(q ) } , i = 1, 2, …, N according to it’s logical structure. This splitting should be such that each string S i( j ) (j = 1, 2, …, q ) uniquely corresponds to some logic area ( mi( j ) , Γi( j ) ) ∈ ℒi not containing in itself any other nested logic areas, and sets some initial language Li( j ) . According to the data stored in a knowledge base, from the language Li( j ) the generating grammar Gi ( j ) can be obtained. However, the grammar Gi ( j ) constructed this way will only determine string S i( j ) . Therefore
language Li( j ) is expanded such that it includes also strings syntactically similar to string S i( j ) . Applying the grammar generalization function [6] on the resulting grammar Gi ( j ) results in the low-level grammar Gi ( j ) : Gi( j ) = ϕ ( Gi ( j ) ), i = 1, 2, …, N, j = 1, 2, …, q.
Example: In the running example (see figure 1), the string What the searcher wants is an atomic fragment S i( j ) for some i and j. The corresponding grammar Gi ( j ) is: ΔƊ = {
→ →
(3)
→
(4) (5)
→ →
} And the grammar Gi( j ) = ϕ ( Gi ( j ) ) is ΔƊ = {
8 PƊ = {
(1) (2) (3)
→ → →
(4) (5)
→ →
Based on the constructed set of low-level grammars { Gi(1) , Gi( 2) , …, Gi(q ) } and the set of logic areas ℒ labeled by the user, we create an initial high-level grammar Gi . This grammar can be extended by first applying the generalization function: Gi = ϕ ( Gi ), and next the association function ψ on rules with identical left part. The resulting top-level grammar is:
Gi = ψ ( Gi ) = ψ (ϕ ( Gi )), i = 1, 2, …, N. Next we consider the metagrammar structure in more detail. Assume the knowledge base has preset metagrammar GM = {NM, TM, PM, ΔM}. Furthermore, let's consider some limitations to be obeyed by a set of metagrammar rules PM = {π1, π2, …, πq} with. Let rule i of the metagrammar GM look like
αi → βi 1 βi 2 … β i ki , where αi ∈ TM, βi j ∈ TM ∪ NM; ki - positive numbers; i = 1, 2, …, q; j = 1, 2, …, ki. Let Α ⊆ NM be the set of all nonterminal symbols αi standing in a right member of rules πi, and Β the set of all right members βi 1 βi 2 … β i ki of rules πi, i = 1, 2, …, q. Then the set of rules PM should obey to the following main condition: μ: Β → Α - is an injective function. Upon execution of this condition on set PM it is possible to enter the relation ⊴ as follows:
⎧ ⎪π 1 = π 2 ⎪ π 1 < π 2 ⇔ ⎨∨ ∃β 2 j ∀ j =1, 2,K,k2 [ β 2 j = α 1 ] ⎪ (1) (r ) (1) ( 2) (r ) ⎪⎩∨ ∃π (1) ,π ( 2 ) ,K,π ( r ) ∈PM [π = π 1 , π = π 2 ∧ π < π
Theorem 1. The relation ⊴ is an ordering of the set of rules PM . The set of rules PM of metagrammar GM forms a lattice. 3.2
General algorithm of logical structure recognition
This leads to the following algorithm to obtain the document class logical structure description: 1. Reset iterations counter: k ← 0; Make initial approximation: G[0] ← ∅. 2. Extract next document Dk + 1 from learning sample. 3. Divide document Dk + 1 into a set of logical areas ℒk + 1. 3a. From ℒk + 1 construct the tree of its logical structure Tk + 1. 3b. From tree Tk + 1 construct the cubic probability matrices ΜT (k+1), ΜB(k + 1), ΜL (k + 1)and ΜR (k + 1) for document Dk + 1 3c. Connect matrixes ΜT (k + 1), ΜB (k + 1), ΜL (k + 1) and ΜR (k + 1) with the general probability matrices ΜT, ΜB, ΜL and ΜR respectively for document class Ɗ. 4. Construct the set of a low-level grammars { Gk(1+)1 , Gk( 2+)1 , …, Gk( q+)1 }. 5. Construct top-level grammar Gk + 1. 6. Integration of grammar G[k] and Gk + 1. 7. Integration of rules with an identical left part of obtained grammar G[k + 1]. 8. Compare of grammar G[k] and G[k + 1]. If grammars are congruent (T [k] = T [k + 1], N [k] = N [k + 1] and P [k] = P [k + 1]) Then Exit Else k ← k + 1; Goto 2.
9 ~ The result of this algorithm is the approximated grammar G . Because of constructed model of the document class and logical structure grammar the parsing methods and algorithms of document logical structure with use of physical structure and contextual-probability dependences are designed.
3.3
The algorithm using physical structure and contextual-probability dependences
Let Λ = {λ1, λ2, ..., λh} be the set of leaves and Δ the root of tree. Let P = {p1, p2, ..., pg} be the set of logic gates defined by the physical attributes. Select subset of set P: P ⊇ P′ = {p(1), p(2), ..., p(q)} where p(i) ∈ P, i = 1, 2, ..., q and q ≤ g, such as each unit p(i), i = 1, 2, ..., q, in a tree is on the minimum level after leaves. Then, based on the set of leaves Λ, it is possible to partition S such that the cosets are those leaves of Λ that are descendants of the same unit of set P′ ∪ Δ. Thus we shall have (q + 1) cosets Ci, i = 1, 2, ..., q + 1, where the first q cosets correspond to the q units of set P′ and the class Cq+1 corresponds to an initial unit Δ. Let P′′ = P′ ∪ Δ = {p(1), p(2), ..., p(q), p(q + 1)} where p(q + 1) = Δ, and relation Des (a, b) mean that a is a descendant from b, we shall receive λ ∈ Ci ⇔ Des (λ, p(i)) = true where p(i) ∈ P′′; i = 1, 2, ..., q + 1; λ ∈ Λ. The following parsing method of document logical structure with use of physical structure is offered: 1. Partition Λ on cosets Ci, i = 1, 2, ..., q + 1. 2. Bottom-up selection of the subtrees T(i) with root p(i) and leaves from coset Ci, i = 1, …, q+1 3. Top-down syntactical analysis of the subtree T(i). 4. Insert the obtained subtree T(i) into the common tree Т. 5. If the root of the subtree T(i) coincides with the root of the tree T, Then Exit, Else - Goto Step 2. The use of contextual-probability dependences allows to increase the efficiency of the top-down parsing method by the introduction of fuzzy sets defining the probabilities of bush appearances in a tree {(℘, ξΒ (℘))}, ∀ ℘ ∈ ℙD and given tree appearances in a set of trees {(T, ξΘ (T))}, ∀ T ∈ FƊ. The characteristic functions of fuzzy sets can be described as follows [4]:
⎛ (Μ (℘))α + (Μ B (℘))α + (Μ L (℘))α + (Μ R (℘))α ξ Β (℘, α ) = ⎜ T ⎜ 4 ⎝
(
)
⎛ (ξ Β (℘1 ))α + (ξ Β (℘2 ))α + K+ ξ Β (℘q ) α ξ Θ (℘1 ,℘2 , K,℘q ;α ) = ⎜ ⎜ q ⎝
1
⎞α ⎟ ⎟ ⎠
1
⎞α ⎟ ⎟ ⎠
Let ϑ ∈ NƊ and ϕ ∈ TƊ be the initial and finite units of a chain. Then the method of «following chain search» [11] used in a classical method of top-down parse can be exchanged by the offered method of «most approaching chain search» which is the chain having the highest probability of appearance in a given context: 1. с (0) = ϑ. 2. Using a classical method of «following chain search», we shall discover the next unit ck[r] which can be used for continuation of a current chain c (k – 1) = c0 c1 … ck – 1. 3. Calculate probability of a chain appearance {c (k – 1) & ck[r]} in a parsing tree: ξΘ[r] (℘0 ℘1 … ℘k[r]; α) where ℘i is bush with root node ci and ℘k[r] is bush with root node ck [r]. 4. Excluding element ck [r] and increasing value r, we shall repeat Steps 2-3 until the next element ck [r] can be found. 5. Let's consider set of probabilities {ξΘ[1], ξΘ[2], …, ξΘ[r]} and select from them maximal probability ξΘ[q]. Then ck = ck[q]. 6. Let's add a new unit in a chain: c (k) = c (k – 1) & ck where c (i) = c0 c1 … ci, i = 0, 1, …, k – 1. 7. Repeat Steps 2-6 until the equality ck = ϕ will not be executed yet or yet will not be sort out all possible chains. As a result of this algorithm we obtain a chain c(n) = c0 c1 … cn where c0 = ϑ, cn = ϕ and ξΘ(℘0 ℘1 … ℘n; α) - is maximum that is most probable chain in this context. Theorem 2. There is constant c such as the algorithm at processing entry chain w of length n ≥ 1 does not perform more qcn of elementary operations (where q is the number of logic element defined by the physical attributes) provided
10 that calculation of one step of algorithm requires fixed number of elementary operations. The theoretical evaluations of complexity of designed algorithms have shown that if no statistical information is available then they are exponential. However, by statistical information accumulation their characteristics will verge towards deterministic parsing methods.
4 Structured information storage and retrieval methods In thesis the sentential calculus for structured queries designed. It’s based on following mappings:
I: ΩINDEX → Z;
K: Z → ΩINDEX / Ker I;
J: S → ΩTAGS.
Here mapping I of set indexed terms of the document to set of indexes is surjective one and allows any indexed term associate with unique index according to its layout in the document. According to the main mapping theorem there is a bijective mapping K where ΩINDEX / Ker I is factor set on a kernel which allows for each index uniquely to define a coset of appropriate indexed term. Let S be the set of strings from an electronic document and S ⊇ ΩINDEX / Ker I. Consider the mapping J that according to some heuristic observations singles out from string s ∈ S a substring s′ ⊆ s and associates it with a marking tag. Then the mapping
K ◦ J ◦ I: Z → Z means that some unit r1 ∈ Z can be associate (if such probably) with unit r2 ∈ Z, and K(r2) represents appropriate tag to K (r1). From this mappings it is clear that everyone indexed term, which is present at the document, can be associate with index defining its layout relatively other terms. Therefore any part of the document can be presented as a segment (i, j), where i is index of the first word and j is index of the last word in a selected part of the document [5]. Definition 7. An extent a = (i, j) is a pair of numbers representing begin and end of segment that appropriate to some part of the document. Reasoning from mappings the following predicates are offered: (a) INCLUDE (a, b) - allows to define whether includes an extent a in itself an extent b. (b) ORDER (a, b, c) - allows to define whether is extent c such as it begins with an extent a and ends by extent b, and а ≠ b. (c) CONTENTS (a, b) - allows to define whether are contents of extent b a substring of contents of extent a and a ⊇ b . (d) CORRESPOND (a, b) - allows to define whether is the content of extent b appropriate tag to content of extent a such as J ( a ) = b . (e) EQUAL (x, y) - allows to define whether the argument x is equal to argument y. Next we add some definitions based on a terminology of relational calculus. Definition 8. A universe of extent components U is a set of numbers to which belongs every possible extent components. Definition 9. An active extent domain dom is a set of extents (i, j) such as i, j ∈ U and i ≥ j. Definition 10. Let ψ be a formula of calculus. An extended active extent domain with respect to the formula ψ, denoted as edom (ψ), is a set of extents (i, j) ∈ dom such as their components i and j explicitly or implicitly appears in ψ. All predicates which are included in well-formed formula (WFF) ψ are determined on set edom (ψ). Thus the extended active extent domain limits a use of the calculus to finite sets only. Theorem 3. All predicates offered in sentential calculus are independent in the sense that any of them cannot be expressed only through the stayed predicates. Proof. Note that independence of predicate groups {INCLUDE, ORDER}, {CONTENTS, CORRESPOND} and {EQUAL} with each other is obvious. The first group of predicates determines a positional relationship of extents without their contents; the second group of predicates determines the relation between extents and their contents using mappings K
11 and J; finally, predicate EQUAL determines equality not only extents and strings and also functions of extents and strings, therefore it forms independent group. Hence, for the theorem proof it is enough to prove that predicates INCLUDE and ORDER, and also predicates CONTENTS and CORRESPOND are independent. Let ψ (x1, x2, …, xk) be WFF. According to the de Morgan theorem it is possible to propose that ψ contains only operators ∨ and ¬. Let also edom (ψ) be an extended active extent domain with respect to the formula ψ. (I) Prove independence of predicate INCLUDE. For this purpose we shall propose that the formula ψ contains only predicate ORDER except for operators ∨ and ¬, however ψ it is equivalent to INCLUDE (a, b) for anyone a = (a1, a2) and b = (b1, b2) from edom (ψ). We prove an induction on number of the operators used in anyone subformule ω from ψ that from ω it is impossible to conclude that a includes b. Let generally a1 < b1 and a2 > b2. Basis. Zero of operators in ψ. Then ω is ORDER (x, y, z). If we substitute instead of x, y and z to a and b, then generally, by means of predicate ORDER it is impossible to show that a includes b. Induction. Let ω contain at least one operator and the induction hypothesis fair for all subformulas of formula ψ having less operators than ω. Case 1. ω = ω1 ∨ ω2. As under the induction hypothesis neither from ω1, nor from ω2 it is impossible to conclude that a includes b, ω1 ∨ ω2 also meets to this hypothesis. Case 2. ω = ¬ ω1. In this case, from ω it is possible to conclude that a includes b if and only if ω1 approves that a not includes b. However this statement also can not be constructed without use of predicate INCLUDE. Hence, ω meets to the induction hypothesis. (II) Analogously to a case (I) independence of predicate ORDER (a, b, c) is proved. Really, by means of predicate INCLUDE it is possible to show only contents a and b in c, however it is impossible to show an additional condition: c begins with a and ends to b. Consideration of cases 1 and 2 of inductive proof completely coincides with (I). (III) Prove independence of predicate CONTENTS (a, b). It is obvious that it can be expressed through predicate CORRESPOND only if the string a contains a tag name itself. In the general case it is impossible. (IV) Independence of predicate CORRESPOND follows from that it is unique predicate using the mapping J. Therefore it is obvious that it can not be expressed through other predicates which are not including the mapping J. The theorem is proved. ■ Relational calculus is the prevailing mathematical tool used to databases. Therefore the fact of a reduction from constructed calculus to relational calculus is the important fact. Denote as B a set of binary relations. Any set of extents can be represented as some relation from B which tuples are extents of this set, i.e. it is possible to introduce an injective mapping E: dom → B. Definition 11. A GC-relation (Generalized Concordance relation) is a binary relation from B which tuples represent the extents not enclosed each other [5]. Denote as G a set of GC-relations. Then the following theorem is correct: Theorem 4. If G is a set of all GC-relations and G* = G ∪ ∅ then a surjective mapping N: B → G* exists. Proof. Let S ∈ B and tuples of S are extents. We shall construct a relation S ′ = N (S) which tuples-extents will not be enclosed each other. A construction of relation S ′ using relational calculus with variables on tuples is following: S ′ = N (S) = {s | S (s) ∧ ¬ (∃ s′) (S (s′) ∧ s′ [1] > s [1] ∧ s′ [2] < s [2])} This relation S ′ is either GC-relation or empty, i.e. S ′ ∈ G*. Surjective of mapping N follows from G ⊂ B. Therefore, for any S ′ ∈ G* there is even if one S ∈ B such as S ′ = N (S). ■ Consequence. From the theorem 4 follows that some set of extents from dom can be associate with any (probably, empty) GC-relation, i.e. exists a mapping EN: dom → G*. As at construction of calculus all WFFs were appropriate to the basic limit then in relational calculus the result of each expression is a GC-relation. Definition 12. An extended relational calculus with variables on the tuples is a relational calculus extended by functions I, K and J, and also functions of string comparison and concatenation functions ⊃, ⊂, =, +. Theorem 5. For every WFF of the calculus exists an equivalent safe expression of extended relational calculus with variables on the tuples. Proof. Let ψ (t1, t2, …, tk) be a WFF if calculus. Proceeding from the purpose of query one of variables t1, t2, …, tk connected by a existential quantifier, advanced forward in the expression of relational calculus as free variable-tuple. In
12 other words if ψ (t1, t2, …, tk) = (∃ ti) (ϕ (t1, …, ti-1, ti+1, …, tk)) then this formula is replaced with expression G ({ti | ϕ′ (t1, …, ti-1, ti+1, …, tk)}). Here ϕ′ represents the formula ϕ in which all quantifiers are left by the same, and predicates INCLUDE, ORDER, CONTENTS, CORRESPOND and EQUAL are replaced with the appropriate formulas of relational calculus with variables on tuples: (a) INCLUDE (a, b) ∼ (a[1] ≤ b[1] ∧ a[2] ≥ b[2]); (b) ORDER (a, b, c) ∼ (b[1] > a[2] ∧ c[1] = a[1] ∧ c[2] = b[2]); (c) CONTENTS (a, b) ∼ (K (a) ⊇ K (b)); (d) CORRESPOND (a, b) ∼ (K (b) = J (K (a))); (e) EQUAL (x, y) ∼ (x = y). Next we prove an existence of the safe formula ϕ′′ which is equivalent to ϕ. As the formula ϕ of the calculus was determined on the extended active extent domain edom (ϕ), then the formula ϕ′ is determined on the extended active domain edom (ϕ′). So it is the limited interpretation of the formula of tuple calculus. According to the classic theory, for any formula of tuple calculus ϕ′ at the limited interpretation there is an equivalent safe formula ϕ′′. As the formula ϕ′′ is equivalent to ϕ′ and the formula ϕ′ is equivalent to ϕ, then formula ϕ′′ is equivalent to ϕ. ■ According to the offered sentential calculus the language for structured queries designed: Main query operator
[Let
Add operator
Append (
Delete operator
Delete [From ( SuchAs
Modify operator
According to the offered method of structured documents stored the different database models designed and the main notice is given to application a relational DBMS for structured information storage and retrieval. The theorem of reduction of the offered language for structured queries to the SQL is proved. Theorem 6. Any query composed on the offered language for structured queries can be reduced to SQL-query. The practical significance of this theorem consists that it allows developing the program of SQL-queries generation for the compiler of the structured information storage and retrieval automized system.
5 Structured information storage and retrieval automized system Designed structured information storage and retrieval automized system showed on Figure 4, includes a set of independent units, joint among themselves:
∑A = ∑(I)Created → ∑(II)Structured → ∑(III)Stored. Here ∑(I) is editor of XML-documents with the graphics interface and similar editor for creation and validation of DTDrules; ∑(II) is interactive system for document logical structure recognition for XML-documents creation; ∑(III) is information system for the structured information storage and retrieval with the specialized query language. The testing results of a designed learning system represented on a Figure 5 and evaluated as the ratio of unrecognized units count n to units total N in documents. From these results are clear that the rather fast asymptotic convergence of obtained grammar to common grammar of the document class exists. Also it’s necessary to note that the convergence speed is vary depending on learning document sampling and the user should take the special note of this fact. The comparison testing of classical parsing methods, methods with use of physical structure and contextual-probability methods was carried out as the ratio of distance between automatically obtained and true structures d to units total N in the document. The given characteristic allows estimating an amount of corrections, which should be brought by user to the automatically labeled document. The results of this testing are shows on a Figure 6. The measurement of a query time interval to an information system represents measurement of time necessary for execution of each separate query [9]. The graphics representation of an averaged time interval T ( n) depending on an amount of documents n in the database is shown on a Figure 7. From the graphics on a Figure 7 is possible to conclude about logarithmic dependence of a time interval on an amount of documents in the database. Therefore the designed information system expends on demand processing, at any rate, not more time than the classical methods of information retrieval.
13
II
Document I
Document create
Parser
III
Marked document
Language for structured queries
user Document structure editor
Structured query Structured document Compiler for structured query
Document structure analyze
SQL-query
Output query
Logical structure grammar, physical attributes, context probabilities Integration of grammars
Database
Entry to Database
knowledge base
integrated structure grammar
New integrated structure grammar
Figure 4. Structured information storage and retrieval automated system
14 0 ,9 0 ,8
The n / N ratio
0 ,7 0 ,6 M e th o d w ith u se o f p h isic a l str u c tu r e
0 ,5 0 ,4 0 ,3
C o n te x t-p r o b a b ilitie s m e th o d
0 ,2 0 ,1 0 0
2
4 6 8 D o c u m e n t c o u n t in t h e le a r n in g s a m p le
10
Figure 5. Testing results of the learning system 0 ,3
0 ,2 5
0 ,2
d / N r a t io 0 ,1 5
0 ,1
n / N r a t io 0 ,0 5
0 0
1
2
3
Figure 6. Comparison testing of parsing methods
Average time interval
, sec
160 140
y = 8 4 . 7 7 ln x - 1 9 9 . 8 6
120 100 80 60 40 20 0 0
10
20
30
40
D ocum ent count n
Figure 7. Measurement of a query time interval to information system
50
15
6 Conclusion In this research following results are ensued: 1. As a systems analysis result of management system information processes it’s offered to use the structured information storage and retrieval automized system and also the primal problems on a research and development of automized systems of the given class are formulated. 2. The abstract mathematical model of the document class is devised, that define not only physical and logical structure of documents but also the set of contextual-probability dependences of structure units. 3. The iterative methods and algorithms for representation of the given finite document class with use of the contextual-probability model are designed and probed. 4. The methods and algorithms of combined parsing with the common tendency with use of physical structure and contextual-probability dependences are designed and probed. 5. The specifically calculus for structured queries is offered whose predicates are directional to fixation of document logical structure and respective query language is designed. 6. The methods of structured information storage in relational databases with flexibility logical structure representation are probed and designed. 7. As an experimental result of the designed structured information storage and retrieval automized system research were obtained numerical characteristics and experimental dependences confirming efficiency of offered models and methods.
7 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Aho A.V., Ullman J.D The Theory of Parsing, Translation and Compilation, 1973. Bapst F., Brugger R., Ingold R. Towards an interactive document structure recognition system. Internal working paper, Institute of Informatics, University of Fribourg, Switzerland, 1995. Birkhoff R., Bartee T.C. Modern Applied Algebra, 1971. Brugger R., Zramdini A., Ingold R. Document modeling using generalized n-grams. Proceedings of 3rd International Conference Document Analysis and Recognition, Montreal, Canada, 1995. Clarke Ch., Cormack G., Burkowski F. An algebra for structured text search and a framework for its implementation. Technical Report CS-94-30, University of Waterloo, Waterloo, Canada, 1994. Fankhauser P., Xu Y. MarkItUp! An incremental approach to document structure recognition. Electronic Publishing: Origination, Dissemination and Design, Vol. 6 (4), 1993. Knuth D.E. The Art of Computer Programming, 1968-1973. Kuikka E., Penttonen M. Transformation of structured documents. Electronic Publishing: Origination, Dissemination and Design, Vol. 8(4), 1995. Lancaster F.W. Information Retrieval Systems: Characteristics, Testing and Evaluation, 1965. Liang J. Document structure analysis and performance evaluation. Phd thesis, University of Washington, Washington, USA, 1999. Rayward-Smith V.J. A First Course in Formal Language Theory, 1974. Suen C.Y, Liu K., Strathy N.W. Sorting and recognizing cheques and financial documents. Proceedings of the 5th International Conference Document Analysis and Recognition, New York, USA, 1999. Summers K.M. Automatic discovery of logical document structure. Phd thesis, Cornell University, Cornell, USA, 1998. Zimakova M.V. Mathematical Models and Methods for Automized Systems of Structured Information Processing. Phd thesis, Penza State University, Penza, Russia, 2001.