Chinese Characters in a Small World (åç¨¿)

Viewer
Transcript

Chinese Characters in a Small World () Shu-Kai Hsieh Department of Computational Linguistics Universit¨at T¨ubingen D-72076, Germany [email protected]

Abstract This paper represents statistical analyses of the structure of a specific Chinese characters network. We find that this network exhibits a small world property, characterized by sparse connectivity, small average shortest paths between characters, and strong local clustering. In addition, due to its dynamic property, it appears to exhibit an asympotic scale-free feature with connectivity of power laws distribution. We believe these findings are important for the construction of lexical resources in Chinese NLP.

1

Introduction

The interest here is focused primarily on how Chinese writing system behave and how that behaviour is affected by their connectivity from a statistical viewpoints. In this paper, we introduce two important quantitative features which have reshaped the current study of network science: the small world and scale-free features.

2

Background

2.1 The small world phenomenon Research specific to the small world phenomenon in the network commences with the “social network” by sociologist in the 1960s. The small world phenomenon formalises the anecdotal notion that “you are only ever six ‘degrees of separation’ away from anybody else on the planet.” The claim being that

even when two people do not have a friend in common, they are separated by only a short chain of intermediaries (Watts 2004). From the views of Graph theory, regular graphs (networks) have high clusterings and small average shortest paths, with random graph (networks) at the opposite of the spectrum which have small shortest path and low clusterings. There seemed no interesting things existed between regular (or deterministic) networks and random networks. By the middle of the 1990s, with the astounding discovery and development of networks everywhere - be it natural (e.g., biological network) or artificial (e.g., WWW), which all have a specific architecture based on a self-organizing, fat-tailed, non-Poisson distribution of the number of connections of vertices that differs crucially from the “classical random graphs”. Structure of networks with random connections turns out to be an object of immense interest for researchers in various sciences. These new trends also have been brought into the study of lexical semantic network. See the pioneering work carried out by Steyvers and Tenenbaum (2002). Based on characteristics of real-world networks, it is now recognized that the small-world networks fall somewhere in between these two extremes. For the remainder of the paper, we use the term small world structure to refer to the combination of average shortest path-length (as small as that of a random network with the same parameters), and relatively high clustering coefficient (as high as that of a regular network).The formal definitions are given as follows: Definition 2.1. (Path Length)

The path length L of a graph G is the median of the means of the shortest path lengths connecting each vertex v ∈ V (G) to all vertices. Namely, calculate d(v, j)∀j ∈ V (G) and find dv for each v. Then define L as the median of { dv }. Definition 2.2. (Clustering Coefficient) The clustering coefficient Cv depicts the extent to which vertices adjacent to any vertex v are adjacent to each other, Cv =

|E(Γv )| ¡kv ¢

(1)

2

where the neighbourhood Γυ of a vertex υ is the subgraph that consists of the vertices adjacent to υ (not including υ itself); |E(Γv )| is¡the¢ number of edges in the neighbourhood of v, and k2v is the total number of possible edges in Γv .1 For example, suppose we have a undirected network, and one of its vertices v has 4 nearest neighbours, and there are 2 edges between these nearest neighbours. Then the clustering coefficient Cv of this vertex is calculated as Cv = 24 = 31 . And (2) if we want to calculate the clustering coefficient of this network, then C - the average value of Cv - is considered. One can easily see that the clustering coefficient of a completely connected network is equal to 1, while on the other side, the clustering coefficient of a tree is 0. For the purpose of comparison, the statistical features of classical random graph will also be computed. Suppose that a classical random graph consists of N vertices randomly connected ¯ Each pair by M edges, with the mean degree k. of vertices is connected with the same probability |E(Γv )| ∼ . Here |E(Γv )| = k¯ = 2M = N N . So the clustering coefficient of a classical random graph is v )| Crandom = |E(Γ . N Definition 2.3. (small-world network) A small-world network is a graph G with n vertices and average degree k that exhibits L ≈ Lrandom (n, k), but C À Crandom . 2.2 Scale-free Network The term scale-free network was first coined by physicist Albert-Laszlo Barabasi and his colleagues 1

In fact, such subgraphs can be regarded as small loops of length 3 in the network.

(Barabasi 1998). It is a specific kind of network in which the distribution of connectivity is extremely uneven. In scale-free networks, some nodes act as ”very connected” hubs using a power-law distribution. Formally, scale-free networks are networks whose degree distribution (i.e., fractions of nodes with k degrees(connections)) behaves as: P (k) ∝ k −λ , k 6= 0, m ≤ k ≤ K,

(2)

where λ is the exponent, m is the lower cutoff, and K is the upper cutoff. There is no nodes with degree below m and above K. As of today, extensive studies have shown that many large natural and artificial networks have the small world and scale-free features. In the field of lexical network, Steyvers and Tenenbaum (2002) investigated the graph theoretic properties of the semantic networks created by WordNet, Roget’s Thesaurus, and the associative word lists built by Nelson et al. The results showed that all three lexical resources turn out to share the distinctive statistical features of both small-world and scale-free structure. This results motivates us to the speculation that this sort of property will be found to be widespread amongst networks of Chinese characters. 2.3

Do Chinese Characters Constitute an Affiliation Network?

Chinese writing system employs many thousands of characters (Hanzi), which link together in an fairly sophisticated way. Quite unlike those of European writing systems, Chinese writing system is constructed in a fashion that it carries abundant intricate complex of conceptual, phonological and semantic information. In order to survey the statistical properties in Chinese characters, we need a database of characters. But, how to define that two characters are connected depends on how we define the relation between them. Previous work like that of Fujiwara et al (2002) is based on 6500 Hanzi (Kanji) used in Japan extracted from a character database (UTF2000). In this experiment, characters are decomposed into components in a tree-like hierarchy, but without explicit reasons about the decomposition rules.

(b)

(a)

A

B

C

D

E

F

Figure 1: (a). Bipartite graphs of characters (numerically indexed row) and components (alphabetically indexed row), (b). Reduced graph from (a) containing only characters In this paper, we construct a bipartite network from the entries in a concept knowledge base of Chinese characters (HGD).2 The vertices in this bipartite network are split into two sets A (components:A,B,C,D,E,F,...) and B (characters:1,2,3,4,5,6,7,8,...) in such a way that each edge has one end in A and one end in B.The construction of network is depicted in Figure 1.

3

Data

The data meets the requirement N ≥ k ≥ log(N ) ≥ 1, where the total degree of network K ≥ log(N ) guarantees that a random graph will be connected. In addition, the character network considered here is an undirected sparse ¡ ¢ network. Sparseness here implies (M ) ¿ k2v , where each node is connected to only a small fraction of the network, in comparison with a “fully connected” graph. This is a necessary condition for the notion of small world to make sense.

4

Experiment and Results

Estimating both the average number of the binding of characters and the probability of two randomly selected members of a character network being connected by a chain consisting of two or more intermediate characters. As mentioned previously, two statistical quantities are presumed to describe the static structure properties of this network : Path length L, and clustering coefficient C. The L measures the typical separation between two vertices in the graph (a global 2 Hanzi Genes Dictionary. http://www.cbflabs.com. Apart from pictographs, each character in this dictionary is decomposed into two parts: Character Head and Character Body, which contribute to the meaning composition. The number of components and characters are 256 and 6493 respectively.

property), whereas C measures the cliquishness of a typical neighbourhood (a local property) (Watts 1998). Conventionally (Watts and Strogatz (1998); Watts (1999)), L¯ can be computed as lnN ¯ ; C can be lnk calculated as the the ratio between the total edges in Γ(v) (total number y of the edges connecting its nearest neighbors), and the total possible edges in Γ(v) ( total number of all possible edges between all these nearest neighbors), Cv =

number of direct links between neighbours of v number of all such possible links (3)

and therefore reflects the ‘cliquishness’ of a typical neighbourhood (Watts 1998). Further, the clustering coefficient of graph G is C, defined as the average of Cv over the entire graph G. The scale-free property, on the other hand, is defined by an algebraic behavior in the probability distribution P(k, N ) of k. Since the character network in this experiment is undirected, and vertices can be distinguished, for each vertex we can get degree distribution p(k, v, N ). This is the probability that the vertex v in the network of size N has k connections. Knowning the degree distributions of each vertices in a network, the total degree districution can be calculated as:

P (k, N ) =

N 1 X p(k, v, N ) N

(4)

s=1

The first moment of the distribution, P that is, the mean degree of this network is k¯ = k kP(k), and the total number of edges in this network is equal to ¯ kN/2 (Dorogovtsev and Mendes 2003). Our first result is that the conceptual network is highly clustered and at the same time and has a very small length, i.e., it is a small world model in the static aspect. Specifically, L & Lrandom but C À Crandom . Results for the network of characters, and a comparison with a corresponding random network with the same parameters are shown in Table 1. N is the total number of nodes (characters), k is the average number of links per node, C is the clustering coefficient, and L is the average shortest path. Next we consider the dynamic feature of the conceptual network. The distribution of the number of

Table 1: Statistical characteristics of the character network: N is the total number of nodes(characters), k is the average number of links per node, C is the clustering coefficient, and L is the average shortestpath length, and Lmax is the maximum length of the shortest path between a pair of characters in the network. N k C L Actual configuration 6493 306 0.048 1.5 Random configuration 6493 306 0.006 1.3 connections follow power laws that indicate a scalefree pattern of connectivity, with with most nodes having relatively few connections jointed together through a small number of hubs with many connections. The degree distribution is shown in loglog coordinates with the line showing the best fitting power law distribution.(Figure ??)

5

Conclusion

In conclusion, the character network we consider here share the similar statistical features with other lexical resources both in small-world and scale-free structures: a high degree of sparsity, a single connected component containing the vast majority of nodes, very short average distances between nodes, high local clustering, and a power-law degree distribution with exponent near 3 for the undirected networks. The real characters network - if exist - can be more complicated than the thumbnail sketch presented here. However, These statistical regularities that we have uncovered in this paper could be helpful for us in thinking of the construction of Chinese lexical resources.

References Fujiwara.Y, Y.Suzuki and T.Morioka. 2002. Network of Words. Artificial Life and Robotics. Dorogovtsev, S.N. and J.F.F. Mendes. 2003. Evolution of Networks. Oxford University Press, Oxford, UK. Steyvers, M. and Tenenbaum, J.B. 2002 The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth. Cognitive Science.

Watts, D. J. and Strogatz, S. H. 1998. Collective dynamics of ‘small-world’ networks. Nature 393:440-42. Watts, D. J. 2004. Small worlds: The dynamics of networks between order and randomness. Princeton University Press.

Chinese Characters in a Small World (åç¨¿)

There seemed no interest- ing things existed between regular (or deterministic) networks and random networks. By the middle of the 1990s, with the astounding.

Download PDF

79KB Sizes 3 Downloads 93 Views

Report

Chinese Characters in a Small World (åç¨¿)

Recommend Documents

Chinese Characters in a Small World (åç¨¿)