Low-Complexity Detection of POI Boundaries Using ...

Viewer
Transcript

Low-Complexity Detection of POI Boundaries Using Geo-Tagged Tweets: A Geographic Proximity Based Approach Dung D. Vu

Won-Yong Shin

Dankook University Yongin 448-701, Republic of Korea

Dankook University Yongin 448-701, Republic of Korea

[email protected]

[email protected]

ABSTRACT

1.

Users tend to check in and post their statuses in locationbased social networks (LBSNs) to describe that their interests are related to a point-of-interest (POI). Since the relevance of the data to the POI varies according to the geographic distance between the POI and the locations where the data are generated, it is important to characterize an area-of-interest (AOI) that enables to utilize the location information in a variety of businesses, services, and place advertisements. While previous studies on discovering AOIs were conducted based mostly on density-based clustering methods with the collection of geo-tagged photos from LBSNs, we focus on detecting a POI boundary, which corresponds to only one cluster containing its POI center. Using geo-tagged tweets recorded from Twitter users, this paper introduces a low-complexity two-phase strategy to detect a POI boundary by finding a suitable radius reachable from the POI center. We detect a polygon-type boundary of the POI as the convex hull (i.e., the outermost region) of selected geo-tags through our two-phase approach, where each phase proceeds on with different sizes of radius increment, thus yielding a more precise boundary. It is shown that our approach outperforms the conventional density-based clustering method in terms of runtime complexity.

Location-based social networks (LBSNs) such as Foursquare and Flickr have grown rapidly in recent years. They provide a platform for millions of users to share their locationtagged media contents such as photos, videos, music, and texts. Owing to the location information from geo-tags, there has been a steady push to study a variety of pointof-interest (POI) issues [1–3] through LBSNs. In general, when users visit a POI, they are likely to check in online and post their statuses to describe that their interests are related to the POI. There have been two types of studies in the literature to reveal and utilize the characteristics of POIs for various applications: POI recommendation and area-ofinterest (AOI) discovery. From the fact that the relevance of the data to the POI varies according to the geo-tagged data between the POI and the positions where the data are generated, it is of fundamental importance to detect AOIs [4–11]. Previous studies on discovering AOIs were conducted in the literature based mostly on density-based clustering methods along with the collection of geo-tagged photos from LBSNs. Density-based spatial clustering of application with noise (DBSCAN) [5– 7, 11, 12] is the most commonly used density-based clustering algorithm even if it was not originally designed for AOI discovery. DBSCAN can find arbitrarily-shaped multiple clusters with an overall average runtime complexity of O(nin lognin ) [12] (the worst case complexity of O(n2in )), where nin denotes the number of input records. In our work, by reflecting the geographic proximity for POIs, we focus on characterizing “POI boundary”, which is defined as only one cluster having a convex hull shape that contains the corresponding POI center. That is, this POI boundary represents one high-density cluster within the discovered AOI (possibly the cluster with the highest density) on a much smaller scale. We aim at detecting such a boundary that corresponds to the most attractive cluster to which users pay attention for the POI only at the cost of linear scaling runtime complexity in nin . By detecting POI boundaries, one can think of a variety of applications including, but not limited to

Categories and Subject Descriptors J.4 [Computer Applications]: Social and Behavioral sciences

General Terms Algorithms, Human Factors, Measurement

Keywords Area-of-Interest (AOI), Geographic Distance, Geo-Tagged Tweet, Point-of-Interest (POI) Boundary, Two-Phase Approach, Twitter Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. LBSN’15, November 03, 2015, Bellevue, WA, USA c 2015 ACM. ISBN 978-1-4503-3975-9/15/11...$15.00

DOI: http://dx.doi.org/10.1145/2830657.2830663.

INTRODUCTION

• Location advertisements: As a marketing strategy of companies aimed at a POI (e.g., shopping malls), leaflets /brochures online will be disseminated only to the people who come to visit the place. Thus, company managers will not only be aware of the explicit marketing zone but also reduce the marketing cost.

• Traffic control: When there is a festival at a POI, traffic congestion will be significantly reduced by recommending the best route based on the POI boundary for festival participants. Instead of geo-tagged photos collected from LBSNs, we utilize geo-tagged tweets on Twitter [13–15], which holds a more substantial amount of user accounts and records than those from LBSNs. More specifically, to determine how POI boundaries are geographically formed with a convex hull shape, we analyze a single-source dataset that contains a huge number of geo-tagged tweets from users in both the United Kingdom (UK) and the United States (US). These two location sets were selected as demographically comparable, leading adopters of Twitter with sufficient data to enable meaningful comparative analysis for our intentionally exploratory study. Then, this paper introduces a new two-phase strategy to detect a POI boundary, whose computational complexity scales linearly with the number of input records for the algorithm (i.e., O(nin )) and thus is much lower than that in [5–7, 11]. The proposed detection algorithm is composed of the following steps: • We first describe the POI with Wikipedia concepts. • Unlike exploring the topical similarities with check-in data in LBSNs, from the fact that users are allowed to tweet within 140-character limit, we collect all the relevant geo-tagged tweets through query processing whose text contains the POI name (e.g, full name and abbreviated name). • Thereafter, we compute the distance between the POI center and the location where each tweet is posted.

• text: actual UTF-8 text of the status update containing a POI name • lat: latitude of the tweet’s location • lon: longitude of the tweet’s location We first represent each POI with Wikipedia concepts. Through query processing, we then obtain the filtered geotagged tweets whose text field is associated with the POI names. Two POIs located in London and Los Angeles, respectively, are used for our analysis. In our work, POIs are not only restricted to the above we consider, but also can be a whole variety of points potentially interesting to tourists and places where people tend to check in. We selected these POIs whose boundary size is expected to differ from each other.

POI name London Eye

Table 1: Four POIs The number (latitude, of geo-tagged longitude) tweets (51.503300o , 2,178 −0.119700o )

Initial radius ∆r1 (m) 80

Victoria and Albert Museum (V&A)

(51.496667o , −0.171944o )

1,098

120

Dodger stadium

(34.072686o , −118.240603o )

3,666

120

Los Angeles International Airport (LAX)

(34.053718o , −118.242642o )

4,100

2,000

Owing to a huge amount of geo-tagged Twitter data, our analysis provides a much more fine-grained POI boundary with linear scaling runtime complexity.

Representative attributes of the four POIs are summarized in Table 1. The second and the third columns of Table 1 represent the POI center’s coordinate and the total number of geo-tagged tweets whose text contains the POI names, respectively. As a rule of thumb, an approximate minimum distance from the POI center covering the geographic area of the POI is obtained from Google Maps and is referred to as the initial radius ∆r1 , which is shown in the last column of Table 1. As depicted in the table, since we selected different types of POIs, the number of relevant geo-tagged tweets and the initial radius are significantly different according to the POIs.

2.

3.

• After simply finding a suitable radius reachable from the POI center, we detect a polygon-type boundary of the POI as the convex hull (i.e., the outermost region) of selected geo-tags through the two-phase approach. To provide a more precise result, each phase proceeds on with different sizes of radius increment.

DATASET

We use a dataset collected via Twitter Streaming API. The dataset consists of a huge amount of geo-tagged tweets recorded from Twitter users from July 29, 2015 to August 29, 2015 (about one month) in the following two countries: the US and the UK. Note that this short-term (one month) dataset is sufficient to detect boundaries for widely-known POIs. The dataset collected in the UK consists of 18,682,819 geo-tagged tweets from 629,881 different users, while another dataset in the US consists of 58,118,361 geo-tagged tweets from 2,139,483 different users. We see that each tweet contains a number of entities that are distinguished by their attributed field names. For data analysis, we adopted the following four essential fields from the metadata of tweets: • user id str : string representation of the unique identifier for a certain user

METHODOLOGY

We start by introducing the following definition of “POI boundary”. Definition 1. POI boundary is one convex hull type highdensity cluster such that contains the corresponding POI center. Within the boundary, all annuli created with a predetermined radius increment from the POI center should include at least one geo-tag. Note that our definition differs from the conventional definition of AOI [6, 7, 11], which may consist of more than two high-density clusters on a larger scale. This POI boundary commonly reveals the highest density among all the clusters. Now, we describe our two-phase detection for a precise POI boundary. As mentioned earlier, we are interested in finding a suitable radius for each POI. Let (tu , lu ) and c denote the geo-tagged textual data of user u and the coordinate of a POI center, respectively, where tu and lu are the

text and the coordinate, respectively, of user u. We assume that Din and Dout indicate the set of all geo-tagged tweets whose text includes a POI name and the set of geo-tagged tweets within the detected POI boundary, respectively. We also denote the set of geo-tagged data within a circle having center c and radius r by D(c,r) . The geographic distance between the POI center c and the location of user u, denoted by d(c, lu ), can be computed using the spherical law of cosines, which gives a well-conditioned result of the estimated distance down to distances as small as 1 meter. For the first phase, given a POI, we start by using a circle centered at c with radius ri = ∆r1 > 0, which is a discretely increasing variable. The radius ri is increased by ∆r1 for each step if a certain condition is satisfied, and the update history is archived. When d(c, lu ) is smaller than ri , it follows that (tu , lu ) ∈ D(c,ri ) . Now, we focus on describing the update condition for the first phase. Let us define dn1 , |D(c,ri ) \ D(c,ri−1 ) |, which is the number of geo-tagged tweets in the corresponding annulus. If dn1 is greater than a given threshold ∆η, then three variables ri , D(c,ri ) , and dn1 for the POI are updated iteratively. Otherwise, this process is terminated. Here, the threshold ∆η > 0 can be determined adaptively according to the radius ∆r1 , which will be specified later. Let us turn to the second phase, which yields a more finegrained result, compared to the single phase method. Let us start by using a circle centered at c with radius ri,j = ri−1 > 0. We denote the increasing radius interval for this phase by ∆r2 , which is set to a value greater than the distance due 1 , where to the GPS error. More precisely, we set ∆r2 = ∆r Q Q > 1 is the parameter representing the interval granularity. Likewise, ri,j is increased by ∆r2 for each step under another condition, and the update history is also archived. In this phase, we define dn2 , |D(c,ri,j ) \ D(c,ri,j−1 ) |, which is the number of geo-tagged tweets in the corresponding annulus. If dn2 ≥ 1, then ri,j , D(c,ri,j ) , and dn2 are updated iteratively. Otherwise, this process is terminated, and the set Dout = D(c,ri,j ) is finally obtained. The overall procedure is summarized in Algorithm 1. Thereafter, by solving the convex-hull problem (e.g., quickhull ), we can find the smallest convex polygon that contains the given points in Dout as well as the POI center, corresponding to the POI boundary.

4.

Algorithm 1 Two-phase detection algorithm. Input: Din , c, ∆r1 , and ∆r2 Output: Dout Initialization: i ← 1; j ← 0; ri ← 0; ri,j ← 0; dn1 ← 0; dn2 ← 0; D(c,ri ) ← {(tu , lu )|d(c, lu ) < ∆r1 , (tu , lu ) ∈ Din } 1: do 2: i←i+1 3: ri ← ri + ∆r1 4: D(c,ri ) ← {(tu , lu )|d(c, lu ) < ri , (tu , lu ) ∈ Din } 5: dn1 ← |D(c,ri ) \ D(c,ri−1 ) | 6: while dn1 > ∆η 7: i ← i − 1 8: ri,j ← ri 9: do 10: j ←j+1 11: ri,j ← ri,j + ∆r2 12: D(c,ri,j ) ← {(tu , lu )|d(c, lu ) < ri,j , (tu , lu ) ∈ Din } 13: dn2 ← |D(c,ri,j ) \ D(c,ri,j−1 ) | 14: while dn2 ≥ 1 15: Dout ← D(c,ri,j ) 16: return Dout

(a) London Eye

(b) Victoria and Albert Museum (V&A)

(c) Dodger Stadium

(d) Los Angeles International Airport (LAX)

ANALYSIS RESULTS

Using the proposed detection algorithm in Section 3, our experimental results are first shown. In our work, we simply assume Q = 10. Then, a suitable ∆η can be set to 100 according to the relationship between ∆r1 and ∆r2 . The detected POI boundaries on Google Maps are illustrated in Figure 1. One blue circle and red pins indicate the POI center and the selected geo-tagged records in the set Dout , respectively (note that some pins are almost overlapped). Next, we perform comparative studies between the stateof-the-art DBSCAN algorithm and the proposed two-phase algorithm in terms of computational complexity. We evaluated the overall average runtime, referred to as the CPU time charged for the execution of instructions of the calling processing system, in detecting the LAX boundary using the dataset in Section 2. Using the DBSCAN algorithm returns 37 clusters, where the radius and the minimum number of neighbors, which are the two parameters of DBSCAN, are set to 2 km and 5, respectively. To obtain the POI boundary,

Figure 1: Detection of POI boundaries

one high-density cluster out of 37 clusters that includes the POI center is then chosen (that is, the rest of the clusters are filtered out). In this case, the maximum distance from the POI center within the boundary is given by 3.782 km. On the other hand, using the proposed algorithm immediately returns only one cluster. In our scheme, by setting ∆r1 = 2 km, Q = 10, and ∆η = 100, the maximum distance reachable from the POI center is given by 3.872 km, which is quite similar to the DBSCAN case. In Figure 2, the horizontal and vertical axes represent the number of geo-tagged tweets in the set Din , nin , |Din |, and the execution time in seconds, respectively. As long as |Dout | scales relatively slower than nin (i.e., |Dout | = o(nin )), one can see that the complexity of the proposed algorithm is given by O(nin ). On the other hand, the complexity of the DBSCAN algorithm is known to scale as nin lognin . For large nin , a performance gap between these two methods gets significantly increased. Asymptotic curves are also shown in Figure 2, where they show trends consistent with our experimental results.

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Figure 2: Runtime complexity [11]

5.

ACKNOWLEDGMENT

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2014R1A1A2054577). [12]

6.

REFERENCES

[1] Q. Yuan, G. Cong, Z. Ma, A. Sun, and N. Magnenat-Thalmann. Time-aware point-of-interest recommendation. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), pages 363–372. July 2013. [2] M. Ye, P. Yin, W.-C. Lee, and D.-L. Lee. Exploiting geographical influence for collaborative point-of-interest recommendation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’11), pages 325–334. July 2011. [3] J.-W. Son, A.-Y. Kim, and S.-B. Park. A location-based news article recommendation with explicit localized semantic analysis. In Proceedings of

[13]

[14]

[15]

the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), pages 293–302, July 2013. M. Berg, W. Muelemans, and B. Speckmann. Delineating imprecise regions via shortest-path graphs. In Proceeding of the 19th ACM SIGSPATIAL International Conference on Advances in Grographic Information Systems (SIGSPATIAL2011), pages 271–280, November 2011. J.-K. Parket and J.-A. Downs. Footprint generation using fuzzy-neighborhood clustering. Geoinformation, 17(2): 285–299, April 2013. J. Liu, Z. Huang, L. Chen, H.-T. Shen, and Z. Yan. Discovering areas of interest with geo-tagged images and check-ins. In Proceeding of the 20th ACM International Conference on Multimedia (MM’12), pages 589–598, October 2012. D. Laptev, A. Tikhonov, P. Serdyukov, and G. Gusev. Parameter-free discovery and recommendation of areas-of-interest. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geogaphic Information Systems, SIGSPATIAL2014, pages 113–122, November 2014. P. Brindley, J. Goulding, and M.-L. Wilson. A data driven approach to mapping urban neighbourhoods. In Proceedings of 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL2014, pages 437–440, November 2014. E. Cunha and B. Martins. Using one-class classifiers and multiple kernel learning for defining imprecise geographic regions. Geographical Information Science, 28(11): 2220–2241, November 2014. C. Grothe and J. Schaab. Automated footprint generation from geotags with Kernel Density Estimation and Support Vector Machines. Spatial Cognition & Computation: An Interdisciplinary, 9(3): 195–211, August 2009. S. Kisilevich, F. Mansmann, and D. Keim. P-DBSCAN: A density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos. In Proceedings of the 1st International Conference and Exhibition of Computing for Geospatial Research & Application (COM.Geo2010), June 2010. M. Ester, H.-P. Keriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. Data Mining and Knowledge Discovery, 96(34): 226–231, 1996. Y. Takhteyev, A. Gruzd, and B. Wellman. Geography of Twitter networks. Social Networks, 34(1): 73–81, January 2012. J. Kulshrestha, F. Kooti, A. Nikravesh, and K. P. Gummadi. Geographic dissection of the Twitter network. In Proceedings of the 6th International AAAI Conference on Weblogs and Socail Media (ICWSM-12), pages 202–209, June 2012. W.-Y. Shin, B. C. Singh, J. Cho, and A. M. Everett. A new understanding of friendships in space: Complex networks meet Twitter. Journal of Information Science, to appear.

Extraction Of Head And Face Boundaries For Face Detection ieee.pdf