Predicting the Number of Mobile Subscribers: An Accurate Forecasting System and its Application Chang Liu*, Stefan Valentin+, and Michael Tangemann+ *
Department of Electrical Engineering, Columbia University, New York City, United States + Bell Labs, Alcatel Lucent, Stuttgart, Germany
Abstract—Forecasting economic quantities can bring high benefits to business planning and operation. To provide accurate forecasts and to investigate the factors behind the trends, we present a new prediction system in this paper. Applying it to the Number of Mobile Subscribers (NMS) in the United States, China, and Germany, we observe interesting differences between saturated and emerging markets. Moreover, we find that prediction accuracy highly increases when (i) external factors are systematically included in the prediction model, while factors with ambiguous effects are removed and (ii) the time lag between a factor’s change and using it for prediction is optimized. By doing both, our system is not only more accurate than common forecasting methods but reveals if and how quickly a factor affects a trend's accuracy.
I. INTRODUCTION The Number of Mobile Subscribers (NMS) is highly relevant for planning network extensions, operating cellular networks, and developing business models. Thus, operators and vendors aim (i) to understand which social and economic factors affect the NMS per market and (ii) to predict how this important quantity will evolve within the upcoming years. Accurately providing such trends is difficult. Without a “crystal ball”, the NMS for the next years is not available. This makes it impossible to measure the accuracy of a prediction method and the resulting trend. Moreover, an NMS statistic is a given quantity. Unlike experimental data, it cannot be obtained for different situations of the same market. This makes it difficult to directly extract how individual factors affect the NMS. Understanding the effects behind a trend and assessing its accuracy, thus, requires further effort. A. Contributions To analyze and to predict the NMS, we make three contributions. First, we combine methods from econometrics, optimization, and experimental design into a prediction system. This new system models arbitrary economic quantities, such as the NMS. Extrapolating the model provides a trend. By establishing this trend for previous years and by comparing it to historical data, we can measure the prediction’s accuracy. Second, we investigate the social and economic factors behind the trends. Here, we utilize that our prediction system automatically derives those factors and model parameters leading
Chang Liu’s work was supported by the program RISE of the DAAD and by the program PhD@BL of Bell Labs, Stuttgart.
to highest accuracy. By observing these optimal values, we can rate how strongly a factor affects the NMS accuracy and how long this takes. For instance, while a population change quickly affects the NMS in Germany and in the United States (US), it needs an average of 6 years to influence the Chinese market. Further differences between saturating and developing markets are presented in Section IV. While we obtained these results by studying the past, finally, we provide forecasts for the upcoming years. As future data is missing, the accuracy of our trends for US, Germany, and China cannot be evaluated. However, our prediction system now assures that all model parameters and factors are optimal with respect to historical data. B. Related work Business analysts publish various predictions of the NMS. Many of these forecasts are based on qualitative assessments [4], ambiguously mix the NMS with traffic forecasts [5], or provide only little information on the quantitative method [5,6], the method’s accuracy [5,6], and the employed data sets [5]. This makes an objective interpretation difficult, if not impossible. However, there is some insight on forecasting methods in telecommunications. Methods for forecasting traffic were summarized in [7] and their accuracy was studied in [8]. After comparing various models to extrapolation, the authors conclude that regression-based models (e.g., ARIMA) provide the highest accuracy [8]. By using a linear regression model, we follow this approach. However, unlike autoregressive models such as ARMA or ARIMA, we base our prediction on external factors. To obtain these factors, we combine regression with methods from convex optimization and factorial design. Such combination has not been proposed in the literature so far. A wide range of forecasting methods for NMS and traffic is reviewed in [9]. After surveying various methods (e.g., based on diffusion or choice modeling), the author highlights that, unlike other fields, telecommunication completely lacks of forecasting methods that have been evaluated on their effectiveness [9, p.37]. This strong conclusion motivates our rigorous study in Section IV. Although we cannot cover the various models from literature, we objectively compare our prediction system to two standard forecasting methods and to real data.
C. Structure Section II describes basic methods from econometrics, optimization, and experimental design, which are combined to a prediction system in Section III. Applying this system to historical data in Section IV, we study the prediction accuracy and underlying effects. Finally, Section V presents our NMS forecasts and Section VI concludes the paper. II. BASIC METHODS Before we detail our prediction system, let us briefly describe the underlying methods for prediction and factor screening. A. Curve Fitting Curve fitting aims for the closest fit of a model function to the observed data. More precisely, for a given model function ! and data vector ! = (!! , … , !! ) the parameters !! , !! , … , !! have to be chosen such that a norm of the residuals !: = ! − ! is minimal. Minimizing the residual norm ! !! = !!!! !! − !! ! , we can state this problem as least sum of squares (LSS) minimize !!,…,!!
! !!!
!! − !! !! , … , !!
!
.
(1)
As long as the model function ! is convex, the residual norm maintains convexity and standard algorithms can reliably solve (1) in quadratic time [1, Sec. 1.2.1]. B. Regression with external factors The core method of our prediction system is a linear regression model with external factors, sometimes called Causal Method [2]. Assuming that K factors linearly and independently affect the predicted variable, then this model is given by ! = !! + !! !! + ⋯ + !! !! + ! ,
(2)
where the value Y is predicted using the factor values !! , … , !! and the model parameters !! , … , !! . The prediction error is expressed by the residual ! ~ ! 0, ! ! . Since a trend is a time series, we extend (2) by the time axis ! ∈ [1, !] to !=!∙!+!.
(3)
where ! and ! are now time-vectors, each of T elements. As above, the model parameter vector ! has K+1 elements, and the regressor matrix X is of size !×(! + 1) with !!,! = 1 ∀ !. The residuals ! of model (2) are calculated by choosing the model parameters in b such that the residual norm ! !! = ! ! !!! !! − !! is minimal. Consequently, b be can be found by LSS minimization similar to (1). Alternatively, for an arbitrary small residual norm, we can solve (3) in b by inverting X. Then, !
!!
! = (! !) !
!
(4)
provides all parameters of the linear regression model. C. Factor screening To screen for the factor set that maximizes the prediction accuracy, we combine the regression model with a modified
2! factorial design [3, Sec. 12.2]. Unlike the typical factorial design, we do not study high and low level effects but either deactivate a factor by level 0 or activate a factor by level 1. Excluding the trivial case, where all factors are deactivated, leads to ! = 2! − 1 factor combinations. We express these combinations by the experimental design matrix ! ϵ {0,1} !×! , where we insert the unit column by ! = [! !] to account for the first element in b. For instance, ! = 3 factors yield 1 1 1 1 1 1 1 0 0 0 1 1 1 1 !! = . 0 1 1 0 0 1 1 1 0 1 0 1 0 1 Using !, we can now formulate the regressor matrix for each factor combination ! ∈ [1, !] by ∀! ∈ 1, ! ∶ !!! = ! ! ∘ !!
(5)
where operator ∘ calculates the Hadamard product between line n of the experimental design matrix and line t of the regressor matrix. Iterating over the whole time axis yields one full regressor matrix !′ per factor combination which we insert in (3) to obtain ! for each n. Using these N predicted vectors to choose the “best” factor set will be described in the next section. III. PREDICTION SYSTEM Let us now combine the above methods to a prediction system. After describing the main structure of the system, we will detail how to measure its accuracy. Finally, we will reveal time lag as an important model parameter and describe its optimization. A. System structure Figure 1 illustrates the prediction system. This system operates on K factors, the main quantity ! (e.g., NMS), and two time periods. For the base period 1, !! we use historical data, while models and trends are derived for the prediction period 1, !! . As we split the complete time axis ! ∈ [1, !] into these two periods, naturally, ! = !! + !! . To predict the main quantity, our regression model relies on the regressor matrix X. As these external factor values have to be predicted for 1, !! , the first function in Figure 1 is curve fitting. D1,k ,...,D T ,k b
For each factor 1,...,k,...,K
historical data for factor k F1,k ,...,F T ,k b a
,...,a
1,k
Curve fitting
M,k
for factor k
6t k
6t*k
For each factor and each year: F ( a* ,...,a * ,6t* ) t,k 1,k M,k k K
Causal Method for main quantity, e.g., NMS
Dk 6t 0
Time lag optimization Factor set
6t*0
Time lag optimization Factor screening
Figure 1. Structure of the prediction system
Y( b,6t*0)
For each factor ! ∈ 1, ! , curve fitting matches a model function !!,! , … , !!!,! to the data !!,! , … , !!!,! . Both timevectors are given for the base period and will be defined in Section IV.A. The model function is initialized with the starting parameters !!,! , !!,! , … , !!,! until solving (1) yields the optimal parameters ! ∗!,! , ! ∗ !,! , … , ! ∗ !,! that minimize the residuals. Using this “best fit”, the Regression block computes the model function for each factor. This extrapolates the regressor matrix over the prediction period to !!!!!,! , … , !!!!!!,! . Adding this extrapolation to the data-based regressors !!,! , … , !!!,! , finally, provides X over the full period ! = !! + !! . Based on the complete !, the Regression block can obtain the model vector ! by solving (4). Inserting in (3) provides the main quantity ! for both time periods. We call !! , … , !!! the historical fit and !!!!! , … , !!!!!! the trend. As illustrated in Figure 1, !! , … , !!! are fed back to the functions time lag optimization and factor screening. The functions choose the free parameter time lag Δ! or the factor set, respectively, such that the Mean Absolute Percentage Error (MAPE) of the historical fit is minimal. To do so, the factor screening block computes the MAPE and ! for each factor combination. Out of these N combinations, we select the factor set for which the MAPE over 1, !! is minimal. Computing the MAPE and optimizing the time lag will be described next. B. Measuring prediction accuracy To optimize factor set and time lag as well as to rate the quality of our prediction, we have to measure prediction accuracy. To do so, we compute the Mean Absolute Percentage Error (MAPE) as !
! = !
!! !!! ! !!! !!
.
(6)
This error metric is common in trending [2] and it intuitively captures the deviation of the predicted value !! from the actual data !! as a percentile of !! . For perfect fit, the MAPE is zero and (6) causes a division by zero if !! = 0. Such data values do not exist for the NMS in the studied time span. C. Time lag optimization This optimization follows the intuition that a change of a factor may need some time until it affects the predicted quantity. Consequently, excluding recent data from the base period can lead to a more accurate prediction. Defining this exclusion period as time lag Δ! and searching for the time lag that minimizes the MAPE over 1, !! − Δ!! , we cannot only compensate for recent outliers in the factors (e.g., baby booms, GDP spikes). Moreover, we identify the delay between the most relevant (i.e., MAPE-minimizing) factor change and its effect on the trend’s accuracy. As illustrated in Figure 1, time lag optimization is done separately for the regression model and for curve fitting. Curve fitting extrapolates each factor individually and, thus, requires to
find K optimal time lags Δ!!∗ , … , Δ!!∗ . An optimal time lag Δ!!∗ is found by step-wise decreasing the base time period 1, !! − Δ!! where Δ!! ∈ [0, !! [. Iteratively increasing Δ!! excludes more and more measurement times until, at Δ!! = !! − 1, only the first measurement time is included. For each Δ!! and for each factor, we now solve the LSS (1) for each ! ∈ 1, !! − Δ!! . Finally, the MAPE-minimizing time lag is selected as Δ!!∗ ≔ arg min!!! !!!! . As regression combines all external factors in (3), we cannot individually obtain an optimal time lag for each factor. Instead, we chose one optimal time lag for the main quantity Δ!!∗ by, first, computing (3) for each factor and for each value of Δ!! ∈ [0, !! [. For each of the resulting !×!! solutions for Y we, second, solve (6) to obtain the MAPE. Finally, we choose the MAPE-minimizing time lag value Δ!!∗ as above. IV.STUDYING HISTORICAL DATA Before we provide the trends, we predict the number of mobile subscribers for previous years. This historical study, allows to rate our method’s accuracy and to screen for relevant factors. A.
Scenario and data sets We focus on the Number of Mobile Subscribers (NMS) in the United States (US), Germany, and China. The US and Germany are developed countries where the NMS and other economic factors saturate. With curve fitting, we model such saturation by using the inverse logit function !! =
!! !!! ! !!!! !!! !
,
(7)
also called “logistic function”. As this function is log-concave [1, Ch. 3] and the residual norm ! !! maintains concavity, the LSS (1) remains concave and can be solved in at least polynomial time [1, Ch. 11]. Curve fitting achieves the same complexity order when quantities in emerging markets (e.g., the NMS in China) are modeled by the exponential function !! = a! e!!! + a! ,
(8)
which is strictly convex. When applying the regression model, we restrict our system to ! = 4 factors and denote these factors by single letters for brevity. For each market, we study the factors total population (p), Gross Domestic Product (g), number of fixed telephone lines per person (f), and percentage of urban population (u). Note that the factors can be seen as mutually independent but that factor f may negatively interact with the NMS. Table I lists our data sources. Whenever two sources were available, we cross-verified their data by comparison. This was possible and showed only negligible differences for the NMS and for the factors f and u. On the time axis, we employ data from year 1986 to 2010 to predict the factors and the NMS from year 2006 to 2010, i.e., !! = 20, !! = 5.
TABLE I.
DATA SOURCES
Quantity
Source Issuer
URL (verified Sep. 30, 2011)
International Telecommunication Union (ITU)
http://www.itu.int/ITUD/ict/statistics/index.html
Nation Master
http://www.nationmaster.com /statistics
Population p
US Census Bureau
http://www.census.gov
GDP g
World Bank
http://data.worldbank.org/
Fixed telephone lines f
International Telecommunication Union (ITU)
http://www.itu.int/ITUD/ict/statistics/index.html
World Bank
http://data.worldbank.org/
World Bank
http://data.worldbank.org/
Central Intelligence Agency (CIA)
https://www.cia.gov/
Number of mobile subscribers (NMS)
Urban population percentage u
C. Results from time lag optimization Table II shows the values of the optimal time lag Δ!!∗ for the NMS obtained by regression. As discussed above, the number of fixed telephone lines was excluded for China. Its time lag is, thus, not optimized here. We observe interesting differences between the 3 markets. Population changes quickly affect the NMS in the US and in Germany but need an average of 6 years to influence the Chinese market. On the other hand, a change in urban population quickly affects the Chinese NMS while the US and German markets remain stable. Finally, a change of the GDP needs only 2 years to influence the German and Chinese NMS but 6 years in the US. These different time lags are interesting and provoke intuitive explanations. However, their causes require further study, remain speculative and are, thus, not further discussed here. TABLE II. Market
B. Results from factor screening Performing factor screening according to Section II.C and III.A leads to the results in Figure 2. For a different number of factors, each subplot shows the MAPE at the y-axis, while the x-axis denotes the factor combination. Comparing the y-axis between the subplots indicates that increasing K always decreases the MAPE. For China, however, a different result is obtained when comparing ! = 4 to ! = 3. Here, excluding the number of fixed telephone lines f nearly doubles the prediction accuracy. As f does not only represent the overall degree of telecommunication but can also compete with mobile communication, it can suppress as well as increase the NMS. Due to this ambiguous effect, we ignore factor f in our further studies of the Chinese market. The resulting high accuracy and the reduced size of !, !, ! , and ! clearly demonstrate that factor screening is a powerful tool to improve the prediction’s accuracy and complexity at the same time. 3
One factor MAPE
1.5
2
1
1
0.5
0 0.4
p g f u Three factors MAPE
0.2 0
0 0.04
Two factors MAPE USA China Germany (f,u) (g,u) (g,f) (p,u) Four factors MAPE
0.02 (g,f,u) (p,f,u) (p,g,u) (p,g,f)
0
(p,g,f,u) (p,g,f,u) (p,g,f,u)
Figure 2. MAPE for the three studied markets and all combinations of the external factors: population p; GDP g; fixed telephone lines per person f; urbanization u.
OPTIMAL TIME LAGS WITH REGRESSION Factor Population p
GDP g
Tel. lines f
Urbanization u
United States
2
6
2
7
Germany
3
2
7
10
China
6
2
n.a.
2
D. Results for prediction accuracy To study the accuracy of our prediction system, we compute the NMS trend from 2006 to 2010 for US, Germany, and China. The results are illustrated in Figure 3. Here, we can compare the data points to the predictions using (i) curve fitting alone and (ii) our complete prediction system (Figure 1). Both prediction methods are studied with and without time lag optimization. In all following plots, base period and prediction period are separated by a vertical line. As shown, our complete prediction system with time lag optimization provides higher prediction accuracy than the other studied methods. This is the case for saturated as well as for emerging markets. Without time lag optimization, the regression model cannot follow diminishing trends and, thus, poorly predicts saturating quantities (cp. Figure 3a and 3b). A similar behavior is shown for curve fitting alone, which cannot closely follow the saturating phase even when Δ! is optimized. Note that we could increase accuracy with a different modeling function, e.g., a higher order polynomial. However, such models require to optimize a larger set of parameters and may violate convexity. Higher order model functions are, thus, less general and more complex than the widely-used “logistic” function class (7). Curve fitting remains an option for predicting the exponential trends in emerging markets. However, even here time lag optimization is crucial. As shown in Figure 3c, curve fitting and regression fail completely, when Δ! is not optimized.
TABLE III.
PREDICTION ERROR FROM 2006 TO 2010 MAPE
Prediction Accuracy
Curve fitting
Regression model
Without opt.
With opt.
Without opt.
With opt.
United States
0.5226
0.1311
0.6436
0.0325
Germany
0.9182
0.2729
2.0104
0.0494
China
3.5458
0.0441
3.9117
0.0174
Number of Mobile Subscribers
x 10
Data Prediction Sys. without Opt. Prediction Sys. with Opt. Curve Fitting with Opt. Curve Fitting without Opt.
2.5 2 1.5 1 0.5 0 1985
1990
1995 2000 Year
2005
2010
Figure 3a. United States (US) 7
Number of Mobile Subscribers
12
x 10
10 8
Data Prediction Sys. without Opt. Prediction Sys. with Opt. Curve Fitting with Opt. Curve Fitting without Opt.
1.5
Data Prediction Sys. without Opt. Prediction Sys. with Opt. Curve Fitting with Opt. Curve Fitting without Opt.
1
0.5
0 1985
1990
1995 2000 Year
2005
2010
Figure 3c. China Figure 3. Accuracy test for the number of mobile subscribers in US, Germany and China; using data from 1986 to 2005 to predict values from 2006 to 2010.
4 2 1990
1995 2000 Year Figure 3b. Germany
Having demonstrated the high accuracy of our prediction system for the last 5 years, we now forecast the NMS for the upcoming years 2011 to 2015. We provide the trends in Figure 4 and compare sole curve fitting to the complete prediction system. As time lag optimization increased accuracy for each case in Section IV, it is now always used. In Figure 4, both prediction methods agree on an exponential increase for the Chinese market. For US, our complete system predicts a linear increase while curve fitting predicts that the growth of the US market will slow down towards saturation. The trends for Germany are interesting. Curve fitting only loosely follows the data points in the base period and predicts an increasing NMS while our system forecasts a strong downturn. Additional factor screening revealed that this downturn is solely caused by the factor percentage of urban population u. As removing this factor significantly decreases the prediction accuracy (cp. Figure 2), we consider the whole range of this prediction to be relevant. VI. CONCLUSIONS AND FUTURE WORK
6
0 1985
x 10
V.THE TRENDS
8
3
9
2 Number of Mobile Subscribers
Again, the highest accuracy is reached by our complete prediction system. These observations are confirmed by the MAPE results in Table III. For the regression model, adding time lag optimization increases the accuracy by a factor of 20 (US), 41 (Germany), or even 225 (China). With time lag optimization, the complete prediction system provides 2.5 times higher accuracy than curve fitting when the predictions are already close (China) or even increases the accuracy by factor 4 (US) and 5.5 (Germany).
2005
2010
We presented a prediction system to forecast arbitrary economic quantities. Taking external factors into account, our system automatically chooses model parameters, factors, and time lags such that accuracy is maximal. Applying our approach to the Number of Mobile Subscribers (NMS) in three different markets, demonstrates its high accuracy. Moreover, we observed that: • Curve fitting shows acceptable accuracy for exponential trends but fails for saturated markets. Here, adding external factors via the linear regression leads to very high accuracy. • Supporting both prediction methods by factor screening and time lag optimization substantially increases accuracy. For
We thank Valentine Ilogu (University of Ottawa) and Alistair Urie (Alcatel-Lucent) for their helpful comments. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9]
S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge, 2004. J. Scott Armstrong, Principles of Forecasting, Kluwer Academic, 2001. A. M. Law and W. D. Kelton, Simulation modeling and analysis, 3rd ed., McGraw-Hill, 2000. Sandvine Inc., “2010 Mobile Internet Phenomena Report”, Technical Report, Mar. 2010. Allied Business Intelligence Inc., “Mobile Data Traffic Analysis”, Technical Report RR-TRAF-11, 2011. J. Moss, “The Mobile Internet Is 10 years old: it’s time for a reality check”, Seminar slides, 2010. ITU-T, “Models for forecasting international traffic”, Rec. E.507, 1993. G. Madden and J. Tan, “Forecasting telecommunications data with linear models”, Telecommunications Policy, vol. 31, pp. 31-44, 2007. R. Fildes, “Telecommunications demand forecasting – a review”, International Journal Of Forecasting, vol. 18, pp. 489-522, 2002.
Number of Mobile Subscribers
x 10
Data Prediction Sys. with Opt. Curve Fitting with Opt.
3.2 3 2.8 2.6 2.4 2.2 2006
2008
2010 Year
2012
2014
Figure 4a. United States (US) 7
11 Number of Mobile Subscribers
ACKNOWLEDGEMENT
8
3.4
x 10
10.5 10 9.5 9
Data Prediction Sys. with Opt. Curve Fitting with Opt.
8.5 8 2006
2008
2010 Year
2012
2014
Figure 4b. Germany 9
2 Number of Mobile Subscribers
emerging markets, it turns the prediction methods from a failure to high accuracy. The same result is observed for saturating markets, where regression is only useful if time lag optimization is used. • Analyzing the time lags for saturated and emerging markets indicates that the factors population, Gross Domestic Product (GDP), and urbanization are key differentiators between saturated and emerging markets. Hence, these factors need to be handled with care when comparing the telecommunication markets of different countries. While these findings are based on historical data from 1986 to 2010, finally, we present NMS trends for 2011 to 2015. Predictors with and without external factors agree on an exponentially increasing NMS for China and on a linear increase for US. Although they strongly differ in quantity, both methods predict a significant saturation of the German NMS. Future work will (i) investigate the mathematical properties of time lag optimization to overcome the currently-used exhaustive search, (ii) screen for further dominating factors and study their effects, and (iii) further investigate the time lag differences between saturating and emerging markets. Nevertheless, even at their current state, our studies and our prediction system can be useful for operators and vendors in mobile communication as well as for further economic studies.
x 10
Data Prediction Sys. with Opt. Curve Fitting with Opt.
1.5
1
0.5 2006
2008
2010 Year
2012
2014
Figure 4c. China Figure 4. Trends for the number of mobile subscribers in US, Germany and China from 2011 to 2015 using previous parameters and methods.