Data mining and the Implementation of a Prospective ...

Viewer
Transcript

Data mining and the Implementation of a Prospective Payment System for Inpatient Rehabilitation Daniel Relles, Greg Ridgeway, and Grace Carter {relles, gregr, carter}@rand.org RAND Santa Monica, CA 90407-2138

Abstract

This paper describes the development of a new Medicare Prospective Payment System (PPS) for inpatient rehabilitation care. Congress mandated such a system in the Balanced Budget Act of 1997. To help implement this system, we assembled four years of Medicare hospitalization data, linked it to rehabilitation hospitals’ information about impairment and the functional status of patients, and developed case mix groups using the CART algorithm, a common method for determining groups in health services.

While CART readily produces simple and effective rules for prediction, it adheres to a restrictive functional form and its fitting algorithm does not necessarily produce a global optimum. We wanted to know how these limitations affect our results. So, we compared CART’s performance with methods receiving attention in the data mining community and in the statistics literature. We estimated that the CART models explained about 90 percent of the potentially explainable variance in individual cost and they predicted annual hospital costs that were essentially identical to other methods’ predictions.

Keywords: Health care financing, prospective payment, rehabilitation, regression trees, data mining

1 Introduction Partitioning patient stays into groups of homogeneous resource use is a recurring theme in health services research. Partitioning schemes such as the Diagnosis Related Groups (DRGs), Resource Utilization Groups (Fries et al., 1994), and Psychiatric Patient Classes (Ashcraft et al., 1989) are in widespread use and form the fundamental basis for reporting and resource allocation.

-1-

The Health Care Financing Administration (HCFA), renamed the Centers for Medicare & Medicaid Services (CMS) in 2001, manages Medicare’s $4.2 billion budget for inpatient rehabilitation. As a result of the Balanced Budget Act of 1997, CMS must implement a prospective payment system based on classifying patients into case mix groups. This new classification system for patients in rehabilitation must be based on empirical evidence that resource use within each case mix group is relatively constant. Also, the system must make clinical sense and provide adequate compensation to hospitals providing the services.

The overall goal of our analysis is to group together patients with similar features, such as impairment, age, and functional ability, so that resource use within that category is relatively constant. In this paper, we describe the construction of a set of Function Related Groups (FRGs) that partition the population into groups that are medically similar and that have similar expected resource needs. We measure resource use by the logarithm of wage-adjusted dollars spent from admission to discharge. To group patients we consider using 21 impairment categories, measures of motor and cognitive function, and age. We use the CART (Classification and Regression Trees) algorithm (Breiman et al., 1984) within each impairment category to develop the partitioning that best predicts cost.

In addition to simply fitting the CART models, we wished to further explore the strengths and limitations of our payment system. Even after computing an unbiased estimate of the predictive performance of a particular regression tree it is still difficult to judge how much better we might have done if we were not subject to CART’s limitations. We know that R-squared is always between 0.0 and 1.0 with higher values indicating better prediction, but when a model’s Rsquared is potentially much lower than 1.0 we need a way to judge whether CART’s performance is as best as could be expected by other competing modeling strategies. To further investigate this we compared CART’s performance with other methods that have received attention in the statistics and data mining literature: generalized additive models (GAM) and multiple adaptive regression trees (MART). Both methods are automated, flexible, and effective at fitting complex prediction formulas, and both methods have gained acceptance in the statistical and data mining communities.

-2-

Section 2 describes the study design and the data available for developing the system. Section 3 describes CART and our application of CART to the problem of determining rehabilitation FRGs. Section 4 discusses the other methods we examined to evaluate the performance of the CART and the results of that evaluation. Section 5 offers some general conclusions. A complete description of this payment system is available in Carter et al. (2001).

2 Study Data and Design The population of interest here is all Medicare patients who used inpatient rehabilitation services following an acute care stay. Our initial goal is to produce a patient-level dataset that has measures of resource use as well as medical condition information. Next, we want to group patients that are similar in terms of impairment, functional ability, and age, so that all the patients in a group have roughly the same cost. We obtained patient data for those facilities providing rehabilitation services for Medicare and used these data to stratify patients into 21 internally homogeneous clinical groups that have generally been accepted in the rehabilitation community. Then, we designed and ran a computational study to produce and evaluate a cost classification system.

2.1 Data We combined data from two sets of patient files. Medicare data provided the population frame, information on resource use, and characteristics of each rehabilitation hospital stay. Rehabilitation hospital data provide information on impairment and functional status. On the Medicare side, we examined discharge abstracts that HCFA collected on all Medicare patients in the course of administering the program, recorded for all rehabilitation hospitals. HCFA provided us with records of calendar year 1996 through 1999 discharges from the Medicare Provider Analysis and Review (MEDPAR) file. From this file we extracted information on departmental charges, age at admission, and characteristics of the stay. Payments for transfer cases and deaths are based on adjustments to the standard payments for cases discharged to the community. This paper focuses on developing the payment system for community discharged cases.

-3-

We used data on costs and charges from the Hospital Cost Reporting Information System to estimate accounting cost from the MEDPAR charge data. The method used is described in Newhouse et al. (1989). We adjusted each cost estimate for area wages using the hospital wage index from the acute care PPS.

On the rehabilitation side, we measured the functional status of individual rehabilitation patients using the Functional Independence Measure (FIM) data. The FIM is an 18-item measure covering six domains, self-care (six activities of daily living), sphincter control (two items on bowel and bladder management), mobility (three transfer items), locomotion (two items on walking/wheelchair use and stairs), communication (two items on comprehension and expression), and social cognition (three items on social interaction, problem solving, and memory). All 18 items are scored into one of seven levels of function ranging from complete dependence (1) to complete independence (7). FIM data also contain an impairment code that gives the primary reason for the rehabilitation admission. We collected FIM data from the Uniform Data System for medical rehabilitation, from the Clinical Outcomes Systems data for medical rehabilitation, and from HealthSouth Hospitals.

The MEDPAR and FIM files described the same set patients and we needed to link them in order to develop our resource use models. For privacy reasons there were no patient identifiers available to link them together. The literature on techniques for dealing with this problem is rich and we turned to a probabilistic matching technique (Jaro, 1989) to accomplish the linking. Probabilistic matching takes a set of candidate match variables (here, admission date, discharge date, age, sex, race, and zip code) and attempts to develop a linear scoring function such that scores above a certain cutoff level offer high probabilities of correct matching. Using this technique, we were able to match roughly 95% of the FIM data with patients in the MEDPAR record to form our final dataset. The development of the MEDPAR/FIM dataset is described in Relles and Carter (2002). The merged MEDPAR/FIM data contained several variables useful for modeling and classification.

Table 1 identifies these variables, and indicates at which stages of the process they were used. The selection variables define what we think of as the typical case. We exclude transfers to

-4-

hospitals and to long term care settings, deaths, cases of three days or less duration, and statistical outliers (cases that are outside the three standard deviation confidence interval in log(cost)). Also, the clinical partitioning and resource use variables needed to be present and in range. The FIM data measure functional independence in two main dimensions, the cognitive and the motor. The sum of the thirteen motor components represents an overall measure of motor ability and the sum of the five cognitive components does likewise for cognition. Case selection was based on the intersection of the rules shown in Table 2.

-5-

Table 1: MEDPAR/FIM Variables and Stages of Use Purpose

Variable AGE DISSTAY LOS IMPCD PROVCODE PROVNO

Source MEDPAR FIM† MEDPAR FIM MEDPAR MEDPAR

Description age discharge stay indicator Selection length of stay rehabilitation impairment codes provider code provider number total cost estimates, based on cost to charge TCOST MEDPAR ratios, adjusted by area wage index † IMPCD FIM impairment code Clinical Rehabilitation Impairment Category – partitioning RIC FIM† indicates one of 21 clinical groups resulting from impairment code mappings total cost estimates, based on cost to charge Resource use TCOST MEDPAR ratios, adjusted by area wage index Cognitive score total – sum of 5 components comprehension expression COGNITIVE* FIM† social interaction problem solving memory Motor score total – sum of 13 components eating grooming bathing Functional items dressing — upper body dressing — lower body toileting MOTOR* FIM† bladder management bowel management bed, chair, wheelchair transfer toilet transfer tub or shower transfer walking or wheelchair stair ascending and descending * these individual components are organized into various types of indices, according to body areas and types of impairment. Each component of the motor and cognitive scores are ordinal scales that range from 1 (complete dependence) to 7 (complete independence). Therefore the cognitive scores can range from 5 to 35 and the motor scores can range from 13 to 91. †

Patient’s FIM data came from either UDSmr, COS, or HealthSouth

-6-

Table 2: Rules for Selecting Cases Variable AGE DISSTAY LOS IMPCD, TCOST IMPCD TCOST, COGNITIVE, MOTOR

Selection requirement between 16 and 105 indicates discharged to the community more than three days, less than one year. we excluded cases with log(wage-adjusted cost) more than three standard deviations from its average within RIC contained in an impairment list for assignment to one of the 21 rehabilitation categories (see Table 4) greater than zero

Table 3 shows the effects of the selection rules in 1998 and 1999 on the number of cases available for analysis. The full population is reduced by at least a third owing mostly to the nonparticipation of hospitals in our FIM sources. Missing cost information accounted for another three percent drop, mostly from all-inclusive providers for whom separating out rehabilitation charges was not possible. About a quarter of the remaining cases were discharged someplace other than the community. Other drops in sample sizes were small. Table 3: Number of observations at each stage of selection Population Sizes Population of Medicare rehab patients Matched cases at participating hospitals With cost information With cost and FIM information Discharged to community Exclude transfers to hospitals Exclude age, cost, and LOS outliers

1998 370,352 234,622 228,622 228,248 174,011 170,270 169,816

1999 390,048 259,017 250,254 249,941 191,924 187,258 186,766

2.2 Case Stratification and Sample Sizes The first step in developing case mix groups is to partition the data into clinically similar groups, called rehabilitation impairment categories (RICs), based on the primary reason for the rehabilitation admission. Previous work had established 21 such groups within which we would be fitting models. Table 4 describes those groupings and the sample sizes available for the modeling effort according to the selection rules in Table 2. Over time the sample size increased -7-

largely due to an increase in the number of hospitals participating in the source databases. Table 4 also includes the final number of FRGs in each RIC discussed later.

After establishing the 21 RICs we fit models within each RIC predicting cost from patient features. This is equivalent to interacting RIC with all other covariates used to predict cost. Table 4: RIC Definitions, Sample Sizes, and Number of FRGs

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Rehabilitation Impairment Category 1996 1997 1998 Stroke 32,687 35,026 37,012 Traumatic brain injury 1,383 1,629 1,871 Non-traumatic brain injury 2,517 2,863 3,402 Traumatic spinal cord 738 810 930 Non-traumatic spinal cord 3,782 4,340 5,295 Neurological 4,730 5,717 7,832 Hip fracture 16,017 17,167 18,774 Replacement of lower extremity joint 31,151 37,383 40,931 Other orthopedic 5,292 6,547 8,022 Amputation, lower extremity 4,810 5,423 5,930 Amputation, other 354 477 542 Osteoarthritis 2,340 2,854 3,983 Rheumatoid, other arthritis 1,169 1,521 1,944 Cardiac 4,097 5,662 6,885 Pulmonary 2,442 3,561 4,340 Pain Syndrome 1,321 1,873 2,529 MMT, no brain or spinal cord injury 1,188 1,288 1,540 MMT, with brain or spinal cord injury 156 222 221 Guillain-Barre 240 278 299 Miscellaneous 10,097 13,398 17,423 Burns 70 103 111 Total 126,581 148,142 169,816

1999 FRGs 37,340 14 2,053 5 3,758 4 953 4 5,837 5 8,875 4 20,627 5 43,427 6 9,310 4 6,156 5 662 3 5,036 5 2,350 4 8,104 4 5,382 4 2,993 2 1,679 3 256 4 313 3 21,553 5 102 2 186,766 95

2.3 Computational Experiment The main goal of this study was to evaluate the out-of-sample predictive performance of the different prediction methods and compare them with CART. Out-of-sample evaluation is the process of fitting a model on one sample and evaluating its predictive performance on a new sample of subjects, yielding realistic estimates of the prediction error likely to be observed upon the PPS’s implementation. An important element of a payment system is whether payment formulas offer accurate prospective estimates of cost. In particular we wished to determine

-8-

whether CART was capable of capturing most of the cost information available in the predictor variables, age and functional ability.

Table 5 shows the layout of the computational experiment consisting of four experimental factors that we varied. The first factor was the algorithm used to predict cost from the patient covariates. Subsequent sections of this paper discuss the three candidate methods in greater detail. Linear least squares regression is not among these candidates since age and the functional measures almost certainly have a non-linear relationships to cost and, as expected, the method performed poorly in initial experiments. Besides, linear regression can be viewed as a special case of GAM discussed later. For each method we considered five candidate sets of predictor variables to predict cost. Table 5 lists the five candidate predictor sets in increasing order of granularity. These were the only predictors allowable at this stage. Other variables were either not acceptable (e.g. sex, marital status, wheelchair status) or set apart as adjustments to a base payment (e.g. comorbidities and facility characteristics). To validate the various predictors of log(cost) we fit the various models using the five candidate sets of predictor variables with data from each of the years and predicted in the other years. Naturally we are interested in whether we can estimate the model in, say, 1997 and accurately predict cost in 1999. We initially tried fitting separate models for each year and seeing how well they performed on all other years. This would yield 12 out-of-sample evaluations. We later improved that by observing that some RICs (e.g., 04, 11, 18, 19, 21) were quite small, and it might be advantageous to pool their data. This led to experimenting with fitting periods 1996-97 and 1998-99. Thus, Table 5 describes the full set of fits and predictions with the exception that we did not fit and predict on the same year’s data.

-9-

Table 5: The computational experiment Experimental factors Possible values for each experimental factor CART 1.Model GAM MART Age, Standard FIM Motor and Cognitive Score Age, Standard but remove transfer to tub from the Motor score Age, Standard but decompose Motor into ADLs and mobility 2. Predictor set (without tub transfer) Age, Standard but decompose Motor into transfer (without tub transfer), locomotion, sphincter, and selfcare Age, and the 18 individual FIM components 1996, 1997, 1998, 1999 3. Fitting year 1996-7, 1998-9 4. Evaluation year 1996, 1997, 1998, 1999 Section 3 will detail our use of the CART algorithm and Section 4 describes the other two modeling methods that we were considering.

3 Modeling Cost Using CART The rehabilitation PPS system is to be based on discharges classified according to function related groups. CART is the traditional method of generating FRGs (Stineman et al., 1997) and a reasonable method of determining rules to classify patients into groups that explain cost. Various algorithms have been proposed to build tree structured regression models, many of which are variations on the CART theme.

CART requires a dependent variable (here, the logarithm of wage-adjusted cost) and it seeks to develop a predictor of the dependent variable through a series of binary splits from a candidate set of independent variables. Here the predictor variables are age, the FIM motor score, and the FIM cognitive score. The FIM cognitive score is simply the sum of the five components of cognition. Section 5.1 describes in more detail our use of the FIM motor score, for which we used a sum of 12 of the 13 motor components.

The CART algorithm is recursive. First, it examines the set of independent variables and searches the dataset for a partition that best explains variation in the dependent variable. For

- 10 -

example, CART might examine the partition separating patients with motor score exceeding 50 from those with motor score less than 50. For those patients with motor scores less than 50, CART would predict the average log(cost) of all patients with motor scores less than 50. A similar prediction strategy applies to those patients with motor score exceeding 50. We can evaluate the quality of the split using the squared prediction error. CART searches amongst all variables and split points choosing the variable to split and the split point so that the new partitions minimize the estimated squared prediction error.

CART then recursively splits each partition until it satisfies a stopping criterion. Naturally we want to stop the partitioning process when prospective prediction is optimal. As a surrogate we considered 10-fold cross-validation (Breiman et al., 1984) to estimate prospective prediction error. However, this method estimated that the collection of trees from each RIC should have a total of 359 terminal nodes. This is not too surprising since CART adds nodes as long as the decrease in mean squared error seems statistically significant and with large samples even minor differences can be statistically significant. However, 359 terminal nodes means 359 FRGs, too many to administer. We took additional steps to decrease the number of nodes within each RIC. This included using the “1 standard error rule” (Breiman et al., 1984). This method effectively stops the recursion when a cross-validated estimate of the prediction error is within one standard error of the minimum estimated prediction rule. We also enforced “practical significance” on each node. The predicted costs in neighboring nodes must differ by more than $1500, merging nodes would not change the predictions by more than $1000 from their original values. With these steps we obtained a more manageable 95 terminal nodes or, equivalently, 95 FRGs.

Another policy constraint required predicted costs only to decrease with increasing levels of functional independence. Our technical expert panel believed that if CMS paid less for cases with less function, it would provide incentives that many clinicians would find unacceptable. In fact the data only rarely result in a violation of this monotonicity constraint.

- 11 -

9.107

8.901

35 30

9.079 9.251 9.345

9.481

9.589

9.791

8.926

5

10

9.674

25 20

9.909

15

Cognitive score

8.734

20

40

60

80

Motor score

Figure 1: CART predictions of log(cost) from motor and cognitive scores for N=74,352 stroke cases from 1998-9 To demonstrate we used patient data from RIC 01 (Stroke) for 1998 and 1999 combined and fit a CART model predicting log(cost) from the motor and cognitive scores. Figure 1 shows how CART partitions the data example. The lines show the partitions and the number in each partition is the average log(cost) of the patients with the associated motor and cognitive scores. Costs decrease as the shading gets darker. We can see that motor is the primary effect although at high motor scores cognitive ability can be influential. Figure 2 shows the CART model as a decision tree of the plot. Positive answers to the questions at each node traverse to the left and negative answers move to the right until we reach a prediction in one of the leaves of the tree.

- 12 -

motor<45.5

motor<34.5

motor<27.5

motor<56.5

motor<41.5

motor<38.5 9.909 9.791 $23036 $20075 10263 9254 9.674 $17797 6634

9.589 $16316 5456

9.481 $14663 7969

motor<50.5

9.345 $12796 10715

cog<26.5

9.251 9.107 $11582 $10159 5643 6141

motor<62.5

cog<29.5

9.079 $9835 4147

8.901 $8274 3054

cog<25.5

8.926 $8549 1751

8.734 $7027 3325

Figure 2: The CART partition from Figure 1 as a decision tree. The terminal nodes show the average log(cost), average cost in dollars, and number of cases Although CART seems flexible and interpretable it has some drawbacks. By recursively partitioning the data, CART essentially fits a progressively more complex interaction term and thus can miss main additive effects. The boundaries between groups are abrupt and discontinuous. This lack of smoothness can potentially detract from the quality of the predictions. CART has the pleasing theoretical property that as the sample size grows the prediction rule converges to the one that minimizes the expected prediction error. However, even in fairly large samples CART can fail to fit curvature well (underfit) or can infer curvature where none exists (overfit). In addition, we also saw that with large data sets CART can produce a tree with many partitions causing difficulty in interpretation, evaluation, and implementation of the inferred rules. CART is a high-variance regression method, meaning that small fluctuations in the data set can produce very different tree structures and prediction rules. An early split will influence the shape of the tree and produce results that may be nonsensical. This sensitivity casts doubt on the reliability of any interpretation assigned to the tree structure.

Despite these limitations, CART is useful and powerful, producing groups defined simply by ranges of the independent variables. It becomes easy to classify a new patient by comparing the values of the patient’s set of independent variables with the ranges that define each of the CART - 13 -

determined groups. Only the order of the predictor variables matters in forming the partitions. In this sense the algorithm is invariant to one-to-one transformations (such as log transforms and rescaling) of the predictor variables.

4 Other Models So far we have discussed techniques that we considered to produce CART models that predicted as accurately as possible while still adhering to policy constraints, such as the variables that are allowable for payment group definitions, limits on the size and complexity of the trees, the $1500/$1000 rule for difference in adjacent FRG cost predictions, and the monotonicity constraint. In this section we explore methods outside the CART realm to search for a gold standard model to which we can compare our final CART model. These methods do not necessarily produce case mix groups and, therefore, are not alternatives to CART for developing the ultimate payment formulas. For our purposes, their main use is for assessing CART’s performance.

In modeling cost of medical care the signal-to-noise ratio is relatively small so that an R-squared of, say, 0.30 might be the best that we can accomplish. By searching for a competitor to CART we can try to estimate the best possible performance on the given dataset, thereby gauging the quality of our FRGs derived from CART. We compared CART to generalized additive models (GAM) and multiple adaptive regression trees (MART). These models are discussed in the statistics and data mining literature. We used the version of GAM (Hastie and Tibshirani 1990) as implemented in the statistical package S-plus. MART is described in Friedman (2000), and we used software provided by the author. All of these methods are extensively discussed and compared in Hastie et al., (2001). We assessed each model’s predictive performance on preceding and subsequent years. That is, we fit each model (CART, GAM, and MART) to 1997 data, for example, and used that model to predict cost for 1996, 1998, and 1999. A model that consistently predicts cost well, in terms of the average squared difference between the actual and predicted cost, across the various years and RICs, is our gold standard model.

The following two sections describe the types of models we fit and our reasons for fitting them. Included with each of the methods is a two-dimensional visualization of the surface that each

- 14 -

model fits to data. As in the CART example in Section, these data come from RIC 01 (Stroke) combining 1998 and 1999 data. The darkest regions of the plots show the regions where the model predicts the lowest cost for the motor and cognitive score combination. Since such visualization is limited to two dimensions, the plots intentionally exclude age.

4.1 Generalized additive models (GAM) GAM permits slightly more flexible relationships between the dependent and independent variables. GAM approximates the relationship as a sum of smooth (rather than linear) functions of the independent variables. For example, a GAM cost prediction model may have the form log(cost ) = β 0 + f m (motor) + f c (cognitive score) + ε This means that a change in motor score from 20 to 21 might decrease predicted cost by a different percentage than a change from 60 to 61. It does not model interactions, but only produces estimates of additive effects. Because the relationship is assumed additive, the decrease in predicted cost due to a change in motor score from 20 to 21 will be the same regardless of the values of the other independent variables. The top two panels of Figure 3 compare the ordinary linear regression model to GAM for predicting cost from motor score. Although the two fits seem to agree closely, the GAM fit shows evidence that the marginal increase in cost tapers off as motor score gets smaller. The discussion of the bottom two plots is left for a later section.

- 15 -

OLS 10.0 9.5 8.5 40

60

80

20

40

60

Motor score

Motor score

MART

CART

80

9.5 9.0 8.5

8.5

9.0

log(cost)

9.5

10.0

10.0

20

log(cost)

9.0

log(cost)

9.5 9.0 8.5

log(cost)

10.0

GAM

20

40

60 Motor score

80

20

40

60

80

Motor score

Figure 3: Comparison of Univariate Models fit to Stroke 1998-9 Data Although the additivity restriction may prevent the discovery of interaction effects in multivariate data, the benefits of additivity include fast computation, easy interpretation, and constraints on prediction variance. To interpret GAM we can plot for each covariate the value of the covariate versus the contribution it makes toward the log(cost) estimate. We can then visually look for irregularities, saturation effects, and threshold effects. For example, we may learn that patients with motor scores exceeding a particular value have roughly constant cost, an example of a saturation effect. GAM does use more degrees of freedom than OLS but conserves them by imposing the additive constraint and restricting the additive components to be very smooth, spending roughly four degrees of freedom per predictor. GAM will also work well in small RICs.

Figure 4 shows the shape of the GAM fit. Clearly, GAM picks up curvature that the linear model cannot. It is still apparent that the motor score is the most influential. However, GAM also seems

- 16 -

to pick up that at extreme values on the cognitive scale the cost is slightly lower than for

10

10

15

20

Cognitive score

25

30

35

cognitive scores in the middle of the range.

5

9.5 20

9

40

60

8.5 80

Motor score

Figure 4: GAM predictions of log(cost) from motor and cognitive scores for N=74,352 stroke cases from 1998-9 The cost of the additional flexibility is greater model complexity and variability. However, that same flexibility that makes GAM more complex also can make its predictions more accurate than the linear model when the relationship between the dependent and independent variables is non-linear.

4.2 Multiple adaptive regression trees (MART) MART is a state-of-the-art statistical method. MART is the most flexible and most complex of the models under consideration as a gold standard. Like GAM, it is nonparametric with the ability to find non-linear relationships. However, it is also able to find interaction effects in the predictor variables.

The MART prediction is the sum of predictions from many simple CART models. The algorithm constructs the CART models sequentially in such a way that each additional CART model - 17 -

reduces prediction error. Since each CART model fits an interaction effect, the sum of many of them (100s to 1000s) results in a prediction model that permits complex, non-linear relationships between the dependent and independent variables. We can control the depth of interaction effects MART tries to capture by controlling the depth of the individual CART models. The bottom left plot in Figure 3 shows the MART fit for predicting log(cost) from motor. The fit is actually a sum of a few hundred CART models with two terminal nodes each. Its shape is similar to the GAM fit with a much greater emphasis on the saturation and threshold effects at the extreme motor scores. Note the similarity to the single CART model in the bottom right of Figure 3. However, CART reveals its discontinuous prediction rule, the best it can do given its restrictive functional form.

If cost varies in a non-additive way across motor score and cognitive scores, then MART might be able to capture this information and provide predictions that are more accurate than GAM. Figure 5 shows the shape of the MART fit on the 1998-9 stroke data. Like GAM, MART determines that the high cognitive values have lower costs than the lower cognitive scores at a fixed motor score. Furthermore, MART shows that costs decrease much faster at the high cognitive scores for very low motor scores. This is a feature that the functional form of GAM cannot detect. When such effects are strong then MART would likely outperform GAM. This makes it a good candidate for the gold standard.

- 18 -

35 30 25 20

Cognitive score

15 10 5

9.8 20

9.6

9.4

9.2

40

9 60

8.8 80

Motor score

Figure 5: MART predictions of log(cost) from motor and cognitive scores for N=74,352 stroke cases from 1998-9 Like GAM, the additional complexity complicates interpretation. It is difficult to interpret and it is difficult to quantify the number of degrees of freedom that it spends. However, some measures of variable influence and visualization tools are available for evaluating the predictor’s rationale. It is not clear if it will always work well for very small RICs, but results show that it has been competitive with GAM.

5 Results 5.1 Selecting a Predictor Set The FIM instrument contains 13 measures of motor ability and 5 measures of cognitive ability. We fit some linear regression models to the 18 individual scores to predict cost. We found that the tub transfer item consistently had the wrong sign. That is, patients with greater independence were more expensive on average. Patients may or may not be provided with various assistive devices, including tub benches and handrails. Those patients with lesser motor ability may be offered access to devices that increase the chance of easy transitions into the tub or shower. In - 19 -

addition, the FIM instructions require that unobserved outcomes should be recorded as maximum dependence. This is especially problematic for tub transfer since it is frequently unobserved at admission. Lastly, we observed that the predictor sets without tub transfer simply performed better. Thus, we removed tub transfer for this analysis.

Besides the tub transfer component we considered various decompositions of the motor score into finer measures. The finer decompositions resulted in an unwieldy number of patient categories with little to no improvement in predictive performance. Therefore we concluded that the payment system should be based on the age, standard cognitive score, and the standard motor score minus the transfer to tub score. Fitting CART within each RIC using this predictor set produced a total of 95 FRGs. Table 4 showed how the number of FRGs are distributed across the 21 RICs.

5.2 MART predicts best at the individual level Having a gold standard model does two things. First, it helps us understand how well CART is doing by giving us a measure of attainable root mean squared error (RMSE) with which we can compare the RMSE we get from CART. Second, it will enable us in a simulation exercise to assess the prediction bias for various combinations of demographic and hospital factors.

MART and GAM are the candidates for gold standard status. We have theoretical reasons to prefer MART. It is extremely flexible, and it detects interactions. However, its prediction formula is rather unwieldy. Also, some RIC sample sizes are small, and it may be that without forcing some structure, one effectively fits too many parameters and gets a model that does not predict very well. On the other hand, GAM uses fewer degrees of freedom, and produces a curve to describe each input variable’s effects, so it is a little easier to decide whether the GAM fits make clinical sense. Without a clear a priori winner, we decided to perform our computations on both GAM and MART.

Table 6 shows the results comparing the methods. The Constant column represents the model using only the RIC to predict cost. RIC explains roughly 16% of the variance. Using CART within each of the 21 RICs reduced the RMSE a fair amount indicating that CART can extract

- 20 -

information from age and the function measures. MART sometimes does slightly better than GAM, but both consistently outperform CART. About 34 percent of the total variance in the wage adjusted cost of cases discharged to the community is predicted by the CART based FRG system and 37 percent by our gold standard models. Table 6: Root Mean Squared Errors – Candidate Gold Standard Models and CART CART Prediction Fit Evaluation Const 1 SE rule # Nodes GAM MART interval Year Year reduction 97 0.541 0.480 0.473 0.473 1.9% 96 95 0.479 0.479 98 0.545 0.486 1.9% 99 0.546 0.489 0.482 0.482 1.9% 96 0.535 0.473 0.467 0.467 1.6% 97 97 0.479 0.478 98 0.545 0.485 1.6% 99 0.546 0.488 0.482 0.482 1.6% 96 0.535 0.473 0.468 0.468 1.4% 98 123 0.474 0.473 97 0.541 0.479 1.3% 99 0.546 0.486 0.481 0.481 1.3% 96 0.535 0.474 0.468 0.468 1.6% 99 126 0.474 0.473 97 0.541 0.479 1.3% 98 0.545 0.483 0.479 0.478 1.1% 98 0.545 0.483 0.479 0.478 1.1% 96-97 142 99 0.546 0.486 0.482 0.481 1.1% 96 0.535 0.471 0.468 0.467 0.8% 98-99 180 97 0.541 0.477 0.473 0.473 1.1% Since our models are fit on the log scale, the reported RMSEs are standard errors of prediction on the log dollar scale. To map the performance to the dollar scale, the last column in Table 6 shows the reduction in the width of the 95% prediction interval that the gold standard models offer. The gold standard models offer between a 1% and 2% reduction in prediction error.

We also looked at reductions in RMSE within each RIC. Percentage reductions varied from about 20 percent for stroke (RIC 01) to about 10 percent for the three orthopedic RICs (07, 08, 09). These orthopedic RICs are substantially more homogeneous in cost than other RICs, so that despite the fact that we predict a smaller fraction of the variance in these RICs, they have RMSEs that are among the lowest of all other RICs. Figure 6 and Figure 7 show what was typical of most RICs. Each panel is labeled based on which year’s data we used to fit the model.

- 21 -

The curves within the panels describe the RMSE when we used that model to predict on the other years. The curves labeled “R” refer to using only the average cost to predict cost so that RMSE equals the RIC’s standard deviation of cost. MART and GAM perform about equally well, but MART does a little better across prediction periods. Although CART, shown here using the 1 SE rule, does not predict as well as the gold standards, it seems to move substantially away from the average cost prediction (R) toward the gold standards.

CART did not perform nearly as well as the gold standards in the smallest RIC(see Figure 8). GAM substantially outperformed the other methods in the burn RIC, which has about 100 cases annually in our dataset. This reveals that CART’s structure seems to prevent it from extracting essential information from age and the FIM scores in small datasets. The less restricted GAM is able to utilize this information. If in the aggregate, payments for burn rehabilitation become too large or too small, a refinement based on GAM may offer a more accurate solution but does require redefining the rules for an acceptable payment system. Note that when we pooled data, the panels labeled 1996-7 and 1998-9, CART greatly improved. Because of the relative instability of CART with small sample sizes, we recommended that two years of data be used to create FRGs for the smallest RICs.

0.60

1996 R R R

1996-7 R R R

0.58

RMSE

1997 R R R

1998 R R

R

1998-9 R

R

1999 R R R – RIC only C – CART G – GAM M – MART

0.56 0.54 0.52 0.50 0.48

C G C M G M C G M

C G C M G M

97 98 99

98 99 96

C G M

C G CM G M

C G CM G M

98 99 96 97

C G M

C G CM G M

C G CM G M C G M

99 96 97

96 97 98

Prediction year Figure 6: RMSEs by Fit and Prediction Years: RIC=Stroke N = 32687, 35026, 37012, 37340 - 22 -

0.62

1996 R R

0.60

RMSE

1996-7 R R

1997 R R

1998 R

R

R

1998-9 R

R

1999 R R

R

R

R

R – RIC only C – CART G – GAM M – MART

0.58 0.56 0.54 0.52 0.50

C

C

G C M C

C C

M G

G C M G M M G

97 98 99

98 99 96

G M

C

C

C G CM G M G M

M G

98 99 96 97

C CM G M G

C G M

99 96 97

C G M G CM M G 96 97 98

Prediction year Figure 7: RMSEs, by Fit and Prediction Years: RIC=Traumatic brain injury N = 1383, 1629, 1871, 2053

0.65

1996 C C R R C R

1996-7 R R C M

RMSE

M G 0.60

M M G G

0.55

1997 C C R R

1998-9

1999 C R C R

C R R

C R

MR

R M

R

C

C R M

C G M G

1998

M

M MM C G G C

G

G

G 0.50

M

M C G

G

M

R – RIC only C – CART G – GAM M – MART

M G G

G G

97 98 99

98 99 96

98 99 96 97

99 96 97

96 97 98

Prediction year Figure 8: RMSEs, by Fit and Prediction Years, RIC=Burns N = 70, 103, 111, 102

In summary, we saw that MART seemed to be a little better than GAM for many RICs, validating the observations we made previously at the aggregate level. Our current

- 23 -

recommendation is to use age, the cognitive score, and the motor score without tub transfer with CART and the 1 SE stopping rule. Such models traverse 90% of the way from the constant prediction to the gold standard, indicating that they are achieving near-gold standard performance. Overall the CART RMSEs are about 0.005 larger than the gold standard indicating that prediction error intervals on the dollar scale expand by at most 2%.

5.3 Aggregated to the hospital level CART and MART have similar predictions CART provided both classification groups and predictions, the average log wage-adjusted cost within groups. But CART is a constrained fitting procedure. We would like to know whether a more complex formula, one that adheres to fewer constraints and has the ability to trace complex cost curves, would pay hospitals differently in total.

To answer this question, we performed a simulation exercise utilizing the recommended FRGs and MART as the gold standard model. For each hospital we computed the total predicted cost based on CART produced FRGs and MART. The recommended FRGs will ultimately be used in a payment formula that adjusts their predicted costs by a variety of factors, such as hospital characteristics, patient comorbidities, and budget neutrality requirements. The dependent variable fit in all these models is log(cost / wage adjustment). For a patient with particular features at a particular hospital we transformed the fitted value to dollars by

prediction = exp(model fit) × wage adjustment × hospital adjustment × budgetary constant

where the constant is chosen to assure budget neutrality (i.e., it makes predictions sum to costs). We computed these predictions using CART and MART, aggregated the predicted costs to the hospital level, and compared the ratio of the two cost predictions. Conveniently, the wage and hospital adjustments and the neutrality constant cancel out of the payment ratios.

Table 7 summarizes the results for the MART versus FRG payment formulas in each of four years. For each prediction year, over 90 percent of hospitals have payment ratios between 0.98 and 1.02, and only about 1 percent of hospitals have payment ratios outside of 0.96 to 1.04.

- 24 -

Ninety-four percent of cases go to hospitals with payment ratios between 0.98 and 1.02. These numbers are remarkably close to 1.0. They affirm that the CART prediction formula would pay about the same in total as the more flexible MART prediction formula at the hospital level. Table 7: Hospital Payment Ratios for Final FRGs vs. MART Hospital Payment Ratio 90 94 95 96 97 98 99 100 101 102 103 104 105 106 107 Total

Percent of Hospitals (Facility Weighted) 1996 0.4 0.2 0.0 0.9 2.5 10.7 20.7 27.4 21.2 11.6 2.2 1.6 0.4 0.4 0.0 100.0

1997 0.0 0.0 0.2 0.8 3.5 10.0 22.7 26.9 21.5 10.5 3.2 0.5 0.2 0.0 0.0 100.0

1998 0.0 0.0 0.4 1.0 2.5 10.6 20.8 27.9 22.2 9.6 2.4 1.5 0.4 0.3 0.1 100.0

Percent of Hospitals (Case Weighted) 1999 0.0 0.0 0.1 1.0 2.9 9.5 21.6 26.0 23.9 10.4 3.5 0.9 0.3 0.0 0.0 100.0

1996 0.0 0.1 0.0 0.4 2.5 11.6 21.6 28.9 22.3 9.7 1.9 0.8 0.1 0.2 0.0 100.0

1997 0.0 0.0 0.0 0.3 2.5 9.8 25.8 28.4 22.0 8.3 2.7 0.2 0.1 0.0 0.0 100.0

1998 0.0 0.0 0.2 0.3 2.4 11.9 21.8 30.9 21.5 8.3 2.2 0.5 0.1 0.0 0.0 100.0

1999 0.0 0.0 0.1 1.1 2.1 8.0 24.6 29.7 24.3 7.0 2.4 0.6 0.2 0.0 0.0 100.0

6 Conclusions Developing the rehabilitation PPS payment groups requires tradeoffs between simplicity and accuracy. Simplicity needs case mix groupings that are clinically homogeneous and parsimonious. Accuracy requires adequately capturing the expected cost for a subject with particular characteristics. CART did well on the first dimension. It produced simple, understandable patient groupings (FRGs). But it is a constrained fitting procedure and provides little internal evidence that it has done a good job of capturing cost. We checked its performance by fitting two other methods, MART and GAM, which are less constrained and that are known to be good predictive models.

We found that MART consistently achieved the best prediction and that GAM was not far behind. Their prediction formulas are not as simple as CART and, therefore, could not be used to - 25 -

determine payments. But we viewed their predictions as gold standards against which to evaluate those of CART. We could thus answer two important questions. How much of the explainable variation was CART able to capture? How different would (annual) payments to hospitals have been using the simpler, but less accurate, predictions of CART? We observed that the CART models traversed a substantial fraction of the distance between the constant model and the gold standards, explaining about 90 percent of the potentially explainable RMSE. More importantly, our results show that groupings based on CART would have produced annual hospital level payments nearly the same as the gold standards. This lends credibility to the CART FRGs being a reasonable component of the rehabilitation payment system.

Of course creating the basic classification system is only part of a payment system. In addition to the classification system, payment under the rehabilitation PPS is determined by the rules for unusual cases (including transfers, in-hospital deaths, atypically short stay cases and high cost outlier supplements), case weights, and facility adjustments such as a wage index, rural payment, and disproportionate share payment. We used simulations to examine the actual payment levels for policy relevant subgroups, but these analyses are outside the scope of this paper.

Data mining is about employing very general techniques to find patterns in data. Good data mining techniques must be successful at finding the patterns, but must also be easy to apply to a new problem. MART, GAM, and CART all satisfy the latter requirement, but either or both MART and GAM can generally be expected to beat CART in pattern detection. We feel that CART modeling should always be accompanied by a comparison of performance with other techniques such as these. Such a defensive analysis will at best show that CART is performing well, as was the case here. At worst, the analysis can indicate whether certain segments of the population are not being captured with CART, or whether certain explanatory variables need to be redefined, and the CART models might thus be improved.

7 References 3M Health Information Systems (1994). DRGs: Diagnosis related groups definitions manual, version 12.0, Wallingford, CT.

- 26 -

Ashcraft M.F., B.E. Fries, D.R. Nerenz, S.P Falcon, S.V. Srivastava, C.Z. Lee, S.E. Berki, P. Errera (1989). “A psychiatric patient classification system: An alternative to DRGs.” Medical Care, 27:543–557. Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone (1984). Classification and Regression Trees. Belmont, CA: Wadsworth, Inc. Carter, G.M., M. Beeuwkes Buntin, O. Hayden, J. Kawata, S. Paddock, D.A. Relles, G.K. Ridgeway, M. Totten, and B. Wynn (2001). Analyses for the Initital Implementation of the Inpatient Rehabilitation Facility Prospective Payment System, MR-1500-HCFA. Friedman, J.H. (2001). “Greedy Function Approximation: A Gradient Boosting Machine,” The Annals of Statistics 29(5):1189–1232. Fries, B.E., Schneider, D.P. Foley, J.W., Gavazzi, M., Burke, R., Cornelius, E. (1994). “Refining a case-mix measure for nursing homes: Resource Utilization Groups (RUG-III),” Medical Care 32:668–685. Hastie, T. and R.J. Tibshirani (1990). Generalized Additive Models. London: Chapman and Hall. Hastie, T., R.J. Tibshirani, and J.H. Friedman (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag. Jaro, M.A. (1989). “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,” Journal of the American Statistical Association, 89:414-420. Newhouse, J.P., S. Cretin, and C.J. Witsberger (1989). “Predicting Hospital Accounting Costs.” Health Care Financing Review, 11(1):25–33. Relles, D.A. and G.M. Carter (2002). Linking Medicare and Rehabilitation Hospital Records to Support Development of a Rehabilitation Facility Prospective Payment System, Santa Monica, CA: RAND, MR-1502-HCFA. Stineman M.G., C.J. Tassoni, J.J. Escarce, J.E. Goin, C.V. Granger, R.C. Fiedler, and S.V. Williams (1997). “Development of Function-Related Groups Version 2.0: A Classification System for Medical Rehabilitation,” Health Services Research 32(4):529–548.

- 27 -

Data mining and the Implementation of a Prospective ...

The Health Care Financing Administration (HCFA), renamed the Centers for ... the course of administering the program, recorded for all rehabilitation hospitals. ..... GAM does use more degrees of freedom than OLS but conserves them by.

Download PDF

128KB Sizes 0 Downloads 242 Views

Report

Data mining and the Implementation of a Prospective ...

Recommend Documents