Feature Selection Blog 2: Goals of Feature Selection

Data Science PSI
6 min readAug 3, 2022

Carsten Henneges

In the previous blog, I touched on my intention to provide another perspective on feature selection and its aspects related to drug development. I plan to present a series on blog posts on this topic as part of the PSI Data Science SIG. This is the 2nd blog in that series and asks what the goals for feature selection are?

Before starting it’s important to first understand the problem: feature selection. It means that we plan to generate or have a dataset describing observations by variables or features. In the feature selection context, variables are understood as the raw unprocessed input while features are derived from them. Usually, this distinction is not essential and as such I’ll be writing about features here, including variables. Feature selection requires us to select a subset of features, that is to choose one element set from the power set of features, for a given dataset. What are the goals that will determine such a selection?

Following Isabelle Guyon (Guyon and Elisseeff 2003) the goals can be three-fold:

1. improving the prediction performance of prediction models,

2. providing faster and more cost-effective prediction models, and

3. gaining better understanding of the underlying process that generated the data.

These goals are similarly well reflected and more concretely described by Shmueli et al. (Shmueli and others 2010) who distinguishes different types of statistical modeling approaches:

1. Explanatory modeling,

2. Descriptive modeling, and

3. Predictive modeling.

Let me explain the differences and how these relate back to the goals put forward by Guyon. Predictive modeling is at the heart of Guyon’s first two goals: selecting the best variables for a set of machine learning or other models to improve the measurable prediction error for unknown observations. Usually, this error is estimated on hold-out data or via cross-validation or bootstrap estimations. Explanatory modeling refers to her third goal and entails setting up an analytic or mathematical model that describes how observations are functionally related. Techniques include Regression Modeling, Structural Equation Modeling and in a simpler form may include Influence Diagrams. Descriptive modeling is more related to modelling the data itself: the goal here is not to predict new observations nor to test an underlying causal theory with possibly unobserved variables. Descriptive modeling applies regression to capture the associations between dependent and independent variables.¹

I think it is relevant to note that in our clinical setting, we are exposed to two types of data that can be used for these tasks: observational data, from studies or registries, and data coming from randomized controlled trials (RCT). The difference between both is usually coined by the terms external and internal validity, that is how well the data resembles the general population or how comparable populations are with respect to their course of disease.

Feature selection on RCT is facilitated by providing a low number of standardized and comparable treatment groups. Each population may be assumed comparable due to randomization, being the only causal variable that must be considered during feature selection. However, inferences from the selection may not be easily generalized to a broader population — and thus be useful for predictions of unknown patients — due to the highly selected nature of enrolled RCT patients.

Views turn slightly when it comes to observational studies. In contrast to RCT, the enrolled population is less restricted and more real-world like and there is no one-and-one dominating causal variable present in the data. Thus, observational studies may look like the data of choice for both, explanatory modeling and/or predictive modeling. However, dealing with differing treatment regimens, especially including treatment switching, and group sizes can make performing feature selection difficult.

(Sauerbrei et al. 2020) concentrate their manuscript on state-of-the-art feature selection in an observational study setting. Essential goals for variable selection in their view are interpretability, transportability, and general usability. Predictions are considered of secondary priority. Their emphasis is about avoiding induced selection biases to identify a model that is useful in clinical practice through including few and too costly variables that are relevant and/or have strong effects. Additional considerations go into the direction of avoiding bias and inflated type I error through pre-specification.² They propose to use the standard set of regression modeling (linear, logistic, Cox) together with forward or better backward selection, shrinkage approaches like LASSO or ElasticNet, Boosting and Generalized Additive Models (GAMs).

(Sauerbrei et al. 2020) et al. also highlight that variable selection goes hand-in-hand with the determination of the functional form (i.e., feature construction as per (Guyon and Elisseeff 2003)). Aside from a priori linear features and categorization (by median, quantiles, and so forth), they put forward the usage of fractional polynomials (that can be determined by a closed-test procedure) and splines, i.e., regressing the functional form by spline-based polynomials.

Thinking back, I realize that most statistical analyses I have been doing were hypothesis tests, hence neither explanatory nor predictive nor descriptive models. Variables selected for that purpose were the causal (treatment) and stratification factors (design variables) of the trial plugged into the primary analysis method, a regression model, to produce a p-value followed by an effect estimate with 95% confidence interval, e.g., see (Montorsi et al. 2014). Interaction variables were selected if statistically significant at the 10% level and variables were selected or modified in a pre-specified way to ensure that the (logistic) regression model converged [e.g., by coarsening from site towards geographic region in a pre-specified way]. Nearly all secondary or subgroup analyses then included these variables as a context when testing additional predictors or outcomes. Variable selection for the goal of hypothesis testing seems to follow different rules than those mentioned before. Should there be an additional variable selection category, i.e., when is a test valid? Also, this adds another item to the topic: variable selection can only go as far as the model can be reliably identified (converges) by the data. I have tried to collapse all goals into a visual figure:

Putting explanatory modeling aside and defining descriptive modeling as the search for a “good” model without a clear measurable concept, I think it’s best to concentrate on predictive modeling in what follows in this blog. The reasons are that the methods, measures and methodology are clear, and that — even if they could be made more complex — they can still be used or analyzed to gain insights (aka explainable machine learning).

My blog will continue by touching on essential properties of clinical prediction models next.

See you then!

References

Guyon, Isabelle, and André Elisseeff. 2003. “An Introduction to Variable and Feature Selection.” J. Mach. Learn. Res. 3 (null): 1157–82.

Montorsi, Francesco, Gerald Brock, Jens-Uwe Stolzenburg, John Mulhall, Ignacio Moncada, Hitendra R. H. Patel, Daniel Chevallier, et al. 2014. “Effects of Tadalafil Treatment on Erectile Function Recovery Following Bilateral Nerve-Sparing Radical Prostatectomy: A Randomised Placebo-Controlled Study (REACTT).” European Urology 65 (3): 587–96. https://doi.org/10.1016/j.eururo.2013.09.051.

Sauerbrei, Willi, Aris Perperoglou, Matthias Schmid, Michal Abrahamowicz, Heiko Becher, Harald Binder, Daniela Dunkler, et al. 2020. “State of the Art in Selection of Variables and Functional Forms in Multivariable Analysis–Outstanding Issues.” Diagnostic and Prognostic Research 4 (1): 3. https://doi.org/10.1186/s41512-020-00074-3.

Shmueli, Galit, and others. 2010. “To Explain or to Predict?” Statistical Science 25 (3): 289–310.

¹ That is what is routinely done in analyzing randomized trials or observational studies. In randomized trials the only causal variable explaining all subsequent observations is the assigned treatment, and possibly stratification factors. However other than using this regression model as a hypothesis test, there is — usually — no further causal theory, using mediator or instrumental variables or alike, being set up.

² Interestingly, observational studies may aim at estimating a certain quantity of interest in a general population and thus employ methods for the width of confidence intervals for sample size estimation. Since groups are biased or confounded, and thus require either instrumental variable analyses or propensity scoring methods for drawing conclusions for tests, inflated type I error is possibly of less concern. Propensity scoring makes use of all features in the dataset while instrumental variables are constructed or validated based on a domain theory, i.e., fall into the explanatory modeling rim. And yet there is certainly a need to build data-driven models that neither explain nor predict but represent the data well enough such that they can be used to communicate and allow to interpret associations in clinical data based on handy variables.

--

--

Data Science PSI

We are a group of statisticians, computer and data scientists —working in Data Science functions across the Pharmaceutical Industry.