Feature Selection Blog 1: Aims of the PSI Data Science SIG Blog on Feature Selection
Carsten Henneges
Hello and welcome!
The topic of feature or variable selection covers algorithms to identify a subset of relevant features out of a larger pool to construct statistical models. There are many blogs on that topic, and a ton of publications have come out in recent years. Thus, this blog is just another stream of thoughts on this topic, laying out my perspective on what I have read. If you have a different opinion, please let me know or leave you thoughts below in the comments section of the blog. Otherwise, I’ll hope that my perspective resonates with You and helps You to become a better Statistician or Data Scientist.
I joined the PSI Data Science SIG group at the end of 2019. Before that I was a PSI member as a statistician working in the pharmaceutical industry and supporting projects in the Late Phase setting. As a computer scientist who had worked on machine learning algorithms for a PhD, this was an interesting perspective to learn. Now, after a professional change in my career I have landed back into the early phase and exploratory area, one in which feature selection is a relevant topic to learn more about.
Having these two roots, I hope to provide thoughts and perspectives that bridge between both topics: what data scientists do — developing algorithms and solving computational statistical problems — and what statisticians do — planning analyses, interpreting and estimating treatment effects with quantified uncertainty, and drawing conclusions from statistical tests. Data is what connects both cultures, and both cultures look at data with different backgrounds and ideas. Leo Breiman’s seminal paper on the two statistical cultures of data or task modelers (Breiman 2001) outlines these differences as follows (Breiman 2001):
These two cultures have shaped the topic of feature selection, steering the aims and goals in differing but maybe complementary directions. This will be the first topic in my following series of blogs! When thinking about feature selection I laid out a series of subtopics that I plan to write about, and that will gradually lead you through the different aspects, summarising key approaches, providing references and touching on aspects that are relevant to our work as drug developers.
See You then!
References
Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures.” Statist. Sci. 16 (3): 199–231.