Bringing Statistics to Life
Interview with Trevor Hastie (The John A. Overdeck Professor, Professor of Statistics and Professor of Biomedical Data Science at Stanford University).
During the TI Econometrics Lectures 2017, held May 10-12, Trevor Hastie gave TI students a glimpse of a new world. During the lectures we walked into a land of random forests, for a couple of hours becoming cowboys with lassos and fishermen with elastic nets.
You have been a researcher in statistics in both academics and the corporate sector. What are the main differences between the two worlds, and what do you like best/least about each?
That’s interesting. Regarding the corporate sector, I was at Bell Labs, which was the research lab of AT&T, and it wasn’t quite the same as the corporate sector today; Bell Labs was famous for letting its researchers do whatever they wanted. This was done with the hope that they would produce something useful for the company, but this was not a requirement. Nowadays, if you work at Google or Facebook, and you are in a research group, they expect you to come up with some deliverables pretty quickly; otherwise you’ll be nudged into another group. So my experience at Bell Labs was a little different: we got supported, we were able to go to conferences— they encouraged us in that.
I went to Bell Labs as soon as I graduated. If I had gone to university straightaway I would have been teaching, I would have been writing grant proposals to get research money, and to do research as well. In a way, working in the industry was much easier for me. And then, when I eventually went to university, I had already done a lot of research and so was able to get tenure immediately. By the time I left Bell Labs I had written two books already and was— after eight or nine years— fairly well-established.
But getting back to your original question… I think it’s a little different today. Today, if you graduate with a PhD and you go into industry straightaway, there are very few places that would not represent shutting a door to academia. Maybe Microsoft research is still one of the places where you can be for a few years and then go back if you wanted to, but such places are rare.
Related to the corporate sector, data science has become a buzzword nowadays. Do you think this is a hype, a fad that will disappear in some years?
I think that there is a bit of hype around the term “data science”, but I don’t think there is hype around the need for smart, computationally-minded, applied statisticians and modelers in industry. Things like artificial intelligence (AI) are getting used more and more (almost everything we use today is smart), and I think for that you need these kinds of skills. Right now it’s called data science, which is a catch-all phrase, but I think that the need will stay… what they will call it in the future, who knows?
Do you perceive the growing popularity of data science careers as a threat to the quality of the analyses conducted? Should practitioners be cognizant of all the statistical assumptions and numerical algorithms underlying the packages they use?
Many people who worry about the potential threat to the field of statistics seem to feel like computer scientists are “stealing our turf”, as now computer science departments run data science programs. I believe that if you want to be a good data scientist you need a firm statistical grounding— and if you don’t have that you won’t be a good data scientist in the long run. There are plenty of examples where people have implemented ideas— and perhaps the program was really good and all that— but the statistical underpinnings were weak, and in the end it was not successful.
“if somebody has a good grounding in statistics they are going to be able to do a better job”
I would like to believe that if somebody has a good grounding in statistics they are going to be able to do a better job. I know plenty of very smart people, good programmers, but they lack the right sense about what methods they should be using and what not. There is a huge array of statistical methods out there (and people write papers all the time), and many of these methods aren’t any good. The ability to select the right methods comes from both good training in statistics and common sense in working with data. So you often see these smart people chasing after the wrong methods and wasting everyone’s time.
It seems like the competitive pay and the evident link between applied statistics/econometrics and data science entices many very talented PhD students to pursue an applied career in a company as opposed to an academic path. Do you think this is a loss for research in the field?
It’s a bit of a loss, yes. I am not sure what can be done about that, say, in the European community. I know in the U.S., for example, that at any given university the salary structure for professors can be different across fields— and in some cases statistics is remunerated less than economics. That is going to change. With the rise of data science, statisticians will be respected a bit more and the salary will go up. I think in the European system salaries are much more standardized, though.
“You do pay a price to be an academic, but you get a lifestyle which is very nice”
All in all: yes, you make much more money if you go into industry. Still, being an academic comes with a lifestyle that is important, too. You do pay a price to be an academic, but you get a lifestyle which is very nice and you don’t get as stressed out as you do in industry, so there are two sides to the story.
So, you do not feel that the research frontier has now mainly shifted into the corporate sector and that people who would have done research in a university will simply switch to conducting it in this corporate environment?
Not so much. Most corporations, unless they are very big, can’t really afford a genuine research group. Some very big investment companies might have research groups. So I think that the research is generally still done at the universities. And these days, if you are a statistician or a data scientist at a university you can get really good consulting jobs with industry. Then you can have the best of both worlds. I consult for several companies, and it is not just that you make money, but you learn about new, interesting problems, which is nice.
Finally, what skills and qualities do you consider to be of paramount importance as an applied statistician?
Well, having a good load of common sense is an essential quality of any successful applied statistician. Experience is always important; and I think you need to be comfortable programming. You need to be able to try things out yourself, so you need an environment where you can do that (whether Matlab or Python or R; I would rather use R). That is actually really important, because when you see a problem you want to be able to get some data and try an idea out— to see if it makes sense. Finally, you need to have done some training in the core methods.
“A good load of common sense”
Regarding the methods themselves, it seems that the ‘big data’ approach is appreciated for its ability to fit and to forecast/predict. However, this often occurs in the absence of a clear causal link between the predictors and the outcomes. How important is data mining relative to having a structural model that can inform the estimation?
When you say “structure”, you are actually saying you would like to learn this causal pathway from the data. That is more in the realm of traditional statistical inference, where you understand the roles of variables and causality. In data mining it is harder— much harder.
Machine learning was developed in the computer science world and I don’t think causality was close to their heart, as it was all about prediction. Machine learning is now coming to statistics and people are starting to think more about how these techniques can be used and incorporated into more traditional tasks. For example, we look for causality when we try to estimate treatment effects in observational data, traditionally using propensity scores or instrumental variables. There is a lot of work going on right now in this area, studying some aspects of causality and trying to incorporate these more modern methods in those techniques. We are working on that right now, in fact…
Say we have obtained predictions. How can we use these for recommendations/policy? For example, in healthcare, how can you distinguish cause from effect?
That is harder. The specific example that I have been working on is using observational data from electronic health records to try and choose between two different treatments for heart disease. Both treatments had been approved by the FDA; in big clinical trials they both seemed to be as effective. There is a sense in the community, however, that for some patients one drug is better than the other, so this is an area of personalized medicine. For a subset of people defined by their ethnicity, age and perhaps some other factors, we tried to carry out a causal analysis to find which treatment works better. In this case, it amounted to the estimation of a treatment effect.
Finally, in what directions do you think the field of statistics can develop and grow? What are the main challenges you see for the field?
I think it has always been the case that statistics has had to keep up with other fields, such as machine learning and computer science. Computer scientists operated more or less as engineers, coming up with new ideas often expressed in terms of algorithms, for solving a real problem. They find a real problem and they put together a solution that often works very well. Statistics is always in catch-up mode, striving to find a formulation that expresses the solution in a way that we can understand it and incorporate it with other things we have done in the past. I think there is a big role for that and it is important that we do it, because otherwise we will be left behind.
“Statistics is always in catch-up mode”
Can statistics still be considered an independent field from computer science nowadays? If not, how do these fields differ?
The fields have gotten closer in the past, especially as we increasingly need to use computing in statistics to deal with bigger and bigger data. Statisticians, just like economists, have got other aspects of modelling in mind that are important to them. The real question is how statisticians can take some of the ideas from these new exotic methods and incorporate them into things that we are used to do, in a way that can beef up what we are doing and make it more suitable for the modern day. We have much more data, so we should be able to learn more.