Statistics came well before computers. It would be very different if it were the other way around.The stats most people learn in high school or college come from the time when computations were done with pen and paper. “Statistics were constrained by the computational technology available at the time,” says Stanford statistics professor Robert Tibshirani.
“People use certain methods because that is how it all started and that’s what they are used to. It’s hard to change it.”People who have taken intro statistics courses might recognize terms like “normal distribution,” “t-distribution,” and “least squares regression.” We learn about them, in large part, because these were convenient things to calculate with the tools available in the early 20th century. We shouldn’t be learning this stuff anymore—or, at least, it shouldn’t be the first thing we learn. There are better options.As a former data scientist, there is no question I get asked more than, “What is the best way to learn statistics?” I always give the same answer: Read. Then, if you finish that and want more, read. These two books, written by statistics professors at Stanford University, the University of Washington, and the University Southern California, are the most intuitive and relevant books I’ve found on how to do statistics with modern technology.
Tibsharani is a coauthor of both. You can download them for free.
Number crunchersThe books are based on the concept of “statistical learning,” a mashup of stats and machine learning. The field of machine learning is all about feeding huge amounts of data into algorithms to make accurate predictions. Statistics is concerned with predictions as well, says Tibshirani, but also with determining how confident we can be about the importance of certain inputs.This is important in areas like medicine, where a researcher doesn’t just want to know whether a medicine worked, but also why it worked. Statistical learning is meant to take the best ideas from machine learning and computer science, and explain how they can be used and interpreted through a statistician’s lens.The beauty of these books is that they make seemingly impenetrable concepts—”cross-validation,” “logistical regression,” “support vector machines”—easily understandable. This is because the authors focus on intuition rather than mathematics. Unlike many statisticians, Tibshirani and his coauthors don’t come from a math background. He believes this helps them think conceptually.
“We try to explain concepts intuitively by explaining the underlying idea first,” he says. “Then we give examples of a situation you would expect it work. And also, a situation where it might not work. I think people really appreciate that.” I certainly did.For example, a section of An Introduction to Statistical Learning is dedicated to explaining the use of “bootstrapping”—a statistical technique only available in the age of computers. Bootstrapping is a way to assess the accuracy of an estimate by generating multiple datasets from the same data.For example, lets say you collected the weights of 1,000 randomly selected adult women in the US, and found that the average was 130 pounds. How confident can you be in this number? In conventional statistics, to answer this question you would use a formula developed more than a century ago, which relies on many assumptions.
Today, rather than make those assumptions, you can use a computer to take thousands of samples of 500 people from your original 1,000 (this is the bootstrapping) and see how many of these results are close to 130. If most of them are, you can be more confident in the estimate. Theory and applicationThese books, mercifully, don’t require high-level math, like multivariate calculus or linear algebra. (If you’re into that sort of thing, there is a wealth of worthy but dry academic literature out there for you.) “While knowledge of those topics is very valuable, we believe that they are not required in order to develop a solid conceptual understanding of how statistical learning methods work, and how they should be applied,” says Daniela Witten, a coauthor of An Introduction to Statistical Learning.Helpfully, the books also provide code you can use to apply the tools with the. I recommend putting their examples to work on a dataset you are excited about. If you are into novels, use it to analyze.
If you like basketball, apply their examples to numbers at. The statistical learning tools are wonderful in themselves, but I’ve found they work best for people who are motivated by a personal or professional project.Data and statistics are an increasingly, and nearly everyone would be better off with a deeper understanding of the tools that help explain our world. Even if you don’t want to become a data analyst—which happens to be one of the out there, just so you know—these books are invaluable guides to help explain what’s going on.
David SpiegelhalterBasic Books, $32There are, as the saying goes, three kinds of lies: lies, damned lies and statistics. David Spiegelhalter is here to keep you from being duped by data.Ifyou’re seeking a plain-language intro to statistics, or just want to get betterat judging the reliability of numbers in the news, Spiegelhalter’s The Artof Statistics is a solid crash course. The book is less about learning howto use specific mathematical tools than it is about exploring the myriad waysstatistics can help solve real-world problems — and why statistical claimsoften have to be padded with caveats. Spiegelhalter,a statistician at the University of Cambridge, keeps things lively by tying newconcepts to questions. For instance, should you fret that eating bacon willincrease your risk of bowel cancer? The relative risk might make you think so:People who eat a bacon sandwich every day have an 18 percent higher risk ofbowel cancer than those who don’t. But looking at the absolute risk — a rise of6 to 7 cases per 100 people — may put your mind at ease.Spiegelhalter’snarration is encouraging, and he knows where beginners are likely to gettripped up.
He makes dense sections easier to parse by including frequentrecaps and lots of data visualizations, and tucking equations into footnotes.The Art of Statistics is alight with his enthusiasm for how statistics can be used to glean information for court cases, city planning and a host of other sectors. But Spiegelhalter warns readers not to forget the assumptions and uncertainties inherent in any analysis, and tells many cautionary tales about the ways statistics can go astray.
Patchy samples and logical missteps can lead to faulty conclusions. And bad-faith statistical practices have contributed to the and other areas of science ( SN: 4/2/16, p. Perhaps the most flagrant example is how social psychologist Daryl Bem manipulated study designs and cherry-picked data to publish statistically significant results in 2011 that suggested humans have extrasensory perception.Spiegelhalterdoesn’t let the media off the hook, either.
Many of the questions he uses tointroduce topics are drawn from misleading news reports.