I see two reasons for why the challenges of machine learning are misunderstood. First, as the name suggests, machine learning is software that learns by itself as opposed to being instructed on every single rule by a developer. This is an oversimplification that many media outlets with little or no knowledge of the actual challenges of writing machine learning algorithms often use when speaking of the ML trade. [Read: How the Dutch government uses data to predict the weather and prepare for natural disasters] The second reason, in my opinion, are the many books and courses that promise to teach you the ins and outs of machine learning in a few hundred pages (and the ads on YouTube that promise to net you a machine learning job if you pass an online course). Now, I don’t what to vilify any of those books and courses. I’ve reviewed several of them (and will review some more in the coming weeks), and I think they’re invaluable sources for becoming a good machine learning developer. But they’re not enough. Machine learning requires both good coding and math skills and a deep understanding of various types of algorithms. If you’re doing Python machine learning, you have to have in-depth knowledge of many libraries and also master the many programming and memory-management techniques of the language. And, contrary to what some people say, you can’t escape the math. And all of that can’t be summed up in a few hundred pages. Rather than a single volume, the complete guide to machine learning would probably look like Donald Knuth’s famous The Art of Computer Programming series. So, what is all this tirade for? In my exploration of data science and machine learning, I’m always on the lookout for books that take a deep dive into topics that are skimmed over by the more general, all-encompassing books. In this post, I’ll look at Python for Data Analysis and Practical Statistics for Data Scientists, two books that will help deepen your command of the coding and math skills required to master Python machine learning and data science.
Python for data analysis
Python for Data Analysis, 2nd Edition, is written by Wes McKinney, the creator of the pandas, one of key libraries using in Python machine learning. Doing machine learning in Python involves loading and preprocessing data in pandas before feeding them to your models. In Python for Data Analysis, McKinney takes you through the entire functionality of pandas and manages to do so without making it read like a reference manual. There are lots of interesting examples that build on top of each other and help you understand how the different functions of pandas tie in with each other. You’ll go in-depth on things such as cleaning, joining, and visualizing data sets, topics that are usually only discussed briefly in most machine learning books.Most books and courses on machine learning provide an introduction to the main pandas components such as DataFrames and Series and some of the key functions such as loading data from CSV files and cleaning rows with missing data. But the power of pandas is much broader and deeper than what you see in a chapter’s worth of code samples in most books. You’ll also get to explore some very important challenges, such as memory management and code optimization, which can become a big deal when you’re handling very large data sets in machine learning (which you often do). What I also like about the book is the finesse that has gone into choosing subjects to fit in the 500 pages. While most of the book is about pandas, McKinney has taken great care to complement it with material about other important Python libraries and topics. You’ll get a good overview of array-oriented programming with numpy, another important Python library often used in machine learning in concert with pandas, and some important techniques in using Jupyter Notebooks, the tool of choice for many data scientists. All this said, don’t expect Python for Data Analysis to be a very fun book. It can get boring because it just discusses working with data (which happens to be the most boring part of machine learning). There won’t be any end-to-end examples where you’ll get to see the result of training and using a machine learning algorithm or integrating your models in real applications. My recommendation: You should probably pick up Python for Data Analysis after going through one of the introductory or advanced books on data science or machine learning. Having that introductory background on working with Python machine learning libraries will help you better grasp the techniques introduced in the book.
Practical statistics for data scientists
While Python for Data Analysis improves your data-processing and -manipulation coding skills, the second book we’ll look at, Practical Statistics for Data Scientists, 2nd Edition, will be the perfect resource to deepen your understanding of the core mathematical logic behind many key algorithms and concepts that you often deal with when doing data science and machine learning. But again, the key here is specialization.The book starts with simple concepts such as different types of data, means and medians, standard deviations, and percentiles. Then it gradually takes you through more advanced concepts such as different types of distributions, sampling strategies, and significance testing. These are all concepts you have probably learned in math class or read about in data science and machine learning books. On the one hand, the depth that Practical Statistics for Data Scientists brings to each of these topics is greater than you’ll find in machine learning books. On the other hand, every topic is introduced along with coding examples in Python and R, which makes it more suitable than classic statistics textbooks on statistics. Moreover, the authors have done a great job of disambiguating the way different terms are used in data science and other fields. Each topic is accompanied by a box that provides all the different synonyms for popular terms. As you go deeper into the book, you’ll dive into the mathematics of machine learning algorithms such as linear and logistic regression, K-nearest neighbors, trees and forests, and K-means clustering. In each case, like the rest of the book, there’s more focus on what’s happening under the algorithm’s hood rather than using it for applications. But the authors have again made sure the chapters don’t read like classic math textbooks and the formulas and equations are accompanied by nice coding examples. Like Python for Data Analysis, Practical Statistics for Data Scientists can get a bit boring if you read it end to end. There are no exciting applications or a continuous process where you build your code through the chapters. But on the other hand, the book has been structured in a way that you can read any of the sections independently without the need to go through previous chapters. My recommendation: Read Practical Statistics for Data Scientists after going through an introductory book on data science and machine learning. I definitely recommend reading the entire book once, though to make it more enjoyable, go topic by topic in-between your exploration of other machine learning courses. Also keep it handy. You’ll probably revisit some of the chapters from time to time.
Some closing thoughts
I would definitely count Python for Data Analysis and Practical Statistics for Data Scientists as two must-reads for anyone who is on the path of learning data science and machine learning. Although they might not be as exciting as some of the more practical books, you’ll appreciate the depth they add to your coding and math skills. This article was originally published by Ben Dickson on TechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech and what we need to look out for. You can read the original article here.