HarvardX Data Science course - First final project
I recently finished to penultimate final assignment for the HarvardX Data Science course. The Stanford course was clearly machine learning. This one is definitely lighter on the machine learning and much heavier on the data science: how to source, clean and visualise data are key skills. The targeted knowledge is more traditional probabilities/statistics. Long-existing fundamental techniques like inference, polling are there.
This time R is the centre tool of the course. It makes clear sense. When I started learning it about 15 years ago, I loathed the multiple gotchas. Since then, new libraries have simplified base R and removed its exceptions and exceptions to exceptions. In addition the Rcpp
library has eased implementation of efficient algorithms and interfacing with popular libraries. Still not a speed demon, but not the snail it used to be.
I won’t go through the project and my models. No revolutionary concepts. Just great results. I took half a day to reimplement in Julia, both to crosscheck and personal training. As expected, a lot easier to read. But the big surprise was the speed difference. Although I didn’t time it, Julia only felt about twice quicker. Credit to the R project folks (I only used matrices operations, no modeling libraries).
On this report, I got grades that can’t be improved upon. Happy camper.