Course Review

I am writing this blog post to summarize my learnings from the course ST 558 (Fall 2022) at NC State.

Things I am gonna do differently

I had a summer internship prior to enrolling in this course in the fall, during which I was required to provide R samples for clients that involved dealing with dataframes, manipulating data, gaining insights, and generating plots. I utilized base R extensively to work with the dataframes because I was not familiar with tidyverse. However, now that I’ve completed the course, I realize how effectively the tidyverse universe is integrated. I probably would have preferred to use dplyr for data manipulation and ggplot for richer plots. Therefore, I’m going to utilize tidyverse in the future whenever I need to work with dataframes in R because it not only makes the process easier but also the code much more readable.
I had a general understanding of regression/classification models before completing this course, but after working on Projects 2 and Project 3, I was able to recognize a lot more tools for performing exploratory data analysis and learned more about how to use various models. These projects also demonstrated the significance of EDA in machine learning tasks, and going forward, I want to concentrate more on data preprocessing, EDA, and variable selection before beginning to train models on the data.
Look into doing Cross Validation and implement Random Forest using parallel computing (if possible).
I learned the value of git and github from the group projects mentioned above. Utilizing the best practices of github might occasionally be challenging due to the workload of several courses. I want to make sure I adhere to best practices moving ahead, such as using several feature branches, using PRs to merge code into the main branch, and creating issues to track root cause and resolution as much as possible.

What do I want to learn more about

I currently have a part-time job as a member of a development team building a web application for data science and analytics. Its architecture consists of a variety of components, many of which I learned about during the course. For example, the data is retrieved using calls to API endpoints, users have access to Docker containers, and a Kubernetes cluster contains the Jupyter Notebooks that are actively being used by each user. I still have a lot to learn about things, and I want to dive deeper to understand more about them.
I’m interested in learning more about the “inference” part of data science, where I can learn more about how the variables are chosen or how well the model fits the data. Despite the fact that I’ve taken a statistics course before for it, that course didn’t emphasize the implementation part. So I suppose I just need to go over everything again and make connections in order to try to grasp data modeling better.
Deep learning and neural networks. Although this was not the course’s primary focus, and I am unsure of how it is implemented in R, I do want to learn more about it, so I plan to enroll in a course on neural networks during the upcoming spring semester.

Project 3 Review

Blog Archive

Archive of all previous blog posts