In today’s class, we explored the CDC 2018 diabetes dataset and learned about linear regression. Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It is commonly employed for predicting the value of the dependent variable (Y) based on the independent variable(s) (X).
While examining the CDC diabetic dataset, I noticed that it contains three variables: obesity, inactivity, and diabetes. Each of these variables has a different number of data points. To ensure consistency in our analysis, we initially found the common data points shared among all three variables by using the intersection operation. The result of this intersection was 354 data points, indicating that we have a consistent dataset with 354 data points that can be used for further analysis.
Subsequently, we proceeded to analyze each variable individually, exploring their respective data points and distributions. To enhance the visual representation of the data, we created smooth histograms for each variable, allowing us to gain better insights into their distributions and characteristics.
During our exploration, we also encountered some new terminology, such as “kurtosis,” which refers to the tailedness of a distribution and describes the shape of a dataset. In the context of linear regression, we discussed that if the residuals (the differences between observed and predicted values) exhibit “fanning out”, it is referred to as “heteroscedasticity.”