So, in my last post, I showed how to create two histograms from a certain data set and then how to plot the two variables to see if there is any relationship.
Visually, it was easy to tell that there was a negative relationship between the weight of an automobile and the fuel economy of an automobile. This number is called Pearson’s Correlation Coefficient or, in the vernacular, simply the “correlation.” Essentially, this number measures the percentage of fluctuation in one variable that can be explained by another variable.
But, is there a more objective way to understand the relationship? A correlation of 1 means the variables move in perfect unison, a correlation of -1 means the variables move in the complete opposite direction, and a correlation of 0 means there is no relationship at all between the two variables.
So, how to we retrieve the correlation between two variables in R? First, we import the same data set we used last time. When we run the head() function on motorcars, we get the first 6 rows of every column in the data set.
When we view the data set (using colnames() or head()), we see that the column names for the variables we are trying to measure are “wt” and “mpg.” Now, all we need to do is subset these two variables with the dollar sign and place them within the cor() function. If you plot the two variables using the plot() function, you can see that this relationship is fairly clear visually. Could there be other things that are related to the fuel economy of the vehicle, besides weight? What if we want to see how all of these variables are related to one another?
When we run this code, we can see that the correlation is -0.87, which means that the weight and the mpg move in exactly opposite directions roughly 87% of the time. Well, we could run a correlation on every single combination we can think of, but that would be tedious.
Is there a way we can view all the correlations with a single line of code? First, we create a separate data frame that only includes the data from motorcars (subsets everything to the right of the vehicle model name).
Then, we simply run a correlation on the new data frame, which we’ve called “mc_data.” To clean things up a bit, I’ve nested the cor() function within the round() function to round the result to two decimal places.
When we enter this code, here’s what we get: We can see that there are several other variables that are related to mpg, such as cyl, disp, and hp.
Now, we can plot the variables that are most correlated with miles per gallon using this code (refer to previous post for explanation).
Offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, La Te X, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Naše společnost vyvíjí a vyrábí nástroje, přípravky a jednoúčelové stroje.