Wednesday, 2 September 2015

What is difference between correlation and covariance



Correlation is standardized form of covariance. That means you can compare correlation of two data sets having different units.

You can not compare covariance of two data sets that have different units.

Now, how to find correlation and covariance in R?




Two functions -
  • cor() for calculating correlations
  • cov() for calculating covariance
Syntax for both of them is same
  • cor(x,y,use,method)
Where x and y can be a variable/matrix or data frame.
"use" is basically to handle missing values. It can have following options
  • "everything" (default)- Everything will be included and if the datasets have NA values then correlation will also have a corresponding NAs
  • "all.obs" - If NAs is/are present, an error will be returned
  • "complete.obs"- list wise deletion of NA values is done
  • "pairwise.complete.obs"-pairwise deletion of NA values is done
method specifies which correlation we want to calculate
  • Pearson
  • Spearman
  • Kendell
Most of the times we would be using Pearson only, which is a parametric correlation, that it takes assumptions. Which are - the data sets are normally distributed and they are linear.

Spearman is used mostly in case of ordinal data, when the data sets have a monotonic relationship rather than the linear one. This one is non parametric, that is it takes no assumptions.

But, the case is not solved yet. Just by knowing the coefficient of correlation we can't say that two datasets are having strong/weak/no relationship. The result can also be because of the choice of sample and thus the results can simply be because of chance. Thus, we also need to find the significance of correlation.

To find significance of a result to reject the NULL hypothesis we usually use the p-value. We reject the NULL hypothesis when p-value is less than the level of significance.

So, if the coefficient of correlation is high and the p-value is more than the level of significance, we can't comment on the relationship as it simply means that the results are because of chance and a different sample of same population might produce a different result.

In comes "Hmisc" package

it has a function to solve our purpose, the rcorr() function. This function provides both the strength of correlation as well the corresponding p-values.

This one also takes the same arguments. The input to this function is always a matrix, thus you have to use the as.matrix() function on inputs that are not of the required type, also make sure that you put use="pairwise.complete.cases".

The methods can be any of Pearson or Spearman

Again the "Hmisc" package

But, you know what would be really interesting? Plotting the correlations.

For this we need the corrgram() function of corrgram package package.

Lets plot the correlation of all the variables of mtcars data set.

>library(corrgram)
>corrgram(mtcars)

The color scale goes from red to blue. Intensity of red obviously indicating the negative intensity and intensity of blue indicating the positive intensity

In the above plot we can clearly see that mileage per gallon has negative correlation with cyllinders, displacement, horse power and weight.

Sweet, now my post has a picture.

2 comments: