DataVinci: Data Visualization in R

Welcome, my friend, welcome!

I have just got what you are looking for. I have got amazing codes and beautiful graphs and my oh my all in your favorite language - R :)

Enough drama, in this blog i am going to combine my previous posts on data visualization in R into one single post. Its going to be pretty long, so keep some snacks handy :)
Alright,

So, The general approach to creating a good visualization is as follows -

Decide upon the most suitable visualization for the available data set. Something that provides quickest and maximum insights
Decide upon the type of output file required (pdf, jpeg,png,bmp.....)
Create a basic structure
Keep on adding the details till you get what you want

Lets look at an example. One of the things I really appreciate about R is that it comes with many popular data sets to practice upon.I will take the "mtcars" dataset.

Lets have a look at the first few lines of the dataset by using the head function

This data set provides specs around some of the popular cars of USA. Now, suppose I want to show the relationship between weight of the car (wt) and mileage per gallon (mpg). Lets start by simply plotting the relationship between the two variables by using the plot command.

And yes I want the output in a png file. The name of the output will be "uncleSamCars.png". Here is how the first few lines of our code will look -

png("UncleSamCars.png")
attach(mtcars)
plot(wt,mpg)

The mission is not yet accomplished, but caressing "enter" after writing in the above code will bloom the following on your R GUI -

Hmmm, so ladies and gentlemen we can see over here that bulky cars are having lesser mileage. So, for the sake of my unborn child(ren), start buying lighter cars and reduce the load of your existence on the planet.

Now, suppose I want to add a plot for linear regression between the two variables in this graph. For that I need to add to my code another line

abline(lm(mpg~wt))

And I want to give it a title - "Regression of mpg on wt", yep another line -

title("Regression of mpg on wt")

And here is how our final code will look when it grows up -

png("UncleSamCars.png")
Attach(mtcars)
plot(wt,mpg)
abline(lm(mpg~wt))
title("Regression of mpg on wt")
detach(mtcars)
dev.off()

And here will be the final output -

Of course, we will discuss the code now :)

So the first and the last line - png("name of file. extension of file") and dev.off() are used to save the graphical output in a file. Instead of png, if you want pdf or jpeg or anything else, simply replace the png with the desired file type. The dev.off() will remain same and always needs to be present at the end of your graphics code. Alright bruh?

Next i have used attach(mtcars) and detach(mtcars), this has got nothing to do with the graph, its just a convenient way to access the columns of a dataframe. the attach comes before we start accessing the columns of the data frame and detach comes once our code gets orgasm.

Next, plot function. As the name suggests, it plots! Since it has two axis, we can give only two variables as input. The first argument plots the x axis and the second ? yep, y-axis

Now, the abline. This one is too scared to go out alone. So, if you try to simply plot an abline without using its momma, plot() function. It will start crying! So, this must follow the plot function.

As the name suggests abline is used to plot a line. But, you might have noticed the input of abline is another function - lm

lm() is used to plot linear regression model. In our case we have plotted predictions of mpg based on wt. The line represents the least square of sum of differences between the predicted value and the actual value. To know more search for "linear regression"

So, abline(lm(mpg~wt)) plots a line for linear regression of mpg on wt

In the end we have title(), no extra points for guessing, it puts the title of the graph.

So, here is again the code and the final output -

png("UncleSamCars.png")
Attach(mtcars)
plot(wt,mpg)
abline(lm(mpg~wt))
title("Regression of mpg on wt")
detach(mtcars)
dev.off()

Next we are going to explore the various options available with which we can control the presentation of our graphs. Coz, presentation matters!

*wink*

For this one I will consider the "iris" data set. Lets have a look at the first few rows.

Hmm interesting. Lets see what are the species here -

I will create a subset of iris where species = virginica

Now,

I want to see that for virginica, how does the petal length vary with sepal length. Lets have a simple plot for that

> attach(iris1)
> plot(Sepal.Length,Petal.Length)
> title("Flower Science WOW!")

Pretty bland. But I know someone who can transform this.

Now, let me introduce you to my friend par() function

So, my friend par is good at manipulating the graphical settings of the presentation. It can manipulate Both Graph as well as Text of the visualization.

Lets see what are the various weapons this Neo is carrying. The next few sections will specify the various options that can be used with par(). After that we will see the example.

For manipulating the symbols and lines -

pch - used to control the symbols on the plot ( 1 to 25)
cex (Sex in R :P) and it actually controls the size :P of the symbol
lty - the line type - 6 options
ltw - the line width

pch options in par() function

lty options in par() function

For manipulating colors -

col - for specifying the default plotting color. We can pass a vector also to this, ex, col=c("red","blue","green")
col.axis - color for the axis text
col.lab - color for the axis label
col.main - color of the main title
col.sub - color of the subtitle
fg - color of the foreground
bg - color of the background

We can specify the colors through names, hexadecimal, rgb, hsv

Names - col="red"

Hex - col="#FFFFFF"

rgb - col=rgb(1,3,1)

hsv - col=hsv(0,0,3)

For manipulating the size of the text -

cex - You know what it does, it controls the size
cex.axis - controls the size of axis text
cex.lab - controls the size of the axis label
cex.main - controls the size of the title
cex.sub - controls the size of the subtitle

For manipulating the Font of the text -

Font - The bold and italics. It accepts integer values. 1= plain, 2=bold, 3=italic, 4= Bold italic, 5=Symbols
font.main - the font of the main title. Accepts number
font.axis - the font of the axis text. Accepts number
font.lab - the font of the axis label. Accepts number
font.sub - the font of the subtitle. Accepts number family - the font family. Text input in quotes for the family of the font. eg family = "serif"

We can also control the size and margins of the graph through par() -

pin - dimensions in inches, its c(width,height)
mai - margin in inches, its c(bottom, left, top, right)
mar - margin in lines, its c(bottom, left, top, right)

Alright, so we now know that par can manipulate the plots, the colors, the texts that includes your font and size (cex.....ok its no longer funny ) and the dimension of the plot area. Yay!

Few more things. most of the properties that I have mentioned for the par() can be applied within the various graph functions as well, ex plot(x,y,pch=4), but when mentioned in the par() function, these will be applied to all the graph commands that fall below a particular par() command. You can modify the properties individually for each graph command as well, even when they are below a par() function. Also, you can use multiple par() calls through out your code to apply different attributes to different set of graphs.

If you simply type par() on the console, it will return all the present graphical settings.

Alright, I will now put a code below so that it covers few properties from each kind of manipulation, hmmm, So I want filled circles on the plot and not hollow ones, They should be filled with blue color, also I want a dashed red line. I want all the texts to be 25% larger then they are right now. I want everything in bold and I want the axis labels to be green.

>attach(iris1)
> par(col="red",cex=1.25,pch=19,font=2,col.lab="darkgreen")
> plot(Sepal.Length,Petal.Length)
>

Here you can see that since i have mentioned col="red" in the par function. The plot is also red. But, I want blue. I will add col attribute in the plot function -

>attach(iris1)
> par(col="red",cex=1.25,pch=19,font=2,col.lab="darkgreen")
> plot(Sepal.Length,Petal.Length)
> plot(Sepal.Length,Petal.Length,col="blue")

Next, I will add the line and Title

>attach(iris1)
> par(col="red",cex=1.25,pch=19,font=2,col.lab="darkgreen")
> plot(Sepal.Length,Petal.Length)
> plot(Sepal.Length,Petal.Length,col="blue")

>abline(lm(Petal.Length~Sepal.Length),lty=5)

> title("Colored Flower Science WOW! and cex :P")

>detach(iris1)

Here again you can see that since we did not explicitly mention the color of line, by default it picked up the red color.

Multiple graphs eh? You got it bruh!

Next, I am going to write about general methods that can be used to combine multiple graphs together.

Here again we are going to call our friend the par() function with its mfrow and mfcol super powers

So, lets start by understanding mfrow() [stress on row] and mfcol() [stress on column]

If the objective is to have a panel of graphs with 3 columns and 2 rows then the syntax will be as follows -

>par(mfrow=c(2,3))

>par(mfcol=c(2,3))

The above codes is essentially par(mfrow or col =c(number of rows,number of columns))

So yes, at a time you will only use mfrow() or mfcol(). mfrow() will simply fill the panels with graphs from left to right starting with first row, that is along the rows, and mfcol() will fill the panels with graphs from top to bottom starting with first column, that is along the columns.

Have a look at the panel -

If you are using mfrow(), the horny heart will be the second graph and if you are using mfcol(), the horny graph will be the third graph. Hope that helps :)

Lets take a simple example where we are going to plot four graphs in a grid of 2X2

>attach(mtcars)

> par(mfrow=c(2,2))

> plot(wt,mpg,main="scatter plot of wt vs mpg")

> plot(wt,disp,main="Scatter plot of wt vs disp")

> hist(wt,main="histogram of wt")

> boxplot(wt,main="boxplot of wt")

> detach(mtcars)

And this will produce the following plot -

let us repeat the same thing, but this time we will use mfcol(). And here is how the code will be -

> par(mfcol=c(2,2))

> plot(wt,mpg,main="scatter plot of wt vs mpg")

> plot(wt,disp,main="Scatter plot of wt vs disp")

> hist(wt,main="histogram of wt")

> boxplot(wt,main="boxplot of wt")

> detach(mtcars)

And here is how your result will be -

Now, what if we want to control how many graphs should be there in each row or column?

So, suppose we don't want all the slots of the grid to be filled, if we do this with mfrow/mfcol, we have to create an empty plot at the location where we want nothing and again that would be a blank space and one can clearly discern something missing. See, I am not taking anything away from mfrow/mfcol. mfrow/mfcol are awesome, but even Superman slows down in presence of Kryptonite. Got it...yeah

So, here is our superman struggling in front of Kryptonite and in comes the savior - Layout().

Lets start by understanding the syntax.

>Layout(matrix)

So, yes it needs a matrix. And how should that matrix be?

Suppose we want to plot only three graphs such that there is one graph in the bottom row and two graphs in the top. For this matrix will need to have 2 rows and 2 columns, such that we will merge the columns of row 2 and keep the columns of row 1 as they are.

And what should be the data of the matrix? For convenience make the byrow argument of the matrix True. Now, the graphs will be filled row wise from left to right. Each graph will have a serial number. In our example we are having three graphs. So, there will be three serial numbers viz. 1, 2 and 3.

Now, we only want i graph in the entire bottom row, for this the data of our matrix will be 1,2,3,3. By doing this we are specifying that cell(2,1) and cell(2,2) will share a graph and cell(1,1),cell(1,2) will have different graphs.

So, the matrix will look something like this -

matrix(c(1,2,3,3),2,2,byrow=T)

And this will be our input to layout function

>layout(matrix(c(1,2,3,3),2,2,byrow=T)

So how will the overall code look? Like this -

> layout(matrix(c(1,2,3,3),2,2,byrow=T))
> hist(mtcars$wt)
> hist(mtcars$wt)
> hist(mtcars$wt)

And how will the graph look ? Like this -

let me give another example. This time we will have two graphs in the first column and one graph only in the second column.

For this we will modify our matrix as follows -

matrix(c(1,2,3,2),2,2,byrow=T)

The code will be as follows

> layout(matrix(c(1,2,3,2),2,2,byrow=T))
> hist(mtcars$wt)
> hist(mtcars$wt)
> hist(mtcars$wt)

And the output as follows -

The super awesomeness of layout() does not stop over there. What if after controlling the number of graphs in each row/column you want to control their sizes as well? Yes sir, even that is possible!

Let me reveal to you two more arguments to layout() apart from matrix, the width argument and the height arguments.

The input to width argument is a vector of values that will control the relative widths of the columns. So, if you got two columns in your chart. You will input two values to the vector and the widths of the columns will get divided in the ratio of the inputs.

Width=c(1,3,2) will arrange the widths of the columns 1, 2 and 3 in the ratio of 1:3:2

The height argument controls the height of the rows. The usage is same as that of width argument.

Height=c(2,3,4) will arrange the heights of the rows 1, 2 and 3 in the ratio of 2:3:4

Let us put this into use for our above example. Let’s arrange the column widths in the ratio of 2:1 and the height of rows in the ratio of 1:2. Our code will be modified as following –

>Layout(matrix(c(1,2,3,2),2,2,byrow=T),width=c(2,1),height=c(1,2))

>Hist(mtcars$wt)

And voila –

Now, we are going get even more sci fi. We are going to see how to superimpose one graph over the other. And this time again we gonna call our old friend, yep you got it the par() function.

By now we know that par() has many super powers. In this post I will reveal to you another one – “fig”

The syntax of fig argument is as follows –

fig=c(x1,x2,y1,y2)

Where x1 and x2 are start and end points on the x axis from 0 to 1. And y1 and y2 are the start and end points on the y axis from 0 to 1. These dimensions will include the entire graph area including the labels. It will get clear with the examples.

One more thing after the first plot with each plot we need to put new argument as True in the par() function

Alright. * popping fingers *

>par(fig=c(0,0.8,0,0.8))

>plot(mtcars$wt,mtcars$wt)

>par(fig=c(0,0.8,0.62,1),new=T)

>hist(mtcars$wt)

>par(fig=c(0.6,1,0,1),new=T)

>boxplot(mtcars$mpg,axes=FALSE)

In the above code our first graph is a simple plot with x1=0,x2=0.8,y1=0,y2=0.8 . Things get interesting with second plot. Here I have come with the values of fig after trial and error and most of the times you will be required to do the same as well. Also notice that I have used the new=T argument.

If you increase the size of the y dimension for the second plot by lowering down y1, you will notice that the graphs will start overlapping.

For the final graph again I have come up with dimensions after trial and error and kept new argument as T.

And massaging enter will give. Boom –

Hope this post was worthy of your time. Will sincerely appreciate a feedback.

Till then.

Stay Awesome

Pages

Friday, 28 August 2015

Data Visualization in R

No comments:

Post a Comment