Rajat's Analyticzone: First steps to my ggplots_05Jan17-Session

The data which we collect is only as good as our ability to understand and communicate it to others, which is why choosing the right visualization is essential. If our data is misrepresented or presented ineffectively, key insights and understanding are lost, which affects the overall purpose of our message.

This is my first ggplot guide which will show the most common charts and visualizations and help choose the right presentation for the data.

Information can be visualized in a number of ways, each of which can provide a specific insight. When we start to work with data, it’s important to identify and understand the story we are trying to tell and the relationship we are looking to show. Knowing this information will help us select the proper visualization to best deliver our message.

When analyzing data, search for patterns or interesting insights that can be a good starting place for finding our story, such as – Trends, Correlations and Outliers

BAR CHARTS

Bar charts are very versatile. They are best used to show change over time, compare different categories, or compare parts of a whole.

VERTICAL (COLUMN CHART)

- Best used for chronological data (time-series should always run left to right), or when visualizing negative values below the x-axis.

HORIZONTAL

- Best used for data with long category labels

STACKED

- Best used when there is a need to compare multiple part-to-whole relationships. These can use discrete or continuous data, oriented either vertically or horizontally.

100% STACKED

- Best used when the total value of each category is unimportant and percentage distribution of subcategories is the primary message.

PIE CHARTS

Pie charts are best used for making part-to-whole comparisons with discrete or continuous data. They are most impactful with a small data set.

STANDARD

- Used to show part-to-whole relationships.

DONUT

- Stylistic variation that enables the inclusion of a total value or design element in the center.

LINE CHARTS

Line charts are used to show time-series relationships with continuous data. They help show trend, acceleration, deceleration, and volatility.

AREA CHARTS

Area charts depict a time-series relationship, but they are different than line charts in that they can represent volume.

AREA CHART

- Best used to show or compare a quantitative progression over time.

STACKED AREA

- Best used to visualize part-to-whole relationships, helping show how each category contributes to the cumulative total.

100% STACKED AREA

- Best used to show distribution of categories as part of a whole, where the cumulative total is unimportant.

SCATTER PLOT

Scatter plots show the relationship between items based on two sets of variables. They are best used to show correlation in a large amount of data.

BUBBLE CHART

Bubble charts are good for displaying nominal comparisons or ranking relationships.

BUBBLE PLOT

- This is a scatter plot with bubbles, best used to display an additional variable.

BUBBLE MAP

- Best used for visualizing values for specific geographic regions.

HEAT MAP

Heat maps display categorical data, using intensity of color to represent values of geographic areas or data tables.

INTRO TO GGPLOTS

The grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics. This blog post is my introduction to ggplot2, a visualization package in R. It assumes a very basic knowledge of R, like vectors, data frames and reading csv files. ggplot2 is an R package for producing statistical, or data, graphics, but it is unlike most other graphics packages because it has a deep underlying grammar.

GGPLOT2 INSTALLATION

One of R’s greatest strengths is its excellent set of packages. To install a package, we can use the install.packages() function.

To install ggplot2 package we write the following:

install.packages("ggplot2")

To load a package into our current R session, we use library() like below:

library(ggplot2)

Scatter plots with qplot():

We now create a scatterplot in ggplot2. We’ll use the “iris” data frame that’s automatically loaded into R.

We can use the head function to look at the first few rows of the data frame:

head(iris)

The data frame actually contains three types of species: setosa, versicolor, and virginica. Let’s plot Sepal.Length against Petal.Length using ggplot2’s qplot() function:

qplot(Sepal.Length, Petal.Length, data = iris)

# Plot Sepal.Length vs. Petal.Length, using data from the `iris` data frame.

# * First argument `Sepal.Length` goes on the x‐axis.

# * Second argument `Petal.Length` goes on the y‐axis.

# * `data = iris` means to look for this data in the `iris` data frame.

To see where each species is located in this graph, we can color each point by adding a color = Species argument.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species)

Similarly, we can let the size of each point denote sepal width, by adding a size = Sepal.Width argument.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width

# We see that Iris setosa flowers have the narrowest petals.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width , alpha = I(0.7))

# By setting the alpha of each point to 0.7, we reduce the effects of overplotting.

Finally, let’s fix the axis labels and add a title to the plot.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, xlab = "Sepal Length", ylab = "Petal Length", main = "Sepal vs. Petal Length in Iris data")

Other common geoms:

In the scatterplot examples above, we implicitly used a point geom, the default when you supply two arguments to qplot().

# These two commands are same and give the same output.

qplot(Sepal.Length, Petal.Length, data = iris, geom = "point")
qplot(Sepal.Length, Petal.Length, data = iris)

But we can also easily use other types of geoms to create more kinds of plots.

Barcharts: geom = “bar”

movies = data.frame(director = c("spielberg", "spielberg", "spielberg", "jackson", "jackson"),
   movie = c("jaws", "avatar", "schindler's list", "lotr", "king kong"),
   minutes = c(124, 163, 195, 600, 187))

# Plot the number of movies each director has.

qplot(director, data = movies, geom = "bar", ylab = "# movies")

# By default, the height of each bar is simply a count.

But we can also supply a different weight.

# Here the height of each bar is the total running time of the director's movies.

qplot(director, weight = minutes, data = movies, geom = "bar", ylab = "Total Length”)

Line charts: geom = “line”

qplot(Sepal.Length, Petal.Length, data = iris, geom = "line", color = Species)

# Using a line geom doesn't really make sense here.

`Orange` is another built‐in data frame that describes the growth of orange trees.

qplot(age, circumference, data = Orange, geom = "line", color = Tree, main = "How does tree circumference varies with age")

# We can also plot both points and lines.

qplot(age, circumference, data = Orange, geom = c("point", "line"), color = Tree,  main = "How doe tree circumference varies with age")

`diamonds ` is another built‐in data frame that describes the types of different diamond according to their cut, clarity, carat, color, shapes, etc. and so is mtcars, which describes different cars makes/brands with their respective mpg, cylinder capacity, displacement, horse power, etc.

We can show the info about the data:

head(diamonds)
head(mtcars)

We can also do a comparison between qplot vs ggplot – both give the same output:

# qplot histogram

qplot(clarity, data=diamonds, fill=cut, geom="bar")

# ggplot histogram

ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()

Here we use continuous scale and also a discrete scale(by converting to factors)

head(mtcars)
qplot(wt, mpg, data=mtcars, colour=cyl)
levels(mtcars$cyl)
qplot(wt, mpg, data=mtcars, colour=factor(cyl))

By using different aesthetic mappings:

qplot(wt, mpg, data=mtcars, shape=factor(cyl))
qplot(wt, mpg, data=mtcars, size=qsec)

We now combine mappings (hint: hollow points, geom-concept, legend combination)

qplot(wt, mpg, data=mtcars, size=qsec, color=factor(carb))
qplot(wt, mpg, data=mtcars, size=qsec, color=factor(carb), shape=I(1))
qplot(wt, mpg, data=mtcars, size=qsec, shape=factor(cyl), geom="point")
qplot(wt, mpg, data=mtcars, size=factor(cyl), geom="point")

We now make use of the bar-plot:

qplot(factor(cyl), data=mtcars, geom="bar")

We can flip the bar-plot by 90 degrees:

qplot(factor(cyl), data=mtcars, geom="bar") + coord_flip()

The below code tells us the difference between fill/color bars

qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(cyl))
qplot(factor(cyl), data=mtcars, geom="bar", colour=factor(cyl))

We can fill by variable also:

qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(gear))

We can use the ‘ddply’ module from library ‘plyr’ and split data.frame in subframes and apply functions as below:

library(plyr)

ddply(diamonds, "cut", "nrow")
ddply(diamonds, c("cut", "clarity"), "nrow")
ddply(diamonds, "cut", mean)
ddply(diamonds, "cut", summarise, meanDepth = mean(depth))
ddply(diamonds, "cut", summarise, lower = quantile(depth, 0.25, na.rm=TRUE),
                                  median = median(depth, na.rm=TRUE),
                                  upper = quantile(depth, 0.75, na.rm=TRUE))

Now we see different forms of creating ggplots with geom = histogram by changing different binwidths:

qplot(carat, data = diamonds, geom = "histogram")
qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.1)
qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.01)

We use geom to combine plots by changing the order of layers:

qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"))
qplot(wt, mpg, data = mtcars, geom = c("smooth", "point"))
qplot(wt, mpg, data = mtcars, color = factor(cyl), geom = c("point", "smooth"))

We can remove the standard error portion from the diagram by the following code:

qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), se = FALSE)

We can make the line more or less wiggly (span: 0-1)

qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), span = 0.6)

qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), span = 1)

Now by using linear modelling:

qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), method = "lm")

We can save plot in variable (hint: data is saved in plot, changes in data do not change plot-data)

p.tmp = qplot(factor(cyl), wt, data = mtcars, geom = "boxplot")
p.tmp

We now save mtcars in tmp-var

t.mtcars = mtcars
head(mtcars)
# change mtcars
mtcars = transform(mtcars, wt = wt^2)
# draw plot without/with update of plot data
p.tmp
p.tmp %+% mtcars
# the above line is same as below now with transformed mtcars
qplot(factor(cyl), wt, data = mtcars, geom = "boxplot")

Now to get information about plot:

summary(p.tmp)

We now save plot (with data included):

save(p.tmp, file = "temp.rData")
# save image of plot on disk (hint: svg device must be installed)
library(svglite)

ggsave(file = "test.pdf")
ggsave(file = "test.jpeg", dpi = 72)
ggsave(file = "test.svg", plot = p.tmp, width = 10, height = 5)

We can use shortcuts like this format geom_XXX(mapping, data, ..., geom, position)

p.tmp + geom_point()

# using ggplot-syntax with qplot (hint: qplot creates layers automatically)

qplot(mpg, wt, data = mtcars, color = factor(cyl), geom = "point") + geom_line()
qplot(mpg, wt, data = mtcars, color = factor(cyl), geom = c("point","line"))

We can add an additional layer with different mapping

p.tmp + geom_point()
p.tmp + geom_point() + geom_point(aes(y=disp))

We can set aesthetics instead of mapping:

p.tmp + geom_point(color = "darkblue")
p.tmp + geom_point(aes(color = "darkblue"))

We now show how to deal with overplotting (hollow points, pixel points, alpha[0-1] )

t.df = data.frame(x = rnorm(2000), y = rnorm(2000))
p.norm = ggplot(t.df, aes(x,y))
p.norm + geom_point()

p.norm + geom_point(shape = 1)

p.norm + geom_point(shape = ".")

p.norm + geom_point(colour = alpha("black", 1/2))

p.norm + geom_point(colour = alpha("blue", 1/10))

Rajat's Analyticzone

Sunday, 8 January 2017

First steps to my ggplots_05Jan17-Session_1

No comments:

Post a Comment