The data which we collect is only as good as our ability to
understand and communicate it to others, which is why choosing the right
visualization is essential. If our data is misrepresented or presented
ineffectively, key insights and understanding are lost, which affects the
overall purpose of our message.
This is my first ggplot guide which will show the most common
charts and visualizations and help choose the right presentation for the data.
Information can be visualized in a number of ways, each of
which can provide a specific insight. When we start to work with data, it’s
important to identify and understand the story we are trying to tell and the
relationship we are looking to show. Knowing this information will help us
select the proper visualization to best deliver our message.
When analyzing data, search for patterns or interesting
insights that can be a good starting place for finding our story, such as –
Trends, Correlations and Outliers
BAR CHARTS
Bar charts are very versatile. They are best used to show
change over time, compare different categories, or compare parts of a whole.
VERTICAL (COLUMN CHART)
-
Best used for chronological data (time-series
should always run left to right), or when visualizing negative values below the
x-axis.
HORIZONTAL
-
Best used for data with long category labels
STACKED
-
Best used when there is a need to compare
multiple part-to-whole relationships. These can use discrete or continuous
data, oriented either vertically or horizontally.
100% STACKED
-
Best used when the total value of each category
is unimportant and percentage distribution of subcategories is the primary
message.
PIE CHARTS
Pie charts are best used for making part-to-whole
comparisons with discrete or continuous data. They are most impactful with a
small data set.
STANDARD
-
Used to show part-to-whole relationships.
DONUT
-
Stylistic variation that enables the inclusion
of a total value or design element in the center.
LINE CHARTS
Line charts are used to show time-series relationships with
continuous data. They help show trend, acceleration, deceleration, and
volatility.
AREA CHARTS
Area charts depict a time-series relationship, but they are
different than line charts in that they can represent volume.
AREA
CHART
-
Best used to show or compare a quantitative
progression over time.
STACKED AREA
-
Best used to visualize part-to-whole
relationships, helping show how each category contributes to the cumulative
total.
100% STACKED AREA
-
Best used to show distribution of categories as
part of a whole, where the cumulative total is unimportant.
SCATTER PLOT
Scatter plots show the relationship between items based on
two sets of variables. They are best used to show correlation in a large amount
of data.
BUBBLE CHART
Bubble charts are good for displaying nominal comparisons or
ranking relationships.
BUBBLE PLOT
-
This is a scatter plot with bubbles, best used
to display an additional variable.
BUBBLE MAP
-
Best used for visualizing values for specific
geographic regions.
HEAT MAP
Heat maps display categorical data, using intensity of color
to represent values of geographic areas or data tables.
INTRO TO GGPLOTS
The grammar of graphics is a tool that enables us to
concisely describe the components of a graphic. Such a grammar allows us to
move beyond named graphics (e.g., the “scatterplot”) and gain insight into the
deep structure that underlies statistical graphics. This blog post is my introduction
to ggplot2, a visualization package in R. It assumes a very basic knowledge of
R, like vectors, data frames and reading csv files. ggplot2 is an R package for
producing statistical, or data, graphics, but it is unlike most other graphics
packages because it has a deep underlying grammar.
GGPLOT2 INSTALLATION
One of R’s greatest strengths is its excellent set of packages.
To install a package, we can use the install.packages() function.
To install ggplot2 package we write the following:
install.packages("ggplot2")
To load a package into our current R session, we use
library() like below:
library(ggplot2)
Scatter plots with qplot():
We now create a scatterplot in ggplot2. We’ll use the “iris”
data frame that’s automatically loaded into R.
We can use the head function to look at the first few rows
of the data frame:
head(iris)
The data frame actually
contains three types of species: setosa, versicolor, and virginica. Let’s plot
Sepal.Length against Petal.Length using ggplot2’s qplot() function:
qplot(Sepal.Length, Petal.Length, data = iris)
# Plot Sepal.Length vs. Petal.Length, using data from the
`iris` data frame.
# * First argument `Sepal.Length` goes on the x‐axis.
# * Second argument `Petal.Length` goes on the y‐axis.
# * `data = iris` means to look for this data in the `iris`
data frame.
To see where each species is located in this graph, we can
color each point by adding a color = Species argument.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species)
Similarly, we can let the size of each point denote sepal
width, by adding a size = Sepal.Width argument.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width
# We see that Iris setosa flowers have the narrowest petals.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width , alpha = I(0.7))
# By setting the alpha of each point to 0.7, we reduce the
effects of overplotting.
Finally, let’s fix the axis labels and add a title to the
plot.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species, xlab = "Sepal Length", ylab = "Petal Length", main = "Sepal vs. Petal Length in Iris data")
Other common geoms:
In the scatterplot examples above, we implicitly used a
point geom, the default when you supply two arguments to qplot().
# These two commands are same and give the same output.
qplot(Sepal.Length, Petal.Length, data = iris, geom = "point")
qplot(Sepal.Length, Petal.Length, data = iris)
But we can also easily use other types of geoms to create
more kinds of plots.
Barcharts: geom = “bar”
movies = data.frame(director = c("spielberg", "spielberg", "spielberg", "jackson", "jackson"),
movie = c("jaws", "avatar", "schindler's list", "lotr", "king kong"),
minutes = c(124, 163, 195, 600, 187))
# Plot the number of movies each director has.
qplot(director, data = movies, geom = "bar", ylab = "# movies")
# By default, the height of each bar is simply a count.
But we can also supply a different weight.
# Here the height of each bar is the total running time of
the director's movies.
qplot(director, weight = minutes, data = movies, geom = "bar", ylab = "Total Length”)
Line charts: geom = “line”
qplot(Sepal.Length, Petal.Length, data = iris, geom = "line", color = Species)
# Using a line geom doesn't really make sense here.
`Orange` is another built‐in data frame that describes the
growth of orange trees.
qplot(age, circumference, data = Orange, geom = "line", color = Tree, main = "How does tree circumference varies with age")
# We can also plot both points and lines.
qplot(age, circumference, data = Orange, geom = c("point", "line"), color = Tree, main = "How doe tree circumference varies with age")
`diamonds ` is another built‐in data frame that describes
the types of different diamond according to their cut, clarity, carat, color, shapes,
etc. and so is mtcars, which describes different cars makes/brands with their
respective mpg, cylinder capacity, displacement, horse power, etc.
We can show the info about the data:
head(diamonds)
head(mtcars)
We can also do a comparison between qplot vs ggplot – both give
the same output:
# qplot histogram
qplot(clarity, data=diamonds, fill=cut, geom="bar")
# ggplot histogram
ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()
Here we use continuous scale and also a discrete scale(by converting
to factors)
head(mtcars)
qplot(wt, mpg, data=mtcars, colour=cyl)
levels(mtcars$cyl)
qplot(wt, mpg, data=mtcars, colour=factor(cyl))
By using different aesthetic mappings:
qplot(wt, mpg, data=mtcars, shape=factor(cyl))
qplot(wt, mpg, data=mtcars, size=qsec)
We now combine mappings (hint: hollow points, geom-concept,
legend combination)
qplot(wt, mpg, data=mtcars, size=qsec, color=factor(carb))
qplot(wt, mpg, data=mtcars, size=qsec, color=factor(carb), shape=I(1))
qplot(wt, mpg, data=mtcars, size=qsec, shape=factor(cyl), geom="point")
qplot(wt, mpg, data=mtcars, size=factor(cyl), geom="point")
We now make use of the bar-plot:
We can flip the bar-plot by 90 degrees:
qplot(factor(cyl), data=mtcars, geom="bar")
We can flip the bar-plot by 90 degrees:
qplot(factor(cyl), data=mtcars, geom="bar") + coord_flip()
The below code tells us the difference between fill/color
bars
qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(cyl))
qplot(factor(cyl), data=mtcars, geom="bar", colour=factor(cyl))
qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(gear))
We can use the ‘ddply’ module from library ‘plyr’ and split
data.frame in subframes and apply functions as below:
library(plyr)
ddply(diamonds, "cut", "nrow")
ddply(diamonds, c("cut", "clarity"), "nrow")
ddply(diamonds, "cut", mean)
ddply(diamonds, "cut", summarise, meanDepth = mean(depth))
ddply(diamonds, "cut", summarise, lower = quantile(depth, 0.25, na.rm=TRUE),
median = median(depth, na.rm=TRUE),
upper = quantile(depth, 0.75, na.rm=TRUE))
Now we see different forms of creating ggplots with geom = histogram
by changing different binwidths:
qplot(carat, data = diamonds, geom = "histogram")
qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.1)
qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.01)
We can remove the standard error portion from the diagram by
the following code:
qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), se = FALSE)
We can make the line more or less wiggly (span: 0-1)
qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), span = 0.6)
qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), span = 1)
Now by using linear modelling:
qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), method = "lm")
We can save plot in variable (hint: data is saved in plot,
changes in data do not change plot-data)
p.tmp = qplot(factor(cyl), wt, data = mtcars, geom = "boxplot")
p.tmp
We now save mtcars in tmp-var
t.mtcars = mtcars
head(mtcars)
# change mtcars
mtcars = transform(mtcars, wt = wt^2)
# draw plot without/with update of plot data
p.tmp
p.tmp %+% mtcars
# the above line is same as below now with transformed mtcars
qplot(factor(cyl), wt, data = mtcars, geom = "boxplot")
Now to get information about plot:
summary(p.tmp)
We now save plot (with data included):
save(p.tmp, file = "temp.rData")
# save image of plot on disk (hint: svg device must be installed)
library(svglite)
ggsave(file = "test.pdf")
ggsave(file = "test.jpeg", dpi = 72)
ggsave(file = "test.svg", plot = p.tmp, width = 10, height = 5)
We can use shortcuts like this format geom_XXX(mapping, data, ..., geom,
position)
p.tmp + geom_point()
# using ggplot-syntax with qplot (hint: qplot creates layers automatically)
qplot(mpg, wt, data = mtcars, color = factor(cyl), geom = "point") + geom_line()
qplot(mpg, wt, data = mtcars, color = factor(cyl), geom = c("point","line"))
We can add an additional layer with different mapping
p.tmp + geom_point()
p.tmp + geom_point() + geom_point(aes(y=disp))
We can set aesthetics instead of mapping:
p.tmp + geom_point(color = "darkblue")
p.tmp + geom_point(aes(color = "darkblue"))
We now show how to deal with overplotting (hollow points, pixel points,
alpha[0-1] )
t.df = data.frame(x = rnorm(2000), y = rnorm(2000))
p.norm = ggplot(t.df, aes(x,y))
p.norm + geom_point()
p.norm + geom_point(shape = 1)
p.norm + geom_point(shape = ".")
p.norm + geom_point(colour = alpha("black", 1/2))
p.norm + geom_point(colour = alpha("blue", 1/10))
















































No comments:
Post a Comment