Monday, 30 January 2017

Session6-Plotting graphs with seaborn_31Jan

Different cubehelix palettes:

!pip install seaborn

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
sns.set(style="dark")
rs = np.random.RandomState(50)

# Set up the matplotlib figure
f, axes = plt.subplots(3, 3, figsize=(9, 9), sharex=True, sharey=True)
# Rotate the starting point around the cubehelix hue circle
for ax, s in zip(axes.flat, np.linspace(0, 3, 10)):

    # Create a cubehelix colormap to use with kdeplot
    cmap = sns.cubehelix_palette(start=s, light=1, as_cmap=True)

    # Generate and plot a random bivariate dataset
    x, y = rs.randn(2, 50)
    sns.kdeplot(x, y, cmap=cmap, shade=True, cut=5, ax=ax)
    ax.set(xlim=(-3, 3), ylim=(-3, 3))

f.tight_layout()

Discovering structure in heatmap data:

!pip install seaborn

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(font="monospace")

%matplotlib inline
# Load the brain networks example dataset
df = sns.load_dataset("brain_networks", header=[0, 1, 2], index_col=0)

# Select a subset of the networks
used_networks = [1, 5, 6, 7, 8, 11, 12, 13, 16, 17]
used_columns = (df.columns.get_level_values("network")
                          .astype(int)
                          .isin(used_networks))
df = df.loc[:, used_columns]

# Create a custom palette to identify the networks
network_pal = sns.cubehelix_palette(len(used_networks),
                                    light=.9, dark=.1, reverse=True,
                                    start=1, rot=-2)
network_lut = dict(zip(map(str, used_networks), network_pal))

# Convert the palette to vectors that will be drawn on the side of the matrix
networks = df.columns.get_level_values("network")
network_colors = pd.Series(networks, index=df.columns).map(network_lut)

# Create a custom colormap for the heatmap values
cmap = sns.diverging_palette(h_neg=210, h_pos=350, s=90, l=30, as_cmap=True)

# Draw the full plot
sns.clustermap(df.corr(), row_colors=network_colors, linewidths=.5,
               col_colors=network_colors, figsize=(13, 13), cmap=cmap)

Multiple linear regression:

import seaborn as sns
sns.set(style="ticks", context="talk")

# Load the example tips dataset
tips = sns.load_dataset("tips")

# Make a custom sequential palette using the cubehelix system
pal = sns.cubehelix_palette(4, 1.5, .75, light=.6, dark=.2)

# Plot tip as a function of toal bill across days
g = sns.lmplot(x="total_bill", y="tip", hue="day", data=tips,
               palette=pal, size=7)

# Use more informative axis labels than are provided by default
g.set_axis_labels("Total bill ($)", "Tip ($)")

Scatterplot Matrix:

import seaborn as sns
sns.set()

df = sns.load_dataset("iris")
sns.pairplot(df, hue="species")

Scatterplot with categorical variables:

import pandas as pd
import seaborn as sns
sns.set(style="whitegrid", palette="muted")

# Load the example iris dataset
iris = sns.load_dataset("iris")

# "Melt" the dataset to "long-form" or "tidy" representation
iris = pd.melt(iris, "species", var_name="measurement")

# Draw a categorical scatterplot to show each observation
sns.swarmplot(x="measurement", y="value", hue="species", data=iris)

Grouped Boxplots:

import seaborn as sns
sns.set(style="ticks")

# Load the example tips dataset
tips = sns.load_dataset("tips")

# Draw a nested boxplot to show bills by day and sex
sns.boxplot(x="day", y="total_bill", hue="sex", data=tips, palette="PRGn")
sns.despine(offset=10, trim=True)

Tuesday, 24 January 2017

Session5-Plotting Cities on Map with data_25Jan

Here we have data about number of registered vehicles in different years in this spreadsheet - Google Docs spreadsheet. And, we use this data to draw the Google GeoChart.

We need to do a little tweaking to the html code, which executes our map display function.

We need to get GoogleMaps API Credentials - meaning, we get an API Key from the below link:
https://developers.google.com/maps/documentation/javascript/get-api-key

And, add this to the second line of code of the executable map display html code.

Monday, 23 January 2017

Session4-Web hosting with charts_16Jan

Scatterplot Matrix

HTML Iframes

Document content goes here...

The scatterplot matrix visualizations pairwise correlations for multi-dimensional data; each cell in the matrix is a scatterplot.

Wednesday, 18 January 2017

Session3-My First Google Charts-part4_12Jan

Creating a Dashboard - Part II

Here we take the same data used in the previous post and convert it into a dashboard. We have replaced the chart with a chart wrapper. Added three filters. Added a dashboard component and bound the three filters to the chart wrapper. You can see the result both in this blog as well as on this regular HTML page.

The Dashboard on this blog

Data Source

One can choose one or multiple states. Also one can specify the range of cow and buffalo milk production and so select only the states that have this production.

Tuesday, 17 January 2017

Session3-My First Google Charts-part3_12Jan

Creating a Dashboard - Part 1

When you have a lot of data to be shown on a page, it makes sense to give the viewer an opportunity to filter some of the data so that he or she gets a cleaner view. In this case, we will first draw a rather clumsy Column Chart and then in the next section. The data for the chart is drawn from this spreadsheet. The chart shown below can also be seen in this regular HTML page.

Basic Column Chart Showing All Data

Data Source

Linkfor Data Source

Note how we have specified:

the Google Docs spreadsheet : https://docs.google.com/spreadsheets/d/1k0xYnDU78GYGuMivC-JTD3AakxyftIS4-Z_ptboPsQs/edit?usp=sharing
the sheet : sheet=MilkProduction
range : range=B2:H37
headers : headers=1
columns : query.setQuery('select B,E,F');
chart type : var chartMQ = new google.visualization.ColumnChart(document.getElementById('chart_divMQ'));

Session3-My First Google Charts-part2_12Jan

Specifying Range of Data and Selecting Columns: Google Charts with Google Docs data

In this example, the Google Docs spreadsheet, has four sheets. For the purpose of drawing our chart we would like to specify that

1. Data to be picked up from sheet named "Demo3"
2. Within this sheet, from the range C3:I23
3. Within this range from the columns C, D, G, H
4. Given the nature of the data we would like to multiply column G by 1000 before it is plotted

If you look at the spreadsheet in a browser, the URL will show up as:

https://docs.google.com/spreadsheets/d/1RZ-KVY4h6ZqUY743-R2qtsdJZ1DLDe_YauqX7CDfTYY/edit?usp=sharing

The chart will look as follows:

Sheet 2 Chart - Sheet, Range, Cols

Wednesday, 11 January 2017

Session3-My First Google Charts_part1_12Jan

and here i can put some text

The data which is used to create the below chart is lying in this Google Docs spreadsheet. that I want to display in this blog post or in any other html page using Google Charts.

The first task is to publish this data on the web and make it visible to anyone who has the URL.

Data from a Spreadsheet

7 Data Visualisation techniques in R_11Jan17-session-2

In continuation to my previous blog post, I continue with this post with more detailed visualization tools of the charts discussed there. With ever increasing volume of data, it is impossible to tell stories without visualizations. Data visualization is an art of how to turn numbers into useful knowledge.

R Programming lets us learn this art by offering a set of inbuilt functions and libraries to build visualizations and present data. Firstly, we see how to select the right chart type.

There are four basic presentation types:

Comparison
Composition
Distribution
Relationship

To determine which amongst these is best suited for your data, I suggest you should answer a few questions like:

- - How many variables do you want to show in a single chart?

- - How many data points will you display for each variable?

- - Will you display values over a period of time, or among items or groups?

Below is a great explanation on selecting a right chart type by Dr. Andrew Abela.

In our day-to-day activities, the below listed 7 charts will be made use of for the most of the time.

Scatter Plot
Histogram
Bar & Stack Bar Chart
Box Plot
Area Chart
Heat Map
Correlogram

We’ll use ‘Big Mart data’ example as shown below to understand how to create visualizations in R.

You can download the full dataset from here.

library(readr)
Big_Mart_data = read.csv("C:/Users/Rajat/Downloads/Big Mart Dataset - Sheet1.csv")

Now we see how to use these visualizations in R

1. Scatter Plot

When to use: Scatter Plot is used to see the relationship between two continuous variables.

In our above mart dataset, if we want to visualize the items as per their cost data, then we can use scatter plot chart using two continuous variables, namely Item_Visibility & Item_MRP as shown below.

Here is the R code for simple scatter plot:

library(ggplot2)          # ggplot2 is an R library for visualizations train.

ggplot(Big_Mart_data, aes(Item_Visibility, Item_MRP)) + geom_point() + 
  scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+ 
  scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()

We can view a third variable also in same chart, say a categorical variable (Item_Type) which will give the characteristic (item_type) of each data set. Different categories are depicted by way of different color for item_type in below chart.

R code with an addition of category:


ggplot(Big_Mart_data, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) + 
  scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
  scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
  theme_bw() + labs(title="Scatterplot")

We can even make it more visually clear by creating separate scatter plots for each separate Item_Type as shown below.

R code for separate category wise chart:


ggplot(Big_Mart_data, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) + 
  scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
  scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ 
  theme_bw() + labs(title="Scatterplot") + facet_wrap( ~ Item_Type)

2. Histogram

When to use: Histogram is used to plot continuous variable. It breaks the data into bins and shows frequency distribution of these bins. We can always change the bin size and see the effect it has on visualization.
From our mart dataset, if we want to know the count of items on basis of their cost, then we can plot histogram using continuous variable Item_MRP as shown below.

Here is the R code for simple histogram plot using function ggplot() with geom_histogram().

ggplot(Big_Mart_data, aes(Item_MRP)) + geom_histogram(binwidth = 2)+
  scale_x_continuous("Item MRP", breaks = seq(0,270,by = 30))+
  scale_y_continuous("Count", breaks = seq(0,200,by = 20))+
  labs(title = "Histogram")

3. Bar & Stack Bar Chart

When to use: Bar charts are recommended when you want to plot a categorical variable or a combination of continuous and categorical variable.

From our dataset, if we want to know number of marts established in particular year, then bar chart would be most suitable option, use variable Establishment Year as shown below.

Here is the R code for simple bar plot using function ggplot() for a single continuous variable.

ggplot(Big_Mart_data, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red") + theme_bw()+
  scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) + 
  scale_y_continuous("Count", breaks = seq(0,1500,150)) +
  coord_flip()+ labs(title = "Bar Chart") + theme_gray()

Vertical Bar Chart:
As a variation, you can remove coord_flip() parameter to get the above bar chart vertically.

ggplot(Big_Mart_data, aes(Item_Type, Item_Weight)) + geom_bar(stat = "identity", fill = "darkblue") + 
  scale_x_discrete("Outlet Type")+ 
  scale_y_continuous("Item Weight", breaks = seq(0,15000, by = 500))+ 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + labs(title = "Bar Chart")

Stacked Bar chart:

Stacked bar chart is an advanced version of bar chart, used for visualizing a combination of categorical variables.
From our dataset, if we want to know the count of outlets on basis of categorical variables like its type (Outlet Type) and location (Outlet Location Type) both, stack chart will visualize the scenario in most useful manner.

Here is the R code for simple stacked bar chart using function ggplot().

ggplot(Big_Mart_data, aes(Outlet_Location_Type, fill = Outlet_Type)) + geom_bar()+
  labs(title = "Stacked Bar Chart", x = "Outlet Location Type", y = "Count of Outlets")

4. Box Plot

When to use: Box Plots are used to plot a combination of categorical and continuous variables. This plot is useful for visualizing the spread of the data and detect outliers. It shows five statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum.

From our dataset, if we want to identify each outlet’s detailed item sales including minimum, maximum & median numbers, box plot can be helpful. In addition, it also gives values of outliers of item sales for each outlet as shown in below chart.

Here is the R code for simple box plot using function ggplot() with geom_boxplot:

ggplot(Big_Mart_data, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+
  scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=500))+
  labs(title = "Box Plot", x = "Outlet Identifier")

5. Area Chart

When to use: Area chart is used to show continuity across a variable or data set. It is very much same as line chart and is commonly used for time series plots. Alternatively, it is also used to plot continuous variables and analyze the underlying trends.

From our dataset, when we want to analyze the trend of item outlet sales, area chart can be plotted as shown below. It shows count of outlets on basis of sales.

Here is the R code for simple area chart showing continuity of Item Outlet Sales using function ggplot() with geom_area:

ggplot(Big_Mart_data, aes(Item_Outlet_Sales)) + geom_area(stat = "bin", bins = 30, fill = "steelblue") + 
  scale_x_continuous(breaks = seq(0,11000,1000))+ 
  labs(title = "Area Chart", x = "Item Outlet Sales", y = "Count")

6. Heat Map

When to use: Heat Map uses intensity (density) of colors to display relationship between two or three or many variables in a two dimensional image. It allows you to explore two dimensions as the axis and the third dimension by intensity of color.

From our dataset, if we want to know cost of each item on every outlet, we can plot heatmap as shown below using three variables Item MRP, Outlet Identifier & Item Type from our mart dataset.
The dark portion indicates Item MRP is close 50. The brighter portion indicates Item MRP is close to 250.

Here is the R code for simple heat map using function ggplot().

ggplot(train, aes(Outlet_Identifier, Item_Type))+
  geom_raster(aes(fill = Item_MRP))+
  labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+
  scale_fill_continuous(name = "Item MRP")

7. Correlogram

When to use: Correlogram is used to test the level of co-relation among the variable available in the data set. The cells of the matrix can be shaded or colored to show the co-relation value.

Darker the color, higher the co-relation between variables. Positive co-relations are displayed in blue and negative correlations in red color. Color intensity is proportional to the co-relation value.

From our dataset, let’s check co-relation between Item cost, weight, visibility along with Outlet establishment year and Outlet sales from below plot.
In our example, we can see that Item cost & Outlet sales are positively correlated while Item weight & its visibility are negatively correlated.

install.packages("corrgram")
library(corrgram)

cor(Big_Mart_data[,c(2,4,6,8,12)])

Here is the R code for simple correlogram using function corrgram().

corrgram(Big_Mart_data, order=NULL, panel=panel.shade, text.panel=panel.txt,
         main="Correlogram")

Sunday, 8 January 2017

First steps to my ggplots_05Jan17-Session_1

The data which we collect is only as good as our ability to understand and communicate it to others, which is why choosing the right visualization is essential. If our data is misrepresented or presented ineffectively, key insights and understanding are lost, which affects the overall purpose of our message.

This is my first ggplot guide which will show the most common charts and visualizations and help choose the right presentation for the data.

Information can be visualized in a number of ways, each of which can provide a specific insight. When we start to work with data, it’s important to identify and understand the story we are trying to tell and the relationship we are looking to show. Knowing this information will help us select the proper visualization to best deliver our message.

When analyzing data, search for patterns or interesting insights that can be a good starting place for finding our story, such as – Trends, Correlations and Outliers

BAR CHARTS

Bar charts are very versatile. They are best used to show change over time, compare different categories, or compare parts of a whole.

VERTICAL (COLUMN CHART)

- Best used for chronological data (time-series should always run left to right), or when visualizing negative values below the x-axis.

HORIZONTAL

- Best used for data with long category labels

STACKED

- Best used when there is a need to compare multiple part-to-whole relationships. These can use discrete or continuous data, oriented either vertically or horizontally.

100% STACKED

- Best used when the total value of each category is unimportant and percentage distribution of subcategories is the primary message.

PIE CHARTS

Pie charts are best used for making part-to-whole comparisons with discrete or continuous data. They are most impactful with a small data set.

STANDARD

- Used to show part-to-whole relationships.

DONUT

- Stylistic variation that enables the inclusion of a total value or design element in the center.

LINE CHARTS

Line charts are used to show time-series relationships with continuous data. They help show trend, acceleration, deceleration, and volatility.

AREA CHARTS

Area charts depict a time-series relationship, but they are different than line charts in that they can represent volume.

AREA CHART

- Best used to show or compare a quantitative progression over time.

STACKED AREA

- Best used to visualize part-to-whole relationships, helping show how each category contributes to the cumulative total.

100% STACKED AREA

- Best used to show distribution of categories as part of a whole, where the cumulative total is unimportant.

SCATTER PLOT

Scatter plots show the relationship between items based on two sets of variables. They are best used to show correlation in a large amount of data.

BUBBLE CHART

Bubble charts are good for displaying nominal comparisons or ranking relationships.

BUBBLE PLOT

- This is a scatter plot with bubbles, best used to display an additional variable.

BUBBLE MAP

- Best used for visualizing values for specific geographic regions.

HEAT MAP

Heat maps display categorical data, using intensity of color to represent values of geographic areas or data tables.

INTRO TO GGPLOTS

The grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics. This blog post is my introduction to ggplot2, a visualization package in R. It assumes a very basic knowledge of R, like vectors, data frames and reading csv files. ggplot2 is an R package for producing statistical, or data, graphics, but it is unlike most other graphics packages because it has a deep underlying grammar.

GGPLOT2 INSTALLATION

One of R’s greatest strengths is its excellent set of packages. To install a package, we can use the install.packages() function.

To install ggplot2 package we write the following:

install.packages("ggplot2")

To load a package into our current R session, we use library() like below:

library(ggplot2)

Scatter plots with qplot():

We now create a scatterplot in ggplot2. We’ll use the “iris” data frame that’s automatically loaded into R.

We can use the head function to look at the first few rows of the data frame:

head(iris)

The data frame actually contains three types of species: setosa, versicolor, and virginica. Let’s plot Sepal.Length against Petal.Length using ggplot2’s qplot() function:

qplot(Sepal.Length, Petal.Length, data = iris)

# Plot Sepal.Length vs. Petal.Length, using data from the `iris` data frame.

# * First argument `Sepal.Length` goes on the x‐axis.

# * Second argument `Petal.Length` goes on the y‐axis.

# * `data = iris` means to look for this data in the `iris` data frame.

To see where each species is located in this graph, we can color each point by adding a color = Species argument.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species)

Similarly, we can let the size of each point denote sepal width, by adding a size = Sepal.Width argument.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width

# We see that Iris setosa flowers have the narrowest petals.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width , alpha = I(0.7))

# By setting the alpha of each point to 0.7, we reduce the effects of overplotting.

Finally, let’s fix the axis labels and add a title to the plot.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, xlab = "Sepal Length", ylab = "Petal Length", main = "Sepal vs. Petal Length in Iris data")

Other common geoms:

In the scatterplot examples above, we implicitly used a point geom, the default when you supply two arguments to qplot().

# These two commands are same and give the same output.

qplot(Sepal.Length, Petal.Length, data = iris, geom = "point")
qplot(Sepal.Length, Petal.Length, data = iris)

But we can also easily use other types of geoms to create more kinds of plots.

Barcharts: geom = “bar”

movies = data.frame(director = c("spielberg", "spielberg", "spielberg", "jackson", "jackson"),
   movie = c("jaws", "avatar", "schindler's list", "lotr", "king kong"),
   minutes = c(124, 163, 195, 600, 187))

# Plot the number of movies each director has.

qplot(director, data = movies, geom = "bar", ylab = "# movies")

# By default, the height of each bar is simply a count.

But we can also supply a different weight.

# Here the height of each bar is the total running time of the director's movies.

qplot(director, weight = minutes, data = movies, geom = "bar", ylab = "Total Length”)

Line charts: geom = “line”

qplot(Sepal.Length, Petal.Length, data = iris, geom = "line", color = Species)

# Using a line geom doesn't really make sense here.

`Orange` is another built‐in data frame that describes the growth of orange trees.

qplot(age, circumference, data = Orange, geom = "line", color = Tree, main = "How does tree circumference varies with age")

# We can also plot both points and lines.

qplot(age, circumference, data = Orange, geom = c("point", "line"), color = Tree,  main = "How doe tree circumference varies with age")

`diamonds ` is another built‐in data frame that describes the types of different diamond according to their cut, clarity, carat, color, shapes, etc. and so is mtcars, which describes different cars makes/brands with their respective mpg, cylinder capacity, displacement, horse power, etc.

We can show the info about the data:

head(diamonds)
head(mtcars)

We can also do a comparison between qplot vs ggplot – both give the same output:

# qplot histogram

qplot(clarity, data=diamonds, fill=cut, geom="bar")

# ggplot histogram

ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()

Here we use continuous scale and also a discrete scale(by converting to factors)

head(mtcars)
qplot(wt, mpg, data=mtcars, colour=cyl)
levels(mtcars$cyl)
qplot(wt, mpg, data=mtcars, colour=factor(cyl))

By using different aesthetic mappings:

qplot(wt, mpg, data=mtcars, shape=factor(cyl))
qplot(wt, mpg, data=mtcars, size=qsec)

We now combine mappings (hint: hollow points, geom-concept, legend combination)

qplot(wt, mpg, data=mtcars, size=qsec, color=factor(carb))
qplot(wt, mpg, data=mtcars, size=qsec, color=factor(carb), shape=I(1))
qplot(wt, mpg, data=mtcars, size=qsec, shape=factor(cyl), geom="point")
qplot(wt, mpg, data=mtcars, size=factor(cyl), geom="point")

We now make use of the bar-plot:

qplot(factor(cyl), data=mtcars, geom="bar")

We can flip the bar-plot by 90 degrees:

qplot(factor(cyl), data=mtcars, geom="bar") + coord_flip()

The below code tells us the difference between fill/color bars

qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(cyl))
qplot(factor(cyl), data=mtcars, geom="bar", colour=factor(cyl))

We can fill by variable also:

qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(gear))

We can use the ‘ddply’ module from library ‘plyr’ and split data.frame in subframes and apply functions as below:

library(plyr)

ddply(diamonds, "cut", "nrow")
ddply(diamonds, c("cut", "clarity"), "nrow")
ddply(diamonds, "cut", mean)
ddply(diamonds, "cut", summarise, meanDepth = mean(depth))
ddply(diamonds, "cut", summarise, lower = quantile(depth, 0.25, na.rm=TRUE),
                                  median = median(depth, na.rm=TRUE),
                                  upper = quantile(depth, 0.75, na.rm=TRUE))

Now we see different forms of creating ggplots with geom = histogram by changing different binwidths:

qplot(carat, data = diamonds, geom = "histogram")
qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.1)
qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.01)

We use geom to combine plots by changing the order of layers:

qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"))
qplot(wt, mpg, data = mtcars, geom = c("smooth", "point"))
qplot(wt, mpg, data = mtcars, color = factor(cyl), geom = c("point", "smooth"))

We can remove the standard error portion from the diagram by the following code:

qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), se = FALSE)

We can make the line more or less wiggly (span: 0-1)

qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), span = 0.6)

qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), span = 1)

Now by using linear modelling:

qplot(wt, mpg, data = mtcars, geom = c("point", "smooth"), method = "lm")

We can save plot in variable (hint: data is saved in plot, changes in data do not change plot-data)

p.tmp = qplot(factor(cyl), wt, data = mtcars, geom = "boxplot")
p.tmp

We now save mtcars in tmp-var

t.mtcars = mtcars
head(mtcars)
# change mtcars
mtcars = transform(mtcars, wt = wt^2)
# draw plot without/with update of plot data
p.tmp
p.tmp %+% mtcars
# the above line is same as below now with transformed mtcars
qplot(factor(cyl), wt, data = mtcars, geom = "boxplot")

Now to get information about plot:

summary(p.tmp)

We now save plot (with data included):

save(p.tmp, file = "temp.rData")
# save image of plot on disk (hint: svg device must be installed)
library(svglite)

ggsave(file = "test.pdf")
ggsave(file = "test.jpeg", dpi = 72)
ggsave(file = "test.svg", plot = p.tmp, width = 10, height = 5)

We can use shortcuts like this format geom_XXX(mapping, data, ..., geom, position)

p.tmp + geom_point()

# using ggplot-syntax with qplot (hint: qplot creates layers automatically)

qplot(mpg, wt, data = mtcars, color = factor(cyl), geom = "point") + geom_line()
qplot(mpg, wt, data = mtcars, color = factor(cyl), geom = c("point","line"))

We can add an additional layer with different mapping

p.tmp + geom_point()
p.tmp + geom_point() + geom_point(aes(y=disp))

We can set aesthetics instead of mapping:

p.tmp + geom_point(color = "darkblue")
p.tmp + geom_point(aes(color = "darkblue"))

We now show how to deal with overplotting (hollow points, pixel points, alpha[0-1] )

t.df = data.frame(x = rnorm(2000), y = rnorm(2000))
p.norm = ggplot(t.df, aes(x,y))
p.norm + geom_point()

p.norm + geom_point(shape = 1)

p.norm + geom_point(shape = ".")

p.norm + geom_point(colour = alpha("black", 1/2))

p.norm + geom_point(colour = alpha("blue", 1/10))