This module aims to explore patterns and similarities of people involving in the data visualisation field, by looking into a survey that was conducted in 2019 with 1700 participants by Data Visualisation Society.
Data visualization is the graphical representation of information and data. It has been an important factor in data analytics and decision making process, as it can reveal insights that are often difficult to be delivered in other forms. Therefore,understanding the current state of data visualization is crucial. It gives organizations and practitioners in the field a comprehensive picture of where data visualization stands today. Also, it provides people who have an interest in data visualization with a better understanding of the field.
In this sub-modulo, the focus is on cluster analysis. The aim is to find appropriate visualisation for exploring patterns and similarities in respondents, using a survey that was conducted in 2019 with 1700 participants and has 50+ questions.
For this session, the tidyverse, ggforce, GGally, plotly R and parcoords packages will be used.
The code chunks below are used to install and load the packages in R.
packages = c('tidyverse', 'ggforce', 'GGally', 'plotly', 'parcoords', 'knitr')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}








# import processed dataset
survey <- read_csv("data_visualization_survey-master/data/processed_survey.csv")
survey <- survey[,2:ncol(survey)]
survey
# A tibble: 1,057 x 40
Learning_method Role Yearly_pay Visualisation_tool Charts
<chr> <chr> <chr> <chr> <chr>
1 Equal Parts Scho… Acade… $40k - $6… ArcGIS,Excel,ggplot… PieChart,…
2 Mostly Self-Taug… Leade… < $20k Excel,Tableau,Pikto… LineChart…
3 Equal Parts Scho… Leade… $120k - $… Excel,PowerBI,R,Tab… LineChart…
4 Mostly Self-Taug… Leade… $100k - $… Excel,Googledatastu… LineChart…
5 Mostly Self-Taug… Leade… $100k - $… Excel,Googledatastu… LineChart…
6 Mostly Self-Taug… Devel… $80k - $1… D3,Illustrator,Leaf… BarChart,…
7 Mostly Self-Taug… Analy… $100k - $… PowerBI,Tableau,Pen… LineChart…
8 Mostly Self-Taug… Leade… $20k - $4… D3,Angular,Excel,Ja… LineChart…
9 Equal Parts Scho… Acade… $100k - $… ArcGIS,D3,Excel,Ill… LineChart…
10 Mostly Self-Taug… Leade… $140k - $… D3,Excel,Java,Mapbo… LineChart…
# … with 1,047 more rows, and 35 more variables:
# Organization_area <chr>, Undergraduate_major <chr>, Gender <chr>,
# Country <chr>, tool_Excel <chr>, tool_Tableau <chr>,
# tool_R <chr>, tool_ggplot2 <chr>, tool_D3 <chr>,
# tool_Python <chr>, tool_Pen&Paper <chr>, tool_Illustrator <chr>,
# tool_PowerBI <chr>, tool_Plotly <chr>, tool_Mapbox <chr>,
# tool_QGIS <chr>, tool_Leaflet <chr>, tool_ArcGIS <chr>,
# tool_Matplotlib <chr>, tool_React <chr>, chart_BarChart <chr>,
# chart_LineChart <chr>, chart_Scatterplot <chr>,
# chart_PieChart <chr>, chart_Hexbin/Heatmap <chr>,
# chart_Infographics <chr>, chart_Treemap <chr>,
# chart_FlowChart) <chr>, chart_FlowDiagram(Sankey <chr>,
# chart_DAGRE <chr>, chart_ChoroplethMap <chr>,
# chart_NetworkDiagram <chr>, chart_PictorialVisualization <chr>,
# chart_Force-DirectedGraph <chr>, chart_RasterMap <chr>
# column information
col_list <- colnames(survey)
col_list
[1] "Learning_method" "Role"
[3] "Yearly_pay" "Visualisation_tool"
[5] "Charts" "Organization_area"
[7] "Undergraduate_major" "Gender"
[9] "Country" "tool_Excel"
[11] "tool_Tableau" "tool_R"
[13] "tool_ggplot2" "tool_D3"
[15] "tool_Python" "tool_Pen&Paper"
[17] "tool_Illustrator" "tool_PowerBI"
[19] "tool_Plotly" "tool_Mapbox"
[21] "tool_QGIS" "tool_Leaflet"
[23] "tool_ArcGIS" "tool_Matplotlib"
[25] "tool_React" "chart_BarChart"
[27] "chart_LineChart" "chart_Scatterplot"
[29] "chart_PieChart" "chart_Hexbin/Heatmap"
[31] "chart_Infographics" "chart_Treemap"
[33] "chart_FlowChart)" "chart_FlowDiagram(Sankey"
[35] "chart_DAGRE" "chart_ChoroplethMap"
[37] "chart_NetworkDiagram" "chart_PictorialVisualization"
[39] "chart_Force-DirectedGraph" "chart_RasterMap"
Since this is a survey data, columns in the dataset all have categorical variables. It’s not appropriate to use clustering techniques such as dendrogram and k-means clustering method. Therefore, Parallel Set chart, which is compatible with categorical data, is taken into considerations in this case. The width of polylines indicate the frequency of how often each category occur and the width of bar reflects the percentage of each response. The code below is an example of examining the relationship between the usage of different tools and role of respondents.
Use geom_parallel_sets, geom_parallel_sets_axes, geom_parallel_sets_labels to plot the Parallel sets
# group by all the tools
selected_tool <- c(col_list[2], col_list[10:25])
tool <- survey %>% group_by(survey[selected_tool]) %>%
summarise(freq = n())
# gather the data.frame into long form
tool <- gather_set_data(tool, 1:17)
# plot parallel set, Role of respondents are used to fill
ggplot(tool, aes(x, id = id, split = y, value = freq)) +
geom_parallel_sets(aes(fill = `Role`), alpha = 0.3, axis.width = 0.2) +
geom_parallel_sets_axes(axis.width = 0.2) +
geom_parallel_sets_labels(colour = 'orangered1', angle=360, size = 3) +
theme(axis.text.x = element_text(face="bold", angle = 90, size=14))

As we can see from the parallel chart, Designer, Developer, Engineer, Leadership, Other and Scientist show similar trend across the usage of tools ‘ArcGIS’, ‘D3’, ‘Excel’, ‘ggplot2’, ‘Illustrator’, ‘Leaflet’, ‘Mapbox’, ‘Matplotlib’, ‘Pen&Paper’, ‘Plotly’ and ‘Python’. The result of Analyst category is quite distinctive, compared to other categories. Analyst respondents don’t use majority of tools listed, except for Excel, R and Tableau. Besides, percentage of Academic respondents is relatively small, but their responses are very interesting. As shown in the chart, the majority of Academic respondents don’t use any of the tool listed in the survey. Based on this parallel chart, we can roughly summarise the data into 3-4 groups.
Similarly, let’s take a look of the relationship between different charts used in production in the last 6 months and role of respondents.
# group by all the charts used
selected_chart <- c(col_list[2], col_list[26:40])
chart <- survey %>% group_by(survey[selected_chart]) %>%
summarise(freq = n())
# gather the data.frame into long form
chart <- gather_set_data(chart, 1:16)
# plot parallel set
ggplot(chart, aes(x, id = id, split = y, value = freq)) +
geom_parallel_sets(aes(fill = `Role`), alpha = 0.3, axis.width = 0.2) +
geom_parallel_sets_axes(axis.width = 0.2) +
geom_parallel_sets_labels(colour = 'orangered1', angle=360, size = 3) +
theme(axis.text.x = element_text(face="bold", angle = 90, size=14))

Again, Designer, Developer, Engineer, Leadership, Other and Scientist behave similarly with respect to the usage of charts. Academic respondents don’t use majority of the charts, except for ‘Bar Chart’, ‘Line Chart’ and “Scatter plot”.
In general, parallel set charts are informative and useful in finding similarities and patterns of the data. We can incorporate it into our Shiny-based Visual Analytics Application (Shiny-VAA) and add interactivity. Role of respondents is used in the above two examples. For the final application, we can add a selection input, with options including role, yearly pay, educational background, area of organisation, gender and country.
Use ggparcoord to plot the Coordinate plot
ggparcoord(data = survey,
columns = c(1:3,6:9),
groupColumn = 13,
scale = "uniminmax",
showPoints = TRUE,
boxplot = TRUE) +
theme(axis.text.x = element_text(face="bold", angle = 90, size=14))

This coordinate plot can also be used to investigate the similarity between groups. However, one drawback is that values of each x variable can’t be labeled on the plot, therefore, it’s hard to summarise information into groups as you don’t know who is who.
ggparcoord(data = survey,
columns = c(10:25),
groupColumn = 2,
scale = "uniminmax",
showPoints = TRUE,
boxplot = TRUE) +
theme(axis.text.x = element_text(face="bold", angle = 90, size=10)) +
facet_wrap(~ Role)

Use parcoords to plot an interactive parallel coordinates
parcoords(survey[,c(1:3)],
rownames = FALSE,
reorderable = T,
brushMode = '1D-axes')
This plot gets very messy when a variable has multiple values.
As discussed in the previous sections, parallel set chart with interactivity is informative, clear and useful, therefore it will be used for this sub-modulo.
