Lies, Damned Lies, & Data Visualization

Syllabus

Christian Swinehart — Fri, 19 Feb 2016 20:41:33 GMT

Charts and graphs have an indisputable aura of objectivity and yet, much like statistics, they have an immense power to either elucidate or mislead. What makes an information graphic ‘trustworthy’ and how do designers know that their work is telling the ‘whole story’? In this course we will deal with the nuts and bolts of collecting & processing data and explore different ways of communicating its meaning in a quantitatively rigorous and visually engaging way.

Assignments will involve the use of scripting, databases, and other numerical tools to transform data into something that is understood rather than simply ‘seen’. Students are encouraged to consider data sources from their surroundings or the larger world and to break away from the screen-based status quo, eschewing the expected line graphs, pie charts, and tables in favor of unconventional visualizations of their own devising.

Objectives

to give students an understanding of the process of acquiring, analyzing, refactoring, and visualizing data
to develop an understanding of the building-blocks of visual data representation (bar charts, scatter plots, network diagrams, etc.), know when each is appropriate, and learn to avoid their associated pitfalls
to discuss the epistemological issues raised by being an irrational primate attempting to make systematic sense of an unverifiable world
to establish Hypothesis Testing as a working method for developing visual explanations and discovering the ‘story’ within a dataset

Prerequisites

basic bitmap, vector, and (potentially) video-editing skills
familiarity with statistical reasoning (mean, median, sorting, normalization, etc.)
facility with a scripting language/data visualization library (d3, SciPy, R) or other data analysis tool (Mathematica, MATLAB, Excel)
not required but helpful: knowledge of databases, server-side programming, interaction design, and animation (or audio)

Readings

Readings will be assigned weekly covering both formal and conceptual issues involved in data science. We will discuss the readings in class in relation to the current assignment and each other’s coursework. Each student must submit 3 questions to the class website before 8 a.m. the day of class.

These questions will act as prompts for the in-class discussion, so anything that can be answered with a ‘yes’ or ‘no’ is probably not up to snuff. The questions should not be questions for the instructor but are intended for your fellow students. You must come to class prepared to discuss the texts.

Assignments

There will be 4 assignments over the course of the semester.

Presentations

Each student will give a pair of presentations on an artist, designer, or technical topic (visualizations tools, algorithms, etc.) during the course of the semester. Presentations should be about 10 minutes long in the format of your choice (slideshow, website, mini-lecture).

You must submit an online summary of your presentation as part of your class documentation by the final day of class. This PDF or online summary should cover the main points of your presentation as well as including appropriate visuals and links to resources or additional information on your topic.

Presentation topics (and dates) will be chosen in the second week of class. One of your presentations must be drawn from a provided list of options while the other will be a topic of your own devising.

Grading

Participation: 20%
Final assignment: 15%
Presentations: 15%
Assignments 1–3: 10%
Reading Questions: 10%
Attendance: 2 unexcused absences==instafail

Final-grade Archetypes

F – frequently late and/or absent. insufficient participation. little to no understanding of formal and quantitative practices.
D – occasional lateness and more than one unexcused absence. basic understanding of subject matter.
C – occasional lateness. demonstrated an understanding of subject matter. failed to take risks. work holds together. makes only obligatory contributions to discussions.
B – always present. work in on time. demonstrated a solid understanding of subject matter. was able to seek out new design principles and technological approaches. work has good form and content (and took some risks). able to make interesting contributions to the class.
A – always present. work in on time. demonstrated a solid understanding of data visualization. was able to explore new approaches. work has excellent form and content which took major risks. always makes interesting contributions to the class and frequently led class discussions.

Week 1

Christian Swinehart — Sat, 20 Feb 2016 00:16:47 GMT

February 22

Assessment of student skills, levels, and interests
- What do you want to learn in this class?
- What sorts of data/information graphics work have you done previously?
- Any coding experience?
Introduction to course goals and expectations
Intro talk
Exercise: Catalog & Classify

Assignment

Next week we'll have an in-class programming workshop. In preparation for that, please read this selection of tutorial chapters.
As with every reading assignment, you will be expected to post 3 questions to the class blog the night before the next class meeting to help guide our in-class discussion. Be sure to add the tag "R1" to your post by clicking on the gear icon at the top of the screen.

Catalog & Classify

Christian Swinehart — Sun, 21 Feb 2016 15:08:36 GMT

A collective research project providing examples and discussion of the basic building blocks of visual data representation.

Reading #1

Christian Swinehart — Sun, 21 Feb 2016 21:56:49 GMT

Python Crash Course

Getting started with Python & PlotDevice

Of the many scripting languages in popular use, Python has a reputation for ease-of-learning and power that few others can match. This week's reading assignment is designed to give you a sense of the language's syntax and how it might be applied to the creation of data graphics using PlotDevice.

Primary Reading (chapters 1–3, 9–12)

Getting Started
Environment
Primitives
Variables
Strings
Collections
Serialization

Supplementary Materials

The official Python Tutorial
Learn Python the Hard Way

Dendrogram

Dipesh Chawla — Mon, 22 Feb 2016 19:31:03 GMT

Dendogram

Dendrogram is best used for classification, probability, decision-making and hierarchy. These could be possible options, decision trees, probability growth, evolutionary growth, or generations. These are useful when it comes to determining how things differentiate while having certain things in common.

A dendogram needs to be visually efficient, in sense that it should display the information in the most visually pleasing way while not overwhelming the user with too much data or decisions.

Three particularly good examples include the following:

In number 4, despite being very data heavy, the information is presented spaced out enough not to overwhelm the user.

In this cases, there is too much information, but combined with the graph on the edge, it does overall redeem the graph.

Some bad examples include:
Too much information, not at all clear for the user.

No classification, no information about the branches, and illegible writing.

Star Plots ⭐️

Jina Aris Yoon — Mon, 22 Feb 2016 19:31:04 GMT

Star Plots ⭐️

Star plots, sometimes called radar charts or web charts, are a graphic device method used to display multivariate data. Multivariate in this sense refers to the having of multiple characteristics to observe. The variables must also be ranged values.

Star plots are often used to display several different observations of the same type of data. For example, this set of star plots compares different car models in the same nine variables:

Price
Mileage (MPG)
1978 Repair Record (1 = Worst, 5 = Best)
1977 Repair Record (1 = Worst, 5 = Best)
Headroom
Rear Seat Room
Trunk Space
Weight
Length

Like many other star plots, labeling variables on the chart can be convoluting, so the variables are listed seperately above. In this set of plots, we can see that Cadillacs are the most expensive cars of the observed values.

This method of comparing similar items can be done on the same plot (rather than on separate axes) using different colors or line styles, as in the following. This method makes it easy to compare one observation to another, but can become difficult to read with a larger number of observations. Here, we can easily see that Design 1 (in green) has a strong dominance in mass, or an outlier.

A popular presentation of this data visualization method is in "personality type" analysis tests. Typically, there is no relationship between the placement of variables, but this chart made use of this spatial technique and grouped personality variables that are commonly related, into five groups, indicated by color:

The star plot is useful in answering three major questions:

Which variables, if any, are dominant (or lacking) in a given observation?
What observations are similar? (i.e. Are there any trends between some or all observations?)
Are there any observations that are outliers?

It can also be used to show changes between discrete periods of time. The following graphic compares revenue of companies between two years. The years are marked as different observations by colored lines.

Note how this display is much easier to cognitively process than trying to mentally calculate differences in a tabular chart as in below.

Conversely, one can display the variables (the axes) as discrete times as well. The following graphic compares item stock by line/observations and labels the months as variables.

The main weakness of star plots is that they are limited to displaying a few variables at a time (practically speaking, no more than 20, but there are some that are as large as up to 100 variables). After that, the web becomes overwhelming or extremely large. The plot below is difficult to understand because of its sheer number of variables, obscurely labeled variables, and bad color choice (two types of blue for very similar observations).

Rubber Sheet

Rhea laroya — Mon, 22 Feb 2016 19:31:09 GMT

Rubber Sheet

Like a heat map, but used to map four or more dimensions, through the use of a colored, three dimensional surface.

Example 1

Example 2
A large 3D physical visualization made by the Detroit Edison Company showing electricity consumption for the year 1935, with a slice per day and each day split into 30 min intervals.

physics data visualizations--> http://dataphys.org/list/

Linear and radial parallel coordinates

Mengyuan (Amily) He — Mon, 22 Feb 2016 19:31:13 GMT

Linear and radial parallel coordinates

Parallel coordinates is a common way of visualizing high-dimensional geometry and analyzing multivariate data. Vertical bars represent each dimension. Each element of the data set has values for each dimension, which are shown as points along the vertical axis and then connected together.

Good parallel coordinates present clear

• Data structure • Data trend • Correlations

Good example 1:

In the space between MPG and cylinders, you can tell that eight-cylinder cars generally have lower mileage than six- and four-cylinder ones. Just follow the lines and look at how they cross: lots of crossing lines are an indication of an inverse relationship, and that is clearly the case here: the more cylinders, the lower the mileage.

The correlation is much more direct between cylinders and horsepower: more cylinders means more horses. There are some crossing lines here as well, of course, so more cylinders do not always mean more power, but the general trend is clearly there.

Good example 2:

Good example 3:

Radial parallel coordinates

A fairly large set of data is represented in a relatively small amount of space. If this were to be straightened out linearly the power and functionality would be lost. Within a small amount of visual space, many different stories are unfolding before our eyes. While we see that most information flows from the blue section highlighted, we also see very clearly that four sources come into it from the outside. Another quick understanding is that even though it seems to cover the most territory, the bar charts in the outermost ring shows us that relative to the other segments, the blue has the least activity in that only one bar of green stands out with a higher value. Even though we don’t know what this data represents, the quick inspection allows us to see relationships, and that’s what we’re after.

Most RCDs that I’ve come across so far are interactive. In this example, a hover over the blue fades out all the other segments so that you can concentrate on what relationships are shared by it.

Bad example 1:

Bad example 2:

Bad example 3:

Heat Map

Hanna McLaughlin — Mon, 22 Feb 2016 19:31:18 GMT

HEAT MAP

a graphical representation of data where the values contained in a matrix are represented as colors.

Can be used for understanding behaviors on multiple scales: digital, built environment, geography
UI/UX, shopping, air traffic

Incorporating Interests with Geopgraphy: NYT Geography of baseball fandom

NYT

Facebook Distribution of baseball team fans
facebook

COMPARISON: These two heat map examples depict data gather from facebook about where baseball team fans are geographically located. The two visuals serve different purposes. The NYTs is focusing on the transitions between large team territories as compared to Facebook's visualization of team fan distribution across the country. Regardless of intent, I find the NYT graphic to be richer in content and use the heat map technique to greater effect.

BAD EXAMPLE: Reddit Interest Network
This infographic depicting the frequency of subreddits with the frequent re-posting in red and the less frequent in blue.

This graph highlights the importance of having a known geography to map information on.

BAD EXAMPLE: Understanding What Emotions Goes Viral for Marketing Campaigns

emotion

ex - permutation matrix

Yolanda Lam — Mon, 22 Feb 2016 19:31:20 GMT

permutation matrix

Bertin's sortable bar charts for the display of multi-dimensional data

features

Allows arrangements to transform initial matrix - can rearrange rows/columns to reveal information of interest.
Visual comparison of data to reveal or prove insights.
Custom organization of data help communicate insights with efficiency.
Work flow:

examples

Results for US Presidential Election Sorting into chronological order reveals a pattern (image 2) and ordering the into three regions shows a clearer picture (image 3) ![]
Reorderings of the Character-appearances Network in Victor Hugo's "Les Miserables"
Heat Map
Hotel Occupancy Within Two Years Highlighted bars represent a value above a certain threshold. Indexing is arbitrary.

Tree Maps

Alyssa Mayo — Mon, 22 Feb 2016 19:31:20 GMT

Tree Maps

A tree map is an area-based visualization for a hierarchically-ordered (tree-structured) set of data. It is presented as nested rectangles whose relative size and color each convey a dimension of the data.

Here is a visualization of soft drink preferences across a set of people.
Source: Wikipedia

Tree maps are good for quickly seeing whether or not there is a pattern between two dimensions of a data set. The tetris-like geometry also makes Treemapping good for displaying especially large sets in one view.

Creating a tree map involves choosing two dimensions of the data, color-coding one dimension and defining a "tiling algorithm" for the dimension represented by area. The tiling algorithm determines how the rectangles are sub-divided into rectangles of specific area (corresponding to the data). Tree maps are most legible when the area of sub-rectangles have an aspect ratio close to one.

While not the most beautiful example, the following tree map of greenhouse gas emissions by buildings on the campus of the University of North Carolina at Chapel Hill, shows what is meant by "sub-rectangles".

Source: blogs.sas.com

Another good example, and description of how to read it:

"For UNAIDS (the Joint United Nations Programme on HIV/AIDS), Michael Lindsay of studiovertex designed a pair of treemaps that detail the prevalence of HIV worldwide. The first treemap (left side of the above) visualizes the status of those living with HIV (new infections vs. fatalities; those receiving treatment vs. those waiting for treatment). The second treemap (right side) depicts the geographic regions where those with HIV live, drawing attention to the disproportionate incidence of HIV in Sub-Saharan Africa and Asia as compared to the United States and Western Europe." (Source: arcadenw.org)

Bad example 1:
Ungrouped colors make it hard to compare area

Bad example 2:
Unlabeled groupings with no key

matrix

Katie Wen — Mon, 22 Feb 2016 19:31:20 GMT

matrix

any two dimensional set of numbers, colors, intensities, sized dots, or other glyphs

good for plotting and analyzing a wide range of data. used arranged in rows and columns and often used in sciences.

there are several types of matrices, including numeric, linear, scatterplot, dot correlation, reorderable, etc...

examples

Chernoff Faces

Tim Duschenes — Mon, 22 Feb 2016 19:31:20 GMT

Chernoff Faces

Chernoff Faces are a way to visualise many different variables at once. Invented in 1973 by Herman Chernoff (who studied applied mathematics at Brown), they are based on the premise that humans are very good at recognising minute changes in faces. Each measurement of the face, including eye width, head height, distance between mouth and nose and many others, can be mapped to a particular variable. Chernoff faces are therefore a type of glyph, a graphical object whose properties represent data values.

Some features (e.g.: eyebrow slant and eye size) carry more perceived importance, so care must be taken when choosing which variables to assign to which features. Note that this will vary hugely from person to person, causing major differences in how two people interpret the data.

Chernoff faces can be plotted as points on Cartesian co-ordinates, with X and Y as the two most important variables and the features filling in the other details. In 1981, Bernhard Flury and Hans Riedwyl suggested maing the faces asymmetrical, thereby doubling the amount of data that could be stored on each face, but also ruining the basic premise of it being a face.

The main issue with this technique is that its fundamental principle, that face recognition can help us see subtle data changes, isn't true. Our ability to recogise faces is a preattentive capability, and does not assist at all in comparison between different faces, which is a serial search task.

Another huge problem with this method is that one personifies the faces in completely inaccurate ways. A downturned mouth may simply mean that an baseball coach used more pinch hitters, but we will always read it as anger or sadness. One ends up making arbitrary assumptions that are not seen in the data. It can be particularly dangerous when facial proportions result in racial stereotypes being read into the visualisation.

This method is fun, but great care is needed, and there may well be a better method.

Crime data from all the states of the US as Chernoff faces. It looks like the data used here was not appropriate to the technique, as the faces have come out very similar looking. The key is also extremely difficult to use.

A good example of how meaningless these are without a key.

Created using 18 different variables, these tests show the maximum, minimum and average faces.

Be careful about colour.

It has been noted that Chernoff faces work better for negative associations than positive.

Isosurfaces

Janice Gan — Mon, 22 Feb 2016 19:31:28 GMT

Isosurfaces

Isosurfaces can be used to create data maps that resemble topographies in 2D or 3D. They have traditionally been used in mathematics, engineering, and medical fields to represent heat distribution, fluid dynamics, or surface qualities of everything from equations to human bones. Water turbulence around a propeller, airflow around a spacecraft reentering the atmosphere, or (as seen below) building ventilation and heating can all be represented this way.

Air velocity around and within a naturally-ventilated building

Passive-solar heat distribution within a lecture hall

While these visualizations can be very useful, they can also become confusing very quickly if colors and axes are used indiscriminately...

...or if the relationships depicted are unlabeled or decontextualized.

This idea of mapping with concentric surfaces, however, can also be used in very elegant ways outside of scientific applications. Interactive isosurface maps seem to have great potential for helping people visualize spatial relationships. This web-hosted map, for example, shows the the distance one can travel by bus from Portland's city center in any given increment of time. (For comparison, this somewhat more confusing map aims to do the same thing for Japan by morphing the geography itself rather than creating a spatial overlay.)

Finally, though I was not able to find examples of this, it may be possible to use isosurfaces to represent information that has both breadth and depth, with size of ring depicting breadth and number/height of rings depicting depth.

Survey Plot / Table Lens

Kevin Ma — Mon, 22 Feb 2016 19:32:07 GMT

Survey Plot / Table Lens

Popularized in the "Table Lens" project from Xerox, these resemble series of bar graphs that can be sorted independently.

What is a survey plot?

We present a new visualization, called the Table Lens, for visualizing and making sense of large tables...The Table Lens fuses
symbolic and graphical representations into a single coherent
view that can be fluidly adjusted by the user. This fusion and
interactivity enables an extremely rich and natural style of
direct manipulation exploratory data analysis. — Table Lens

A survey plot is a simple multi-attribute visualization technique that can help to spot correlations between any two variables especially when the data is sorted according to a particular dimension. Each horizontal splice in a plot corresponds to a particular data instance. The data on a specific attribute is shown in a single column, where the length of the line corresponds to the dimensional value. When data includes a discrete or continuous class, the slices (data instances) are colored correspondingly. via Orange

What is it for?

Survey plots can be useful in data analysis to quickly find correlations and patterns in large amounts of data. Not so great if you want to see individual values, it's more of a "bigger pictures" graphing technique.

Compresses larges quantities of data into an easily digested visual. For example, a baseball statistics table contains 323 rows by 24 columns = 7429 cells total which would not fit onto a standard screen. However using the table lens, you could easily display all the data with more room to spare (see below).