Data is everywhere today. Terabytes of data are collected every second from weblogs, sensors, network devices, social media, and so on. Data is rich, and it can be powerful. But only if we can unleash the power hidden in the complex web of data all around us.
How we can we unleash this power and put it to work? Not just by the smartest of handful of humans, but by everyone, in every field, at every level? The answer lies in data visualization.
In this article, we explain the core elements of a data story using a case study to understand how a data scientist explains, sells, and motivates the audience. The article describes the common mistakes that data producers make—and the confusion that results from these mistakes—which often leave them with a perception that the real question was never answered. The article also provides a framework to help avoid these mistakes.
CASE STUDY
Understand the Challenge
To understand the challenge let’s look at the core elements of the problem statement.
- WeTrainYou
- A London-based training company.
- It has 58 full-time trainers with Salesforce certifications.
- Its core competency is the Salesforce platform.
- It is an expert at training engineers and placing them as full-time or part-time Salesforce developers.
- WeTrainYou wants to start a local training facility in California.
- According to Wikipedia, California is divided into 58 counties and contains 482 municipalities.
- California law makes no distinction between "city" and "town," and municipalities may use either term in their official names.
- According to the 2010 Census, 30,908,614 of California's 37,253,956 residents lived in urban areas, accounting for 82.97 percent of the population.
- Our challenge is to find out:
- Which city in California has the most Salesforce developer jobs?
Rubric for a Good Solution
- Our job:
- Using data, determine which city in California has the maximum number of Salesforce developer jobs
- Use a method that a decision maker can easily understand
- We can expect our stakeholders to be
- Executives and board members
- Sales and marketing personnel
- Legal and public relations personnel
- An effective solution would:
- Be a simple, actionable visual
- Suggest locations where the most Salesforce developer jobs are available in California
Anatomy of a Data Story
Now let’s examine the framework to solve this challenge.
Figure 1 illustrates the data science pipeline, showing the steps from data ingestion to data visualization.
Figure 1.Data science pipeline.
A data science pipeline starts with data. There are three steps, framed as questions below, that we must address in the process of formulating a compelling answer for this data challenge.
- Where will the data come from and how will we ingest it?
- Once the data is ingested, how will we cleanse it? How will we correlate it?
- And finally, how will we present it?
Figure 2.Core steps in the process.
Figure 2 illustrates these three steps. We will focus on the last step: presenting the data with data visualization.
Questions to Ask (and Answer) in Creating a Data Visualization Pipeline
Outlining the Story
- What question or questions do you need to answer?
- Is this a long-term question? Or do they need to act on it now?
- Is this an informational, a motivational, or a sales story?
- What will be the flow of my data story?
- How many slides do I need?
- What should be the message on each slide?
Understanding the problem and “so what” test
- Why are the stakeholders asking this question?
- What are they going to do with my recommendation?
- What action(s) will they take?
- What is the goal of this data story?
The challenge is to understand the problem so we can present an accurate and meaningful response using data visuals.
Data
- What data do I need to answer the question?
- Where are the data sources?
- Do I need primary data (that is, collect by survey and write Python* scripts) or secondary data (for example, open government data) to answer the previous question?
- What tools and techniques do I need to collect the data?
- Am I going to use BeautifulSoup* to scrape the data? Or do I need to send one million surveys to get the data?
- How am I going to store the data? Is it going to need billions of rows?
- Do I need to worry about setting up NOSQL such as MongoDB*? Or can I get away with saving the data in simple flat file (for example, CSV)?
- How dirty is the data going to be?
- Do I have access to clean government data in a wonderful Excel* format? Or do I need to write Python scripts to cleanse (e.g. deduplication, normalization, taking care of missing values)
- How many features am I getting in the data? Do I need heavy feature engineering to answer this question?
Algorithm
Now we have tons of data. We understand the format, structure, and features of the data.
- What algorithm do I need to answer the question?
- Is this a supervised machine learning (we provide true labels to the engine and use the model to predict) problem?
- Is this an unsupervised (for example, clustering) problem?
- Will this algorithm be too slow for this question?
- Do I need to support near-real-time visuals?
- Do I really need any algorithm? Or can my question be answered with the data available?
Visual encoding
- What markers (for example, lines or circles) and channels (for example, color, size, or tilt) are best suited to present the story?
- What colors should I use?
- Am I presenting this to an audience that is sensitive to certain markers or channels?
- Is the audience tech-savvy?
- How interactive are my data visuals going to be?
- Is this fully interactive?
- How is the audience going to consume my visuals? Are they going to pinch or zoom in on a visual?
- Are they going to use their smartphone to tables?
- What tools am I going to use to develop the visuals?
- Can Tableau offer me what I want to present or I need D3.js to create my visuals?
Story flow and insights
- Are my three slides making sense?
- Is the flow working?
- Is the question answered?
- “Email test” – If I email this visual to someone in England, will they understand it (without explaining each visual component in the email)
Act on the story
- Will they act on my story?
- Will they act on my data visuals?
- Did I motivate them enough to act (assuming this was a motivational story)?
- Did their question get answered?
Avoiding Pitfalls (Working on Feedback)
A situation may arise in which you don’t have the right data. If this happens, you need to go back to the drawing board and collect new data from scratch.
Another situation could be if you get feedback that the data story is not working (for example, the flow or visualization doesn’t work well).
To mitigate these situations, we present the framework for questions to ask while troubleshooting your data story.
You did not understand the problem
- Why are they asking for this visual?
- What is the story behind it?
- A simple answer is to get the domain expert on your team. If you are building a healthcare data story visualization, it’s a good idea to have a doctor as mentor of the team.
- Interview target users both as a group and one-on-one.
- Document the “ask” very clearly for the entire team.
Visual coding is incorrect
- Did we use too many channels?
- Did we understand how much color to use?
- Did we use too much animation? 3D?
- Were the visual elements too complex?
- How can we keep it simple?
Algorithm is too slow
- Did we use the right algorithm?
- Is this a staging issue?
- Is this a production infrastructure issue?
- Is there another issue?
- Run tests to see what component is slow.
- Measure the time based on the data set, and so on.
Act
- Why has no one used the feature, product, or visual?
- How can you build a talkback? Where did they click?
- Can we use tools like Amazon Mobile Analytics* to understand each component?
- Did they click to “drill-down” on data?
- Were they trying to download?
- What part of the visual data story was not even viewed?
Back to the Case Study
Now let’s look at our problem statement one more time.
The problem statement
- CEO needs to decide on a location (city) in which to open a training facility in California.
- This is a priority action task and they need to make a decision quickly.
What data do we need and where can we get it from?
- Dice.com, a job board, will provide the raw data.
- We will use a BeautifulSoup Python script to scrape the data.
- To keep it simple, we will just grab “Title” and “Location.”
- We need city information in this data.
- We need to have geocoding information in the data.
- To display the map, we need longitude and latitude. Luckily, Tableau has built-in geocoding.
Sample code
Here is the Python script.
## (C) DataTiles.ai ## (C) DataTiles.io ## This is Proof of concept script, please do not use in production ## Sudhir Wadhwa, Jyoti Wadhwa, January 2016 import bs4 as bs import csv import requests holder = dict() myurl = 'tps://www.dice.com/jobs?q=Salesforce+Developer&l=CA' try: # For Python 3.0 and later from urllib.request import urlopen except ImportError: # Fall back to Python 2's urllib2 from urllib2 import urlopen sourcehtml = urlopen("https://www.dice.com/jobs?q=Salesforce+Developer&l=CA") soup = bs.BeautifulSoup(sourcehtml,"lxml") with open('TableauJobsLocations.csv', 'w') as csvfile: fieldnames = ['Title','Location'] jobwriter = csv.DictWriter(csvfile, fieldnames=fieldnames,dialect="excel",lineterminator='\n') jobwriter.writeheader() counter = 0 for a in soup.find_all('a', {"class": "dice-btn-link"}, href=True): url = a['href'] if url.find('jobs/detail') > 0: response=requests.get(url) soup=bs.BeautifulSoup(response.text) jobDesc = soup.find("div", { "id" : "jobdescSec" }).get_text().encode('ascii','ignore').upper() holder['Title'] = soup.find("h1", { "class" : "jobTitle" }).get_text().encode('ascii','ignore').strip() holder['Location'] = soup.find("li", { "class" : "location" }).get_text().encode('ascii','ignore').strip() jobwriter.writerow(holder) holder.clear
Sample output
Here is the output stored in TableauJobsLocations.csv
sudhirwadhwa ~/Desktop/tbd/SCU $ cat TableauJobsLocations.csv Title,Location Sr. Salesforce Developer,"San Marcos, CA" Salesforce Developer,"Los Angeles, CA" Senior Salesforce Developer,"San Francisco, CA" Salesforce Developer - FTE,"San Francisco, CA" Salesforce Developer - Burbank - 125k+ DOE,"Burbank, CA" Salesforce Developer,"San Francisco, CA" Senior Salesforce Developer,"Los Angeles, CA" Senior Salesforce Developers,"San Diego, CA" Junior Salesforce developer,"Aromas (monterey County), CA" Sr. Salesforce Developer,"Santa Clara, CA" Mid-Level Salesforce Developer,"El Segundo, CA" Lead Salesforce Developer,"San Bruno, CA" Lead Salesforce Developer,"San Bruno, CA" Salesforce Developer,"San Diego, CA" Salesforce Dev/Admin,"Los Angeles, CA" Salesforce Developer,"Burbank, CA" Salesforce developer/Admin,"Oakland, CA" Sr. Salesforce Developer,"San Marcos, CA" Salesforce Developer,"San Francisco, CA" Salesforce Developer,"Vista, CA" Salesforce Developer,"San Ramon, CA" SalesForce Developer,"Burbank, CA" Senior Salesforce Developer,"San Francisco, CA" Salesforce Developer,"San Ramon, CA" SalesForce Developer,"Burbank, CA" Salesforce Developer,"Burbank, CA" Salesforce Developer,"San Rafael, CA" Salesforce Developer,"San Francisco, CA" Salesforce Developer,"San Diego, CA" Senior Salesforce Developer,"Milpitas, CA" Sr. Salesforce Developer,"San Marcos, CA" Salesforce Developer,"Los Angeles, CA" Senior Salesforce Developer,"San Francisco, CA" Salesforce Developer - FTE,"San Francisco, CA" Salesforce Developer - Burbank - 125k+ DOE,"Burbank, CA" Salesforce Developer,"San Francisco, CA" Senior Salesforce Developer,"Los Angeles, CA" Senior Salesforce Developers,"San Diego, CA" Junior Salesforce developer,"Aromas (monterey County), CA" Sr. Salesforce Developer,"Santa Clara, CA" Mid-Level Salesforce Developer,"El Segundo, CA" Lead Salesforce Developer,"San Bruno, CA" Lead Salesforce Developer,"San Bruno, CA" Salesforce Developer,"San Diego, CA" Salesforce Dev/Admin,"Los Angeles, CA" Salesforce Developer,"Burbank, CA" Salesforce developer/Admin,"Oakland, CA" Sr. Salesforce Developer,"San Marcos, CA" Salesforce Developer,"San Francisco, CA" Salesforce Developer,"Vista, CA" Salesforce Developer,"San Ramon, CA" SalesForce Developer,"Burbank, CA" Senior Salesforce Developer,"San Francisco, CA" Salesforce Developer,"San Ramon, CA" SalesForce Developer,"Burbank, CA" Salesforce Developer,"Burbank, CA" Salesforce Developer,"San Rafael, CA" Salesforce Developer,"San Francisco, CA" Salesforce Developer,"San Diego, CA" Senior Salesforce Developer,"Milpitas, CA" sudhirwadhwa ~/Desktop/tbd/SCU $
Figure 3.Sample output
Manipulate the output in Tableau*
Next we bring the data into Tableau and split the location into State and City columns. A snapshot of the data source looks like Figure 4.
Figure 4.Data source split in Tableau* with geocoding.
Create a dashboard
Next create two workbooks and use them in a dashboard (see Figure 5).
Figure 5.Shows the winner is San Francisco (based on data set).
Conclusion
Effective storytelling with data visualization not only helps us unleash the power hidden in the terabytes of complex data, but also enables our audience to understand the data and in turn make it actionable.
For instance, data visualization can do the following:
- Meaningfully answer the questions executives are asking.
- Enable data scientists and data engineers to not get lost in the data ocean.
- Empower middle management to draw actionable insights
- Enable data driven decision making at every level.
- And more.