πŸ‡¨πŸ‡΄ @guerravis
πŸ‡ΊπŸ‡Έ @duto_guerra

Visual Insights:
a data science introduction


ΒΏPreguntas?

EscrΓ­banme por Twitter a @guerravis
@guerravis Profile

Data Science ?

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data
Wikipedia
... to work effectively with heterogeneous, real-world data and to extract insights from the data using the latest tools and analytical methods.
UC Berkeley MIDS program brochure

However

When you search online this is what you see

Popularity

Data Science is way more than Machine Learning!

The purpose of visualization is insight, not pictures

The purpose of data analysis is insight, not (just) models

But what are insights?

  • Deep understanding
  • Meaningful
  • Non obvious
  • Actionable
  • Based on data

My insights toolset?

What do I use?

Insights

A new niece πŸ‘ΆπŸΌ!!!

How should the name her πŸ€”?
https://www.babynamewizard.com

What are Colombia's most common names?

https://www.registraduria.gov.co/rev_electro/articulos/jose_maria.htm
https://www.wradio.com.co/noticias/actualidad/estos-fueron-los-nombres-mas-populares-en-colombia-durante-el-2019/20191222/nota/3994345.aspx

Colombian National Registry

  • Me: Hey, do you have data of Colombians' names?
  • CNR: Sure of course!
  • Me: Great, can I have all of them
  • CNR: Of course, it is just $0.40 per name
  • Me: πŸ€¦β€β™‚οΈ Failed!

A couple months later...

  • Brother: What school do you like for your nephew?
  • Me: I wonder which one has the best scores? πŸ€”πŸ€”πŸ€”πŸ€”
https://especiales.dinero.com/buscador-mejores-colegios-2019/
https://johnguerra.co/viz/colegiosColombia

What car should I buy?

Normal procedure

Ask friends and family

Renault 4
Renault 4 JP4
Teilgefalteter Renault 4 am Strassenrand

Problem

That's inferring statistics from a sample n=1

Better approach

Data based decisions

Screenshot Tucarro.com
https://tucarro.com

Presidential Elections

Twitter Influentials

Twitter election analysis

http://old.tweetometro.co/robots_May25.html

How do our Senators vote?

https://johnguerra.co/viz/senadoColombia/

Remember

  • πŸ‘‰πŸΌ Insights! πŸ‘ˆπŸΌ
  • Users, tasks and data

John Alexis Guerra GΓ³mez

johnguerra.co
@duto_guerra
@guerravis

Visualization Science

Problem Abstraction

What/Why/How

  • What is visualized?
    • data abstraction
  • Why is the user looking at it?
    • task abstraction
  • How is visualized?
    • idiom visual encoding and interaction

Abstract language avoids domain specific pitfals
What/Why/How to navigate systematically the design space

Marks and Channels

Analyze Idiom Structure

Marks

Point
Line
Area

Channels

Channels

Channel Types

Channels

Information Visualization

Why should we visualize?

I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Property Value
Mean of x 9
Variance of x 11
Mean of y 7.50
Variance of y 4.125
Correlation between x and y 0.816
Linear regression y = 3.00 + 0.500x
Coefficient of determination of the linear regression 0.67

https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-century/

Datasaurus!


https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-century/

In Infovis we look for Insights

  • Deep understanding
  • Meaningful
  • Non obvious
  • Actionable
  • Based on data

How do I do it?

What do I use?

Bonus

Types of Visualization

  • Infographics
  • Scientific Visualization (sciviz)
  • Information Visualization (infovis, datavis)

Infographics

Scientific Visualization

  • Inherently spatial
  • 2D and 3D

Information Visualization

Visualization Mantra

  • Overview first
  • Zoom and Filter
  • Details on Demand

Data Types

1-D LinearDocument Lens, SeeSoft, Info Mural
2-D MapGIS, ArcView, PageMaker, Medical imagery
3-D WorldCAD, Medical, Molecules, Architecture
Multi-VarSpotfire, Tableau, GGobi, TableLens, ParCoords,
TemporalLifeLines, TimeSearcher, Palantir, DataMontage, LifeFlow
TreeCone/Cam/Hyperbolic, SpaceTree, Treemap, Treeversity
NetworkGephi, NodeXL, Sigmajs

Take home messages

  • Data Analytics is way more than just models
  • Focus on insights!!!
  • Infovis: Choose the best marks and channels

Big Data?

You might have heard of the Vs of Big Data

  • Volume
  • Velocity
  • Variety
  • and Veracity and Value

Too ambiguous!! πŸ€¦πŸ½β€β™€οΈ Let's go beyond that

How Big is big?

Can you fit it in one computer?

Yes? πŸ‘‰πŸΌ Then, is not really big πŸ€·πŸ½β€β™€οΈ

Why this criteria?

Big data πŸ‘‰πŸΌ Big overhead

Example: photo collection

  • One photo πŸ‘‰πŸΌ 10MB
  • 1k photos in a πŸ“± πŸ‘‰πŸΌ 10MB * 1k = 10000MB = 10GB
  • 50k photos in your πŸ’» πŸ‘‰πŸΌ 10MB * 50k = 500GB

Big Data? πŸ™…πŸ½β€β™‚οΈ

How many blue photos are in my collection?

How do you compute this?

  • Put all your photos in one πŸ’»
  • Go through all the collection and count the blue ones

Flickr scale

80+ trillion photos (80'''000''000'000.000)

That's big data

How many blue photos are on Flickr?

How do you compute this?

  • Distribute the data among 100s of πŸ’»πŸ’»πŸ’»s. (a cluster)
  • Compute subtotals on each data part. (Map)
  • Aggregate the subtotals into one big total. (Reduce)

How many computers do you need?

What if one computer breaks? ☒️

Conclusion

Big Data? πŸ‘‰πŸΌ Only if it doesn't fit on one πŸ’»

⚠️ Use it only if you must ⚠️

But don't panic!

Let me share a secret

🀫

My wife tells it to me all the time!

Size doesn't really matter

What matters are the insights πŸ‘

Insights ?

Machine Learning?

Machine Learning?

Classical programming: data+rules = answers. Machine Learning data+answers=rules

What can you use ML for?

  • Photos πŸ–Ό
  • Videos πŸ“Ή
  • Document/Text Processing πŸ“ƒ
  • Speech πŸ‘„πŸ‘‚πŸΌ
  • Structured data πŸ’Ύ?

What can I detect on photos πŸ–Ό?

  • Objects 🐈 πŸ• 🍎
  • Faces πŸ‘±πŸ½β€β™‚οΈπŸ‘±β€β™€οΈ
  • Celebrities 🍾
  • Landmarks πŸ—Ό
  • Text in images πŸ—Ό
Video πŸ“Ή is about the same but on streaming

How can I use it?

Develop locally

Pose Detection

https://johnguerra.co/viz/mlPose/

Object Detection

https://johnguerra.co/viz/mlObject/

How can I use it?

Demos

What can I do with documents πŸ“ƒ?

  • OCR πŸ–Ό β†’ πŸ”€
  • Sentiment analysis πŸ˜†πŸ˜‘
  • Topic extraction 🟑🟠🟣
  • Entities detection
  • Political Affiliation? πŸ‘”πŸŽ‰
  • Psychological Profile?

Demos

What can I do with Speech πŸ‘„πŸ‘‚πŸΌ?

  • Speech recognition πŸ‘‚πŸΌ
  • Speech generation πŸ‘„

That's hip, but...

The purpose of visualization is insight, not pictures

The purpose of data analytics is insight, not (just) models

Let's compare them with a real world example

How is Rappi doing on Twitter?

  • 30k tweets in a week of 2019

Approach 1

πŸ˜‘πŸ˜ πŸ˜’πŸ˜πŸ˜πŸ˜ƒπŸ₯°?

  • Machine learning 🎩! ???
  • Detects sentiment ! ???

I hired a data πŸ’ (might be me)

Analyzed 180 tweets

  • πŸ˜‘πŸ˜ πŸ˜’πŸ˜πŸ˜πŸ˜ƒπŸ₯°

Here are some of them

Rappi tweet
😐 -10%
Rappi tweet
😑 -80%
Rappi tweet
πŸ₯° 80%
Rappi tweet
😐 -10%
Rappi tweet
😐 -20%
Rappi tweet
πŸ₯° 90%
Rappi tweet
πŸ˜’ -40%
Rappi tweet
πŸ˜’ -30%

Would you hire this data πŸ’?

Well, actually

  • It wasn't a data πŸ’
  • It was a πŸ’»
  • Would you use it?

Well, actually, actually

Will you trust it?

I don't

Approach 2

Approach 3

It's up to you!

  • Interactivity πŸ‘‰ Ask questions
  • Slice and dice
  • Overview first, Zoom/Filter, then details on demand

Rappi Dashboard Link πŸ˜‰

Β‘No coma Machine Learning, coma πŸ–!

Machine Learning

  • Prediction vs Training
  • How was it trained?
  • Garbage in - garbage out

Other Insights

FDA

Task: Change in drug's adverse effects reports

User: FDA Analysts

State of the art

https://treeversity.cattlab.umd.edu/

Health insurance claims

Task: Detect fraud networks

User: Undisclosed Analysts

Clustering

Overview

Ego distance

Who to follow on Twitter

https://johnguerra.co/slides/untanglingTheHairball/#/

Who am I?

PhD

Silicon Valley

Many other projects

Big Data Technologies

Technologies

  • MapReduce (Hadoop, Hive, pig, Spark ...)
  • NoSQL Databases (Redis, Cassandra, MongoDB, Neo4J)
  • Distributed Relational (SQL) Databases (MySQL, PostgreSQL, Oracle, SqlServer)
  • Many others

Hadoop

  • Computing platform for big data
  • Uses clusters for storing and processing the data

Hadoop Architecture

Spark

A distributed computing alternative of to map reduce.

  • Easier to use
  • Integrates better with traditional programming models

NoSQL Databases

  • Scalable storage platforms that use techniques different to traditional SQL databases
  • Sacrifices features for performance

Types of NoSQL

  • Column Oriented: Cassandra, HBase, Redshift ...
  • Key-value: Redis, memcached, Aerospike ....
  • Document based: MongoDB, CouchDB, DynamoDB ...
  • Graph based: Neo4J, Titan, ...

Bonus

Introduction to NoSQL for Web Developers

Distributed Relational DB

  • You can also use traditional databases on a distributed way.
  • Divides the database into shards.
  • Usually doesn't scale that well.

Others

  • Google DataFlow
  • Google's replacement for MapReduce based on flows.
  • Supposed to scale better.
  • AFAIK can only be used with Google's Cloud.

How to make sense of data?

  • Statistical Analysis
  • Machine Learning and Artificial Intelligence
  • Visual Analytics (and data analytics)

Visual Analytics

Traditional

  • Query for known patterns
  • Display results using traditional techniques

Pros:
  • Many solutions
  • Easier to implement

Cons:
  • Can’t search for the unexpected

Data Mining/ML

  • Based on statistics
  • Black box approach
  • Output outliers and correlations
  • Human out of the loop

Pros:
  • Scalable

Cons:
  • Analysts have to make sense of the results
  • Makes assumptions on the data

InfoVis

  • Visual Interactive Interfaces
  • Human in the loop

Pros:
  • Visual bandwidth is enormous
  • Experts decided what to search for
  • Identify unknown patterns and errors in the data

Cons
  • Scalability can be an issue