πŸ‡¨πŸ‡΄ @guerravis
πŸ‡ΊπŸ‡Έ @duto_guerra

Data to Insights

The importance of good data in data analytics


Use spacebar and the arrows to advance slides

Creative Commons License

Data Analytics

To extract insights from data

But, what are insights?

Story time

What car should I buy?

Normal procedure

Ask friends and family

Renault 4
Renault 4 JP4
Teilgefalteter Renault 4 am Strassenrand

Problem

That's inferring statistics from a sample n=1

Better approach

Data based decisions

Screenshot Tucarro.com
https://tucarro.com
To extract good insights πŸ‘‰πŸΌ You need good data

Cars analysis data

  • Scraped
  • Asking price
  • Few examples for rare cases

What is
data analytics?

The process of extracting insights from data

Why should we
visualize?

Anscombe's quartet

I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Property Value
Mean of x 9
Variance of x 11
Mean of y 7.50
Variance of y 4.125
Correlation between x and y 0.816
Linear regression y = 3.00 + 0.500x
Coefficient of determination of the linear regression 0.67

https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-century/

Datasaurus!


https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-century/

The purpose of visualization
is insight,
not pictures

But what are insights?

  • Deep understanding
  • Meaningful
  • Non obvious
  • Actionable
  • Based on data

My insights toolbelt?

What do I use?

Let me show you
more insights

Another story

A new niece πŸ‘ΆπŸΌ!!!

How should the name her πŸ€”?
https://namerology.com/baby-name-grapher/

What are Colombia's most common names?

https://www.registraduria.gov.co/rev_electro/articulos/jose_maria.htm
https://www.wradio.com.co/noticias/actualidad/estos-fueron-los-nombres-mas-populares-en-colombia-durante-el-2019/20191222/nota/3994345.aspx

Colombian National Registry

  • Me: Hey, do you have data of Colombians' names?
  • CNR: Sure of course!
  • Me: Great, can I have all of them
  • CNR: Of course, it is just $0.40 per name
  • Me: πŸ€¦β€β™‚οΈ Failed!

A couple months later...

  • Brother: What school do you like for your nephew?
  • Me: I wonder which one has the best scores? πŸ€”πŸ€”πŸ€”πŸ€”
https://especiales.dinero.com/buscador-mejores-colegios-2019/
https://johnguerra.co/viz/colegiosColombia
https://johnguerra.co/viz/colegiosColombia

What we learned

  • No data πŸ‘‰πŸΌ No Insights
  • Dig deeper
  • Ask questions
  • The πŸ‘Ί is in the details

More Insights

Covid Colombia

Presidential Election

Remember

  • πŸ‘‰πŸΌ Insights! πŸ‘ˆπŸΌ
  • No data πŸ‘‰πŸΌ No Insights
  • Bad Data πŸ‘‰πŸΌ No Insights

John Alexis Guerra GΓ³mez

johnguerra.co
@duto_guerra
@guerravis

Influentials

Influentials CHI2021
https://johnguerra.co/viz/influentials/story/?hashtag=chi2021

Other Insights!

https://johnguerra.co/viz/resultadosPlebiscito/
https://johnguerra.co/viz/USElections2016/
https://johnguerra.co/viz/eliminatoriaMundial/
https://johnguerra.co/viz/resultadosFutbolColombiano/

Results Bogota Council

https://public.tableau.com/app/profile/john.alexis.guerra.g.mez/viz/votosConcejo/

Reject
Buzzwords

Or: No coma Machine Learning, follow the insights

Data Science ?

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data
Wikipedia
... to work effectively with heterogeneous, real-world data and to extract insights from the data using the latest tools and analytical methods.
UC Berkeley MIDS program brochure

However

When you search online this is what you see

Popularity

Data Science is way more than Machine Learning!

The purpose of visualization
is insight,
not pictures

The purpose of data analysis
is insight,
not (just) models

Machine Learning?

Classical programming: data+rules = answers. Machine Learning data+answers=rules

What can you use ML for?

  • Photos πŸ–Ό
  • Videos πŸ“Ή
  • Document/Text Processing πŸ“ƒ
  • Speech πŸ‘„πŸ‘‚πŸΌ
  • Structured data πŸ’Ύ?

What can I detect on photos πŸ–Ό?

  • Objects 🐈 πŸ• 🍎
  • Faces πŸ‘±πŸ½β€β™‚οΈπŸ‘±β€β™€οΈ
  • Celebrities 🍾
  • Landmarks πŸ—Ό
  • Text in images πŸ—Ό
Video πŸ“Ή is about the same but on streaming

How can I use it?

Develop locally

Pose Detection

https://johnguerra.co/viz/mlPose/

Object Detection

https://johnguerra.co/viz/mlObject/

How can I use it?

Demos

What can I do with documents πŸ“ƒ?

  • OCR πŸ–Ό β†’ πŸ”€
  • Sentiment analysis πŸ˜†πŸ˜‘
  • Topic extraction 🟑🟠🟣
  • Entities detection
  • Political Affiliation? πŸ‘”πŸŽ‰
  • Psychological Profile?

Demos

What can I do with Speech πŸ‘„πŸ‘‚πŸΌ?

  • Speech recognition πŸ‘‚πŸΌ
  • Speech generation πŸ‘„

That's hip, but...

The purpose of visualization is insight, not pictures

The purpose of data analytics is insight, not (just) models

Traditional

  • Query for known patterns
  • Display results using traditional techniques

Pros:
  • Many solutions
  • Easier to implement

Cons:
  • Can’t search for the unexpected

Data Mining/ML

  • Based on statistics
  • Black box approach
  • Output outliers and correlations
  • Human out of the loop

Pros:
  • Scalable

Cons:
  • Analysts have to make sense of the results
  • Makes assumptions on the data

InfoVis

  • Visual Interactive Interfaces
  • Human in the loop

Pros:
  • Visual bandwidth is enormous
  • Experts decided what to search for
  • Identify unknown patterns and errors in the data

Cons
  • Scalability can be an issue

Rappi

How is Rappi doing on Twitter?

  • 30k tweets in a week of 2019

Approach 1

πŸ˜‘πŸ˜ πŸ˜’πŸ˜πŸ˜πŸ˜ƒπŸ₯°?

  • Machine learning 🎩! ???
  • Detects sentiment ! ???

I hired a data πŸ’ (might be me)

Analyzed 180 tweets

  • πŸ˜‘πŸ˜ πŸ˜’πŸ˜πŸ˜πŸ˜ƒπŸ₯°

Here are some of them

Rappi tweet
😐 -10%
Rappi tweet
😑 -80%
Rappi tweet
πŸ₯° 80%
Rappi tweet
😐 -10%
Rappi tweet
😐 -20%
Rappi tweet
πŸ₯° 90%
Rappi tweet
πŸ˜’ -40%
Rappi tweet
πŸ˜’ -30%

Would you hire this data πŸ’?

Well, actually

  • It wasn't a data πŸ’
  • It was a πŸ’»
  • Would you use it?

Well, actually, actually

Will you trust it?

I don't

Approach 2

Approach 3

It's up to you!

  • Interactivity πŸ‘‰ Ask questions
  • Slice and dice
  • Overview first, Zoom/Filter, then details on demand

Rappi Dashboard Link πŸ˜‰

Β‘No coma Machine Learning, coma πŸ–!

Machine Learning

  • Prediction vs Training
  • How was it trained?
  • Garbage in - garbage out

AI ?

Big Data?

You might have heard of the Vs of Big Data

  • Volume
  • Velocity
  • Variety
  • and Veracity and Value

Too ambiguous!! πŸ€¦πŸ½β€β™€οΈ Let's go beyond that

How Big is big?

Can you fit it in one computer?

Yes? πŸ‘‰πŸΌ Then, is not really big πŸ€·πŸ½β€β™€οΈ

Why this criteria?

Big data πŸ‘‰πŸΌ Big overhead

Example: photo collection

  • One photo πŸ‘‰πŸΌ 10MB
  • 1k photos in a πŸ“± πŸ‘‰πŸΌ 10MB * 1k = 10000MB = 10GB
  • 50k photos in your πŸ’» πŸ‘‰πŸΌ 10MB * 50k = 500GB

Big Data? πŸ™…πŸ½β€β™‚οΈ

How many blue photos are in my collection?

How do you compute this?

  • Put all your photos in one πŸ’»
  • Go through all the collection and count the blue ones

Flickr scale

80+ trillion photos (80'''000''000'000.000)

That's big data

How many blue photos are on Flickr?

How do you compute this?

  • Distribute the data among 100s of πŸ’»πŸ’»πŸ’»s. (a cluster)
  • Compute subtotals on each data part. (Map)
  • Aggregate the subtotals into one big total. (Reduce)

How many computers do you need?

What if one computer breaks? ☒️

Conclusion

Big Data? πŸ‘‰πŸΌ Only if it doesn't fit on one πŸ’»

⚠️ Use it only if you must ⚠️

But don't panic!

Let me share a secret

🀫

My wife tells it to me all the time!

Size doesn't really matter

What matters are the insights πŸ‘

Insights ?

Machine Learning?

Big Data Technologies

Technologies

  • MapReduce (Hadoop, Hive, pig, Spark ...)
  • NoSQL Databases (Redis, Cassandra, MongoDB, Neo4J)
  • Distributed Relational (SQL) Databases (MySQL, PostgreSQL, Oracle, SqlServer)
  • Many others

Hadoop

  • Computing platform for big data
  • Uses clusters for storing and processing the data

Hadoop Architecture

Spark

A distributed computing alternative of to map reduce.

  • Easier to use
  • Integrates better with traditional programming models

NoSQL Databases

  • Scalable storage platforms that use techniques different to traditional SQL databases
  • Sacrifices features for performance

Types of NoSQL

  • Column Oriented: Cassandra, HBase, Redshift ...
  • Key-value: Redis, memcached, Aerospike ....
  • Document based: MongoDB, CouchDB, DynamoDB ...
  • Graph based: Neo4J, Titan, ...

Bonus

Introduction to NoSQL for Web Developers

Distributed Relational DB

  • You can also use traditional databases on a distributed way.
  • Divides the database into shards.
  • Usually doesn't scale that well.

Others

  • Google DataFlow
  • Google's replacement for MapReduce based on flows.
  • Supposed to scale better.
  • AFAIK can only be used with Google's Cloud.

Types of Visualization

  • Infographics
  • Scientific Visualization (sciviz)
  • Information Visualization (infovis, datavis)

Infographics

Scientific Visualization

  • Inherently spatial
  • 2D and 3D

Information Visualization

Visualization Mantra

  • Overview first
  • Zoom and Filter
  • Details on Demand

Data Types

1-D Linear Document Lens, SeeSoft, Info Mural
2-D Map GIS, ArcView, PageMaker, Medical imagery
3-D World CAD, Medical, Molecules, Architecture
Multi-Var Spotfire, Tableau, GGobi, TableLens, ParCoords,
Temporal LifeLines, TimeSearcher, Palantir, DataMontage, LifeFlow
Tree Cone/Cam/Hyperbolic, SpaceTree, Treemap, Treeversity
Network Gephi, NodeXL, Sigmajs

Visualization Science

Problem Abstraction

What/Why/How

  • What is visualized?
    • data abstraction
  • Why is the user looking at it?
    • task abstraction
  • How is visualized?
    • idiom visual encoding and interaction

Abstract language avoids domain specific pitfals
What/Why/How to navigate systematically the design space

Marks and Channels

Analyze Idiom Structure

Marks

Point
Line
Area

Channels

Channels

Channel Types

Channels

Bonus

Other Insights

FDA

Task: Change in drug's adverse effects reports

User: FDA Analysts

State of the art

https://treeversity.cattlab.umd.edu/

Health insurance claims

Task: Detect fraud networks

User: Undisclosed Analysts

Clustering

Overview

Ego distance

Who to follow on Twitter

https://johnguerra.co/slides/untanglingTheHairball/#/

Who am I?

PhD

Silicon Valley

Twitter Influentials

Twitter election analysis

http://old.tweetometro.co/robots_May25.html