John Alexis Guerra Gómez
@duto_guerra
Use spacebar and the arrows to advance slides
http://johnguerra.co/viz/bigDataQuest
Can you fit it in one computer?
Yes? > Then is not really big
Let's call it big data only if it doesn't fit on one computer (and has the 3Vs)
Because if it fits in one computer you don't need all the overhead of big data technologies, just use a traditional relational database.
How do you compute this?
Put all your photos in one computer
Go through all the collection and count
80+ trillion photos (80'''000''000'000.000)
That's big data
How do you compute this?
Distribute the data among hundreds of thousand of computers (a cluster).
Compute subtotals on each chunk of the data. (Map)
Aggregate the subtotals into one big total. (Reduce)
total / one computer capacity?
What if one computer breaks down?
We need redundancy > Each photo is stored in many computers
How do we control versions? How to keep records? What goes where?
That's why we need big data!!
A distributed computing alternative of to map reduce.
Traditional
Pros:
Cons:

Data Mining/ML
Pros:
Cons:

InfoVis
Pros:
Cons

I  II  III  IV  

x  y  x  y  x  y  x  y 
10.0  8.04  10.0  9.14  10.0  7.46  8.0  6.58 
8.0  6.95  8.0  8.14  8.0  6.77  8.0  5.76 
13.0  7.58  13.0  8.74  13.0  12.74  8.0  7.71 
9.0  8.81  9.0  8.77  9.0  7.11  8.0  8.84 
11.0  8.33  11.0  9.26  11.0  7.81  8.0  8.47 
14.0  9.96  14.0  8.10  14.0  8.84  8.0  7.04 
6.0  7.24  6.0  6.13  6.0  6.08  8.0  5.25 
4.0  4.26  4.0  3.10  4.0  5.39  19.0  12.50 
12.0  10.84  12.0  9.13  12.0  8.15  8.0  5.56 
7.0  4.82  7.0  7.26  7.0  6.42  8.0  7.91 
5.0  5.68  5.0  4.74  5.0  5.73  8.0  6.89 
Property  Value 

Mean of x  9 
Variance of x  11 
Mean of y  7.50 
Variance of y  4.125 
Correlation between x and y  0.816 
Linear regression  y = 3.00 + 0.500x 
Coefficient of determination of the linear regression  0.67 
Task: Change in drug's adverse effects reports
User: FDA Analysts
Task: Detect fraud networks
User: Undisclosed Analysts
Task: Who are the most interesting people to follow on #OpenvisConf
User: Conference attendants
Number of followers overall vs followers in #OpenVisConf
Task: Twitter behavior during Presidential Elections
User: Me
Task: What's the best car to buy?
User: Me
Ask friends and family
That's inferring statistics from a sample n=1
Data based decisions
Adapted from from:Tamara Munzner Book Chapter
1D Linear  Document Lens, SeeSoft, Info Mural 
2D Map  GIS, ArcView, PageMaker, Medical imagery 
3D World  CAD, Medical, Molecules, Architecture 
MultiVar  Spotfire, Tableau, GGobi, TableLens, ParCoords, 
Temporal  LifeLines, TimeSearcher, Palantir, DataMontage, LifeFlow 
Tree  Cone/Cam/Hyperbolic, SpaceTree, Treemap, Treeversity 
Network  Gephi, NodeXL, Sigmajs 