background-size: cover class: title-slide count: false # .monash-blue[Multivariate Data Visualization and Thinking] <br> <h2 style="font-weight:900!important;">IFFRUG</h2> .bottom_abs.width100[ <br> *Nicholas Spyrison* <br> [nspyrison.netlify.app](https://nspyrison.netlify.app/)<br>Nicholas.Syrison@iff.com <br> IFF, Madison <br> April 2023 <br><br> Slides -- [nspyrison.github.io/pred_ebs](https://nspyrison.github.io/pred_ebs/#1) ] --- # Terminology & bias - Stats background, your preferred terms may vary - __Variable__ - _p_, attribute, measure, column - __Observation__ - _n_, item, instance, sample, repetition, row - __Projection__ - an embedded space, of _d < p_ dimension - __Dimension__ - Overloaded, but __p__ data or approximated space, __d__ projection - R, Grammar of Graphics, & __ggplot2__ <br> # Etiquette - Please interrupt for clarifications & elaboration - Tangential or extension questions at the end, time permitting --- # Contents - Context and data - Traditional techniques - Dimension reduction - Tours - Cheem --- # Context: data types <img src="./figures/munzner_datatypes.PNG" width="70%" style="display: block; margin: auto;" /> *Munzner, 2014* - Complete numerical matrix - Alternatively, feature decomposition of other formats --- # Example data -- Palmer penguins - Penguins near Palmer Station, Antarctica - 330 observations - X variables: 4 physical measurements - Species of penguin mapped to color & shape <table> <thead> <tr> <th style="text-align:right;"> bill_length_mm </th> <th style="text-align:right;"> bill_depth_mm </th> <th style="text-align:right;"> flipper_length_mm </th> <th style="text-align:right;"> body_mass_g </th> <th style="text-align:left;"> species </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 39.1 </td> <td style="text-align:right;"> 18.7 </td> <td style="text-align:right;"> 181 </td> <td style="text-align:right;"> 3750 </td> <td style="text-align:left;"> Adelie </td> </tr> <tr> <td style="text-align:right;"> 39.5 </td> <td style="text-align:right;"> 17.4 </td> <td style="text-align:right;"> 186 </td> <td style="text-align:right;"> 3800 </td> <td style="text-align:left;"> Adelie </td> </tr> <tr> <td style="text-align:right;"> 40.3 </td> <td style="text-align:right;"> 18.0 </td> <td style="text-align:right;"> 195 </td> <td style="text-align:right;"> 3250 </td> <td style="text-align:left;"> Adelie </td> </tr> <tr> <td style="text-align:right;"> 36.7 </td> <td style="text-align:right;"> 19.3 </td> <td style="text-align:right;"> 193 </td> <td style="text-align:right;"> 3450 </td> <td style="text-align:left;"> Adelie </td> </tr> <tr> <td style="text-align:right;"> 39.3 </td> <td style="text-align:right;"> 20.6 </td> <td style="text-align:right;"> 190 </td> <td style="text-align:right;"> 3650 </td> <td style="text-align:left;"> Adelie </td> </tr> <tr> <td style="text-align:right;"> 38.9 </td> <td style="text-align:right;"> 17.8 </td> <td style="text-align:right;"> 181 </td> <td style="text-align:right;"> 3625 </td> <td style="text-align:left;"> Adelie </td> </tr> </tbody> </table> --- # Scatterplot matrix (SPLOM, small multiples) <img src="images/index.rmd/unnamed-chunk-4-1.png" width="50%" style="display: block; margin: auto;" /> _Chambers, 1983_ ``` GGally::ggpairs(...) ``` - Scalability? - Structure in 3+ variables? --- # Parallel coordinate plot <img src="images/index.rmd/unnamed-chunk-5-1.png" width="50%" style="display: block; margin: auto;" /> _Ocagne, 1885_ ``` GGally::ggparcoord(...) ``` - Scalability? - Poor comparison across non-adjacent variables; asymmetric across variable order - Correlation is harder to extract --- # Bonus: observation-based (observation-mapped axes) <img src="./figures/pixel_based.png" width="80%" style="display: block; margin: auto;" /> - Scalability? Interpretability? Correlation? - Asymmetric across variable order --- # Bonus: Chernoff faces - Scalability? Interpretability? Correlation? ``` tourr::animate_faces() ## Only for fun? ``` <img src="images/index.rmd/unnamed-chunk-7-1.png" width="50%" style="display: block; margin: auto;" /><img src="images/index.rmd/unnamed-chunk-7-2.png" width="50%" style="display: block; margin: auto;" /> --- # Dimension reduction **Linear** - *Affine* transformations: "parrellel lines stay parrellel" - Examples: - Principal Component Analysis (PCA), oriented by variance - Linear Discriminant Analysis (LDA), oriented by the separation of the supervised class - Visualization _tours_ (next section) -- **Nonlinear** - Other transformations: variable interactions, exponents, _etc._ - Examples: - Sammon mapping - Self Organizing Maps (SOM) - t-distributed Stochastic Neighbor Embedding (tSNE) - Uniform Manifold Approximation and Projection (UMAP) --- # Nonlinear projections Good news; you are already familiar with `\((p=3, d=2)\)` non-linear projections! <img src="./figures/intuition_nonlinear_proj.PNG" width="60%" style="display: block; margin: auto;" /> [wikipedia - Map projections](https://en.wikipedia.org/wiki/Map_projection) - Inconsistent space; hard to explain & interpret - Many methods, many hyperparmeters; how _faithful_ is the representation? “All non-linear projections are wrong, but some are useful.” <br>--- Anatasios Panagiotelis (play on George Box’s quote about models) <!-- quote from: NUMBAT Seminar, 04/20/2020 --> --- class: transition ## Linear projections & _tours_ --- # Linear projections -- intuition Good news; you are already familiar with ($p=3, d=2$) linear projections! <img src="./figures/intuition_linear_proj.PNG" width="100%" style="display: block; margin: auto;" /> - Not all orientations hold interesting structure - But structural information could be gleaned from rotation --- # Linear projections -- rotation <br> <img src="./figures/linear_proj_wide.png" width="100%" style="display: block; margin: auto;" /> LDA: _Fischer, 1936_, supervised cluster separation Given the distributions and orientations of different clusters, find a basis (rotation) that separate clusters the most -- - Often times our output space is of the same dimensionality as our input space - eg. PCA returns a `\(pxp\)` basis - The reduction happens when the space is approximated with fewer components, often involving guided, but _subjective_ selection. eg. Finding an "elbow" in screeplot - Or worse -- only showing PC1:2, with no recognition or discussion of the others --- # Linear projections -- traditional process <br><br> 1) Scale each variable to [0, 1] or by standard deviations away from the mean (z-score)<br> 2) Some people 'whiten' or 'sphere' transform the covariance matrix to an identity matrix; should be justified<br> 3) If `\(p\)` is sizable, say more than 10 or, may first approximate the data in fewer components to get to a realistic dimensionality to view - Typically with PCA by eyeballing an elbow in the screeplot - "We approximate 90% of the variation of our 20 variable in the first 5 principal components" - Alternative mathematical approximations: "intrinsic dimensionality estimation" 4) Visualize data- or component- space -- <br><br> Visualize _one rotation (or more if lucky);_ **static, not interactive** --- # Tours - Instead of viewing only static views, *animate* small changes in the basis over time - Object permanence between frames; can see observations and cluster moving together <img src="./figures/tour_frames.PNG" width="50%" style="display: block; margin: auto;" /> *Buja et al., 2005* --- # Grand tour *Asimov, 1984* .left-two-thirds[ <img src="./figures/grand_tour.gif" width="100%" style="display: block; margin: auto;" /> ] .right-third[ ``` library(spinifex) grand_tour() %>% ggtour() + proto_point() + proto_basis() + proto_origin() %>% animate_gganimate() ``` ] --- # Tour taxonomy Distinguished by how the basis path is produced <br> <table> <thead> <tr> <th style="text-align:left;"> Tour type </th> <th style="text-align:left;"> Target bases </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> grand </td> <td style="text-align:left;"> random bases </td> </tr> <tr> <td style="text-align:left;"> guided </td> <td style="text-align:left;"> objective function (simulated annealing) </td> </tr> <tr> <td style="text-align:left;"> manual </td> <td style="text-align:left;"> change the contribution of selected variable </td> </tr> <tr> <td style="text-align:left;"> local </td> <td style="text-align:left;"> random bases within a local vacinity </td> </tr> <tr> <td style="text-align:left;"> _et al._ </td> <td style="text-align:left;"> slicing, lensing </td> </tr> </tbody> </table> -- <br> - The grand tour is good for EDA, but selects target frames randomly, no objective function - Hypothesis; Bill length is important for distinguishing the orange cluster - Issue; Component spaces and grand tour have no user interaction to "steer" the basis - Response; Control its contribution with a _manual tour_ --- # Manual tour *Cook & Buja, 1997. Spyrison & Cook, 2020* .left-two-thirds[ <img src="./figures/manual_tour.gif" width="100%" style="display: block; margin: auto;" /> ] .right-third[ ``` manual_tour() %>% ggtour() + proto_point() + proto_basis() + proto_origin() %>% animate_gganimate() ``` ] --- # Tours in R <table> <thead> <tr> <th style="text-align:left;"> Package </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Authors </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> {tourr} </td> <td style="text-align:left;"> Tour paths, geodesic interpolation, display in *interactive* base R </td> <td style="text-align:left;"> Wickham et al., 2011 </td> </tr> <tr> <td style="text-align:left;"> _{spinifex}_ </td> <td style="text-align:left;"> Manual tours, compose animations exportable to plotly/gganimate </td> <td style="text-align:left;"> Spyrison & Cook, 2020 </td> </tr> <tr> <td style="text-align:left;"> {ferrn} </td> <td style="text-align:left;"> Diagnostic plots for projection pursuit (guided tour), tracing basis-paths </td> <td style="text-align:left;"> Zhang et al., 2021 </td> </tr> <tr> <td style="text-align:left;"> {liminal} </td> <td style="text-align:left;"> Ensemble graphics with tSNE and tours side-by-side </td> <td style="text-align:left;"> Lee, 2021 </td> </tr> <tr> <td style="text-align:left;"> {loon.tour} </td> <td style="text-align:left;"> Graphics display system 'loon' for tours </td> <td style="text-align:left;"> Xu & Oldford, 2021 </td> </tr> <tr> <td style="text-align:left;"> _{cheem}_ </td> <td style="text-align:left;"> Explore local explanations of non-linear models with the radial tour </td> <td style="text-align:left;"> Spyrison, 2022 </td> </tr> <tr> <td style="text-align:left;"> {detour} </td> <td style="text-align:left;"> Alternative HTML with brushing and 3D proj, but no basis </td> <td style="text-align:left;"> Hart & Wang, 2022 </td> </tr> </tbody> </table> -- - Reminder: other geometric display can be used when `\(d!=2\)` - 1D density curves - 3D scatterplot - `\(d\)`-dim scatterplot matrix, parallel coordinate plots --- # Overview tours Also see: [A review of State-of-the-Art on Tours, arxiv.org/pdf/2104.08016.pdf](https://arxiv.org/pdf/2104.08016.pdf), preprint of accept WIREs article - User interaction - Basis paths, as inscribed on `\(p-\)`spheres/torii - Slicing to explore hollowness and structure - Scalar Vector Machine boundaries - Guided tours on classification Random Forests - Visualizing neural networks --- class: transition ## Local explanation of a black-box models & application of radial tours, _cheem_ --- # Local explanation - Given a non-linear ("blackbox") model, how can we maintain interpretability - Approximation of the linear variable importance to the model in the vicinity of _one_ observation of a model <img src="./figures/lime_nonlinear.png" width="60%" style="display: block; margin: auto;" /> _Ribeiro, et al. (2017). Why Should I Trust You?_ --- .pull-left[ # SHAP values - SHapley Additive exPlanation - FIFA 2020 data, 5000 observations, ~35 skill measurements aggregated to 9 variables, Y: wages [2020 Euros] - Model: Random forest regressing wages from 9 skill measurements - SHAP is a model-agnostic local explanation - Approximate linear variable importance at one observation; the median importance, permuting over combinations of the explanatory variables **The model has very different variable importance across the player position** ] -- .pull-right[ <img src="./figures/cheem_fifa_messi_dijk.png" width="100%" style="display: block; margin: auto;" /> ] --- # Cheem, concept - Create a non-linear model - Extract explanations for every observation (computationally expensive) - __Global view__: approximate data- and attribution-space side-by-side - Explore with liked brushing, tooltips, interactive tabular display - Select a primary and comparison point to explore their explanation - __Cheem radial tour__: - Primary point attribution of the becomes the initial basis - Evaluate/explore the explanation by testing the support of variable contributions -- <br> __New__, at a github server near you, - model- and local explanation-agnostic (BYO) - Illustrationed with random forests and a tree SHAP (reduced computational complexity) --- # Global View
- Select a primary and comparison point, typically misclassified and neighboring correctly classified - Use the SHAP values of the primary point as the basis, perform a 1D radial (manual) tour to interrogate the models explanation ??? - PC1:2 of data (left) and SHAP (right) - Points color and shape are mapped to *predicted* cluster - Misclassified have red circle --- .left-third[ # Cheem radial tour - Primary & comparison observation: dashed & dotted lines - SHAP values displayed as parallel coordinate lines on basis <br><br> - **The misclassified point downplays the weight on V2; this discrepancy is a measure of how differently this misclassified point is, compared with its peers** ] .right-two-thirds[ <img src="./figures/cheem_tour.gif" width="100%" height="600px" style="display: block; margin: auto;" /> ] --- # Demo the app Explore interactively with an R shiny application <br> Local resources: ```r cheem::run_app() ``` <br> - Externally hosted shiny app (outdated): [ebsmonash.shinyapps.io/cheem_initial/](https://ebsmonash.shinyapps.io/cheem_initial/) - __cheem__ (back) on CRAN soon, available on GitHub: [github.com/nspyrison/cheem](https://github.com/nspyrison/cheem) --- # Cheem -- getting started <br><br> ``` ## Install cheem development version & its CRAN dependancies. remotes::install_github("nspyrison/cheem", dependencies = TRUE) ## Run the {cheem} app cheem::run_app() ## BYO data, model, local explanation: ?cheem::cheem_ls ``` --- # Namesake - __Cheem__ are a fictional race of bipedal trees from the Dr. Who universe - (Original implementation) on tree based models, __DALEX__ ecosystem <img src="./figures/cheem.jpg" width="60%" style="display: block; margin: auto;" /> --- # Acknowledgments <br><br> Thanks to Professor Przemyslaw Biecek for his guidance and input, in addition to the __DALEX__ package ecosystem and _Exploratory Model Analysis_ book <br> Thanks to Di Cook and Kim Marriott for their supervision <br> This research was supported by an Australian government Research Training Program (RTP) scholarship. These slides created in __R__ using __rmarkdown__ and __xaringan__ *(R Core Team, 2021; Xie et al. 2018; Xie, 2018)* <br> _Tentatively,_ IFF may support continued development of related content; broader preprocessing functions, more/better geom-like displays --- background-size: cover class: title-slide count: false # Thank you for attending <hr><br> <h2 class="monash-blue" style="font-size: 20pt!important;">Multivariate Data Visualization and Thinking</h1> <br> <h3 style="font-weight:900!important;">IFFRUG</h2> .bottom_abs.width100[ <br> *Nicholas Spyrison* <br> [nspyrison.netlify.app](https://nspyrison.netlify.app/)<br>Nicholas.Syrison@iff.com <br> IFF, Madison <br> April 2023 <br><br> Slides -- [nspyrison.github.io/pred_ebs](https://nspyrison.github.io/pred_ebs/#1) ]