A picture is worth a thousand words, even more so when it comes to data-centric projects. Data exploration is the first step in any machine learning project, and it is pivotal to how well the rest of the project turns out. Although libraries like Plotly and Seaborn provide a huge collection of plots and options, they require the user to first think about how the visualization should look like and what to visualize in the first place. This is not conducive to data exploration and just contributes to making it the most time-consuming part of the machine learning life cycle. Well, what if you could get visualizations recommended to you? Lux is a Python package created by the folks at RiseLabs that aims to make data exploration easier and quicker with its simple one-line syntax and visualization recommendations. As the developers put it “Lux is built on the philosophy that users should always be able to visualize anything they want without having to think about how the visualization should look like“.
In Lux, you don’t explicitly create plots; you simply specify your analysis intent, i.e., what attributes/subset interest you; Lux takes care of the rest. Apart from this, Lux is tightly integrated with Pandas and can be used without modifying any code with just one import statement. It preserves the Pandas data frame semantics, so all the commands from the Pandas’s API work in Lux as expected.
Installation
Install Lux from PyPI
pip install lux-api
Install and activate the Lux notebook extension (lux-widget) included in the package.
For VsCode and Jupyter notebook
jupyter nbextension install --py luxwidget jupyter nbextension enable --py luxwidget
For JupyterLab
jupyter labextension install @jupyter-widgets/jupyterlab-manager jupyter labextension install luxwidget
Note: Lux does not work in Colab because Colab doesn’t support custom widgets yet.
Check other methods of installation here.
Data Exploration with Lux
Enable Lux by importing it.
import pandas as pd import lux
That’s it. Now every time you print a data frame, you’ll get a toggle option to view the Lux visualizations. Let’s load some data and try this out.
df = pd.read_csv("https://raw.githubusercontent.com/Aditya1001001/English-Premier-League/master/EDA_data.csv") df
This creates several plots divided into three tabs:
- Correlation: Visualizes the relationships between two qualitative attributes. The plots are arranged from the highest to the lowest correlated pair of attributes.
- Distribution: Shows histogram distributions of different quantitative attributes, ranked from the most to least skewed.
- Occurrence: Displays bar chart distributions of different categorical attributes, ranked from most to least uneven plots.
In addition to simply visualizing the intermediate steps of data exploration Lux has a simple language for specifying your analysis intent, i.e., attributes and values you’re interested in. There are two ways of specifying intent in Lux:
- Using the
intent
property of data frames. - Through the
lux.Clause
object
Simple intent specification with intent
Provides simple string-based description to specify the intent of analysis conveniently.
Specifying attributes of interest
Let’s say value_eur
is an attribute of interest:
df.intent = ['value_eur'] df
Lux recommends a number of interesting plots in two tabs:
- Enhance Tab: Enhance lets the user visualize the relationship between the specified attribute and different attributes. For example, a plot of
value_eur
vsoverall
. - Filter Tab: It adds filters to the intended visualization, it lets the user quickly browse through subsets of the data. For example, the distribution plot for
value_eur
withGoals = 1
.
Another thing noted here is that Lux doesn’t simply create all possible plots; it determines the channel mappings and plot type based on a set of best practices.
If there are multiple attributes of interest, they can be mentioned in the form of a list. Let’s say we have two attributes of interest: overall and value_eur.
df.intent = ['overall','value_eur'] df
This creates recommendations depicting the effect other attributes and filters have on the specified attributes.
There is also a new tab called Generalize, it recommends plots with one of the specified attributes removed.
Specifying subset of the dataset via filters
Let’s say we are only interested in midfielders.
df.intent = ["Position=Midfielder"] df
This creates the same correlation, distribution, and occurrence plot as before but with only midfielder data.
Multiple values of interest can be specified by using the |
notation. Let’s say we are interested in midfielders and defenders.
df.intent = ["Position=Midfielder|Defender"]
Advanced intents with lux.Clause
There’s only so much one can accomplish with string-based intent specifications, lux.Claus
offers a more complex and expressive way of specifying intent. Additionally, it allows us to override auto-inferred details about the plots, such as the attribute’s default axis or the aggregation function used for the quantitative attributes.
The lux.Clause
equivalent for specifying interest in overall would be:
df.intent = [lux.Clause(attribute='overall')]
Let’s say that we want to create plots with overall
on the y-axis.
df.intent = [lux.Clause(attribute='overall', channel='y')]
Or want to use sum
as the aggregation function instead of mean
.
df.intent = ["value_eur",lux.Clause("overall",aggregation="sum")]
Create individual visualizations with Vis
objects
A Vis
object indicates an individual visualization displayed in Lux. To generate a Vis
, a source data frame and the intent of analysis are needed as inputs and this intent is expressed using the same intent specification as specified before using either intent
or lux.Clause
. For example, here, we describe our intent for visualizing the overall
attribute on the dataframe df
.
from lux.vis.Vis import Vis intent = ["overall"] vis = Vis(intent,df) Vis
You can easily replace the Vis
‘s data source and the query’s intent without changing its definition. For example, to represent the overall distribution on the subset of data with forwards with a bin size of 50.
new_intent = [lux.Clause("overall",bin_size=50),"Position=Forward"] vis.set_intent(new_intent) vis
You can learn more about Vis
here.
The visualizations can be stored as stand-alone HTML files. The default file name is export.html, you can optionally specify the HTML filename in the input parameter.
df.save_as_html('overall_vs_value.html')
Vis objects can also be exported to code in Altair or as Vega-Lite.
vis.to_Altair()
vis.to_VegaLite()
You can find more information about saving and exporting visualizations here.
Code for the above implementation is available in this Jupyter notebook.
References
For a more in-depth understanding of Lux, see: