Titanic Visions

A gentle introduction to using and extending the visions library.

import pandas as pd
import numpy as np

import visions as v
from visions.typesets import CompleteSet
from visions.functional import infer_type, cast_to_inferred, compare_detect_inference_frame
%matplotlib inline

The Basics

visions is a library for deterministic evaluation of semantic data types. Under normal conditions, particularly when working with unprocessed data, the type of a series is defined by it’s physical representation in memory, so although it’s trivially obvious to a practitioner that the sequence [1.0, 2.0, 3.0] should be integer, to the machine those are all floats. Here we will demonstrate how you can use visions to easily infer the semantic representations of your data for this and other more complicated data types.

# Load dataset
df = pd.read_csv(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)


df.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Plotting

If you don’t already have gcc compiled you might need to install pygraphviz directly. Visions relies upon pygraphviz to visualize the relationship graph constructed between types.

Let’s make sure pygraphviz is installed so that

!conda install pygraphviz -y
Collecting package metadata (repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.8.1
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

visions Building Blocks

Within visions there are two main ideas we need to get started.

Types

The first is, unsurprisingly, a Type. Think of these as literally the semantic meaning of a sequence. A String type represents the type for all sequences of exclusively strings (and the missing value indicator np.nan), a DateTime type the type for all sequences of Timestamps or Datetimes, and so on.

Typesets

The second is the Typeset, which represents a collection of types. Now that we’ve moved beyond the realm of physical types it’s possible to imagine conflicting notions of a sequences data type. For example, some users might be interested in probabilities defined as any sequence of numbers between 0 and 1 while others might instead consider them bound between 0 and 100. In order to resolve this conflict, users can encapsulate their preferred types into a single object, the Typeset. So long as each type follows a few simple rules, visions will automatically determine the relationships between types to construct a traversible relationship graph between each.

typeset = CompleteSet()
typeset.plot_graph()

png

Using a Typeset

Now that we have the basic building blocks down, let’s start using our typeset. Our first task is to see visions type inference capabilities in action. In order to make the problem a little more interesting let’s also convert all of our data to strings and see what visions does.

inferred_types = compare_detect_inference_frame(df.astype(str), typeset)
pd.DataFrame({name: [data[i] for data in inferred_types] 
              for i, name in enumerate(['Columns', 'OriginalType', 'Inferredtype'])})
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Great, as we can see, visions is able to exploit the relationships instrinsic to a Typeset to infer the semantic types of our data regardless of it’s initial physical representation.

We can go a little further though and have visions losslessly coerce our data into it’s most semantically meaningful representation without our lifting a finger.

cast_df = cast_to_inferred(df, typeset)
cast_df.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

cast_df.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

As an aside, utilities like cast from visions functional API are simple redirections of an object oriented API defined on the typeset. Feel free to use whichever is most appealing to your personal style.

Custom Typing

This is all well and good but often we are interested in defining our own custom semantic associations with our data whether as part of an EDA pipeline, a data validation test, or something else altogether and with visions defining those types is easy.

Let’s create a BiologicalSex type corresponding to the same field in our data which has only two unique values: male and female.

from visions.types import VisionsBaseType

class BiologicalSex(VisionsBaseType):
    @classmethod
    def contains_op(cls, series):
        return series.isin({'male', 'female'}).all()
    
    @classmethod
    def get_relations(cls):
        return []

As we can see, defining a new type is easy, requiring only two methods: contains_op and get_relations. The contains op is a test to determine whether a provided series is of the semantic type represented by the class. In this case, is the data comprised exclusively of the strings male and female.

This simple definition builds a friendly API to work with data. For example, if we wanted to determine whether a sequence was of the type BiologicalSex we can do the following

df.Sex in BiologicalSex
True
df.Age in BiologicalSex
False

What about get_relations though? This method allows us to create relationships between other semantic data types. In this case, our definition of BiologicalSex assumes that all data is a String so let’s now define a relationship between the two and create a new typeset with BiologicalSex.

from visions.relations import IdentityRelation, InferenceRelation
from visions.types import String

class BiologicalSex(VisionsBaseType):
    @classmethod
    def contains_op(cls, series):
        return series.isin({'male', 'female'}).all()
    
    @classmethod
    def get_relations(cls):
        return [IdentityRelation(cls, String)]
new_typeset = typeset + BiologicalSex
new_typeset.plot_graph()

png

Voila, like that we have a new type appropriately inserted in the relationship graph of our new typeset and we can do all the same work done above.

Dealing with Nulls

We do have one potential problem though, we haven’t accounted for any missing value indicators in our contains_op, that means data with missing values won’t be considered to be BiologicalSex. There are numerous ways to resolve this issue (should you wish) but visions offers utility functionality to easily define nullable contains operations.

We can even use both in the same typeset by appropriately defining relations.

null_biosex_series = pd.Series([np.nan, 'male', 'female']) 

null_biosex_series in BiologicalSex
False
from visions.utils.series_utils import nullable_series_contains


class NullBiologicalSex(VisionsBaseType):
    @classmethod
    @nullable_series_contains
    def contains_op(cls, series):
        return series.isin({'male', 'female'}).all()
    
    @classmethod
    def get_relations(cls):
        return [IdentityRelation(cls, String)]

    
class BiologicalSex(VisionsBaseType):
    @classmethod
    def contains_op(cls, series):
        return series.isin({'male', 'female'}).all()
    
    @classmethod
    def get_relations(cls):
        return [IdentityRelation(cls, NullBiologicalSex, relationship=lambda s: not s.hasnans)]
null_biosex_series in NullBiologicalSex
True

Bringing it all together

new_typeset = typeset + NullBiologicalSex + BiologicalSex
new_typeset.plot_graph()

png

import random
new_df = df.astype(str)
new_df['NullSex'] = [random.choice(['male', 'female', np.nan]) for i in range(df.shape[0])]

inferred_types = compare_detect_inference_frame(new_df, new_typeset)
pd.DataFrame({name: [data[i] for data in inferred_types] 
              for i, name in enumerate(['Columns', 'OriginalType', 'Inferredtype'])})
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Ian Eaves
Founder

Accelerating data science.