import pandas as pd
import seaborn as snsNotebook goals:
- Summarize key dataset properties
- Quickly evaluate missing data and data types
- Insights into the whole dataset without manual review
Import packages and Datasets
iris = sns.load_dataset("iris")
type(iris)pandas.core.frame.DataFrame
titanic = sns.load_dataset("titanic")
type(titanic)pandas.core.frame.DataFrame
Pandas Built-in Functions
Pandas is very powerful and popular data tool.
Dataframes (df) are 2D tabular datasets.
These features make understanding dfs quick and painless: - info : structure, column types, non-null counts - describe : basic stats for numeric columns - shape : dimensions (rows, columns) - head : first 5 rows (or head(n)) - tail : last 5 rows (or tail(n)) - columns : column names - dtypes : type of data in each column - nunique : count of unique values/column - isnull.sum() : determines if a value is missing, combined with sum gives null count/column
We’ll check each for both the iris and titanic datasets.
Info
Returns structure, column types, non-null counts.
iris.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
titanic.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 891 non-null int64
1 pclass 891 non-null int64
2 sex 891 non-null object
3 age 714 non-null float64
4 sibsp 891 non-null int64
5 parch 891 non-null int64
6 fare 891 non-null float64
7 embarked 889 non-null object
8 class 891 non-null category
9 who 891 non-null object
10 adult_male 891 non-null bool
11 deck 203 non-null category
12 embark_town 889 non-null object
13 alive 891 non-null object
14 alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
Describe
Returns basic stats for numeric columns.
iris.describe()| sepal_length | sepal_width | petal_length | petal_width | |
|---|---|---|---|---|
| count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
| mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
| std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
| 25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
| 50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
| 75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
| max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
titanic.describe()| survived | pclass | age | sibsp | parch | fare | |
|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
Pro Tip: Describe All
Add include='all' for results that include non-numeric columns as well!
iris.describe(include='all')| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| count | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150 |
| unique | NaN | NaN | NaN | NaN | 3 |
| top | NaN | NaN | NaN | NaN | setosa |
| freq | NaN | NaN | NaN | NaN | 50 |
| mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 | NaN |
| std | 0.828066 | 0.435866 | 1.765298 | 0.762238 | NaN |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 | NaN |
| 25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 | NaN |
| 50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 | NaN |
| 75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 | NaN |
| max | 7.900000 | 4.400000 | 6.900000 | 2.500000 | NaN |
titanic.describe(include='all')| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 891 | 714.000000 | 891.000000 | 891.000000 | 891.000000 | 889 | 891 | 891 | 891 | 203 | 889 | 891 | 891 |
| unique | NaN | NaN | 2 | NaN | NaN | NaN | NaN | 3 | 3 | 3 | 2 | 7 | 3 | 2 | 2 |
| top | NaN | NaN | male | NaN | NaN | NaN | NaN | S | Third | man | True | C | Southampton | no | True |
| freq | NaN | NaN | 577 | NaN | NaN | NaN | NaN | 644 | 491 | 537 | 537 | 59 | 644 | 549 | 537 |
| mean | 0.383838 | 2.308642 | NaN | 29.699118 | 0.523008 | 0.381594 | 32.204208 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| std | 0.486592 | 0.836071 | NaN | 14.526497 | 1.102743 | 0.806057 | 49.693429 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| min | 0.000000 | 1.000000 | NaN | 0.420000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | 0.000000 | 2.000000 | NaN | 20.125000 | 0.000000 | 0.000000 | 7.910400 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | 0.000000 | 3.000000 | NaN | 28.000000 | 0.000000 | 0.000000 | 14.454200 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | 1.000000 | 3.000000 | NaN | 38.000000 | 1.000000 | 0.000000 | 31.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| max | 1.000000 | 3.000000 | NaN | 80.000000 | 8.000000 | 6.000000 | 512.329200 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Shape
Returns tuple of dimensions (# ofrows, # of columns).
iris.shape(150, 5)
titanic.shape(891, 15)
Head
Returns first 5 rows of the df.
Can supply n, to control number of rows returned df.head(n=10).
iris.head()| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
titanic.head()| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
Tail
Returns last 5 rows of the df.
Can supply n, to control number of rows returned df.tail(n=10).
iris.tail()| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
titanic.tail()| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 0 | 2 | male | 27.0 | 0 | 0 | 13.00 | S | Second | man | True | NaN | Southampton | no | True |
| 887 | 1 | 1 | female | 19.0 | 0 | 0 | 30.00 | S | First | woman | False | B | Southampton | yes | True |
| 888 | 0 | 3 | female | NaN | 1 | 2 | 23.45 | S | Third | woman | False | NaN | Southampton | no | False |
| 889 | 1 | 1 | male | 26.0 | 0 | 0 | 30.00 | C | First | man | True | C | Cherbourg | yes | True |
| 890 | 0 | 3 | male | 32.0 | 0 | 0 | 7.75 | Q | Third | man | True | NaN | Queenstown | no | True |
Columns
Returns the df column names.
iris.columnsIndex(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
titanic.columnsIndex(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
'alive', 'alone'],
dtype='object')
dtypes
Returns the datatype of each column.
iris.dtypessepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species object
dtype: object
titanic.dtypessurvived int64
pclass int64
sex object
age float64
sibsp int64
parch int64
fare float64
embarked object
class category
who object
adult_male bool
deck category
embark_town object
alive object
alone bool
dtype: object
nunique
Returns a count of the unique values in each column.
iris.nunique()sepal_length 35
sepal_width 23
petal_length 43
petal_width 22
species 3
dtype: int64
titanic.nunique()survived 2
pclass 3
sex 2
age 88
sibsp 7
parch 7
fare 248
embarked 3
class 3
who 3
adult_male 2
deck 7
embark_town 3
alive 2
alone 2
dtype: int64
isnull().sum()
Returns a count of missing values by column.
Chains isnull() which returns a Boolean mask with sum() to get a count.
iris.isnull().sum()sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
titanic.isnull().sum()survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
dtype: int64
Pandas Built-in Functions Review
These built-in features of pandas dfs make it easy to summaize and investigate your datasets.
Knowing the number of rows, columns combined with missing data and datatypes can solve many analysis woes.
Deeper Dive: Skim
Skimpy is a package that profiles a df with visually digestible information.
from skimpy import skimskim(iris)╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ Dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 150 │ │ float64 │ 4 │ │ │ │ Number of columns │ 5 │ │ string │ 1 │ │ │ └───────────────────┴────────┘ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━━━━┓ │ │ ┃ column ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━━╇━━━━━━━━━┩ │ │ │ sepal_length │ 0 │ 0 │ 5.843 │ 0.8281 │ 4.3 │ 5.1 │ 5.8 │ 6.4 │ 7.9 │ ▃██▇▅▂ │ │ │ │ sepal_width │ 0 │ 0 │ 3.057 │ 0.4359 │ 2 │ 2.8 │ 3 │ 3.3 │ 4.4 │ ▁▇█▇▂▁ │ │ │ │ petal_length │ 0 │ 0 │ 3.758 │ 1.765 │ 1 │ 1.6 │ 4.35 │ 5.1 │ 6.9 │ █ ▂▇▆▂ │ │ │ │ petal_width │ 0 │ 0 │ 1.199 │ 0.7622 │ 0.1 │ 0.3 │ 1.3 │ 1.8 │ 2.5 │ █ ▂▆▄▄ │ │ │ └───────────────────┴──────┴────────┴──────────┴───────────┴───────┴───────┴────────┴──────┴───────┴─────────┘ │ │ string │ │ ┏━━━━━━━━━┳━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ │ │ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ chars per ┃ words per ┃ ┃ │ │ ┃ column ┃ NA ┃ NA % ┃ shortest ┃ longest ┃ min ┃ max ┃ row ┃ row ┃ total words ┃ │ │ ┡━━━━━━━━━╇━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ │ │ species │ 0 │ 0 │ setosa │ versicolor │ setosa │ virginica │ 8.33 │ 1 │ 150 │ │ │ └─────────┴────┴──────┴──────────┴────────────┴────────┴───────────┴─────────────┴─────────────┴─────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
skim(titanic)╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types Categories │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ Dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ ┃ Categorical Variables ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ Number of rows │ 891 │ │ string │ 5 │ │ class │ │ │ │ Number of columns │ 15 │ │ int64 │ 4 │ │ deck │ │ │ └───────────────────┴────────┘ │ float64 │ 2 │ └───────────────────────┘ │ │ │ category │ 2 │ │ │ │ bool │ 2 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━━┳━━━━━━━━┓ │ │ ┃ column ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━━╇━━━━━━━━┩ │ │ │ survived │ 0 │ 0 │ 0.3838 │ 0.4866 │ 0 │ 0 │ 0 │ 1 │ 1 │ █ ▅ │ │ │ │ pclass │ 0 │ 0 │ 2.309 │ 0.8361 │ 1 │ 2 │ 3 │ 3 │ 3 │ ▄ ▃ █ │ │ │ │ age │ 177 │ 19.865319865319865 │ 29.7 │ 14.53 │ 0.42 │ 20.12 │ 28 │ 38 │ 80 │ ▂██▃▁ │ │ │ │ sibsp │ 0 │ 0 │ 0.523 │ 1.103 │ 0 │ 0 │ 0 │ 1 │ 8 │ █ │ │ │ │ parch │ 0 │ 0 │ 0.3816 │ 0.8061 │ 0 │ 0 │ 0 │ 0 │ 6 │ █▁▁ │ │ │ │ fare │ 0 │ 0 │ 32.2 │ 49.69 │ 0 │ 7.91 │ 14.45 │ 31 │ 512.3 │ █ │ │ │ └───────────┴──────┴──────────────────────┴─────────┴─────────┴───────┴───────┴───────┴─────┴───────┴────────┘ │ │ category │ │ ┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓ │ │ ┃ column ┃ NA ┃ NA % ┃ ordered ┃ unique ┃ │ │ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩ │ │ │ class │ 0 │ 0 │ False │ 3 │ │ │ │ deck │ 688 │ 77.21661054994388 │ False │ 8 │ │ │ └─────────────────┴───────────┴─────────────────────────────────────────┴───────────────────┴────────────────┘ │ │ bool │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column ┃ true ┃ true rate ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ adult_male │ 537 │ 0.6 │ ▅ █ │ │ │ │ alone │ 537 │ 0.6 │ ▅ █ │ │ │ └───────────────────────────────────┴─────────────────┴───────────────────────────────┴──────────────────────┘ │ │ string │ │ ┏━━━━━━━━━━━┳━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓ │ │ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ chars ┃ words per ┃ total ┃ │ │ ┃ column ┃ NA ┃ NA % ┃ shortest ┃ longest ┃ min ┃ max ┃ per row ┃ row ┃ words ┃ │ │ ┡━━━━━━━━━━━╇━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩ │ │ │ sex │ 0 │ 0 │ male │ female │ female │ male │ 4.7 │ 1 │ 891 │ │ │ │ embarked │ 2 │ 0.2244668 │ S │ S │ C │ S │ 1 │ 1 │ 889 │ │ │ │ │ │ 911335578 │ │ │ │ │ │ │ │ │ │ │ who │ 0 │ 0 │ man │ woman │ child │ woman │ 3.79 │ 1 │ 891 │ │ │ │ embark_to │ 2 │ 0.2244668 │ Cherbour │ Southampt │ Cherbour │ Southampt │ 10.5 │ 1 │ 889 │ │ │ │ wn │ │ 911335578 │ g │ on │ g │ on │ │ │ │ │ │ │ alive │ 0 │ 0 │ no │ yes │ no │ yes │ 2.38 │ 1 │ 891 │ │ │ └───────────┴────┴───────────┴──────────┴───────────┴──────────┴───────────┴──────────┴───────────┴──────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
Deeper Dive: missingno
Missing data matters, and just if it exists. Relationships between missing data can heavily impact analysis results.
missingno provides insightful and easy insight into missing data.
All below examples focus on the titantic df. Iris has no missing values.
import missingno as msnomsno.matrix(titanic)
msno.bar(titanic);
Dataset Overview Review
Knowing how much data you have, how it is labeled, if any is missing, and the datatypes is essential to working with data.
- df key features:
- info: structure, column types, non-null counts
- describe: basic stats for numeric columns
- shape: dimensions (rows, columns)
- head: first 5 rows (or head(n))
- tail: last 5 rows (or tail(n))
- columns: column names
- dtypes: type of data in each column
- nunique: count of unique values/column
- isnull.sum: determines if a value is missing, combined with sum gives null count/column
- Deep dive packages
- skimpy
- missingno
Verify your data when loaded and use these tools if you see unexpected behavior.