Subsets with Pandas

During the last lesson and it's exercise, we took a brief glance at a storage drive failure dataset by BackBlaze. The dataset has almost 130k individual drives. Each drive has 131 attributes, on which one is date, which is same for all the drives in the chosen CSV file.

At this point, we might want to perform some operations such as:

Count all the drives that failed. Going through 130,000 lines manually wouldn't be fun.
Count all cells that have null values (NaN).
Inspect a single row of the dataset
Inspect a single column of the dataset
Inspect 10 rows of selected columns.
Inspect all rows or columns that match some search criteria.

This lesson is about all that!

Terminology

What is common with the operations above is that they are subsets of the dataset. Let's define some terminology for the various kinds of subsets:

| Term | Definition | An example | | --------- | ------------------------------------------------------------ | ----------------- | | Selecting | Selecting an entire column or multiple columns. | df["column_name"] | | Slicing | Creates a shallow-copy of certain rows. | df[:5] | | Indexing | Usually multi-axis indexing; rows and columns | df.iloc[:5, :] | | Filtering | Filtering the subset using some criteria, including but not limited to Boolean masks or filter() method. | df[mask] |

Remember, the Pandas' own '10 minutes to pandas' should be used as a companion for these lessons. Read it. Compare it to this. Also, read it while doing the exercise.

Comparison: loc versus iloc

Loc and iloc behave similarly, but they do have some differences that need to be known. Examples below assume that we are handline a 2-axis DataFrame with rows and columns.

| loc | iloc | | ------------------------------------------------------------ | ------------------------------------------------------------ | | Based on label | Based on integer | | Slice [:5] returns up to and including label 5. Assuming previous 5 labels are [0, 1, 2, 3, 4], the returned rows are the ones with titles [0,1,2,3,4,5]. | Slicing [:5] return up to but excluding integer 5. Returned rows are the ones on [0,1,2,3,4]th positions. In other words, it return the first 5 rows, no matter what their labels are. |

Notice that the Index for the rows does not necessarily include labels of integers belonging to range(0, lengthofdataset). Index may include, for example, serial codes including letters and numbers.

Both of the following lines will return the same subset of a dataset 'hdd':

# Rows up to label 5 from start, columns up to and including label "failure"
hdd.loc[:5, :"failure"]

# Rows up to the 6th from start, columns up to the 6th from start. 
hdd.iloc[:6, :5]

The output of both is as follows:

| | date | serialnumber | model | capacitybytes | failure | | ---: | ---------: | -------------: | -------------------: | -------------: | ------- | | 0 | 2020-01-01 | Z305B2QN | ST4000DM000 | 4000787030016 | 0 | | 1 | 2020-01-01 | ZJV0XJQ4 | ST12000NM0007 | 12000138625024 | 0 | | 2 | 2020-01-01 | ZJV0XJQ3 | ST12000NM0007 | 12000138625024 | 0 | | 3 | 2020-01-01 | ZJV0XJQ0 | ST12000NM0007 | 12000138625024 | 0 | | 4 | 2020-01-01 | PL1331LAHG1S4H | HGST HMS5C4040ALE640 | 4000787030016 | 0 | | 5 | 2020-01-01 | ZA16NQJR | ST8000NM0055 | 8001563222016 | 0 |

Now, let's change the indexing to something else than ranged index from 0 to len(hdd).

# "Serial number" column will replace the current row index
hdd.set_index("serial_number", inplace=True)

For reference, the first row (up to "failure") would look like this:

| serialnumber | date | model | capacitybytes | failure | | ------------: | ---------: | ----------: | -------------: | ------: | | Z305B2QN | 2020-01-01 | ST4000DM000 | 4000787030016 | 0 |

And the hdd.index, which includes indices for the rows of the DataFrame, contains such values:

['Z305B2QN', 'ZJV0XJQ4', 'ZJV0XJQ3', 'ZJV0XJQ0', 'PL1331LAHG1S4H', 'ZA16NQJR', 'ZJV02XWG', 'ZJV1CSVX', 'ZJV02XWA', 'ZA18CEBS', '...']

Now, notice that the index does not contain a label 1, 2, 3 or any other integer label. All the labels are in the string format.

Calling the following line will throw an error:

hdd.loc[:5, "failure"]

# Summarized output of the Notebook cell: 
"""
    ...
    TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [5]     of <class 'int'>
"""

If we would want a subset containing the first three disks of this database, in which the index labels are strings, we could do it using either loc or iloc:

# Rows up to and including the label "ZJV0XJQ3"
# Columns up to and including the label "failure"
hdd.loc[:"ZJV0XJQ3", :"failure"]

# Rows up to 3rd, columns up to 4th from the start
hdd.iloc[:3, :4]

| serialnumber | date | model | capacitybytes | failure | | ------------: | ---------: | ------------: | -------------: | ------: | | Z305B2QN | 2020-01-01 | ST4000DM000 | 4000787030016 | 0 | | ZJV0XJQ4 | 2020-01-01 | ST12000NM0007 | 12000138625024 | 0 | | ZJV0XJQ3 | 2020-01-01 | ST12000NM0007 | 12000138625024 | 0 |

What the [:, :]

If the notation [:5, 2:3] looks completely strange, you have not met Python slicing before. NumPy arrays can be sliced in the same way as Python's built-in tuples, arrays and strings.

The notation for 1-dimensional array (or similar) is:

variable[start:stop:step]

For 2-dimensional it is then

variable[start:stop:step, start:stop:step]

For 3-dimensional…

variable[start:stop:step, start:stop:step, start:stop:step]

Start and stop default to None, so you will not often see all three values (start, stop, step). In many cases, you will see e.g.:

numbers = [1,2,3,4,5,6,7,8,9,10]
crop = numbers[:5]
# crop: [1,2,3,4,5]

crop = numbers[::2]
# crop: [1, 3, 5, 7, 9] 

crop = numbers[1::2]
# crop: [2, 4, 6, 8, 10]

crop = numbers[2:4]
# crop: [3, 4]

crop = numbers[1]
# crop: 2 (Notice! Python lists are 0-indexed)

Python docs contain some information of slicing, but since slicing is mostly used in NumPy , one of the best sources of information is the NumPy documentation. Read more from: https://numpy.org/doc/1.18/reference/arrays.indexing.html

If all of this is completely new to you, make sure to spend some time practicing list and array slicing. Otherwise, your Python skills will be limiting how well you can perform in machine learning. Read the documentation linked above. You must understand this part of Python/NumPy as a data scientist. Having that said, when you need a subset of a DataFrame, prefer calling the iloc method.

Chaining methods

Whether you are accessing the whole DataFrame or sliced Series, you have an access to wide variety of methods you can call. The output of the previous method will be an input for the next.

hdd.filter(like='raw').isna().sum().sort_values().head(15)

An example of this can be seen above. Let's break it into pieces.

# Filter columns that contain the words raw. 
# 63 columns will remain, e.g. "smart_1_raw"
hdd.filter(like='raw')

# Feed the 63 columns to isna() method.
# All values will be replaced with "True" or "False"
# Depending in the cell is empty (NaN)
hdd.filter(like='raw').isna()

# Sum up the Trues. Essentially, this will count how
# many Trues are per column
hdd.filter(like='raw').isna().sum()

# Sort values (ascending by default)
hdd.filter(like='raw').isna().sum().sort_values()

# Display the first 10 rows
hdd.filter(like='raw').isna().sum().sort_values().head(15)

Do your best to understand what is happening in the example above. The syntax can become a bit difficult to read, especially when multiple methods have been chained.

The output of the line is as follows:

smart_1_raw         60
smart_192_raw       60
smart_194_raw       60
smart_9_raw         60
smart_12_raw        60
smart_7_raw        346
smart_3_raw        346
smart_197_raw      346
smart_5_raw        346
smart_199_raw      346
smart_198_raw      346
smart_10_raw       346
smart_4_raw        346
smart_193_raw      835
smart_240_raw    30105

What does this mean? Think! For the feature 'Smart 1 Raw', we have only 60 drives that lack the value. Same applies to the other top 5 features. After the 14th item, the amount of missing features sky-rockets quite quickly.

What should we do? What is over half of our dataset is filled with NaN, such as is case with 'smart184raw '. It is missing in 79 out of 125 thousand cases, so over 60 % of the dataset doesn't have that value.

This is why we why examine the data thoroughly.

Useful DataFrame/Series methods

Note that you don't always need to access the Pandas documentation using a web browser. Assuming you have imported pandas as pd, you can also print the docstring of a 'count' method using a command:

help(pd.Series.count)

# or

help(pd.DataFrame.count)

Here are some methods you will either meet during the exercise or that you should try for yourself. Call the Pythons built-in help() function or head to the web documentation if you are ever unsure how to use a certain method.

| DataFrame / Series / Both | How to call the method | | ------------------------- | ---------------------- | | Both | .count() | | Both | .nunique() | | Both | .sum() | | Both | .mean() | | Both | .max() | | Both | .idxmax() | | Series | .unique() | | Series | .value_counts() | | Both | .dropna() | | Both | .drop() | | Both | .nsmallest() | | Both | .isna() | | Series | .startswith() | | Both | .filter() | | Both | .take() |

All methods can be found from Pandas Series and DataFrame documentations. Don't be overwhelmed by the volume of choices. It is highly likely that you won't ever need majority of the available methods. Also, often you will learn to use methods by reading code that others have written. Reading the documentation is useful, but not maybe the most efficient way to learn.

Boolean filters

This topic is likely familiar from other computer science courses. A boolean array is an array filled with Trues and Falses (or 1's and 0's.). In Pandas, any NaN values will be treated as False.

Let's investigate this quickly before starting the exercise. We will start by creating a small subset of the whole dataset, as learned above. Let's keep a couple of S.M.A.R.T. stat columns: smart3normalized, smart5normalized and smart7normalized along with the model name.

# Either iloc or loc is fine here. If I used loc, what would 2, 9, 13 and 15 be?
subset = hdd.iloc[:4, [2, 9, 13, 15]]

| | model | smart3normalized | smart5normalized | smart7normalized | | ---: | ------------: | -----------------: | -----------------: | ------------------ | | 0 | ST4000DM000 | 91.0 | 100.0 | 89.0 | | 1 | ST12000NM0007 | 91.0 | 100.0 | 88.0 | | 2 | ST12000NM0007 | 99.0 | 100.0 | 82.0 | | 3 | ST12000NM0007 | 93.0 | 100.0 | 88.0 |

We can create a boolean array by using boolean operator, which in this case is: "larger than".

mask = subset.smart_3_normalized > 92
print(mask)

Output will be:

0    False
1    False
2     True
3     True
Name: smart_3_normalized, dtype: bool

Now, this mask can be used to mask out rows from the subset that we created before.

subset[mask]

# Can you figure out what the following line would do?
# subset[~mask]

| | model | smart3normalized | smart5normalized | smart7normalized | | ---: | ------------: | -----------------: | -----------------: | ------------------ | | 2 | ST12000NM0007 | 99.0 | 100.0 | 82.0 | | 3 | ST12000NM0007 | 93.0 | 100.0 | 88.0 |