Grouping and Combining Data in Pandas

Grouping in Pandas is handy way of organizing for data analytics. During this course, we will take a very brief peak at the tool. Feel free to spend your time to get to known it better.

Let's pretend we are designing a role playing game with dices and figures. Maybe we would want to analyse and keep track of various game pieces and their features. Our Characters database could look like this when imported to Pandas.

| | Name | Race | Attack | Dice | TokenMat | CanSwim | | ---: | -------: | ------: | -----: | ---: | -------: | ------- | | 0 | Gargoyle | Magical | 10.2 | D20 | Wood | True | | 1 | Peter | Human | 9.8 | D20 | Wood | False | | 2 | Lisa | Human | 10.5 | D6 | Wood | False | | 3 | Goblin | Magical | 8.7 | D6 | Metal | True | | 4 | Blob | Magical | 9.5 | D20 | Jelly | True |

Now, we could create a GroupBy object by calling:

by_race = gamedata.groupby("Race")

This doesn't output anything useful yet, it juts constructs an object.

In the game, there are two races: human and magical, so the GroupBy object would include two groups. If we would want to count how many individual characters we have per each race, we would have to count the unique values of column "Name" in each group.

by_race[["Name"]].nunique()

The output is:

| Race | Name | | ------: | ---: | | Human | 2 | | Magical | 3 |

If your dataset is some time series style data, and DataFrame Index is in a datetime format, you can perform clever tricks such as group per certain time limits or days (which is called resampling.)

Similar but different: Concat, Merge and Join

So far, we have been shaping data within a single database: selecting, slicing, indexing, filtering and placing items into groups. Note that we can also add new data. The simplest way is to add data in a similar way as adding new key-value pairs into a dictionary:

# All rows have a value of 1
df["new_column"] = 1

# Values for rows are found in a list 
df["another_column"] = list_of_values

In many cases, we might have data outside the current DataFrame that we are interested in. In the BackBlaze example, our data is spread across multiple CSV files on a daily basis. We might want to combine 2020-01-01 and 2020-01-02 to analyse how the values have changed over time. Pandas is highly optimised for merging dataframes and provided an easy interface for doing so. The three main tools for combining DataFrames (or Series) are: append, concat, join and merge. Append is a shortcut to concat, so it will not be included in the comparison table below.

If you are familiar with SQL, you might find Pandas' comparison with SQL documentation useful.

| concat | join | merge | | ---------------------------- | ---------------------- | ----------------------- | | Any number of objects | Any number of objects | Exactly two DataFrames | | pd.concat() | df.join() | df.merge() | | By default, outer join. | By default, left join. | By default, inner join. | | Vertically (or Horizontally) | Horizontally | Horizontally |

By default, two dataframes would be combined as:

Concat: stacked vertically as they are. No data would be dropped out.
Join: combined horizontally, keeping only the indices from the first dataset (also known as the 'left' dataset).
Merge: combined horizontally, keeping only the indices that are present in both datasets (in 'left 'and 'right 'datasets).

This is not a database course, so we will not delve deeper into database relationships. If you are no database or SQL courses in your curriculum, you might want to add some courses on your TODO-list.

Notice that concat can be used for any operation that append, join or merge can perform. Concat and append are the only methods that can combine DataFrames vertically. Even though concat can be used for combining in any direction, the syntax will be shorter and easier to read if you select a method that suits the purpose closest by default.