Reading Data with Pandas

This lesson will introduce many of the useful Pandas commands that you will need when reading files and understanding what the include. Reading the data into a DataFrame object is simple. Assuming that we have a CSV file ready to be read, we can simply type:

import pandas as pd

my_film_collection = pd.read_csv("films.csv")

After this, we could confirm that the object is of type Pandas DataFrame. A curious data scientist is a good data scientist. If you meet terms you don't recognize during this course, stop reading and find out. Do you know what is a CSV file? If not, find out now!

type(my_film_collection)

Output:

pandas.core.frame.DataFrame

In comparison, you can download the same files locally and try working with them in, say, Microsoft Excel. I even suggest you to do this, especially with large dataset files. This will give you some perspective on the practicality and speed of Pandas. Before diving blindly into the practical section, let's cover some of the important theory.

Having that said, Pandas is an in-memory tool. The DataFrame has to fit into your computers memory. If your dataset is way too large for in-memory computation, you would have to be using some (usually slower) on-disk alternatives such as HDF5 files or Hadoop (HDFS) file system or MongoDB or some other tool. When you are still learning AI, it is highly likely that most of your datasets will fit into memory. Thus, Pandas is a good choice.

Reading the contents

Having the file in an object is useful, but we still don't know what the file consist of. If you did the task in the previous lesson, you have most likely ended up noticing the "10 minutes to pandas" and the Cookbook. If not, read the 10 minutes guide and glance through the Cookbook to see what's there.

10 minutes to pandas, pydata.org
Cookbook, pydata.org

This first lesson is about reading (and viewing) files in Pandas. After reading the lesson material and going through the Notebook, you should be able to describe what the following function perform:

df = pd.read_csv()
df.head()
df.tail()
df.desbribe()
df.info()

Use the "10 minutes to pandas" as a companion guide for this courses pandas-related contents. I suggest that you bookmark the site and access it whenever your memory needs a refresher.

Data Structure

When you handle data in Pandas, your data will be either structured as DataFrames and Series.

Series are one-dimensional ndarrays (note: numpy!) with axis labels.
DataFrames are two-dimensional datasets.

When you slice a column of a row from a DataFrame, you often end up having a Series object. The actual data is usually in NumPy arrays, contained into one of the structures mentioned above. Both data structure types utilize Index, storing the labels for the object.

Notice that the Index is being used on rows (axis=0) and columns (axis=1). Thus, one can usually perform same or similar operations on both rows and columns.

About the Exercise

In the Jupyter Notebook exercise, we will use a fairly large (> 120000x100) table on data, consisting on real-world data. The dataset will not be used for machine learning during this course. Pandas library is well documented and it would be unnecessary to replicate their "10 minutes to pandas" documentation. Instead, we will use the same tools and methods for a lot, lot larger dataset than the examples on their documentation. This will give you perspective on how fast data handling is with Pandas even when there are hundreds of thousands of rows in your dataset.

Later on, we will use Pandas on a lot simpler dataset and train a model on the data. If you follow the exercises thoroughly, you will gain enough knowledge to face datasets you've never meat before and make some sense of the data.

Conclusion

In short, when used for viewing the data, Pandas is like a "NumPy-front-end" for structured data files such as CSV and Excel and HDF5 files. When working with data, the first step is always understanding the data. You must know what is in the dataset.

A machine learning algorithm should not be treated as a black box that will blindly learn from any data, whether there are correlations on not. If you put garbage in, you get garbage out.