During this course, we will use some of the tools that make the lives of data scientists' lives easier. There are a lot of libraries that you will end up finding useful. Pandas and Sci-Kit Learn are maybe the most crucial tools to understand so that you can start working on data, so the next lessons will focus those individually.
Jupyter Notebook is an open source platform that runs Notebook documents. It is an integrated development environment (IDE) that allows you to write code within a typical web browser. You do not need to install Python or any libraries locally. The code is being run in the server too. It comes with downsides, such as learning bad programming habits, and not getting used to great tools that proper IDE's offer (such as PyCharm or VScode). For online courses, Jupyter Notebook is a great choice. Just keep in mind that you might end up installing Python locally (or on a virtual server) in future courses. For many learners, Notebooks are a stepping stone before they start installing everything manually.
For this course, Jupyter Notebook is more than enough. In this course, the libraries are already installed. If you ever install Python locally, you will have to install the packages yourself. The libraries mentioned here can be found from most common package indices such as PyPi or conda. In this course, you can simply just import them within the Notebook and they are ready to use.
NumPy stands for Numerical Python and it is THE tool for scientific computing with Python. The library is mainly used for working with arrays (and matrices). Whether your project includes images, sounds, text files or some other data, you will most likely end up using NumPy in a form or other.
By convention, it is usually imported with a name 'np'.
import numpy as np
Many other libraries use NumPy arrays internally to represent data. The code below would create a Full HD RGB image (1920x1080 with RGB color channels) with all-white pixels:
white_image = np.full((1920,1080,3), 255, dtype=np.uint8)
Pandas has been built on top of NumPy. Pandas is used for manipulating tabular data (spreadsheets, databases). Whereas in NumPy the data is usually same type numbers, Pandas works with Pandas DataFrame objects, which have a similar feel like spreadsheets in Excel or similar applications. Data may be numbers or textual, or both. Data may include time series.
By convention, it is usually imported with a name 'pd'.
import pandas as pd
A usual first operation would be opening a file. Below is an example of reading an Excel file:
df = pd.read_excel('Example.xlsx', sheetname='Sheet1')
Visualizing Pandas Dataframes is fairly easy using Matplotlib. That library has already been used in this course. By convention, it is usually impoted with a name 'plt'.
import matplotlib as plt
Seaborn is based on Matplotlib and is often used for visualizing Pandas Dataframes. It is a very high-level library: creating a useful plot may not require more than 1 or 2 lines of code. Seaborn is usually imported as sns.
import seaborn as sns
Scikit learn is called 'sklearn' in PyPi package index and it is a library used for machine learning in Python. You've seen it being used within this course. Unless you are writing code for neural networks, it is highly unlikely that you will not be importing this library. Sklearn is rarely imported fully. Usually, you import just the module you need.
A typical import would look like this:
from sklearn.ensemble import RandomForestClassifier
In the Chapter 5, we will be using TensorFlow, which is one of many frameworks used for deep learning (neural networks). For now, let's stick with the libraries listed above.
TASK: Find the websites of the libraries mentioned above. Take a quick tour around the webpages. Get familiarized with their documentation. When you are writing code, knowing where to find documentation and how to read it is crucial. Get used to it early on!