Introduction

pandas is a data manipulation package that helps with data alignment, merging and joining data sets, aggregating data, time series-functionality, and more for academic and commercial purposes.

To use pandas, you'll typically start with the following line of code:

import pandas as pd

Creating Data

There are two core objects in pandas: the DataFrame and the Series.

Create DataFrame

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

Input:

df = pd.DataFrame({"a": [4, 5, 6], "b": [7, 8, 9], "c": [10, 11, 12]}, index=[1, 2, 3])

df = pd.DataFrame(
    [[4, 7, 10], [5, 8, 11], [6, 9, 12]], index=[1, 2, 3], columns=["a", "b", "c"]
)

Output:

	a	b	c
1	4	7	10
2	5	8	11
3	6	9	12

Create Series

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list.

Input:

pd.Series([1, 2, 3, 4, 5])

Output:

0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name.

Input:

pd.Series(
    [30, 35, 40], index=["2015 Sales", "2016 Sales", "2017 Sales"], name="Product A"
)

Output:

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

Reading Data Files

Data can be stored in any of a number of different forms and formats.

Read CSV file

wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv")

Indexing data

Let's consider this DataFrame:

Input:

reviews

Output:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129969	France	A dry style of Pinot Gris, this is crisp with ...	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte...	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...	Gewürztraminer	Domaine Schoffit

Native Accessors

Select a Column

Input:

reviews.country

reviews['country']

Output:

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

💡

In general, when we select a single column from a DataFrame, we'll get a Series.

Select a Single Entry

Input:

reviews['country'][0]

Output:

'Italy'

Indexing in pandas

pandas has its own accessor operators, loc and iloc.

iloc selects data based on its numerical position in the data; loc selects data by row and column labels.
iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. loc, meanwhile, indexes inclusively.
Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

iloc

reviews.iloc[0] # selects the first row

reviews.iloc[:, 0] # selects the first column

💡

On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values.

reviews.iloc[:3, 0] # selects just the first, second, and third row from the first column

reviews.iloc[1:3, 0] # selects just the second and third entries from the first column

It's also possible to pass a list:

reviews.iloc[[0, 1, 2], 0]

loc

When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead.

For example, here's one operation that's much easier using loc:

Input:

reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

Output:

	taster_name	taster_twitter_handle	points
0	Kerin O’Keefe	@kerinokeefe	87
1	Roger Voss	@vossroger	87
...	...	...	...
129969	Roger Voss	@vossroger	90
129970	Roger Voss	@vossroger	90

Conditional Selection

reviews.loc[reviews.country == 'Italy']

reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

Aaronspace

Aaronspace

Introduction to pandas

Table of contents

Introduction

Creating Data

Create DataFrame

Create Series

Reading Data Files

Read CSV file

Indexing data

Native Accessors

Select a Column

Select a Single Entry

Indexing in pandas

iloc

loc

Conditional Selection