Introduction to pandas

·

5 min read

Introduction

pandas is a data manipulation package that helps with data alignment, merging and joining data sets, aggregating data, time series-functionality, and more for academic and commercial purposes.

To use pandas, you'll typically start with the following line of code:

import pandas as pd

Creating Data

There are two core objects in pandas: the DataFrame and the Series.

Create DataFrame

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

Input:

df = pd.DataFrame({"a": [4, 5, 6], "b": [7, 8, 9], "c": [10, 11, 12]}, index=[1, 2, 3])

or

df = pd.DataFrame(
    [[4, 7, 10], [5, 8, 11], [6, 9, 12]], index=[1, 2, 3], columns=["a", "b", "c"]
)

Output:

abc
14710
25811
36912

Create Series

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list.

Input:

pd.Series([1, 2, 3, 4, 5])

Output:

0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name.

Input:

pd.Series(
    [30, 35, 40], index=["2015 Sales", "2016 Sales", "2017 Sales"], name="Product A"
)

Output:

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

Reading Data Files

Data can be stored in any of a number of different forms and formats.

Read CSV file

wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv")

Indexing data

Let's consider this DataFrame:

Input:

reviews

Output:

countrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
0ItalyAromas include tropical fruit, broom, brimston...Vulkà Bianco87NaNSicily & SardiniaEtnaNaNKerin O’Keefe@kerinokeefeNicosia 2013 Vulkà Bianco (Etna)White BlendNicosia
1PortugalThis is ripe and fruity, a wine that is smooth...Avidagos8715.0DouroNaNNaNRoger Voss@vossrogerQuinta dos Avidagos 2011 Avidagos Red (Douro)Portuguese RedQuinta dos Avidagos
..........................................
129969FranceA dry style of Pinot Gris, this is crisp with ...NaN9032.0AlsaceAlsaceNaNRoger Voss@vossrogerDomaine Marcel Deiss 2012 Pinot Gris (Alsace)Pinot GrisDomaine Marcel Deiss
129970FranceBig, rich and off-dry, this is powered by inte...Lieu-dit Harth Cuvée Caroline9021.0AlsaceAlsaceNaNRoger Voss@vossrogerDomaine Schoffit 2012 Lieu-dit Harth Cuvée Car...GewürztraminerDomaine Schoffit

Native Accessors

Select a Column

Input:

reviews.country

or

reviews['country']

Output:

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object
💡
In general, when we select a single column from a DataFrame, we'll get a Series.

Select a Single Entry

Input:

reviews['country'][0]

Output:

'Italy'

Indexing in pandas

pandas has its own accessor operators, loc and iloc.

  • iloc selects data based on its numerical position in the data; loc selects data by row and column labels.

  • iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. loc, meanwhile, indexes inclusively.

  • Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

iloc

reviews.iloc[0] # selects the first row
reviews.iloc[:, 0] # selects the first column
💡
On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values.
reviews.iloc[:3, 0] # selects just the first, second, and third row from the first column
reviews.iloc[1:3, 0] # selects just the second and third entries from the first column

It's also possible to pass a list:

reviews.iloc[[0, 1, 2], 0]

loc

When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead.

For example, here's one operation that's much easier using loc:

Input:

reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

Output:

taster_nametaster_twitter_handlepoints
0Kerin O’Keefe@kerinokeefe87
1Roger Voss@vossroger87
............
129969Roger Voss@vossroger90
129970Roger Voss@vossroger90

Conditional Selection

reviews.loc[reviews.country == 'Italy']
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]