Introduction to pandas
Introduction
pandas is a data manipulation package that helps with data alignment, merging and joining data sets, aggregating data, time series-functionality, and more for academic and commercial purposes.
To use pandas, you'll typically start with the following line of code:
import pandas as pd
Creating Data
There are two core objects in pandas: the DataFrame and the Series.
Create DataFrame
A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.
Input:
df = pd.DataFrame({"a": [4, 5, 6], "b": [7, 8, 9], "c": [10, 11, 12]}, index=[1, 2, 3])
or
df = pd.DataFrame(
[[4, 7, 10], [5, 8, 11], [6, 9, 12]], index=[1, 2, 3], columns=["a", "b", "c"]
)
Output:
a | b | c | |
1 | 4 | 7 | 10 |
2 | 5 | 8 | 11 |
3 | 6 | 9 | 12 |
Create Series
A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list.
Input:
pd.Series([1, 2, 3, 4, 5])
Output:
0 1
1 2
2 3
3 4
4 5
dtype: int64
A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index
parameter. However, a Series does not have a column name, it only has one overall name
.
Input:
pd.Series(
[30, 35, 40], index=["2015 Sales", "2016 Sales", "2017 Sales"], name="Product A"
)
Output:
2015 Sales 30
2016 Sales 35
2017 Sales 40
Name: Product A, dtype: int64
Reading Data Files
Data can be stored in any of a number of different forms and formats.
Read CSV file
wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv")
Indexing data
Let's consider this DataFrame:
Input:
reviews
Output:
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
Native Accessors
Select a Column
Input:
reviews.country
or
reviews['country']
Output:
0 Italy
1 Portugal
...
129969 France
129970 France
Name: country, Length: 129971, dtype: object
Select a Single Entry
Input:
reviews['country'][0]
Output:
'Italy'
Indexing in pandas
pandas has its own accessor operators, loc
and iloc
.
iloc
selects data based on its numerical position in the data;loc
selects data by row and column labels.iloc
uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded.loc
, meanwhile, indexes inclusively.Both
loc
andiloc
are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.
iloc
reviews.iloc[0] # selects the first row
reviews.iloc[:, 0] # selects the first column
:
operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values.reviews.iloc[:3, 0] # selects just the first, second, and third row from the first column
reviews.iloc[1:3, 0] # selects just the second and third entries from the first column
It's also possible to pass a list:
reviews.iloc[[0, 1, 2], 0]
loc
When we use iloc
we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc
, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc
instead.
For example, here's one operation that's much easier using loc
:
Input:
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]
Output:
taster_name | taster_twitter_handle | points | |
0 | Kerin O’Keefe | @kerinokeefe | 87 |
1 | Roger Voss | @vossroger | 87 |
... | ... | ... | ... |
129969 | Roger Voss | @vossroger | 90 |
129970 | Roger Voss | @vossroger | 90 |
Conditional Selection
reviews.loc[reviews.country == 'Italy']
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]