Week 5 - pandas

pandas is an open source library that brings high-performance, easy-to-use data structures and analysis tools to the Python programming language. It’s core features are its 1-dimensional Series and 2-dimensional DataFrame objects. The former is like a NumPy array (pandas is built on top of NumPy) with an explicit index, and the latter is a bit like an Excel worksheet.

The 10 minutes to pandas and getting started guides are excellent places to learn more, but read on for my own tour of the library which draws comparisons between pandas and familiar spreadsheet programs (e.g., Microsoft Excel, Apple Numbers, LibreOffice Calc, Google Sheets).

Importing pandas

The community agreed custom for importing pandas is:

import pandas as pd

This gives us access to all pandas functionality, which means it is a bit like clicking on the desktop icon that launches a spreadsheet application. Let’s import pandas, and while we are at it, let’s also import numpy and matplotlib (and set some configuration options).

[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')

We’ll begin with a detailed look at the Series and DataFrame objects.

Series

pd.Series is much like a 1-dimensional NumPy array, but with explicit axis labels, the ability to hold any data type, and various feature and method enhancements to simplify working with complex data. We can think of a Series as the rough equivalent of a column in a spreadsheet application.

The basic way of making a Series is as follows:

s = pd.Series(data, index=index)

Note that the argument for data can take various forms, such as a Python dictionary, a numpy array, a list, or even a single value. If provided, index must be a list of axis labels with the same length as data.

Here’s a simple Series of random numbers.

[2]:
s = pd.Series(np.random.random(100))
print(s)
0     0.965606
1     0.500888
2     0.485584
3     0.704136
4     0.001702
        ...
95    0.199428
96    0.274759
97    0.653682
98    0.639472
99    0.949416
Length: 100, dtype: float64

We didn’t specify an index, so a default integer-based index (starting at 0) was applied, which we can see to the left of the actual values. Below the index and values we can also see the length and data type.

Now, let’s make a Series containing the letters of the alphabet and assign also an index and a name.

[3]:
letters = list('abcdefghijklmnopqrstuvwxyz')
s = pd.Series(letters, index=range(1, 27), name='lowercase')
print(s)
1     a
2     b
3     c
4     d
5     e
6     f
7     g
8     h
9     i
10    j
11    k
12    l
13    m
14    n
15    o
16    p
17    q
18    r
19    s
20    t
21    u
22    v
23    w
24    x
25    y
26    z
Name: lowercase, dtype: object

A few things to note here. First, the index doesn’t default to zero because we specified our own labels. Second, the name we provided is shown below the values (this is analogous to a column header in spreadsheet applications). Third, the data type is object, denoting that the Series contains str data.

DataFrame

A DataFrame is a 2-dimensional data structure consisting of rows and columns, which makes it similar to a worksheet in Microsfot Excel. Each column in a DataFrame is a Series, which by now we know to be a feature-enhanced, explicitly indexed, NumPy array that can contain any type of data. A DataFrame, then, is a collection of Series that share a common Index.

An easy way to create a basic DataFrameis to pass a Python dict to the pd.DataFrame(...) constructor. In this scenario, the keys will serve as names for the columns, and the values, which must be convertible to a series-like data structure, will be the data.

[4]:
# It's the familiar shopping example again...
df = pd.DataFrame(
    {
        'item': ['apples', 'bananas', 'bread', 'yoghurt', 'pasta', 'coffee', 'butter', 'carrots'],
        'unit_price': np.array([2.20, 1.85, 1.10, 1.20, 1.45, 2.80, 2.40, .89], dtype='float32'),
        'quantity': np.array([2, 1, 2, 4, 2, 1, 1, 1]),
        'date_of_purchase': pd.Timestamp('20221208'),
        'shop': pd.Categorical(['Aldi'] * 8),
        'paid_cash': True
    }
)
df
[4]:
item unit_price quantity date_of_purchase shop paid_cash
0 apples 2.20 2 2022-12-08 Aldi True
1 bananas 1.85 1 2022-12-08 Aldi True
2 bread 1.10 2 2022-12-08 Aldi True
3 yoghurt 1.20 4 2022-12-08 Aldi True
4 pasta 1.45 2 2022-12-08 Aldi True
5 coffee 2.80 1 2022-12-08 Aldi True
6 butter 2.40 1 2022-12-08 Aldi True
7 carrots 0.89 1 2022-12-08 Aldi True

As with NumPy arrays, the DataFrame object stores some basic information as attributes. Below, we get information about shape, number of dimensions, size (number of elements), and the data types for each column of df.

[5]:
print('Shape:                ', df.shape)
print('Number of dimensions: ', df.ndim)
print('Number of elements:   ', df.size)
print('Data types:\n')
print(df.dtypes)
Shape:                 (8, 6)
Number of dimensions:  2
Number of elements:    48
Data types:

item                        object
unit_price                 float32
quantity                     int64
date_of_purchase    datetime64[ns]
shop                      category
paid_cash                     bool
dtype: object

We can also see how much memory (in bytes) is being used by each column.

[6]:
df.memory_usage()
[6]:
Index               128
item                 64
unit_price           32
quantity             64
date_of_purchase     64
shop                124
paid_cash             8
dtype: int64

The actual values stored in a DataFrame are accessible via the .values attribute. Inspection of these values shows that pandas is built right on top of NumPy.

[7]:
print("The values of the data frame are: ")
print(df.values)
print("The type of the values is ", type(df.values))
The values of the data frame are:
[['apples' 2.200000047683716 2 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['bananas' 1.850000023841858 1 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['bread' 1.100000023841858 2 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['yoghurt' 1.2000000476837158 4 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['pasta' 1.4500000476837158 2 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['coffee' 2.799999952316284 1 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['butter' 2.4000000953674316 1 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['carrots' 0.8899999856948853 1 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]]
The type of the values is  <class 'numpy.ndarray'>

Viewing, selecting and indexing

If you are new to pandas, it may sometimes feel like the data are hidden away and out of reach, especially in comparison to spreadsheet environments where data are generally always on the screen and can be selected and browsed by clicking and scrolling the mouse. Thankfully, pandas has some handy tools for viewing a DataFrame.

DataFrame.head() can be used to view the top of a frame, and DataFrame.tail() to view the bottom. The default number of rows to display is 5, but we can change this by specifying a number.

[8]:
# Show the first three rows of df
df.head(3)
[8]:
item unit_price quantity date_of_purchase shop paid_cash
0 apples 2.20 2 2022-12-08 Aldi True
1 bananas 1.85 1 2022-12-08 Aldi True
2 bread 1.10 2 2022-12-08 Aldi True
[9]:
# Show the last three rows of df
df.tail(3)
[9]:
item unit_price quantity date_of_purchase shop paid_cash
5 coffee 2.80 1 2022-12-08 Aldi True
6 butter 2.40 1 2022-12-08 Aldi True
7 carrots 0.89 1 2022-12-08 Aldi True

The row and column indices themselves can also be accessed and operated upon (note that we didn’t specify a row index at the time of creation, so we got a default RangeIndex).

[10]:
# View the row index of df
df.index
[10]:
RangeIndex(start=0, stop=8, step=1)
[11]:
# View the column index of df
df.columns
[11]:
Index(['item', 'unit_price', 'quantity', 'date_of_purchase', 'shop',
       'paid_cash'],
      dtype='object')

Let’s talk about selecting columns. To select a single column of a DataFrame, put its name in square brackets. This is a bit like clicking on the column header in a spreadsheet to highlight the entire column. The result of this operation is a Series.

[12]:
df['item']
[12]:
0     apples
1    bananas
2      bread
3    yoghurt
4      pasta
5     coffee
6     butter
7    carrots
Name: item, dtype: object

As long as the column name doesn’t contain whitespace or special characters, single columns may also be accessed in equivalent fashion via . notation. I often find this more convenient, as it’s easier to type.

[13]:
df.unit_price
[13]:
0    2.20
1    1.85
2    1.10
3    1.20
4    1.45
5    2.80
6    2.40
7    0.89
Name: unit_price, dtype: float32

Selecting multiple columns at the same time requires a list of column names, and returns a DataFrame (this is like doing Ctrl+click or Ctrl+drag to select multiple columns in a spreadsheet).

[14]:
df[['item', 'quantity', 'unit_price']]
[14]:
item quantity unit_price
0 apples 2 2.20
1 bananas 1 1.85
2 bread 2 1.10
3 yoghurt 4 1.20
4 pasta 2 1.45
5 coffee 1 2.80
6 butter 1 2.40
7 carrots 1 0.89

[] can be used to slice by row

[15]:
df[2:5]
[15]:
item unit_price quantity date_of_purchase shop paid_cash
2 bread 1.10 2 2022-12-08 Aldi True
3 yoghurt 1.20 4 2022-12-08 Aldi True
4 pasta 1.45 2 2022-12-08 Aldi True

pandas also supports label-based indexing, which is extremely useful for selecting individual values or cross-sections of a DataFrame.

Label-based indexing is performed using the .loc[] and .at[] access methods. Note these are square, and not round, brackets.

[16]:
# Select rows 4:7 and the specified columns
df.loc[4:7, ['item', 'unit_price', 'quantity']]
[16]:
item unit_price quantity
4 pasta 1.45 2
5 coffee 2.80 1
6 butter 2.40 1
7 carrots 0.89 1
[17]:
# Select the value in row=3, column='item'
df.at[3, 'item']
[17]:
'yoghurt'

Positional indexing can be achieved with the .iloc[] and .iat[] access methods. The difference here is that we must provide integers (starting at 0). The behavior is very similar to array slicing in Python and NumPy.

[18]:
# Select the 6th row
df.iloc[5]
[18]:
item                             coffee
unit_price                          2.8
quantity                              1
date_of_purchase    2022-12-08 00:00:00
shop                               Aldi
paid_cash                          True
Name: 5, dtype: object
[19]:
# Select the value in the second row of the first column
df.iat[1, 0]
[19]:
'bananas'

Boolean indexing is another common way to select a subset of data from a frame. Suppose we cared only about rows where the unit_price was over £2.00.

[20]:
df[df['unit_price'] > 2.]
[20]:
item unit_price quantity date_of_purchase shop paid_cash
0 apples 2.2 2 2022-12-08 Aldi True
5 coffee 2.8 1 2022-12-08 Aldi True
6 butter 2.4 1 2022-12-08 Aldi True

This works, because the expression inside the square brackets evaluates as a Series of bool, and the indexing operation ensures that we only get the rows where the expression comes up True.

[21]:
df['unit_price']>2.
[21]:
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7    False
Name: unit_price, dtype: bool

We can get quite fancy by chaining multiple expressions together. For example, if we only wanted rows where the unit_price is greater than £1.00 and less than £2.00.

[22]:
df[((df['unit_price'] > 1.) & (df['unit_price'] < 2.))]
[22]:
item unit_price quantity date_of_purchase shop paid_cash
1 bananas 1.85 1 2022-12-08 Aldi True
2 bread 1.10 2 2022-12-08 Aldi True
3 yoghurt 1.20 4 2022-12-08 Aldi True
4 pasta 1.45 2 2022-12-08 Aldi True

World population dataset

The best way to learn pandas is to use it with real data, so let’s explore some more advanced features with an actual dataset.

In the data folder of the course materials, there’s a file called world_population.csv which I sourced from an excellent data science website called Kaggle. The file contains historical population data for every country/territory in the world, along with various other parameters. Here’s what’s in the file:

Variable

Definition

Rank

Rank by Population

CCA3

3 Digit Country/Territories Code

Country

Name of the Country/Territories

Capital

Name of the Capital

Continent

Name of the Continent

2022 Population

Population of the Country/Territories in the year 2022

2020 Population

Population of the Country/Territories in the year 2020

2015 Population

Population of the Country/Territories in the year 2015

2010 Population

Population of the Country/Territories in the year 2010

2000 Population

Population of the Country/Territories in the year 2000

1990 Population

Population of the Country/Territories in the year 1990

1980 Population

Population of the Country/Territories in the year 1980

1970 Population

Population of the Country/Territories in the year 1970

Area (km²)

Area size of the Country/Territories in square kilometer

Density (per km²)

Population Density per square kilometer

Growth Rate

Population Growth Rate by Country/Territories

World Population Percentage

The population percentage by each Country/Territories

Let’s start by loading and inspecting the data.

Because its a CSV file, pd.read_csv(...) is right tool for the job.

[23]:
df = pd.read_csv('../data/world_population.csv')
df
[23]:
Rank CCA3 Country Capital Continent 2022 Population 2020 Population 2015 Population 2010 Population 2000 Population 1990 Population 1980 Population 1970 Population Area (km²) Density (per km²) Growth Rate World Population Percentage
0 36 AFG Afghanistan Kabul Asia 41128771 38972230 33753499 28189672 19542982 10694796 12486631 10752971 652230 63.0587 1.0257 0.52
1 138 ALB Albania Tirana Europe 2842321 2866849 2882481 2913399 3182021 3295066 2941651 2324731 28748 98.8702 0.9957 0.04
2 34 DZA Algeria Algiers Africa 44903225 43451666 39543154 35856344 30774621 25518074 18739378 13795915 2381741 18.8531 1.0164 0.56
3 213 ASM American Samoa Pago Pago Oceania 44273 46189 51368 54849 58230 47818 32886 27075 199 222.4774 0.9831 0.00
4 203 AND Andorra Andorra la Vella Europe 79824 77700 71746 71519 66097 53569 35611 19860 468 170.5641 1.0100 0.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
229 226 WLF Wallis and Futuna Mata-Utu Oceania 11572 11655 12182 13142 14723 13454 11315 9377 142 81.4930 0.9953 0.00
230 172 ESH Western Sahara El Aaiún Africa 575986 556048 491824 413296 270375 178529 116775 76371 266000 2.1654 1.0184 0.01
231 46 YEM Yemen Sanaa Asia 33696614 32284046 28516545 24743946 18628700 13375121 9204938 6843607 527968 63.8232 1.0217 0.42
232 63 ZMB Zambia Lusaka Africa 20017675 18927715 16248230 13792086 9891136 7686401 5720438 4281671 752612 26.5976 1.0280 0.25
233 74 ZWE Zimbabwe Harare Africa 16320537 15669666 14154937 12839771 11834676 10113893 7049926 5202918 390757 41.7665 1.0204 0.20

234 rows × 17 columns

We now have a DataFrame with 17 columns and 234 rows, which appear to be sorted by Country in ascending alphabetical order. All in all, these are clean and well organized data, but if we sorted the rows by Rank, we could get a quick and easy insight into the most and least populous countries. We can do this using .sort_values(...).

[24]:
# Sort the rows by Rank in descending order
df.sort_values('Rank', ascending=True)
[24]:
Rank CCA3 Country Capital Continent 2022 Population 2020 Population 2015 Population 2010 Population 2000 Population 1990 Population 1980 Population 1970 Population Area (km²) Density (per km²) Growth Rate World Population Percentage
41 1 CHN China Beijing Asia 1425887337 1424929781 1393715448 1348191368 1264099069 1153704252 982372466 822534450 9706961 146.8933 1.0000 17.88
92 2 IND India New Delhi Asia 1417173173 1396387127 1322866505 1240613620 1059633675 870452165 696828385 557501301 3287590 431.0675 1.0068 17.77
221 3 USA United States Washington, D.C. North America 338289857 335942003 324607776 311182845 282398554 248083732 223140018 200328340 9372610 36.0935 1.0038 4.24
93 4 IDN Indonesia Jakarta Asia 275501339 271857970 259091970 244016173 214072421 182159874 148177096 115228394 1904569 144.6529 1.0064 3.45
156 5 PAK Pakistan Islamabad Asia 235824862 227196741 210969298 194454498 154369924 115414069 80624057 59290872 881912 267.4018 1.0191 2.96
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
137 230 MSR Montserrat Brades North America 4390 4500 5059 4938 5138 10805 11452 11402 102 43.0392 0.9939 0.00
64 231 FLK Falkland Islands Stanley South America 3780 3747 3408 3187 3080 2332 2240 2274 12173 0.3105 1.0043 0.00
150 232 NIU Niue Alofi Oceania 1934 1942 1847 1812 2074 2533 3637 5185 260 7.4385 0.9985 0.00
209 233 TKL Tokelau Nukunonu Oceania 1871 1827 1454 1367 1666 1669 1647 1714 12 155.9167 1.0119 0.00
226 234 VAT Vatican City Vatican City Europe 510 520 564 596 651 700 733 752 1 510.0000 0.9980 0.00

234 rows × 17 columns

This is great, but everyone knows that China is the most populated country in the world. Surely there is a more nuanced story to be told. Let’s start by calculating the most recent estimate of the global population by summing all of the values in the 2022 Population column.

[25]:
print('The global population in 2022 is: ')
df['2022 Population'].sum()
The global population in 2022 is:
[25]:
7973413042

That is believable. But now I’m curious how that is distributed by Continent.

A .groupby operation is perfect for this job.

.groupby() - Read more here

[26]:
# Group by Continent and sum the values
df.groupby('Continent')['2022 Population'].sum()
[26]:
Continent
Africa           1426730932
Asia             4721383274
Europe            743147538
North America     600296136
Oceania            45038554
South America     436816608
Name: 2022 Population, dtype: int64

Unsurprisingly, Asia is the most highly populated continent. A bar chart may help us to appreciate this more fully. We can easily make a bar chart using the .plot() method for Series and DataFrames.

.plot() - Read more here

[27]:
# Repeat the above operation, but this time chain a plotting method to the end
(
 df.groupby('Continent')['2022 Population']
 .sum()
 .plot(kind='bar', rot=45, xlabel='', ylabel='Population')
);
_images/05_pandas_51_0.png

So Asia is the most populated continent, but surely that’s just because its the biggest. How does the population relate to the total size of the continent? The Density (per km²) column looks like it was calculated by dividing the population of a country by its total area, so repeating the operation with this variable will help to develop the story.

[28]:
# As above, but with the 'Density (per km²)' column instead
(
 df.groupby('Continent')['Density (per km²)']
 .sum()
 .plot(kind='bar', rot=45, xlabel='', ylabel='Population density (per km²)')
);
_images/05_pandas_53_0.png

This paints a different picture. Most striking is the difference for Europe. Though it is the third most populated continent, Europe comes second for population density, and by a long way in comparison to the others. This is especially true in comparison with South America, whose population is very sparse indeed.

The complete picture - population growth

To tell the overall story of these population data, I came up with my own two-figure solution, which I feel does a pretty good job. I sort the data by Rank, calculate a column-wise differential on the <date> Population columns, and then plot the result for each country in a stacked horizontal bar chart, with bars to the left show population decline, and bars to right show population growth for that period. To avoid ambiguity, I put the total population values to the side of each bar. On the right, these values indicate the population for that country in 2022, and on the left, they indicate the total population decline since 1970.

[29]:
# Get the columns with the population data
pop_cols = df.columns[df.columns.str.endswith('Population')]

# Sort and select data
data = (df
 .set_index('Country')  # Set 'Country' as the index
 .sort_values('Rank', ascending=True)[pop_cols]  # Sort by 'Rank' and keep only the columns with population data
 .iloc[0:117]  # Choose 117 MOST populated countries with location-based indexing
 .loc[:, lambda df_: reversed(df_.columns)]  # Reverse the order of the columns in a fancy way
)

# Pull out the base population at 1970
base_pop_1970 = data['1970 Population']

# Calculate a column-wise differential. This means the data will
# now reflect change from the previous timepoint, rather than the
# total population at that time.
data = data.diff(axis=1)

# Put the base population back in
data['1970 Population'] = base_pop_1970

# Make a figure and axis that's big enough to show the data
fig, ax = plt.subplots(figsize=(12, 24))

# Plot a horizontal bar chart with pandas plotting method.
# Note the reversal of the data with [::-1], which is done
# to make sure the longest bars are at the top
data[::-1].plot(kind='barh', stacked=True, ax=ax)
lines = ax.hlines(range(0,117), 0, data[::-1].sum(axis=1).values, color='k', lw=1.5)

# Add the numbers as text
for i, (country, row) in enumerate(data[::-1].iterrows()):
    pop = row.sum()
    text_pos = row[row>0].sum()
    ax.text(text_pos+1e7, i, '{:,}'.format(pop), va='center', fontsize=8)
    text_pos_neg = row[row < 0].abs().sum()
    if text_pos_neg > 0:
        ax.text(-(text_pos_neg+1e7), i, f'{text_pos_neg:,}', va='center', ha='right', fontsize=8)

# Add a vertical black line at zero as a visual aid
ax.axvline(0, 0, 1, c='k', lw=.5)

# Format the x axis
def billions_formatter(x, pos):
    return f'{x / 1000_000_000}'
ax.xaxis.set_major_formatter(plt.FuncFormatter(billions_formatter))

# Tweak axis
ax.set(
    ylabel='',
    title='Population change in the 117 most populated countries/territories (1970-2022)',
    xlabel='Population in billions',
    xlim=(-.25e9, 1.75e9)
);

# Put legend in lower right of figure
handles, labels = ax.get_legend_handles_labels()
handles.append(lines)
labels = [
    '1970 Population',
    '1970 - 1980',
    '1980 - 1990',
    '1990 - 2000',
    '2000 - 2010',
    '2010 - 2015',
    '2015 - 2020',
    '2020 - 2022',
    '2022 Population'
]
ax.legend(handles, labels, loc='lower right')

# Save the figure with a tight bounding box and high resolution (300 dots-per-inch)
fig.savefig('../images/most_populated_countries_2022.png', bbox_inches='tight', dpi=300)
_images/05_pandas_55_0.png

Rinse and repeat for the other half of the data. By using a second figure for the less-populated half of the dataset, we reset the scale on the x-axis and dramatically improve the explanatory power of the visualisations.

[30]:
# Get the columns with the population data
pop_cols = df.columns[df.columns.str.endswith('Population')]

# Sort and select data
data = (df
 .set_index('Country')  # Set 'Country' as the index
 .sort_values('Rank', ascending=True)[pop_cols]  # Sort by 'Rank' and keep only the columns with population data
 .iloc[-117:]  # Choose the 117 LEAST populated countries with location-based indexing
 .loc[:, lambda df_: reversed(df_.columns)]  # Reverse the order of the columns in a fancy way
)

# Pull out the base population at 1970
base_pop_1970 = data['1970 Population']

# Calculate a column-wise differential. This means the data will
# now reflect change from the previous timepoint, rather than the
# total population at that time.
data = data.diff(axis=1)

# Put the base population back in
data['1970 Population'] = base_pop_1970

# Make a figure and axis that's big enough to show the data
fig, ax = plt.subplots(figsize=(12, 24))

# Plot a horizontal bar chart with pandas plotting method.
# Note the reversal of the data with [::-1], which is done
# to make sure the longest bars are at the top
data[::-1].plot(kind='barh', stacked=True, ax=ax)
lines = ax.hlines(range(0,117), 0, data[::-1].sum(axis=1).values, color='k', lw=1.5)

# Add the numbers as text
for i, (country, row) in enumerate(data[::-1].iterrows()):
    pop = row.sum()
    text_pos = row[row > 0].sum()
    ax.text(text_pos+1e5, i, '{:,}'.format(pop), va='center', fontsize=8)
    text_pos_neg = row[row < 0].abs().sum()
    if text_pos_neg > 0:
        ax.text(-(text_pos_neg+1e5), i, '{:,}'.format(text_pos_neg), va='center', ha='right', fontsize=8)

# Add a vertical line at zero as a visual aid
ax.axvline(0, 0, 1, c='k', lw=.5)

# Format the x axis
def millions_formatter(x, pos):
    return f'{x / 1_000_000}'
ax.xaxis.set_major_formatter(plt.FuncFormatter(millions_formatter))

# Tweak axis
ax.set(
    ylabel='',
    title='Population change in the 117 least populated countries/territories (1970-2022)',
    xlabel='Population in millions',
    xlim=(-3e6, 8e6)
)

# Put legend in lower right of figure
handles, labels = ax.get_legend_handles_labels()
handles.append(lines)
labels = [
    '1970 Population',
    '1970 - 1980',
    '1980 - 1990',
    '1990 - 2000',
    '2000 - 2010',
    '2010 - 2015',
    '2015 - 2020',
    '2020 - 2022',
    '2022 Population'
]
ax.legend(handles, labels, loc='lower right')


#fig.suptitle('Population change in the 117 least populated countries/territories (1970-2022)')

#ax.text(-5.5e5, 117, 'Decline | Growth')

# Save the figure with a tight bounding box and high resolution (300 dots-per-inch)
fig.savefig('../images/least_populated_countries_2022.png', bbox_inches='tight', dpi=300)
_images/05_pandas_57_0.png

The big picture - population density

After plotting the above, I realised I could do the same using only the Density (per km²) column. This was a little bit easier as it didn’t require stacking the bars.

[31]:
# Sort and select data
data = (df
 .set_index('Country')  # Set 'Country' as the index
 .sort_values('Density (per km²)', ascending=True)
 .iloc[-117:]  # Choose the 117 MOST densely populated countries with location-based indexing
)

# Make a figure and axis that's big enough to show the data
fig, ax = plt.subplots(figsize=(12, 24))

# Plot a horizontal bar chart with pandas plotting method.
data.plot(kind='barh', y='Density (per km²)', ax=ax, legend=False)

# Add the numbers in text
for i, (country, row) in enumerate(data.iterrows()):
    pop = row['Density (per km²)']
    ax.text(pop+100, i, '{:,}'.format(round(pop, 2)), va='center', fontsize=8)

# Add a vertical line at zero as a visual aid
ax.axvline(0, 0, 1, c='k', lw=.5)

# Tweak axis
ax.set(
    ylabel='',
    title='Population density in the 117 most populated countries/territories',
    xlabel='Population (per km²)',
    xlim=(0, 25000)
)

# Save the figure with a tight bounding box and high resolution (300 dots-per-inch)
fig.savefig('../images/most_dense_countries_2022.png', bbox_inches='tight', dpi=300)
_images/05_pandas_59_0.png

What’s it like to be one of 23,172 people in a square kilometer of land? Go to Macau and you’ll find out. Also, I didn’t realise Gibralter was so dense.

Now for the other half…

[32]:
# Sort and select data
data = (df
 .set_index('Country')  # Set 'Country' as the index
 .sort_values('Density (per km²)', ascending=True)
 .iloc[0:117]  # Choose the 117 LEAST densely populated countries with location-based indexing
)

# Make a figure and axis that's big enough to show the data
fig, ax = plt.subplots(figsize=(12, 24))

# Plot a horizontal bar chart with pandas plotting method.
data.plot(kind='barh', y='Density (per km²)', ax=ax, legend=False)

# Add the numbers in text
for i, (country, row) in enumerate(data.iterrows()):
    pop = row['Density (per km²)']
    ax.text(pop+1, i, '{:,}'.format(round(pop, 2)), va='center', fontsize=8)

# Add a vertical line at zero as a visual aid
ax.axvline(0, 0, 1, c='k', lw=.5)

# Tweak axis
ax.set(
    ylabel='',
    title='Population density in the 117 least populated countries/territories',
    xlabel='Population (per km²)',
    xlim=(0, 100)
)

# Save the figure with a tight bounding box and high resolution (300 dots-per-inch)
fig.savefig('../images/least_dense_countries_2022.png', bbox_inches='tight', dpi=300)
_images/05_pandas_61_0.png

Iceland… A beautiful country, where a kilometer of land is shared by only 3.62 people.

That’s it for now. These examples are complicated, but if you study them carefully, change bits, and come up with your own variations, you will learn a lot about Python and pandas in the process! If you are feeling adventures, why not try and replicate these plots for the Area (km²) column? In this situation, countries like Russia and Canada would be right at the top, and at the bottom would be The Vatican City and small island nations like the Falkland and Faroe Islands.

[ ]: