Week 5 - pandas¶

pandas is an open source library that brings high-performance, easy-to-use data structures and analysis tools to the Python programming language. It’s core features are its 1-dimensional Series and 2-dimensional DataFrame objects. The former is like a NumPy array (pandas is built on top of NumPy) with an explicit index, and the latter is a bit like an Excel worksheet.

The 10 minutes to pandas and getting started guides are excellent places to learn more, but read on for my own tour of the library which draws comparisons between pandas and familiar spreadsheet programs (e.g., Microsoft Excel, Apple Numbers, LibreOffice Calc, Google Sheets).

Importing `pandas`¶

The community agreed custom for importing pandas is:

import pandas as pd

This gives us access to all pandas functionality, which means it is a bit like clicking on the desktop icon that launches a spreadsheet application. Let’s import pandas, and while we are at it, let’s also import numpy and matplotlib (and set some configuration options).

[1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')

We’ll begin with a detailed look at the Series and DataFrame objects.

`Series`¶

pd.Series is much like a 1-dimensional NumPy array, but with explicit axis labels, the ability to hold any data type, and various feature and method enhancements to simplify working with complex data. We can think of a Series as the rough equivalent of a column in a spreadsheet application.

The basic way of making a Series is as follows:

s = pd.Series(data, index=index)

Note that the argument for data can take various forms, such as a Python dictionary, a numpy array, a list, or even a single value. If provided, index must be a list of axis labels with the same length as data.

Here’s a simple Series of random numbers.

[2]:

s = pd.Series(np.random.random(100))
print(s)

0     0.965606
1     0.500888
2     0.485584
3     0.704136
4     0.001702
        ...
95    0.199428
96    0.274759
97    0.653682
98    0.639472
99    0.949416
Length: 100, dtype: float64

We didn’t specify an index, so a default integer-based index (starting at 0) was applied, which we can see to the left of the actual values. Below the index and values we can also see the length and data type.

Now, let’s make a Series containing the letters of the alphabet and assign also an index and a name.

[3]:

letters = list('abcdefghijklmnopqrstuvwxyz')
s = pd.Series(letters, index=range(1, 27), name='lowercase')
print(s)

1     a
2     b
3     c
4     d
5     e
6     f
7     g
8     h
9     i
10    j
11    k
12    l
13    m
14    n
15    o
16    p
17    q
18    r
19    s
20    t
21    u
22    v
23    w
24    x
25    y
26    z
Name: lowercase, dtype: object

A few things to note here. First, the index doesn’t default to zero because we specified our own labels. Second, the name we provided is shown below the values (this is analogous to a column header in spreadsheet applications). Third, the data type is object, denoting that the Series contains str data.

`DataFrame`¶

A DataFrame is a 2-dimensional data structure consisting of rows and columns, which makes it similar to a worksheet in Microsfot Excel. Each column in a DataFrame is a Series, which by now we know to be a feature-enhanced, explicitly indexed, NumPy array that can contain any type of data. A DataFrame, then, is a collection of Series that share a common Index.

An easy way to create a basic DataFrameis to pass a Python dict to the pd.DataFrame(...) constructor. In this scenario, the keys will serve as names for the columns, and the values, which must be convertible to a series-like data structure, will be the data.

[4]:

# It's the familiar shopping example again...
df = pd.DataFrame(
    {
        'item': ['apples', 'bananas', 'bread', 'yoghurt', 'pasta', 'coffee', 'butter', 'carrots'],
        'unit_price': np.array([2.20, 1.85, 1.10, 1.20, 1.45, 2.80, 2.40, .89], dtype='float32'),
        'quantity': np.array([2, 1, 2, 4, 2, 1, 1, 1]),
        'date_of_purchase': pd.Timestamp('20221208'),
        'shop': pd.Categorical(['Aldi'] * 8),
        'paid_cash': True
    }
)
df

[4]:

	item	unit_price	quantity	date_of_purchase	shop	paid_cash
0	apples	2.20	2	2022-12-08	Aldi	True
1	bananas	1.85	1	2022-12-08	Aldi	True
2	bread	1.10	2	2022-12-08	Aldi	True
3	yoghurt	1.20	4	2022-12-08	Aldi	True
4	pasta	1.45	2	2022-12-08	Aldi	True
5	coffee	2.80	1	2022-12-08	Aldi	True
6	butter	2.40	1	2022-12-08	Aldi	True
7	carrots	0.89	1	2022-12-08	Aldi	True

As with NumPy arrays, the DataFrame object stores some basic information as attributes. Below, we get information about shape, number of dimensions, size (number of elements), and the data types for each column of df.

[5]:

print('Shape:                ', df.shape)
print('Number of dimensions: ', df.ndim)
print('Number of elements:   ', df.size)
print('Data types:\n')
print(df.dtypes)

Shape:                 (8, 6)
Number of dimensions:  2
Number of elements:    48
Data types:

item                        object
unit_price                 float32
quantity                     int64
date_of_purchase    datetime64[ns]
shop                      category
paid_cash                     bool
dtype: object

We can also see how much memory (in bytes) is being used by each column.

[6]:

df.memory_usage()

[6]:

Index               128
item                 64
unit_price           32
quantity             64
date_of_purchase     64
shop                124
paid_cash             8
dtype: int64

The actual values stored in a DataFrame are accessible via the .values attribute. Inspection of these values shows that pandas is built right on top of NumPy.

[7]:

print("The values of the data frame are: ")
print(df.values)
print("The type of the values is ", type(df.values))

The values of the data frame are:
[['apples' 2.200000047683716 2 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['bananas' 1.850000023841858 1 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['bread' 1.100000023841858 2 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['yoghurt' 1.2000000476837158 4 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['pasta' 1.4500000476837158 2 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['coffee' 2.799999952316284 1 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['butter' 2.4000000953674316 1 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]
 ['carrots' 0.8899999856948853 1 Timestamp('2022-12-08 00:00:00') 'Aldi'
  True]]
The type of the values is  <class 'numpy.ndarray'>

Viewing, selecting and indexing¶

If you are new to pandas, it may sometimes feel like the data are hidden away and out of reach, especially in comparison to spreadsheet environments where data are generally always on the screen and can be selected and browsed by clicking and scrolling the mouse. Thankfully, pandas has some handy tools for viewing a DataFrame.

DataFrame.head() can be used to view the top of a frame, and DataFrame.tail() to view the bottom. The default number of rows to display is 5, but we can change this by specifying a number.

[8]:

# Show the first three rows of df
df.head(3)

[8]:

	item	unit_price	quantity	date_of_purchase	shop	paid_cash
0	apples	2.20	2	2022-12-08	Aldi	True
1	bananas	1.85	1	2022-12-08	Aldi	True
2	bread	1.10	2	2022-12-08	Aldi	True

[9]:

# Show the last three rows of df
df.tail(3)

[9]:

	item	unit_price	quantity	date_of_purchase	shop	paid_cash
5	coffee	2.80	1	2022-12-08	Aldi	True
6	butter	2.40	1	2022-12-08	Aldi	True
7	carrots	0.89	1	2022-12-08	Aldi	True

The row and column indices themselves can also be accessed and operated upon (note that we didn’t specify a row index at the time of creation, so we got a default RangeIndex).

[10]:

# View the row index of df
df.index

[10]:

RangeIndex(start=0, stop=8, step=1)

[11]:

# View the column index of df
df.columns

[11]:

Index(['item', 'unit_price', 'quantity', 'date_of_purchase', 'shop',
       'paid_cash'],
      dtype='object')

Let’s talk about selecting columns. To select a single column of a DataFrame, put its name in square brackets. This is a bit like clicking on the column header in a spreadsheet to highlight the entire column. The result of this operation is a Series.

[12]:

df['item']

[12]:

0     apples
1    bananas
2      bread
3    yoghurt
4      pasta
5     coffee
6     butter
7    carrots
Name: item, dtype: object

As long as the column name doesn’t contain whitespace or special characters, single columns may also be accessed in equivalent fashion via . notation. I often find this more convenient, as it’s easier to type.

[13]:

df.unit_price

[13]:

0    2.20
1    1.85
2    1.10
3    1.20
4    1.45
5    2.80
6    2.40
7    0.89
Name: unit_price, dtype: float32

Selecting multiple columns at the same time requires a list of column names, and returns a DataFrame (this is like doing Ctrl+click or Ctrl+drag to select multiple columns in a spreadsheet).

[14]:

df[['item', 'quantity', 'unit_price']]

[14]:

	item	quantity	unit_price
0	apples	2	2.20
1	bananas	1	1.85
2	bread	2	1.10
3	yoghurt	4	1.20
4	pasta	2	1.45
5	coffee	1	2.80
6	butter	1	2.40
7	carrots	1	0.89

[] can be used to slice by row

[15]:

df[2:5]

[15]:

	item	unit_price	quantity	date_of_purchase	shop	paid_cash
2	bread	1.10	2	2022-12-08	Aldi	True
3	yoghurt	1.20	4	2022-12-08	Aldi	True
4	pasta	1.45	2	2022-12-08	Aldi	True

pandas also supports label-based indexing, which is extremely useful for selecting individual values or cross-sections of a DataFrame.

Label-based indexing is performed using the .loc[] and .at[] access methods. Note these are square, and not round, brackets.

[16]:

# Select rows 4:7 and the specified columns
df.loc[4:7, ['item', 'unit_price', 'quantity']]

[16]:

	item	unit_price	quantity
4	pasta	1.45	2
5	coffee	2.80	1
6	butter	2.40	1
7	carrots	0.89	1

[17]:

# Select the value in row=3, column='item'
df.at[3, 'item']

[17]:

'yoghurt'

Positional indexing can be achieved with the .iloc[] and .iat[] access methods. The difference here is that we must provide integers (starting at 0). The behavior is very similar to array slicing in Python and NumPy.

[18]:

# Select the 6th row
df.iloc[5]

[18]:

item                             coffee
unit_price                          2.8
quantity                              1
date_of_purchase    2022-12-08 00:00:00
shop                               Aldi
paid_cash                          True
Name: 5, dtype: object

[19]:

# Select the value in the second row of the first column
df.iat[1, 0]

[19]:

'bananas'

Boolean indexing is another common way to select a subset of data from a frame. Suppose we cared only about rows where the unit_price was over £2.00.

[20]:

df[df['unit_price'] > 2.]

[20]:

	item	unit_price	quantity	date_of_purchase	shop	paid_cash
0	apples	2.2	2	2022-12-08	Aldi	True
5	coffee	2.8	1	2022-12-08	Aldi	True
6	butter	2.4	1	2022-12-08	Aldi	True

This works, because the expression inside the square brackets evaluates as a Series of bool, and the indexing operation ensures that we only get the rows where the expression comes up True.

[21]:

df['unit_price']>2.

[21]:

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7    False
Name: unit_price, dtype: bool

We can get quite fancy by chaining multiple expressions together. For example, if we only wanted rows where the unit_price is greater than £1.00 and less than £2.00.

[22]:

df[((df['unit_price'] > 1.) & (df['unit_price'] < 2.))]

[22]:

	item	unit_price	quantity	date_of_purchase	shop	paid_cash
1	bananas	1.85	1	2022-12-08	Aldi	True
2	bread	1.10	2	2022-12-08	Aldi	True
3	yoghurt	1.20	4	2022-12-08	Aldi	True
4	pasta	1.45	2	2022-12-08	Aldi	True

World population dataset¶

The best way to learn pandas is to use it with real data, so let’s explore some more advanced features with an actual dataset.

In the data folder of the course materials, there’s a file called world_population.csv which I sourced from an excellent data science website called Kaggle. The file contains historical population data for every country/territory in the world, along with various other parameters. Here’s what’s in the file:

Variable	Definition
Rank	Rank by Population
CCA3	3 Digit Country/Territories Code
Country	Name of the Country/Territories
Capital	Name of the Capital
Continent	Name of the Continent
2022 Population	Population of the Country/Territories in the year 2022
2020 Population	Population of the Country/Territories in the year 2020
2015 Population	Population of the Country/Territories in the year 2015
2010 Population	Population of the Country/Territories in the year 2010
2000 Population	Population of the Country/Territories in the year 2000
1990 Population	Population of the Country/Territories in the year 1990
1980 Population	Population of the Country/Territories in the year 1980
1970 Population	Population of the Country/Territories in the year 1970
Area (km²)	Area size of the Country/Territories in square kilometer
Density (per km²)	Population Density per square kilometer
Growth Rate	Population Growth Rate by Country/Territories
World Population Percentage	The population percentage by each Country/Territories

Let’s start by loading and inspecting the data.

Because its a CSV file, pd.read_csv(...) is right tool for the job.

[23]:

df = pd.read_csv('../data/world_population.csv')
df

[23]:

	Rank	CCA3	Country	Capital	Continent	2022 Population	2020 Population	2015 Population	2010 Population	2000 Population	1990 Population	1980 Population	1970 Population	Area (km²)	Density (per km²)	Growth Rate	World Population Percentage
0	36	AFG	Afghanistan	Kabul	Asia	41128771	38972230	33753499	28189672	19542982	10694796	12486631	10752971	652230	63.0587	1.0257	0.52
1	138	ALB	Albania	Tirana	Europe	2842321	2866849	2882481	2913399	3182021	3295066	2941651	2324731	28748	98.8702	0.9957	0.04
2	34	DZA	Algeria	Algiers	Africa	44903225	43451666	39543154	35856344	30774621	25518074	18739378	13795915	2381741	18.8531	1.0164	0.56
3	213	ASM	American Samoa	Pago Pago	Oceania	44273	46189	51368	54849	58230	47818	32886	27075	199	222.4774	0.9831	0.00
4	203	AND	Andorra	Andorra la Vella	Europe	79824	77700	71746	71519	66097	53569	35611	19860	468	170.5641	1.0100	0.00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
229	226	WLF	Wallis and Futuna	Mata-Utu	Oceania	11572	11655	12182	13142	14723	13454	11315	9377	142	81.4930	0.9953	0.00
230	172	ESH	Western Sahara	El Aaiún	Africa	575986	556048	491824	413296	270375	178529	116775	76371	266000	2.1654	1.0184	0.01
231	46	YEM	Yemen	Sanaa	Asia	33696614	32284046	28516545	24743946	18628700	13375121	9204938	6843607	527968	63.8232	1.0217	0.42
232	63	ZMB	Zambia	Lusaka	Africa	20017675	18927715	16248230	13792086	9891136	7686401	5720438	4281671	752612	26.5976	1.0280	0.25
233	74	ZWE	Zimbabwe	Harare	Africa	16320537	15669666	14154937	12839771	11834676	10113893	7049926	5202918	390757	41.7665	1.0204	0.20

234 rows × 17 columns

We now have a DataFrame with 17 columns and 234 rows, which appear to be sorted by Country in ascending alphabetical order. All in all, these are clean and well organized data, but if we sorted the rows by Rank, we could get a quick and easy insight into the most and least populous countries. We can do this using .sort_values(...).

[24]:

# Sort the rows by Rank in descending order
df.sort_values('Rank', ascending=True)

[24]:

	Rank	CCA3	Country	Capital	Continent	2022 Population	2020 Population	2015 Population	2010 Population	2000 Population	1990 Population	1980 Population	1970 Population	Area (km²)	Density (per km²)	Growth Rate	World Population Percentage
41	1	CHN	China	Beijing	Asia	1425887337	1424929781	1393715448	1348191368	1264099069	1153704252	982372466	822534450	9706961	146.8933	1.0000	17.88
92	2	IND	India	New Delhi	Asia	1417173173	1396387127	1322866505	1240613620	1059633675	870452165	696828385	557501301	3287590	431.0675	1.0068	17.77
221	3	USA	United States	Washington, D.C.	North America	338289857	335942003	324607776	311182845	282398554	248083732	223140018	200328340	9372610	36.0935	1.0038	4.24
93	4	IDN	Indonesia	Jakarta	Asia	275501339	271857970	259091970	244016173	214072421	182159874	148177096	115228394	1904569	144.6529	1.0064	3.45
156	5	PAK	Pakistan	Islamabad	Asia	235824862	227196741	210969298	194454498	154369924	115414069	80624057	59290872	881912	267.4018	1.0191	2.96
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
137	230	MSR	Montserrat	Brades	North America	4390	4500	5059	4938	5138	10805	11452	11402	102	43.0392	0.9939	0.00
64	231	FLK	Falkland Islands	Stanley	South America	3780	3747	3408	3187	3080	2332	2240	2274	12173	0.3105	1.0043	0.00
150	232	NIU	Niue	Alofi	Oceania	1934	1942	1847	1812	2074	2533	3637	5185	260	7.4385	0.9985	0.00
209	233	TKL	Tokelau	Nukunonu	Oceania	1871	1827	1454	1367	1666	1669	1647	1714	12	155.9167	1.0119	0.00
226	234	VAT	Vatican City	Vatican City	Europe	510	520	564	596	651	700	733	752	1	510.0000	0.9980	0.00

234 rows × 17 columns

This is great, but everyone knows that China is the most populated country in the world. Surely there is a more nuanced story to be told. Let’s start by calculating the most recent estimate of the global population by summing all of the values in the 2022 Population column.

[25]:

print('The global population in 2022 is: ')
df['2022 Population'].sum()

The global population in 2022 is:

[25]:

7973413042

That is believable. But now I’m curious how that is distributed by Continent.

A .groupby operation is perfect for this job.

.groupby() - Read more here

[26]:

# Group by Continent and sum the values
df.groupby('Continent')['2022 Population'].sum()

[26]:

Continent
Africa           1426730932
Asia             4721383274
Europe            743147538
North America     600296136
Oceania            45038554
South America     436816608
Name: 2022 Population, dtype: int64

Unsurprisingly, Asia is the most highly populated continent. A bar chart may help us to appreciate this more fully. We can easily make a bar chart using the .plot() method for Series and DataFrames.

.plot() - Read more here

[27]:

# Repeat the above operation, but this time chain a plotting method to the end
(
 df.groupby('Continent')['2022 Population']
 .sum()
 .plot(kind='bar', rot=45, xlabel='', ylabel='Population')
);

So Asia is the most populated continent, but surely that’s just because its the biggest. How does the population relate to the total size of the continent? The Density (per km²) column looks like it was calculated by dividing the population of a country by its total area, so repeating the operation with this variable will help to develop the story.

[28]:

# As above, but with the 'Density (per km²)' column instead
(
 df.groupby('Continent')['Density (per km²)']
 .sum()
 .plot(kind='bar', rot=45, xlabel='', ylabel='Population density (per km²)')
);

This paints a different picture. Most striking is the difference for Europe. Though it is the third most populated continent, Europe comes second for population density, and by a long way in comparison to the others. This is especially true in comparison with South America, whose population is very sparse indeed.

The complete picture - population growth¶

To tell the overall story of these population data, I came up with my own two-figure solution, which I feel does a pretty good job. I sort the data by Rank, calculate a column-wise differential on the <date> Population columns, and then plot the result for each country in a stacked horizontal bar chart, with bars to the left show population decline, and bars to right show population growth for that period. To avoid ambiguity, I put the total population values to the side of each bar. On the right, these values indicate the population for that country in 2022, and on the left, they indicate the total population decline since 1970.

[29]:

# Get the columns with the population data
pop_cols = df.columns[df.columns.str.endswith('Population')]

# Sort and select data
data = (df
 .set_index('Country')  # Set 'Country' as the index
 .sort_values('Rank', ascending=True)[pop_cols]  # Sort by 'Rank' and keep only the columns with population data
 .iloc[0:117]  # Choose 117 MOST populated countries with location-based indexing
 .loc[:, lambda df_: reversed(df_.columns)]  # Reverse the order of the columns in a fancy way
)

# Pull out the base population at 1970
base_pop_1970 = data['1970 Population']

# Calculate a column-wise differential. This means the data will
# now reflect change from the previous timepoint, rather than the
# total population at that time.
data = data.diff(axis=1)

# Put the base population back in
data['1970 Population'] = base_pop_1970

# Make a figure and axis that's big enough to show the data
fig, ax = plt.subplots(figsize=(12, 24))

# Plot a horizontal bar chart with pandas plotting method.
# Note the reversal of the data with [::-1], which is done
# to make sure the longest bars are at the top
data[::-1].plot(kind='barh', stacked=True, ax=ax)
lines = ax.hlines(range(0,117), 0, data[::-1].sum(axis=1).values, color='k', lw=1.5)

# Add the numbers as text
for i, (country, row) in enumerate(data[::-1].iterrows()):
    pop = row.sum()
    text_pos = row[row>0].sum()
    ax.text(text_pos+1e7, i, '{:,}'.format(pop), va='center', fontsize=8)
    text_pos_neg = row[row < 0].abs().sum()
    if text_pos_neg > 0:
        ax.text(-(text_pos_neg+1e7), i, f'{text_pos_neg:,}', va='center', ha='right', fontsize=8)

# Add a vertical black line at zero as a visual aid
ax.axvline(0, 0, 1, c='k', lw=.5)

# Format the x axis
def billions_formatter(x, pos):
    return f'{x / 1000_000_000}'
ax.xaxis.set_major_formatter(plt.FuncFormatter(billions_formatter))

# Tweak axis
ax.set(
    ylabel='',
    title='Population change in the 117 most populated countries/territories (1970-2022)',
    xlabel='Population in billions',
    xlim=(-.25e9, 1.75e9)
);

# Put legend in lower right of figure
handles, labels = ax.get_legend_handles_labels()
handles.append(lines)
labels = [
    '1970 Population',
    '1970 - 1980',
    '1980 - 1990',
    '1990 - 2000',
    '2000 - 2010',
    '2010 - 2015',
    '2015 - 2020',
    '2020 - 2022',
    '2022 Population'
]
ax.legend(handles, labels, loc='lower right')

# Save the figure with a tight bounding box and high resolution (300 dots-per-inch)
fig.savefig('../images/most_populated_countries_2022.png', bbox_inches='tight', dpi=300)

Rinse and repeat for the other half of the data. By using a second figure for the less-populated half of the dataset, we reset the scale on the x-axis and dramatically improve the explanatory power of the visualisations.

[30]:

# Get the columns with the population data
pop_cols = df.columns[df.columns.str.endswith('Population')]

# Sort and select data
data = (df
 .set_index('Country')  # Set 'Country' as the index
 .sort_values('Rank', ascending=True)[pop_cols]  # Sort by 'Rank' and keep only the columns with population data
 .iloc[-117:]  # Choose the 117 LEAST populated countries with location-based indexing
 .loc[:, lambda df_: reversed(df_.columns)]  # Reverse the order of the columns in a fancy way
)

# Pull out the base population at 1970
base_pop_1970 = data['1970 Population']

# Calculate a column-wise differential. This means the data will
# now reflect change from the previous timepoint, rather than the
# total population at that time.
data = data.diff(axis=1)

# Put the base population back in
data['1970 Population'] = base_pop_1970

# Make a figure and axis that's big enough to show the data
fig, ax = plt.subplots(figsize=(12, 24))

# Plot a horizontal bar chart with pandas plotting method.
# Note the reversal of the data with [::-1], which is done
# to make sure the longest bars are at the top
data[::-1].plot(kind='barh', stacked=True, ax=ax)
lines = ax.hlines(range(0,117), 0, data[::-1].sum(axis=1).values, color='k', lw=1.5)

# Add the numbers as text
for i, (country, row) in enumerate(data[::-1].iterrows()):
    pop = row.sum()
    text_pos = row[row > 0].sum()
    ax.text(text_pos+1e5, i, '{:,}'.format(pop), va='center', fontsize=8)
    text_pos_neg = row[row < 0].abs().sum()
    if text_pos_neg > 0:
        ax.text(-(text_pos_neg+1e5), i, '{:,}'.format(text_pos_neg), va='center', ha='right', fontsize=8)

# Add a vertical line at zero as a visual aid
ax.axvline(0, 0, 1, c='k', lw=.5)

# Format the x axis
def millions_formatter(x, pos):
    return f'{x / 1_000_000}'
ax.xaxis.set_major_formatter(plt.FuncFormatter(millions_formatter))

# Tweak axis
ax.set(
    ylabel='',
    title='Population change in the 117 least populated countries/territories (1970-2022)',
    xlabel='Population in millions',
    xlim=(-3e6, 8e6)
)

# Put legend in lower right of figure
handles, labels = ax.get_legend_handles_labels()
handles.append(lines)
labels = [
    '1970 Population',
    '1970 - 1980',
    '1980 - 1990',
    '1990 - 2000',
    '2000 - 2010',
    '2010 - 2015',
    '2015 - 2020',
    '2020 - 2022',
    '2022 Population'
]
ax.legend(handles, labels, loc='lower right')


#fig.suptitle('Population change in the 117 least populated countries/territories (1970-2022)')

#ax.text(-5.5e5, 117, 'Decline | Growth')

# Save the figure with a tight bounding box and high resolution (300 dots-per-inch)
fig.savefig('../images/least_populated_countries_2022.png', bbox_inches='tight', dpi=300)

The big picture - population density¶

After plotting the above, I realised I could do the same using only the Density (per km²) column. This was a little bit easier as it didn’t require stacking the bars.

[31]:

# Sort and select data
data = (df
 .set_index('Country')  # Set 'Country' as the index
 .sort_values('Density (per km²)', ascending=True)
 .iloc[-117:]  # Choose the 117 MOST densely populated countries with location-based indexing
)

# Make a figure and axis that's big enough to show the data
fig, ax = plt.subplots(figsize=(12, 24))

# Plot a horizontal bar chart with pandas plotting method.
data.plot(kind='barh', y='Density (per km²)', ax=ax, legend=False)

# Add the numbers in text
for i, (country, row) in enumerate(data.iterrows()):
    pop = row['Density (per km²)']
    ax.text(pop+100, i, '{:,}'.format(round(pop, 2)), va='center', fontsize=8)

# Add a vertical line at zero as a visual aid
ax.axvline(0, 0, 1, c='k', lw=.5)

# Tweak axis
ax.set(
    ylabel='',
    title='Population density in the 117 most populated countries/territories',
    xlabel='Population (per km²)',
    xlim=(0, 25000)
)

# Save the figure with a tight bounding box and high resolution (300 dots-per-inch)
fig.savefig('../images/most_dense_countries_2022.png', bbox_inches='tight', dpi=300)

What’s it like to be one of 23,172 people in a square kilometer of land? Go to Macau and you’ll find out. Also, I didn’t realise Gibralter was so dense.

Now for the other half…

[32]:

# Sort and select data
data = (df
 .set_index('Country')  # Set 'Country' as the index
 .sort_values('Density (per km²)', ascending=True)
 .iloc[0:117]  # Choose the 117 LEAST densely populated countries with location-based indexing
)

# Make a figure and axis that's big enough to show the data
fig, ax = plt.subplots(figsize=(12, 24))

# Plot a horizontal bar chart with pandas plotting method.
data.plot(kind='barh', y='Density (per km²)', ax=ax, legend=False)

# Add the numbers in text
for i, (country, row) in enumerate(data.iterrows()):
    pop = row['Density (per km²)']
    ax.text(pop+1, i, '{:,}'.format(round(pop, 2)), va='center', fontsize=8)

# Add a vertical line at zero as a visual aid
ax.axvline(0, 0, 1, c='k', lw=.5)

# Tweak axis
ax.set(
    ylabel='',
    title='Population density in the 117 least populated countries/territories',
    xlabel='Population (per km²)',
    xlim=(0, 100)
)

# Save the figure with a tight bounding box and high resolution (300 dots-per-inch)
fig.savefig('../images/least_dense_countries_2022.png', bbox_inches='tight', dpi=300)

Iceland… A beautiful country, where a kilometer of land is shared by only 3.62 people.

That’s it for now. These examples are complicated, but if you study them carefully, change bits, and come up with your own variations, you will learn a lot about Python and pandas in the process! If you are feeling adventures, why not try and replicate these plots for the Area (km²) column? In this situation, countries like Russia and Canada would be right at the top, and at the bottom would be The Vatican City and small island nations like the Falkland and Faroe Islands.

[ ]: