Exploring and visualizing the CIA World Factbook using SQL and Python

The World Factbook is a compendium of data collected by the CIA from 267 countries around the world. The data consists in information on population, geography, culture, politics and many more fields.

In this project, we are going to explore and get some insight in the information of this dataset by using SQL and python for visualization.

The factbook database is from the year 2015 and is available in here.

Data Import

We are going to import and have a quick look of the database using sqlite.

In [1]:
import sqlite3
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="http")

import plotly.express as px
import plotly
plotly.offline.init_notebook_mode()

# connection to the factbook database 
conn = sqlite3.connect("factbook.db")
cursor = conn.cursor()
In [2]:
# query to see the table content in the factbook.db
q1 = '''
SELECT *
  FROM sqlite_master
 WHERE type='table';
 '''
pd.read_sql_query(q1, conn)
Out[2]:
type name tbl_name rootpage sql
0 table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
1 table facts facts 47 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY...
In [3]:
# overview of the facts table in the factbook database
q2 = '''
SELECT *
  FROM facts
 LIMIT 5;
 '''
pd.read_sql_query(q2, conn)
Out[3]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
0 1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
1 2 al Albania 28748 27398 1350 3029278 0.30 12.92 6.58 3.30
2 3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
3 4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.00
4 5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

As seen in the table above, each row in the database consist of one country with several information of it. The columns names are self-descriptive.

Data Preparation

First, we are going to run some summary statistics to look for outliers.

In [4]:
q3 = '''
SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth)
  FROM facts;
'''
pd.read_sql_query(q3, conn)
Out[4]:
MIN(population) MAX(population) MIN(population_growth) MAX(population_growth)
0 0 7256490011 0.0 4.02

There are in fact some outliers:

  • There is one country with no population (0).
  • There is one country with nearly 7.2 billion people.

Let's see which countries are responsible for these values.

In [5]:
q4 = '''
SELECT *
  FROM facts
 WHERE population == (SELECT MIN(population) FROM facts)
    OR population == (SELECT MAX(population) FROM facts);
'''
pd.read_sql_query(q4, conn)
Out[5]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
0 250 ay Antarctica None 280000.0 None 0 NaN NaN NaN None
1 261 xx World None NaN None 7256490011 1.08 18.6 7.8 None

The country with no population is Antartica, as there is in fact no permanent population in the continent, there are just temporal science and militar populations over there.

And the country with the highest population is a World row, with values of all the world.

Convert to pandas DataFrame

With the outliers identified, now we are going to create a pandas dataframe based on the facts table for visualization.

In [6]:
# query to obtain the data to convert to DataFrame without the World row
q5 = '''
SELECT *
  FROM facts
 WHERE name <> 'World';
'''
factb = pd.read_sql_query(q5, conn)
factb.head(10)
Out[6]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
0 1 af Afghanistan 652230.0 652230.0 0.0 32564342.0 2.32 38.57 13.89 1.51
1 2 al Albania 28748.0 27398.0 1350.0 3029278.0 0.30 12.92 6.58 3.30
2 3 ag Algeria 2381741.0 2381741.0 0.0 39542166.0 1.84 23.67 4.31 0.92
3 4 an Andorra 468.0 468.0 0.0 85580.0 0.12 8.13 6.96 0.00
4 5 ao Angola 1246700.0 1246700.0 0.0 19625353.0 2.78 38.78 11.49 0.46
5 6 ac Antigua and Barbuda 442.0 442.0 0.0 92436.0 1.24 15.85 5.69 2.21
6 7 ar Argentina 2780400.0 2736690.0 43710.0 43431886.0 0.93 16.64 7.33 0.00
7 8 am Armenia 29743.0 28203.0 1540.0 3056382.0 0.15 13.61 9.34 5.80
8 9 as Australia 7741220.0 7682300.0 58920.0 22751014.0 1.07 12.15 7.14 5.65
9 10 au Austria 83871.0 82445.0 1426.0 8665550.0 0.55 9.41 9.42 5.56

With the Dataframe, now we are going to search for null values in the dataset.

In [7]:
factb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260 entries, 0 to 259
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 260 non-null    int64  
 1   code               260 non-null    object 
 2   name               260 non-null    object 
 3   area               249 non-null    float64
 4   area_land          246 non-null    float64
 5   area_water         243 non-null    float64
 6   population         241 non-null    float64
 7   population_growth  235 non-null    float64
 8   birth_rate         227 non-null    float64
 9   death_rate         227 non-null    float64
 10  migration_rate     223 non-null    float64
dtypes: float64(8), int64(1), object(2)
memory usage: 22.5+ KB

Let's investigate the countries with a null value in the population column.

In [8]:
factb[factb['population'].isnull()]
Out[8]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
197 198 at Ashmore and Cartier Islands 5.0 5.0 0.0 NaN NaN NaN NaN NaN
200 201 cr Coral Sea Islands 3.0 3.0 0.0 NaN NaN NaN NaN NaN
201 202 hm Heard Island and McDonald Islands 412.0 412.0 0.0 NaN NaN NaN NaN NaN
207 208 ip Clipperton Island 6.0 6.0 0.0 NaN NaN NaN NaN NaN
209 210 fs French Southern and Antarctic Lands NaN NaN NaN NaN NaN NaN NaN NaN
221 222 bv Bouvet Island 49.0 49.0 0.0 NaN NaN NaN NaN NaN
222 223 jn Jan Mayen 377.0 377.0 0.0 NaN NaN NaN NaN NaN
227 228 io British Indian Ocean Territory 54400.0 60.0 54340.0 NaN NaN NaN NaN NaN
239 240 sx South Georgia and South Sandwich Islands 3903.0 3903.0 0.0 NaN NaN NaN NaN NaN
243 244 bq Navassa Island 5.0 5.0 0.0 NaN NaN NaN NaN NaN
247 248 wq Wake Island 6.0 6.0 0.0 NaN NaN NaN NaN NaN
248 249 um United States Pacific Island Wildlife Refuges NaN NaN NaN NaN NaN NaN NaN NaN
251 252 pf Paracel Islands 7.0 7.0 0.0 NaN NaN NaN NaN NaN
252 253 pg Spratly Islands 5.0 5.0 0.0 NaN NaN NaN NaN NaN
255 256 xq Arctic Ocean NaN NaN NaN NaN NaN NaN NaN NaN
256 257 zh Atlantic Ocean NaN NaN NaN NaN NaN NaN NaN NaN
257 258 xo Indian Ocean NaN NaN NaN NaN NaN NaN NaN NaN
258 259 zn Pacific Ocean NaN NaN NaN NaN NaN NaN NaN NaN
259 260 oo Southern Ocean NaN NaN NaN NaN NaN NaN NaN NaN

As it can be seen above, these rows are unhabitated islands and oceans, so we are going to remove them from the dataset.

In [9]:
factb = factb[factb['population'].notnull()]
factb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 241 entries, 0 to 254
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 241 non-null    int64  
 1   code               241 non-null    object 
 2   name               241 non-null    object 
 3   area               237 non-null    float64
 4   area_land          234 non-null    float64
 5   area_water         231 non-null    float64
 6   population         241 non-null    float64
 7   population_growth  235 non-null    float64
 8   birth_rate         227 non-null    float64
 9   death_rate         227 non-null    float64
 10  migration_rate     223 non-null    float64
dtypes: float64(8), int64(1), object(2)
memory usage: 22.6+ KB

There are still some null values for the population_growth, birth_rate, death_rate and migration_rate but we are not going to remove these countries from the dataset.

For last, we are going to review the area column in order to remove null values.

In [10]:
factb[factb['area'].isnull()]
Out[10]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
34 35 cd Chad NaN 1259200.0 24800.0 11631456.0 1.89 36.60 14.28 3.45
127 128 ng Niger NaN 1266700.0 300.0 18045729.0 3.25 45.45 12.42 0.56
211 212 tb Saint Barthelemy NaN NaN NaN 7237.0 NaN NaN NaN NaN
249 250 ay Antarctica NaN 280000.0 NaN 0.0 NaN NaN NaN NaN

For Chad and Niger we can get the area using area_land and area_water, and for Saint Barthelemy and Antartica we can remove the rows.

In [11]:
factb.loc[factb['area'].isnull(), 'area'] = factb['area_land'] + factb['area_water']
factb = factb[factb['area'].notnull()]

With the null values of the area column removed, let's look for outliers in the same column.

In [12]:
factb['area'].describe()
Out[12]:
count    2.390000e+02
mean     5.887453e+05
std      1.769758e+06
min      0.000000e+00
25%      2.137500e+03
50%      7.174000e+04
75%      4.225345e+05
max      1.709824e+07
Name: area, dtype: float64
In [13]:
factb[factb['area']==0]
Out[13]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
189 190 vt Holy See (Vatican City) 0.0 0.0 0.0 842.0 0.0 NaN NaN NaN

There is one outlier with 0 area, this is the Vatican City. The reason of the 0 area should be due roundin, as the Vatican has less than a squared kilometer in area. We are going to remove this row.

In [14]:
factb.drop(189, inplace=True)

Data analysis

With the dataset cleaned, now we are going to analyze and visualize the countries in terms of:

  • The highest Population
  • The highest Population Density

Population

We are going to determine which are the top 10 countries with the highest population in 2015, and after that we are going to visualize them in a map.

In [15]:
high_pop = factb.sort_values(by='population', ascending=False, ignore_index=True).loc[:9,['name', 'population']]
In [16]:
high_pop
Out[16]:
name population
0 China 1.367485e+09
1 India 1.251696e+09
2 European Union 5.139494e+08
3 United States 3.213689e+08
4 Indonesia 2.559937e+08
5 Brazil 2.042598e+08
6 Pakistan 1.990858e+08
7 Nigeria 1.815621e+08
8 Bangladesh 1.689577e+08
9 Russia 1.424238e+08

As seen in the table above, the European Union appears in the factbook database but we are interested in countries not in groups of countries, so we are going to remove that row.

In [17]:
factb[factb['name']=='European Union']
Out[17]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
196 197 ee European Union 4324782.0 NaN NaN 513949445.0 0.25 10.2 10.2 2.5
In [18]:
factb.drop(196, inplace=True)
In [19]:
high_pop = factb.sort_values(by='population', ascending=False, ignore_index=True).loc[:9,['name', 'population']]
# we are going to express the population in millions
high_pop['population'] = round(high_pop['population']/1000000,1)
In [20]:
high_pop
Out[20]:
name population
0 China 1367.5
1 India 1251.7
2 United States 321.4
3 Indonesia 256.0
4 Brazil 204.3
5 Pakistan 199.1
6 Nigeria 181.6
7 Bangladesh 169.0
8 Russia 142.4
9 Japan 126.9

As we are going to use Plotly Scatter_Mapbox, we need to have the latitude and longitude for each country. To obtain these coordinates, the geopy module is going to be use.

Note: We are using the geopy module in this small part of the dataset, as it easiest to handle than in all the rows of the factbook dataset.

In [21]:
# functions to obtain latitude and longitude
def geo_lat(country):
    loc = geolocator.geocode(country)
    return loc.latitude
def geo_lon(country):
    loc = geolocator.geocode(country)
    return loc.longitude
In [22]:
high_pop['lat'] = high_pop.name.apply(geo_lat)
high_pop['lon'] = high_pop.name.apply(geo_lon)
In [23]:
high_pop
Out[23]:
name population lat lon
0 China 1367.5 35.000074 104.999927
1 India 1251.7 22.351115 78.667743
2 United States 321.4 39.783730 -100.445882
3 Indonesia 256.0 -2.483383 117.890285
4 Brazil 204.3 -10.333333 -53.200000
5 Pakistan 199.1 30.330840 71.247499
6 Nigeria 181.6 9.600036 7.999972
7 Bangladesh 169.0 24.476878 90.293243
8 Russia 142.4 64.686314 97.745306
9 Japan 126.9 36.574844 139.239418
In [27]:
px.set_mapbox_access_token('pk.eyJ1IjoicGFsYXZpY2kiLCJhIjoiY2tlajUwa2Y2MDF1ZjJ6bzg0Y3Ryb2Z3bSJ9.d1_EzLVQiv1ap5XuOxdhsg') #there is a need for a personal mapbox_tocken to use recreate this notebook.

fig = px.scatter_mapbox(high_pop, lat="lat", lon="lon", size='population', color='name', zoom=0, mapbox_style='light',
                       labels={'name':'Countries'}, title='Countries with the highest population',
                       hover_data={'name':False,'population': False, 'lat': False, 'lon':False},
                       custom_data=['name','population'], text='population')

fig.update_layout(
    title={
        'y':0.85,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.update_traces(
    hovertemplate='<b>%{customdata[0]}</b>' + "<br>Pop:%{customdata[1]} millions<extra></extra>",
    texttemplate='%{customdata[1]} mill', textposition='middle center')
    

plotly.offline.iplot(fig, filename='countries_population')
  • The countries with the highest populations are China (1,367 millions) followed by India (1,251 millions).
  • There is a high concentration of countries in Asia with a big population, 6 of the top 10 countries in terms of population are in Asia.
  • United States is the third country in terms of population and in America is followed by Brazil in 5th place.

Population density

To determine which countries have the highest population density, first we need to calculate it, this is done using this formula:

$population\;density = \cfrac{population}{land\;area}$

The population density is expressed in population per squared kilometer.

In [25]:
factb['pop_density'] = round(factb['population']/factb['area_land'])
high_pop_den = factb.sort_values(by='pop_density', ascending=False, ignore_index=True).loc[:9,['name', 'pop_density']]
high_pop_den['lat'] = high_pop_den.name.apply(geo_lat)
high_pop_den['lon'] = high_pop_den.name.apply(geo_lon)
high_pop_den
Out[25]:
name pop_density lat lon
0 Macau 21169.0 22.175760 113.551414
1 Monaco 15268.0 43.732349 7.427683
2 Singapore 8260.0 1.340863 103.830392
3 Hong Kong 6655.0 22.279328 114.162813
4 Gaza Strip 5192.0 31.433166 34.377929
5 Gibraltar 4876.0 36.140807 -5.354129
6 Bahrain 1772.0 26.155125 50.534461
7 Maldives 1320.0 4.706435 73.328785
8 Malta 1310.0 35.888599 14.447691
9 Bermuda 1300.0 32.301822 -64.760358
In [28]:
fig = px.scatter_mapbox(high_pop_den, lat="lat", lon="lon", size=np.repeat(1,10), color='pop_density', zoom=0, mapbox_style='light',
                       labels={'pop_density':'Population Density'}, title='Countries with the highest population density',
                       hover_data={'name':False,'pop_density': False, 'lat': False, 'lon':False}, size_max=8,
                       custom_data=['name','pop_density'], color_continuous_scale='rainbow')

fig.update_layout(
    title={
        'y':0.85,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.update_traces(
    hovertemplate='<b>%{customdata[0]}</b>' + "<br>Pop_den:%{customdata[1]}/km2<extra></extra>")


plotly.offline.iplot(fig, filename='countries_population_den')
  • The highest densely populated country/region is Macau, having an extraordinary 21,169 people per squared kilometer.
  • The second country with the highest population density is the Principality of Monaco in Europe with 15,268 people per squared kilometer.
  • The last of the top densely populated countries is Bermuda in America.

Conclusion

The country with the highest population by 2015 is China with 1,367 million people, and also within China is the place with the highest population density, Macau.

In America, we have the United States as the country with the biggest population, 321 million people, and Bermuda being the 10 highest country in terms of population density.