Exploring and visualizing the CIA World Factbook using SQL and Python¶

The World Factbook is a compendium of data collected by the CIA from 267 countries around the world. The data consists in information on population, geography, culture, politics and many more fields.

In this project, we are going to explore and get some insight in the information of this dataset by using SQL and python for visualization.

The factbook database is from the year 2015 and is available in here.

Data Import¶

We are going to import and have a quick look of the database using sqlite.

import sqlite3
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="http")

import plotly.express as px
import plotly
plotly.offline.init_notebook_mode()

# connection to the factbook database 
conn = sqlite3.connect("factbook.db")
cursor = conn.cursor()

# query to see the table content in the factbook.db
q1 = '''
SELECT *
  FROM sqlite_master
 WHERE type='table';
 '''
pd.read_sql_query(q1, conn)

# overview of the facts table in the factbook database
q2 = '''
SELECT *
  FROM facts
 LIMIT 5;
 '''
pd.read_sql_query(q2, conn)

As seen in the table above, each row in the database consist of one country with several information of it. The columns names are self-descriptive.

Data Preparation¶

First, we are going to run some summary statistics to look for outliers.

q3 = '''
SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth)
  FROM facts;
'''
pd.read_sql_query(q3, conn)

There are in fact some outliers:

There is one country with no population (0).
There is one country with nearly 7.2 billion people.

Let's see which countries are responsible for these values.

q4 = '''
SELECT *
  FROM facts
 WHERE population == (SELECT MIN(population) FROM facts)
    OR population == (SELECT MAX(population) FROM facts);
'''
pd.read_sql_query(q4, conn)

The country with no population is Antartica, as there is in fact no permanent population in the continent, there are just temporal science and militar populations over there.

And the country with the highest population is a World row, with values of all the world.

Convert to pandas DataFrame¶

With the outliers identified, now we are going to create a pandas dataframe based on the facts table for visualization.

# query to obtain the data to convert to DataFrame without the World row
q5 = '''
SELECT *
  FROM facts
 WHERE name <> 'World';
'''
factb = pd.read_sql_query(q5, conn)
factb.head(10)

With the Dataframe, now we are going to search for null values in the dataset.

factb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260 entries, 0 to 259
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 260 non-null    int64  
 1   code               260 non-null    object 
 2   name               260 non-null    object 
 3   area               249 non-null    float64
 4   area_land          246 non-null    float64
 5   area_water         243 non-null    float64
 6   population         241 non-null    float64
 7   population_growth  235 non-null    float64
 8   birth_rate         227 non-null    float64
 9   death_rate         227 non-null    float64
 10  migration_rate     223 non-null    float64
dtypes: float64(8), int64(1), object(2)
memory usage: 22.5+ KB

Let's investigate the countries with a null value in the population column.

factb[factb['population'].isnull()]

As it can be seen above, these rows are unhabitated islands and oceans, so we are going to remove them from the dataset.

factb = factb[factb['population'].notnull()]
factb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 241 entries, 0 to 254
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 241 non-null    int64  
 1   code               241 non-null    object 
 2   name               241 non-null    object 
 3   area               237 non-null    float64
 4   area_land          234 non-null    float64
 5   area_water         231 non-null    float64
 6   population         241 non-null    float64
 7   population_growth  235 non-null    float64
 8   birth_rate         227 non-null    float64
 9   death_rate         227 non-null    float64
 10  migration_rate     223 non-null    float64
dtypes: float64(8), int64(1), object(2)
memory usage: 22.6+ KB

There are still some null values for the population_growth, birth_rate, death_rate and migration_rate but we are not going to remove these countries from the dataset.

For last, we are going to review the area column in order to remove null values.

factb[factb['area'].isnull()]

For Chad and Niger we can get the area using area_land and area_water, and for Saint Barthelemy and Antartica we can remove the rows.

factb.loc[factb['area'].isnull(), 'area'] = factb['area_land'] + factb['area_water']
factb = factb[factb['area'].notnull()]

With the null values of the area column removed, let's look for outliers in the same column.

factb['area'].describe()

count    2.390000e+02
mean     5.887453e+05
std      1.769758e+06
min      0.000000e+00
25%      2.137500e+03
50%      7.174000e+04
75%      4.225345e+05
max      1.709824e+07
Name: area, dtype: float64

factb[factb['area']==0]

There is one outlier with 0 area, this is the Vatican City. The reason of the 0 area should be due roundin, as the Vatican has less than a squared kilometer in area. We are going to remove this row.

factb.drop(189, inplace=True)

Data analysis¶

With the dataset cleaned, now we are going to analyze and visualize the countries in terms of:

The highest Population
The highest Population Density

Population¶

We are going to determine which are the top 10 countries with the highest population in 2015, and after that we are going to visualize them in a map.

high_pop = factb.sort_values(by='population', ascending=False, ignore_index=True).loc[:9,['name', 'population']]

high_pop

As seen in the table above, the European Union appears in the factbook database but we are interested in countries not in groups of countries, so we are going to remove that row.

factb[factb['name']=='European Union']

factb.drop(196, inplace=True)

high_pop = factb.sort_values(by='population', ascending=False, ignore_index=True).loc[:9,['name', 'population']]
# we are going to express the population in millions
high_pop['population'] = round(high_pop['population']/1000000,1)

high_pop

As we are going to use Plotly Scatter_Mapbox, we need to have the latitude and longitude for each country. To obtain these coordinates, the geopy module is going to be use.

Note: We are using the geopy module in this small part of the dataset, as it easiest to handle than in all the rows of the factbook dataset.

# functions to obtain latitude and longitude
def geo_lat(country):
    loc = geolocator.geocode(country)
    return loc.latitude
def geo_lon(country):
    loc = geolocator.geocode(country)
    return loc.longitude

high_pop['lat'] = high_pop.name.apply(geo_lat)
high_pop['lon'] = high_pop.name.apply(geo_lon)

high_pop

px.set_mapbox_access_token('pk.eyJ1IjoicGFsYXZpY2kiLCJhIjoiY2tlajUwa2Y2MDF1ZjJ6bzg0Y3Ryb2Z3bSJ9.d1_EzLVQiv1ap5XuOxdhsg') #there is a need for a personal mapbox_tocken to use recreate this notebook.

fig = px.scatter_mapbox(high_pop, lat="lat", lon="lon", size='population', color='name', zoom=0, mapbox_style='light',
                       labels={'name':'Countries'}, title='Countries with the highest population',
                       hover_data={'name':False,'population': False, 'lat': False, 'lon':False},
                       custom_data=['name','population'], text='population')

fig.update_layout(
    title={
        'y':0.85,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.update_traces(
    hovertemplate='<b>%{customdata[0]}</b>' + "<br>Pop:%{customdata[1]} millions<extra></extra>",
    texttemplate='%{customdata[1]} mill', textposition='middle center')
    

plotly.offline.iplot(fig, filename='countries_population')

The countries with the highest populations are China (1,367 millions) followed by India (1,251 millions).
There is a high concentration of countries in Asia with a big population, 6 of the top 10 countries in terms of population are in Asia.
United States is the third country in terms of population and in America is followed by Brazil in 5th place.

Population density¶

To determine which countries have the highest population density, first we need to calculate it, this is done using this formula:

$population\;density = \cfrac{population}{land\;area}$

The population density is expressed in population per squared kilometer.

factb['pop_density'] = round(factb['population']/factb['area_land'])
high_pop_den = factb.sort_values(by='pop_density', ascending=False, ignore_index=True).loc[:9,['name', 'pop_density']]
high_pop_den['lat'] = high_pop_den.name.apply(geo_lat)
high_pop_den['lon'] = high_pop_den.name.apply(geo_lon)
high_pop_den

fig = px.scatter_mapbox(high_pop_den, lat="lat", lon="lon", size=np.repeat(1,10), color='pop_density', zoom=0, mapbox_style='light',
                       labels={'pop_density':'Population Density'}, title='Countries with the highest population density',
                       hover_data={'name':False,'pop_density': False, 'lat': False, 'lon':False}, size_max=8,
                       custom_data=['name','pop_density'], color_continuous_scale='rainbow')

fig.update_layout(
    title={
        'y':0.85,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.update_traces(
    hovertemplate='<b>%{customdata[0]}</b>' + "<br>Pop_den:%{customdata[1]}/km2<extra></extra>")


plotly.offline.iplot(fig, filename='countries_population_den')

The highest densely populated country/region is Macau, having an extraordinary 21,169 people per squared kilometer.
The second country with the highest population density is the Principality of Monaco in Europe with 15,268 people per squared kilometer.
The last of the top densely populated countries is Bermuda in America.

Conclusion¶

The country with the highest population by 2015 is China with 1,367 million people, and also within China is the place with the highest population density, Macau.

In America, we have the United States as the country with the biggest population, 321 million people, and Bermuda being the 10 highest country in terms of population density.

	id	code	name	area	area_land	area_water	population	population_growth	birth_rate	death_rate	migration_rate
0	1	af	Afghanistan	652230	652230	0	32564342	2.32	38.57	13.89	1.51
1	2	al	Albania	28748	27398	1350	3029278	0.30	12.92	6.58	3.30
2	3	ag	Algeria	2381741	2381741	0	39542166	1.84	23.67	4.31	0.92
3	4	an	Andorra	468	468	0	85580	0.12	8.13	6.96	0.00
4	5	ao	Angola	1246700	1246700	0	19625353	2.78	38.78	11.49	0.46

	id	code	name	area	area_land	area_water	population	population_growth	birth_rate	death_rate	migration_rate
0	250	ay	Antarctica	None	280000.0	None	0	NaN	NaN	NaN	None
1	261	xx	World	None	NaN	None	7256490011	1.08	18.6	7.8	None

	id	code	name	area	area_land	area_water	population	population_growth	birth_rate	death_rate	migration_rate
0	1	af	Afghanistan	652230.0	652230.0	0.0	32564342.0	2.32	38.57	13.89	1.51
1	2	al	Albania	28748.0	27398.0	1350.0	3029278.0	0.30	12.92	6.58	3.30
2	3	ag	Algeria	2381741.0	2381741.0	0.0	39542166.0	1.84	23.67	4.31	0.92
3	4	an	Andorra	468.0	468.0	0.0	85580.0	0.12	8.13	6.96	0.00
4	5	ao	Angola	1246700.0	1246700.0	0.0	19625353.0	2.78	38.78	11.49	0.46
5	6	ac	Antigua and Barbuda	442.0	442.0	0.0	92436.0	1.24	15.85	5.69	2.21
6	7	ar	Argentina	2780400.0	2736690.0	43710.0	43431886.0	0.93	16.64	7.33	0.00
7	8	am	Armenia	29743.0	28203.0	1540.0	3056382.0	0.15	13.61	9.34	5.80
8	9	as	Australia	7741220.0	7682300.0	58920.0	22751014.0	1.07	12.15	7.14	5.65
9	10	au	Austria	83871.0	82445.0	1426.0	8665550.0	0.55	9.41	9.42	5.56

	id	code	name	area	area_land	area_water	population	population_growth	birth_rate	death_rate	migration_rate
197	198	at	Ashmore and Cartier Islands	5.0	5.0	0.0	NaN	NaN	NaN	NaN	NaN
200	201	cr	Coral Sea Islands	3.0	3.0	0.0	NaN	NaN	NaN	NaN	NaN
201	202	hm	Heard Island and McDonald Islands	412.0	412.0	0.0	NaN	NaN	NaN	NaN	NaN
207	208	ip	Clipperton Island	6.0	6.0	0.0	NaN	NaN	NaN	NaN	NaN
209	210	fs	French Southern and Antarctic Lands	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
221	222	bv	Bouvet Island	49.0	49.0	0.0	NaN	NaN	NaN	NaN	NaN
222	223	jn	Jan Mayen	377.0	377.0	0.0	NaN	NaN	NaN	NaN	NaN
227	228	io	British Indian Ocean Territory	54400.0	60.0	54340.0	NaN	NaN	NaN	NaN	NaN
239	240	sx	South Georgia and South Sandwich Islands	3903.0	3903.0	0.0	NaN	NaN	NaN	NaN	NaN
243	244	bq	Navassa Island	5.0	5.0	0.0	NaN	NaN	NaN	NaN	NaN
247	248	wq	Wake Island	6.0	6.0	0.0	NaN	NaN	NaN	NaN	NaN
248	249	um	United States Pacific Island Wildlife Refuges	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
251	252	pf	Paracel Islands	7.0	7.0	0.0	NaN	NaN	NaN	NaN	NaN
252	253	pg	Spratly Islands	5.0	5.0	0.0	NaN	NaN	NaN	NaN	NaN
255	256	xq	Arctic Ocean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
256	257	zh	Atlantic Ocean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
257	258	xo	Indian Ocean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
258	259	zn	Pacific Ocean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
259	260	oo	Southern Ocean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	id	code	name	area	area_land	area_water	population	population_growth	birth_rate	death_rate	migration_rate
34	35	cd	Chad	NaN	1259200.0	24800.0	11631456.0	1.89	36.60	14.28	3.45
127	128	ng	Niger	NaN	1266700.0	300.0	18045729.0	3.25	45.45	12.42	0.56
211	212	tb	Saint Barthelemy	NaN	NaN	NaN	7237.0	NaN	NaN	NaN	NaN
249	250	ay	Antarctica	NaN	280000.0	NaN	0.0	NaN	NaN	NaN	NaN

	type	name	tbl_name	rootpage	sql
0	table	sqlite_sequence	sqlite_sequence	3	CREATE TABLE sqlite_sequence(name,seq)
1	table	facts	facts	47	CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY...

	name	population
0	China	1.367485e+09
1	India	1.251696e+09
2	European Union	5.139494e+08
3	United States	3.213689e+08
4	Indonesia	2.559937e+08
5	Brazil	2.042598e+08
6	Pakistan	1.990858e+08
7	Nigeria	1.815621e+08
8	Bangladesh	1.689577e+08
9	Russia	1.424238e+08

	name	population
0	China	1367.5
1	India	1251.7
2	United States	321.4
3	Indonesia	256.0
4	Brazil	204.3
5	Pakistan	199.1
6	Nigeria	181.6
7	Bangladesh	169.0
8	Russia	142.4
9	Japan	126.9

	name	population	lat	lon
0	China	1367.5	35.000074	104.999927
1	India	1251.7	22.351115	78.667743
2	United States	321.4	39.783730	-100.445882
3	Indonesia	256.0	-2.483383	117.890285
4	Brazil	204.3	-10.333333	-53.200000
5	Pakistan	199.1	30.330840	71.247499
6	Nigeria	181.6	9.600036	7.999972
7	Bangladesh	169.0	24.476878	90.293243
8	Russia	142.4	64.686314	97.745306
9	Japan	126.9	36.574844	139.239418

	name	pop_density	lat	lon
0	Macau	21169.0	22.175760	113.551414
1	Monaco	15268.0	43.732349	7.427683
2	Singapore	8260.0	1.340863	103.830392
3	Hong Kong	6655.0	22.279328	114.162813
4	Gaza Strip	5192.0	31.433166	34.377929
5	Gibraltar	4876.0	36.140807	-5.354129
6	Bahrain	1772.0	26.155125	50.534461
7	Maldives	1320.0	4.706435	73.328785
8	Malta	1310.0	35.888599	14.447691
9	Bermuda	1300.0	32.301822	-64.760358