The World Factbook is a compendium of data collected by the CIA from 267 countries around the world. The data consists in information on population, geography, culture, politics and many more fields.
In this project, we are going to explore and get some insight in the information of this dataset by using SQL and python for visualization.
The factbook database is from the year 2015 and is available in here.
We are going to import and have a quick look of the database using sqlite.
import sqlite3
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="http")
import plotly.express as px
import plotly
plotly.offline.init_notebook_mode()
# connection to the factbook database
conn = sqlite3.connect("factbook.db")
cursor = conn.cursor()
# query to see the table content in the factbook.db
q1 = '''
SELECT *
FROM sqlite_master
WHERE type='table';
'''
pd.read_sql_query(q1, conn)
# overview of the facts table in the factbook database
q2 = '''
SELECT *
FROM facts
LIMIT 5;
'''
pd.read_sql_query(q2, conn)
As seen in the table above, each row in the database consist of one country with several information of it. The columns names are self-descriptive.
First, we are going to run some summary statistics to look for outliers.
q3 = '''
SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth)
FROM facts;
'''
pd.read_sql_query(q3, conn)
There are in fact some outliers:
Let's see which countries are responsible for these values.
q4 = '''
SELECT *
FROM facts
WHERE population == (SELECT MIN(population) FROM facts)
OR population == (SELECT MAX(population) FROM facts);
'''
pd.read_sql_query(q4, conn)
The country with no population is Antartica, as there is in fact no permanent population in the continent, there are just temporal science and militar populations over there.
And the country with the highest population is a World row, with values of all the world.
With the outliers identified, now we are going to create a pandas dataframe based on the facts table for visualization.
# query to obtain the data to convert to DataFrame without the World row
q5 = '''
SELECT *
FROM facts
WHERE name <> 'World';
'''
factb = pd.read_sql_query(q5, conn)
factb.head(10)
With the Dataframe, now we are going to search for null values in the dataset.
factb.info()
Let's investigate the countries with a null value in the population column.
factb[factb['population'].isnull()]
As it can be seen above, these rows are unhabitated islands and oceans, so we are going to remove them from the dataset.
factb = factb[factb['population'].notnull()]
factb.info()
There are still some null values for the population_growth, birth_rate, death_rate and migration_rate but we are not going to remove these countries from the dataset.
For last, we are going to review the area column in order to remove null values.
factb[factb['area'].isnull()]
For Chad and Niger we can get the area using area_land and area_water, and for Saint Barthelemy and Antartica we can remove the rows.
factb.loc[factb['area'].isnull(), 'area'] = factb['area_land'] + factb['area_water']
factb = factb[factb['area'].notnull()]
With the null values of the area column removed, let's look for outliers in the same column.
factb['area'].describe()
factb[factb['area']==0]
There is one outlier with 0 area, this is the Vatican City. The reason of the 0 area should be due roundin, as the Vatican has less than a squared kilometer in area. We are going to remove this row.
factb.drop(189, inplace=True)
With the dataset cleaned, now we are going to analyze and visualize the countries in terms of:
We are going to determine which are the top 10 countries with the highest population in 2015, and after that we are going to visualize them in a map.
high_pop = factb.sort_values(by='population', ascending=False, ignore_index=True).loc[:9,['name', 'population']]
high_pop
As seen in the table above, the European Union appears in the factbook database but we are interested in countries not in groups of countries, so we are going to remove that row.
factb[factb['name']=='European Union']
factb.drop(196, inplace=True)
high_pop = factb.sort_values(by='population', ascending=False, ignore_index=True).loc[:9,['name', 'population']]
# we are going to express the population in millions
high_pop['population'] = round(high_pop['population']/1000000,1)
high_pop
As we are going to use Plotly Scatter_Mapbox, we need to have the latitude and longitude for each country. To obtain these coordinates, the geopy module is going to be use.
Note: We are using the geopy module in this small part of the dataset, as it easiest to handle than in all the rows of the factbook dataset.
# functions to obtain latitude and longitude
def geo_lat(country):
loc = geolocator.geocode(country)
return loc.latitude
def geo_lon(country):
loc = geolocator.geocode(country)
return loc.longitude
high_pop['lat'] = high_pop.name.apply(geo_lat)
high_pop['lon'] = high_pop.name.apply(geo_lon)
high_pop
px.set_mapbox_access_token('pk.eyJ1IjoicGFsYXZpY2kiLCJhIjoiY2tlajUwa2Y2MDF1ZjJ6bzg0Y3Ryb2Z3bSJ9.d1_EzLVQiv1ap5XuOxdhsg') #there is a need for a personal mapbox_tocken to use recreate this notebook.
fig = px.scatter_mapbox(high_pop, lat="lat", lon="lon", size='population', color='name', zoom=0, mapbox_style='light',
labels={'name':'Countries'}, title='Countries with the highest population',
hover_data={'name':False,'population': False, 'lat': False, 'lon':False},
custom_data=['name','population'], text='population')
fig.update_layout(
title={
'y':0.85,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.update_traces(
hovertemplate='<b>%{customdata[0]}</b>' + "<br>Pop:%{customdata[1]} millions<extra></extra>",
texttemplate='%{customdata[1]} mill', textposition='middle center')
plotly.offline.iplot(fig, filename='countries_population')
To determine which countries have the highest population density, first we need to calculate it, this is done using this formula:
$population\;density = \cfrac{population}{land\;area}$
The population density is expressed in population per squared kilometer.
factb['pop_density'] = round(factb['population']/factb['area_land'])
high_pop_den = factb.sort_values(by='pop_density', ascending=False, ignore_index=True).loc[:9,['name', 'pop_density']]
high_pop_den['lat'] = high_pop_den.name.apply(geo_lat)
high_pop_den['lon'] = high_pop_den.name.apply(geo_lon)
high_pop_den
fig = px.scatter_mapbox(high_pop_den, lat="lat", lon="lon", size=np.repeat(1,10), color='pop_density', zoom=0, mapbox_style='light',
labels={'pop_density':'Population Density'}, title='Countries with the highest population density',
hover_data={'name':False,'pop_density': False, 'lat': False, 'lon':False}, size_max=8,
custom_data=['name','pop_density'], color_continuous_scale='rainbow')
fig.update_layout(
title={
'y':0.85,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.update_traces(
hovertemplate='<b>%{customdata[0]}</b>' + "<br>Pop_den:%{customdata[1]}/km2<extra></extra>")
plotly.offline.iplot(fig, filename='countries_population_den')
The country with the highest population by 2015 is China with 1,367 million people, and also within China is the place with the highest population density, Macau.
In America, we have the United States as the country with the biggest population, 321 million people, and Bermuda being the 10 highest country in terms of population density.