Data Analysis and Data Engineering

Import all libraries first

In [24]:
"""
Created on Sun Sep  3 01:17:41 2017

@author: Kalyan
"""
import pandas as pd
import numpy as np
from pandas import DataFrame
import os

Load the Data Set

In [25]:
data= pd.read_csv("http://makemeanalyst.com/python-data-science/gapminder.csv", low_memory=False)
In [26]:
data.head()
Out[26]:
country incomeperperson alcconsumption armedforcesrate breastcancerper100th co2emissions femaleemployrate hivrate internetuserate lifeexpectancy oilperperson polityscore relectricperperson suicideper100th employrate urbanrate
0 Afghanistan 0.03 0.5696534 26.8 75944000 25.60000038 3.654121623 48.673 0 6.6843853 55.70000076 24.04
1 Albania 1914.996551 7.29 1.0247361 57.4 223747333.3 42.09999847 44.98994696 76.918 9 636.3413834 7.699329853 51.40000153 46.72
2 Algeria 2231.993335 0.69 2.306817 23.5 2932108667 31.70000076 0.1 12.50007331 73.131 0.420094525 2 590.5098143 4.848769665 50.5 65.22
3 Andorra 21943.3399 10.17 81 5.362178802 88.92
4 Angola 1381.004268 5.57 1.4613288 23.1 248358000 69.40000153 2 9.999953883 51.093 -2 172.9992274 14.55467701 75.69999695 56.7
In [27]:
print("No of Records : ",len(data)) #This will show number of observations
print("No of Features : ",len(data.columns)) #This will show number of Features
No of Records :  213
No of Features :  16

Check the types of the variable

In [28]:
print(data.dtypes)
Out[28]:
country                 object
incomeperperson         object
alcconsumption          object
armedforcesrate         object
breastcancerper100th    object
co2emissions            object
femaleemployrate        object
hivrate                 object
internetuserate         object
lifeexpectancy          object
oilperperson            object
polityscore             object
relectricperperson      object
suicideper100th         object
employrate              object
urbanrate               object
dtype: object

Setting variables you will be working with to numeric

In [30]:
data['incomeperperson']=data['incomeperperson'].convert_objects(convert_numeric=True)
data['urbanrate']=data['urbanrate'].convert_objects(convert_numeric=True)
data['employrate']=data['employrate'].convert_objects(convert_numeric=True)
print(data.dtypes)
C:\Users\Kalyan\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  if __name__ == '__main__':
C:\Users\Kalyan\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  from ipykernel import kernelapp as app
C:\Users\Kalyan\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  app.launch_new_instance()
Out[30]:
country                  object
incomeperperson         float64
alcconsumption           object
armedforcesrate          object
breastcancerper100th     object
co2emissions             object
femaleemployrate         object
hivrate                  object
internetuserate          object
lifeexpectancy           object
oilperperson             object
polityscore              object
relectricperperson       object
suicideper100th          object
employrate              float64
urbanrate               float64
dtype: object

First I want to see are there any countries where incomeperperson, urbanrate and employrate all three features are NA values

In [36]:
sub1=data[np.isnan(data['incomeperperson']) & np.isnan(data['urbanrate']) & np.isnan(data['employrate'])]
sub1
Out[36]:
country incomeperperson alcconsumption armedforcesrate breastcancerper100th co2emissions femaleemployrate hivrate internetuserate lifeexpectancy oilperperson polityscore relectricperperson suicideper100th employrate urbanrate
43 Cook Islands NaN 3.23 1045000 4.373364925 NaN NaN
71 Gibraltar NaN 7388333.333 65 0 NaN NaN
134 Nauru NaN 4.81 4814333.333 4.079525471 NaN NaN
143 Niue NaN 8.69 132000 4.217076302 NaN NaN

Now I made the income variable as categorical. I made them into three catagories

In [37]:
data['factor_income']=pd.cut(data['incomeperperson'],[0,1000,12735, data.ix[:,['incomeperperson']].max()], 
                             labels=['Lower Income','Middle Income','Upper Income'])
In [43]:
data.head()
Out[43]:
country incomeperperson alcconsumption armedforcesrate breastcancerper100th co2emissions femaleemployrate hivrate internetuserate lifeexpectancy oilperperson polityscore relectricperperson suicideper100th employrate urbanrate factor_income factor_urbanrate EmpRateCatogory
0 Afghanistan NaN 0.03 0.5696534 26.8 75944000 25.60000038 3.654121623 48.673 0 6.6843853 55.700001 24.04 NaN 1=0%tile Average
1 Albania 1914.996551 7.29 1.0247361 57.4 223747333.3 42.09999847 44.98994696 76.918 9 636.3413834 7.699329853 51.400002 46.72 Middle Income 2=25%tile Average
2 Algeria 2231.993335 0.69 2.306817 23.5 2932108667 31.70000076 0.1 12.50007331 73.131 0.420094525 2 590.5098143 4.848769665 50.500000 65.22 Middle Income 3=50%tile Average
3 Andorra 21943.339900 10.17 81 5.362178802 NaN 88.92 Upper Income 4=75%tile NaN
4 Angola 1381.004268 5.57 1.4613288 23.1 248358000 69.40000153 2 9.999953883 51.093 -2 172.9992274 14.55467701 75.699997 56.70 Middle Income 2=25%tile High
In [39]:
print ('counts for original incomeperperson')
c1 = data['factor_income'].value_counts(sort=False, dropna=False)
print(c1) #Here you will get the frequncy of each catagory as well as number of NA values
print(data['factor_income'].describe())
counts for original incomeperperson
Lower Income     54
Middle Income    97
Upper Income     39
NaN              23
dtype: int64
Out[39]:
count               190
unique                3
top       Middle Income
freq                 97
Name: factor_income, dtype: object

Data management for urban rate

I will use quartile split (use qcut function & ask for 4 groups - gives you quartile split)

In [40]:
data['factor_urbanrate']=pd.qcut(data['urbanrate'],4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
print ('counts for original incomeperperson')
c2 = data['factor_urbanrate'].value_counts(sort=False, dropna=False)
c3 = data['factor_urbanrate'].value_counts(sort=False, dropna=False,normalize=True)#It will display percentage of data
#falls in each catagory
print(c2) #Here you will get the frequncy of each catagory as well as number of NA values
print("Percentage for each catagory\n",c3*100)
counts for original incomeperperson
1=0%tile     51
2=25%tile    51
3=50%tile    50
4=75%tile    51
NaN          10
dtype: int64
Percentage for each catagory
 1=0%tile     23.943662
2=25%tile    23.943662
3=50%tile    23.474178
4=75%tile    23.943662
NaN           4.694836
dtype: float64

Now create a new employrate variable which is categorical and cretae three categories as Low, average, high

In [41]:
data['EmpRateCatogory'] =pd.cut( data['employrate'],[0,50,70,data.ix[:,['employrate']].max()],
    labels=['Low','Average','High'])
    
c4 = data['EmpRateCatogory'].value_counts(sort=False, dropna=False)
print(c4)
Low         37
Average    114
High        27
NaN         35
dtype: int64

Try yourself

In [ ]: