Making Data Management Decisions

Import all libraries first

In [1]:
import pandas as pd
import numpy as np
import os
from pandas import DataFrame

Data Analysis and Data Engineering

Bug fix for display formats to avoid run time errors and Load the Data Set

In [3]:
pd.set_option('display.float_format', lambda x:'%f'%x)
data= pd.read_csv("http://makemeanalyst.com/python-data-science/gapminder.csv", low_memory=False)
print(len(data)) #Get no of rows in the data set
data.head()#Just print fews rows to see how the data looks like
213
Out[3]:
country incomeperperson alcconsumption armedforcesrate breastcancerper100th co2emissions femaleemployrate hivrate internetuserate lifeexpectancy oilperperson polityscore relectricperperson suicideper100th employrate urbanrate
0 Afghanistan 0.03 0.5696534 26.8 75944000 25.60000038 3.654121623 48.673 0 6.6843853 55.70000076 24.04
1 Albania 1914.996551 7.29 1.0247361 57.4 223747333.3 42.09999847 44.98994696 76.918 9 636.3413834 7.699329853 51.40000153 46.72
2 Algeria 2231.993335 0.69 2.306817 23.5 2932108667 31.70000076 0.1 12.50007331 73.131 0.420094525 2 590.5098143 4.848769665 50.5 65.22
3 Andorra 21943.3399 10.17 81 5.362178802 88.92
4 Angola 1381.004268 5.57 1.4613288 23.1 248358000 69.40000153 2 9.999953883 51.093 -2 172.9992274 14.55467701 75.69999695 56.7

Now check the types of each column

In [4]:
print(data.dtypes)
country                 object
incomeperperson         object
alcconsumption          object
armedforcesrate         object
breastcancerper100th    object
co2emissions            object
femaleemployrate        object
hivrate                 object
internetuserate         object
lifeexpectancy          object
oilperperson            object
polityscore             object
relectricperperson      object
suicideper100th         object
employrate              object
urbanrate               object
dtype: object

Make all the Columns as Numeric Except Country Column as it is catagorical data

In [5]:
colnames=data.columns.values.tolist() #Get all the column names
colnames2=colnames[1:len(data.columns)] 
for i in colnames2:
    data[i]=pd.to_numeric(data[i], errors='coerce')

Now explicitly make the country column as categorical

In [6]:
data[colnames[0]]=data[colnames[0]].astype('category')   
print(data.dtypes) #Now you will get all the column datatypes as your choice
country                 category
incomeperperson          float64
alcconsumption           float64
armedforcesrate          float64
breastcancerper100th     float64
co2emissions             float64
femaleemployrate         float64
hivrate                  float64
internetuserate          float64
lifeexpectancy           float64
oilperperson             float64
polityscore              float64
relectricperperson       float64
suicideper100th          float64
employrate               float64
urbanrate                float64
dtype: object

I will work with three varibles for this example. Now try to find no of NA values in my varible of interest

In [7]:
print(data['incomeperperson'].isnull().sum()) #No of NA is 23
print(data['urbanrate'].isnull().sum()) #No of NA is 10
print(data['employrate'].isnull().sum()) #NO of NA is 35
23
10
35

You can get number of NA values in the full data set

In [8]:
print(data.isnull().sum())
country                   0
incomeperperson          23
alcconsumption           26
armedforcesrate          49
breastcancerper100th     40
co2emissions             13
femaleemployrate         35
hivrate                  66
internetuserate          21
lifeexpectancy           22
oilperperson            150
polityscore              52
relectricperperson       77
suicideper100th          22
employrate               35
urbanrate                10
dtype: int64

Now I want to drop all rows where both the three variables are NA

In [9]:
data=data.dropna(subset=['incomeperperson','urbanrate','employrate'], how='all')
print(len(data))
209

Above code will drop 4 rows where both 'incomeperperson', 'urbanrate', 'employrate' are NA

Now find the mean value of all there columns. Because I will replace rest of the NA vaules with the mean value of that column

In [10]:
mean=data[['incomeperperson','urbanrate','employrate']].mean()
mean=np.round(mean, decimals=2)
print(mean)
incomeperperson   8740.970000
urbanrate           56.770000
employrate          58.640000
dtype: float64

Replace the missing data with the mean of each three columns respectively

In [11]:
data['incomeperperson']=data['incomeperperson'].replace(np.nan,mean[0])
data['urbanrate']=data['urbanrate'].replace(np.nan,mean[1])
data['employrate']=data['employrate'].replace(np.nan,mean[2])
In [12]:
data=data.iloc[:,1:]
data=data[['incomeperperson','urbanrate','employrate']]
print(data.head())
print(data.shape)
   incomeperperson  urbanrate  employrate
0      8740.970000  24.040000   55.700001
1      1914.996551  46.720000   51.400002
2      2231.993335  65.220000   50.500000
3     21943.339900  88.920000   58.640000
4      1381.004268  56.700000   75.699997
(209, 3)

Now I made the income variable as categorical. I made them into three categories

In [13]:
data['factor_income']=pd.cut(data['incomeperperson'],[0,1000,12735, data.ix[:,['incomeperperson']].max()], 
                             labels=['Lower Income','Middle Income','Upper Income'])
  
print ('counts for each incomeperperson')
c1 = data['factor_income'].value_counts(sort=False, dropna=False)
print(c1)
c5 = data['factor_income'].value_counts(sort=False, normalize=True)
print(c5)
counts for each incomeperperson
Lower Income      54
Middle Income    116
Upper Income      39
dtype: int64
Lower Income    0.258373
Middle Income   0.555024
Upper Income    0.186603
dtype: float64

Data management for urban rate. Do quartile split (use qcut function & ask for 4 groups - gives you quartile split)

In [14]:
data['factor_urbanrate']=pd.qcut(data['urbanrate'],4, labels=["1","2","3","4"])
print ('counts for each urban rate')
c2 = data['factor_urbanrate'].value_counts(sort=False, dropna=False)
c3 = data['factor_urbanrate'].value_counts(sort=False, normalize=True)#It will display percentage of data falls in each catagory
print(c2) #Here you will get the frequncy of each catagory as well as number of NA values
print("Percentage for each catagory\n",c3)
counts for each urban rate
1    53
2    52
3    52
4    52
dtype: int64
Percentage for each catagory
 1   0.253589
2   0.248804
3   0.248804
4   0.248804
dtype: float64

Similarly, I will create new employrate variable as categorical i.e. Low, Average, High

In [15]:
data['EmpRateCatogory'] =pd.cut( data['employrate'],[0,50,70,data.ix[:,['employrate']].max()],
    labels=['Low','Average','High'])
    
c4 = data['EmpRateCatogory'].value_counts(sort=False, dropna=False)
print(c4)
c6 = data['EmpRateCatogory'].value_counts(sort=False, normalize=True)
print("Percentage for each catagory\n",c6)
Low         37
Average    145
High        27
dtype: int64
Percentage for each catagory
 Low       0.177033
Average   0.693780
High      0.129187
dtype: float64

Try it yourself

In [ ]: