Bigdatahabits.com

5 Powerful Python and R Formulas to Convert And Replace Data

Data manipulation is the most important task in conducting a data project. It doesn't matter whether the project involves data science, data analytics or busniess intelligence. Before you can mine gold from data, you have to struggle with it first. That's why Kaggle Masters spend a lot of time on data manipulation, which is usually the most vital step in winning competitions. So data manipulation should be an essential part of your data analysis plan.

Data maniuplation requires coding. In this article we show the 5 most powerful convert & replace formulas with Python and R:

  • Data Binning
  • Type Casting
  • Category to Number
  • Replacing Cells
  • Renaming Columns

The formulas will be applied on the titanic dataset, which can be dowloaded from Kaggle.


1. Data Binning



Data Binning is about labeling data in intervals - called bins. The purpose is mostly to ‘tidy’ and categorize your data(colum) to get a bird eye view. For example, in the titanic dataset the age is measured in years, but you want to have age categories as follows:

  1. Child , age ranges of 0-17
  2. Adult, age ranges of 18-39
  3. Middle Aged, age ranges of 40-59
  4. Over 60, age ranges of 60 and above

R

titanic = read.csv("titanic.csv")  
#define the left  edges of the age categories and the corresponding labels:  
edges <- c(0,18,40,60, 120)  
labels <- c("Child",
            "Adult",
            "Middle Aged",
            "Over 60")  
# we can break the ages in categories with the cut function
age.categories <- cut(titanic$Age,breaks = edges, right = FALSE, labels = labels)
# print the first 50 corresponding age categories
age.categories[1:10]

Output:

[1] Adult       Adult       Adult       Adult       Adult      
[6] <NA>        Middle Aged Child       Adult       Child      
Levels: Child Adult Middle Aged Over 60

PYTHON

import pandas as pd
titanic = pd.read_csv("titanic.csv")
labels = ["Child","Adult","Middle Aged", "Over 60"]
edges = [0,18,40,60, 120]
age_categories =  pd.cut(titanic["Age"], 
                edges, labels=labels)
print age_categories[0:10]

output

## 0          Adult
## 1          Adult
## 2          Adult
## 3          Adult
## 4          Adult
## 5            NaN
## 6    Middle Aged
## 7          Child
## 8          Adult
## 9          Child
## Name: Age, dtype: category
## Categories (4, object): [Child < Adult < Middle Aged < Over 60]

2. Type Casting


Sometimes columns are not in the datatype format you would like to have. Type casting is about converting the types of columns to the desired types. Most used type casting formulas are:

  • Numeric to String
  • String to Numeric
  • Double to Int

R as.character, as.type
titaninc <-read.csv("titanic.csv")
# numeric to string
age.string <- as.character(titanic$Age)
# string to numeric 
age.numeric <- as.numeric(age.string)
# double to int
fare.int  <- as.integer(titanic$Fare)
Python (astype)
import pandas as pd
titanic = pd.read_csv("titanic.csv")
# numeric to string
age_string = titanic['Age'].astype('string')
# string to numeric 
age_numeric = age_string.astype('float')
# double to int
fare_int = titanic['Fare'].astype('int')

3. Category to Number



Category To Number is about converting nominal data to numeric. Very often, prediction or machine learning functions don’t accept nominal data, making it necessary to convert these columns to integer.

For instance, the column ‘Sex’ in the titanic dataset is nominal consisting of “male” and “female”, which can be encoded to the integers 0/1 as follows:

R (as.factor)
titanic = read.csv("titanic.csv")
gender.encoded <- as.integer(as.factor(titanic$Sex))-1
#print subset
head(gender.encoded)
## [1] 1 0 0 0 1 1

Python (pd.Categorical)

import pandas as pd
titanic = pd.read_csv("titanic.csv")
gender_encoded =   pd.Categorical(titanic.Sex).codes 
# print subset
print gender_encoded[0:6]
## [1 0 0 0 1 1]

4. Replacing cells



A frequently used data manipulation technique is replacing cells in a column according to a dictionary. The technique entails two subsequent steps:

  1. find the occurences in the target column that match the key in the dictionary
  2. replace the occurences with the matched values.

In excel this technique is known as vlookup/hlookup, which is very popular and frequently used.

Suppose that you want to anonymize the names of the persons in the titanic dataset by ficitve names as displayed in the dictionary below

##   gender fictive.name
## 1   male          jan
## 2 female        griet

In other words, if the gender in the tictanic dataset matches “male” then replace the target colum “Name” of titanic dataset with “jan” , otherwise if female then replace the Name with “griet”

R (match)

titanic = read.csv("titanic.csv")
dictionary <- data.frame(gender = c("male","female"), fictive.name = c("jan","griet"))
row.indices <- match(titanic$Sex, dictionary$gender)
head(row.indices)
## [1] 1 2 2 2 1 1

The function match retuns the row index of the dictionary that corresponds with the Sex column in the titanic dataset. 1 corresponds with the first row of the dictionary table (gender = male), while 2 corresponds with female.

Now you have a matched reference to the dictionary through the row.indices, you can replace the Name column with the fictive names as follows:

titanic$Name <- dictionary$fictive.name[row.indices]
head(titanic$Name)
## [1] jan   griet griet griet jan   jan  
## Levels: griet jan

Python (map)

In python you can use the map function to perform a lookup in the dictionary and replace the cells of the target colum with the matched values

import pandas as pd
titanic = pd.read_csv("titanic.csv")
dictionary = {'male':'jan', 'female':'griet'}
titanic['Name'] = titanic['Sex'].map(dictionary) 
print titanic['Name'].head()
## 0      jan
## 1    griet
## 2    griet
## 3    griet
## 4      jan
## Name: Name, dtype: object

5. Renaming columns



Sometimes your need to rename the column names as they are not clear or you want the column names follow a specific convention.

The column names of the titanic dataset are as follows:

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

You can see that the columns “SibSp”, “Pclass” and “Parch” are ambigiuous, not clear what the fields exactly mean. Suppose you want to replace the column names with the following clearer descriptions:

##   old.name       new.name
## 1    SibSp NumberSiblings
## 2   Pclass    TicketClass
## 3    Parch  NumberParents

R (names and match)

titanic <- read.csv("titanic.csv")
old.names <- c("SibSp","Pclass","Parch")
new.names <- c("NumberSiblings","TicketClass", "NumberParents")
#get a reference to the old names (match) and replace with new names 
names(titanic)[match(old.names,names(titanic)] <- new.names

#print the newly update column names
print(names(titanic))
##  [1] "PassengerId"    "Survived"       "NumberSiblings" "Name"          
##  [5] "Sex"            "Age"            "TicketClass"    "NumberParents" 
##  [9] "Ticket"         "Fare"           "Cabin"          "Embarked"

Python (rename)

import pandas as pd
titanic = pd.read_csv("titanic.csv")
old_names = ["SibSp","Pclass","Parch"]
new_names = ["NumberSiblings","TicketClass", "NumberParents"]
#Make a mapping from old names to new names: {"SibSp:NumberSiblings",..,}
old_to_new = dict(zip(old_names, new_names))
#rename
titanic.rename(old_to_new, axis = 'columns', inplace = True)
print titanic.columns.values
## ['PassengerId' 'Survived' 'TicketClass' 'Name' 'Sex' 'Age'
##  'NumberSiblings' 'NumberParents' 'Ticket' 'Fare' 'Cabin' 'Embarked']