Stochastic Nonsense

Put something smart here.

Cleaning Data in R: Csv Files

When you read csv files, you regularly encounter Excel encoded csv files which include extraneous characters such as commas, dollar signs, and quotes. Such a file might look like

1
2
3
col1, col2, col3
"1,233", "$12.79", "$1,333,233.17"
"470", "$1,113.22", "$0.12"

and call it dirty.csv.

Using read.csv will leave you with:

1
2
3
4
5
6
7
8
9
10
> data <- read.csv(file='~/stuff/blog/dirty.csv', header=T, sep=',')
> data
   col1       col2           col3
1 1,233     $12.79  $1,333,233.17
2   470  $1,113.22          $0.12
> str(data)
'data.frame': 2 obs. of  3 variables:
 $ col1: Factor w/ 2 levels "1,233","470": 1 2
 $ col2: Factor w/ 2 levels " $1,113.22"," $12.79": 2 1
 $ col3: Factor w/ 2 levels " $0.12"," $1,333,233.17": 2 1

which is not ideal. Unfortunately, converting the factors to numbers will fail, as they aren’t in a valid numeric format:

1
2
3
4
5
> data$col1
[1] 1,233 470  
Levels: 1,233 470
> as.numeric(data$col1)
[1] 1 2

The easiest thing to do is to clean the columns of our data frame with a regular expression, as such:

1
2
> gsub( ',', '', data$col1)
[1] "1233" "470" 

A better regular expression would clean all sorts of extraneous characters out of the columns:

1
2
3
4
5
6
7
8
> example <- '\'"$99.37$'
> example
[1] "'\"$99.37$"
> gsub('[^a-zA-Z0-9.]', '', example)
[1] "99.37"
> # -- also remove various money signs
> gsub('[^a-zA-Z0-9.]', '', paste(' €£', example) )
[1] "99.37"

And we can apply this to our columns as such:

1
2
3
4
5
6
7
8
> data2 <- data
> data$col1 <- gsub('[^a-zA-Z0-9.]', '', data$col1)
> data$col1 <- as.numeric(data$col1)
> str(data)
'data.frame': 2 obs. of  3 variables:
 $ col1: num  1233 470
 $ col2: Factor w/ 2 levels " $1,113.22"," $12.79": 2 1
 $ col3: Factor w/ 2 levels " $0.12"," $1,333,233.17": 2 1

But this is annoying to have to type for each column — what if we had many? — so let’s do this programmatically:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> clean <- function(ttt){
+   as.numeric( gsub('[^a-zA-Z0-9.]', '', ttt))
+ }
> data <- data2
> clean(data$col1)
[1] 1233  470
> 
> # all columns at once
> data[] <- sapply(data, clean)
> str(data)
'data.frame': 2 obs. of  3 variables:
 $ col1: num  1233 470
 $ col2: num  12.8 1113.2
 $ col3: num  1.33e+06 1.20e-01

Thus we can collapse all our work to this:

1
2
3
4
5
6
7
8
9
10
> data <- read.csv(file='~/stuff/blog/dirty.csv', header=T, sep=',')
> clean <- function(ttt){
+   as.numeric( gsub('[^a-zA-Z0-9.]', '', ttt))
+ }
> data[] <- sapply(data, clean)
> str(data)
'data.frame': 2 obs. of  3 variables:
 $ col1: num  1233 470
 $ col2: num  12.8 1113.2
 $ col3: num  1.33e+06 1.20e-01