Reshaping ugly CSV with Python

This is another post based on a demo from my SQLSaturday session about Python integration in Power BI. The data file is also available from this link or download your own file from the Reserve Bank link below.

It is a pattern I use fairly often because flat files lend itself to the artistic license of the data provider i.e. they can do just about anything they want with the flat file.

Lets take a look at a real world example. The South African Reserve Bank provides daily exchange rate data (ZAR per USD) through an embedded Reporting Services environment. When you download the data as Comma Separated Value (CSV), this is what you get:

701

The data represents 3 days of data. We only require the section indicated as “useful data”. If you look at the file structure carefully, you will notice that it could still be treated as a CSV file. When you open the file with the Query Editor, you will see that Power BI already treats the file like a CSV (or an Excel file that does not have all cells populated).

702

I am going to illustrate two methods of dealing with this in Python. You may encounter scenarios where either one could be useful.

Method 1: Treat the file as CSV

This method will essentially mimic what the Power BI Query Editor does with the file. We will use the Pandas read_csv method, but we will use some parameters to handle our requirements. Without indicating that there are three columns, read_csv will throw an exception when it encounters a third field further down the file. To indicate three columns we simply specify the columns.

import pandas
df = pandas.read_csv('D:\\data\\dirty\\ExchangeRateDetail.csv',
                     skiprows=4,
                     names =['Date','Value','Extra'])

You will notice that we have ignored the first four rows that we don’t need.

703

From this point on its a simple process that we have encountered in previous posts. We are going to do three things to get the file into a ready to use state:

  • Drop the extra column
  • Remove (filter) unwanted rows
  • Explicitly cast the data types
import pandas
df = pandas.read_csv('D:\\data\\dirty\\ExchangeRateDetail.csv',
                     skiprows=4,
                     names =['Date','Value','Extra'])
df = df.drop(['Extra'], axis=1) # drop "Extra" column
df = df[df.Date != 'The Value'] # filter rows
df = df[df.Date != 'The Value_label'] # filter rows
df['Date'] = pandas.to_datetime(df['Date']) # explicit data type cast
df['Value'] = pandas.to_numeric(df['Value']) # explicit data type cast

The final output can now be used as a Python data source in Power BI

704

Method 2: Read the file line by line

In this example we will read the data line by line. Each line will get add to a list i.e. we will end up with a list of lines. We will then use some list comprehension techniques filter the data before adding it to a dataframe.

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
    lines = [line for line in f]

The new line characters form part of the line as shown below in our list where each item is a line in the file.

705

There are also special characters at the start of the file, but we dont need to worry about this as we will not use this item in the list.

To strip off the “\n” we will use the rstrip method which removes characters from the right hand side of a string. By not specifying a character, whitespace characters are removed. This includes “\n”.

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
    lines = [line.rstrip() for line in f]

706

With the newline removed, we can now start filtering the list of items(lines). We start by filtering out the empty items (blank lines).

lines = list(filter(None, lines))

707

Next we remove the first 3 items, but keep the rest of the list.

lines = lines[3:]

708

Our final filter condition is to remove all items that start with “The”

lines = [t for t in lines if not t.startswith('The')]

709

Now our list is ready to convert to a dataframe.

import pandas
df = pandas.DataFrame(lines)

710

Note that we end up with a single column (0) dataframe. Our final steps to get the dataframe into a ready to use state are:

  • split the column on comma separator
  • drop the original column
  • Explicitly cast the data types

Lets put a complete final script together that include the final steps.

import pandas

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
    lines = [line.rstrip() for line in f]

lines = list(filter(None, lines)) # drop blank lines
lines = lines[3:] # remove top rows
lines = [t for t in lines if not t.startswith('The')] # remove rows starting with "The"

df = pandas.DataFrame(lines) # add remaining rows to a dataframe
df['Date'] = pandas.to_datetime(df[0].str.split(',').str.get(0)) # split of the Date column
df['Value'] = df[0].str.split(',').str.get(1) # split off the value column
df = df.drop([0], axis=1) # drop the original column (0)
df['Date'] = pandas.to_datetime(df['Date']) # explicit data type cast
df['Value'] = pandas.to_numeric(df['Value']) # explicit data type cast

The final result is the same as method 1.

Hopefully that provides some ideas on how to deal with flat files in general. As mentioned at the start of this post, it is often challenging to deal with flat files as it is up to the data provider to produce a well structured file.

Posted in Power BI, Python, Python4PowerBI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: