Loading JSON … it looks simple … Part 4

This is part 4 and the final post of this JSON series. We are going to play with Smartsheet data because its adds to our examples of varying JSON structure. Note that Power BI already has a connector for Smartsheet. There is also a Python SDK for Smartsheet, but we are going to use the Python requests library so that we can start from scratch and learn how to navigate the JSON structure.

We will follow the same approach as the previous post i.e. focus on understanding the structure, draw from our existing patterns and add one or two new techniques to our arsenal.

Below is the screenshot of the actual Smartsheet we will be working with.

1100

Accessing the Smartsheet API

API access is available during the free trial period. If you want to play, have a look at how to generate the token so that you can authenticate your API access. If you have been following my posts, you would know by now that I store my secure information (such as API tokens) in the user environment variables on by development computer.

To access the API we need to create the endpoint from the Smartsheet API URL and the sheet ID for the sheet you want to access. We also need the token which we pass to the request header. The code to do this is as follows:

import json

auth = os.getenv('SMARTSHEET_TOKEN')
smartsheet = 'https://api.smartsheet.com/2.0/sheets/'
sheetid = '3220009594972036'
uri = smartsheet + sheetid
header = {'Authorization': "Bearer " + auth,
          'Content-Type': 'application/json'}

req = requests.get(uri, headers=header)
data = json.loads(req.text)

In the above script, the last line converts the request text into a Python dictionary structure.

Understanding the data structures

This is the critical step. Understand the problem … and in this case it means read the Smartsheet documentation and compare it to the physical structure. This time round we are going to use the pretty print (pprint) library for the latter.

import pprint
pprint.pprint(data)

The documentation tells us that there is a sheet which is the main object. The sheet contains the columns (metadata) object, a rows object and cells object. It also tells us that a collection of cells form a row. When we explore the physical structure we see that this is indeed the case.

1101

1102

So the column metadata and rows are referenced directly from the sheet object, and the cells are referenced from the row object. We could thus access the columns and rows dictionaries directly.

pprint.pprint(data['columns'])
pprint.pprint(data['rows'])

Building the dataframe

The process we are going to follow is to create an empty dataframe from the column metadata (using the ‘title’ key: value pair). We are then going to iterate over the rows and cells and append the values to the new dataframe.

We are firstly going to create a list of columns. This has a dual purpose. We will use it to create the empty dataframe, then we will re-use it to create row dictionaries that we will append to the dataframe.

cols = []
for col in data['columns']:
    cols.append(col['title'])

1103

Creating the empty dataframe from this is straightforward.

import pandas
df = pandas.DataFrame(columns=cols)

To append the rows, we iterate over the rows with a loop and we use a nested loop to iterate over the cells. The cells in each row are added to a list, but first we trap empty values in the VAT field.

for row in data['rows']:
    values = [] #re-initialise the values list for each row
    for cell in row['cells']:
        if cell.get('value'): #handle the empty values
            values.append( cell['value'])
        else:
            values.append('')
    print(values)

1104

Now we need a new trick to added these lists to the dataframe. We are going to use a function called zip. Zip must be used in conjunction with an iterator such as a list or dictionary.

1105

We are going to use this to combine our columns and rows into a dictionary that we append to the dataframe. The complete script is below.

import json, pandas

auth = os.getenv('SMARTSHEET_TOKEN')
smartsheet = 'https://api.smartsheet.com/2.0/sheets/'
sheetid = '3220009594972036'
uri = smartsheet + sheetid
header = {'Authorization': "Bearer " + auth,
          'Content-Type': 'application/json'}

req = requests.get(uri, headers=header)
data = json.loads(req.text)

cols = []
for col in data['columns']:
    cols.append(col['title'])

df = pandas.DataFrame(columns=cols)

for row in data['rows']:
    values = [] #re-initilise the values list for each row
    for cell in row['cells']:
        if cell.get('value'): #handle the empty values
            values.append( cell['value'])
        else:
            values.append('')
    df = df.append(dict(zip(cols, values)), ignore_index=True)

1106

The moral of the story

While JSON provides a great data interchange standard, implementation of the file format will vary. This means we will always need to make sure we understand the structure and select the required techniques to create an analysis ready format.

This post concludes the JSON series, but will revisit the topic when new scenarios pop up.

 

Tagged with: , ,
Posted in Power BI, Python, Python4PowerBI

Loading JSON … it looks simple … Part 3

This is part 3 of the introduction to handling JSON with Python. This post covers a real world example that accesses data from public API from the New York City Open Data. For fun we are going to look at the most popular baby names by sex and ethnicity.

The key idea is that JSON structures vary, and the first step in dealing with JSON is understanding the structure. Once we understand the structure, we select the appropriate techniques to handle the JSON data. Most of these techniques were covered in part 1 and part 2 of this series.

Browsing the data

You can browse the data from the API URL endpoint. The displayed format is not really going to help us understand the data. Copy the text over to an online JSON formatter such as jsonformatter. Here we see the pretty JSON format on the left:

1001

Click on the JSON Viewer button to see a tree view of the JSON data on the right:

1002

The very first thing you notice right at the top of the tree view is that there are 2 objects at the top level and the JSON consists of  nested dictionaries. By collapsing the tree structure to only the top level, we see the following:

1003

Here we see the meta dictionary has 1 sub-dictionary and the data dictionary is a list of 11345 items. Further exploration shows that the data dictionary items are also lists i.e. the data dictionary is a list of 11345 lists (rows). Each of these lists are a list of 14 items. Each item represents a value (column).

1004

The data we have only shows a list index and a list value. We don’t know what those values represent … yet.

Loading the data

It is reasonably easy to get to the data by referencing the data dictionary directly as shown in the script below:

import requests, pandas

url = 'https://data.cityofnewyork.us/api/views/25th-nujf/rows.json?accessType=DOWNLOAD'
req = requests.get(url)
mydict = req.json()

df = pandas.DataFrame.from_dict(mydict['data'])

The results show that the only problem is the column names are the index values.

1005

Loading the column meta data

We revisit our JSON tree view and find that the meta dictionary contains a view sub-dictionary, which in turn contains a list of dictionaries called columns. Each item in the columns list is a dictionary with 8 column properties.

1006

To access the column names is not that tricky. We will use the same techniques from the previous posts. Perhaps the only “trick” is to reference the nested columns dictionary. The rest of the code accesses the key: value pairs nested inside the list, and adds the column name value to our columns list. The final step replaces the column headers with the newly acquired column headers.

cols_lst = []

for obj in mydict['meta']['view']['columns']:
    for key, value in obj.items():
        if key == 'name': cols_lst.append(value)

df.columns = cols_lst

1007

You can clean up by removing the columns you don’t need (see previous blog posts using the drop method for dataframes).

When the problem is understood …

… the solution is (hopefully) obvious. While this post barely introduced anything new from a Python perspective (except referencing sub-dictionaries), the focus was on understanding the problem and cementing techniques for handling JSON.

In the next blog post I want to conclude this JSON series with another real world example.

Tagged with: , , ,
Posted in Power BI, Python, Python4PowerBI

Loading JSON … it looks simple … Part 2

The previous post might have left you feeling that its a bit too complicated and a bit too abstract. Dont worry, the concepts will be clearer after a bit of a visual recap and then exploring how we would navigate the JSON/dictionary structures and finally how we reconstruct a “flattened” dictionary from the original dictionary.

Visual recap

We discussed that the JSON format is the same as a Python dictionary i.e. it consists of key: value pairs separated by a comma and enclosed inside braces. The keys are just names and the values may be of type string, number, boolean, list or dictionary (object).

The image below is the JSON file we created in the previous post. It is a list of 3 dictionaries. Each dictionary has the same keys, but the values may differ. We access the values by referencing the keys.

901

Navigating the JSON file

We can iterate through the JSON file using the for loop structure. Lets start by loading the JSON file and then looping through the outer list of dictionary items. Our image above shows that there are 3 dictionary items. We will observe the same from the code below.

import json

with open('d:\\data\\json\\data.json') as json_file:
    mydict = json.load(json_file) 

#looping through each object in the list
for obj in mydict:
    print(obj)

902

In order to traverse the key: value pairs inside each dictionary, the for loop structure is slightly different. We will nest the dictionary for loop inside the list for loop so that we can access the key: value pairs for each of the 3 dictionaries.

for obj in mydict:
    for key, value in obj.items():
        print(mydict.index(obj), key, value)

For good measure, we have added in the index of the list item so that we know which dictionary the key: value pairs are associated with.

903

We know from the file structure that we still have list and dictionary values to deal with. We will choose our for loop structure according to the value type (list or dictionary). In order to get the values out of the “versions” list, we will use the list for loop structure nested inside our existing nested loop structure.

for obj in mydict:
    for key, value in obj.items():
        if type(value) == list:
            for item in value:
                print(mydict.index(obj), obj['product'], key, item)

Note how we test if the value type is a list. We are excluding all other key values.

904

To keep the data meaningful we have retained the dictionary index and product value that are associated with the list items.

We can use the same technique with the nested dictionary, by using the dictionary for loop structure.

for obj in mydict:
    for key, value in obj.items():
        if type(value) == dict:
            for key2, value2 in value.items():
                print(mydict.index(obj), obj['product'], key2, value2)

905

At this stage we have everything we need in terms of navigating the dictionary. We know the difference in for loop structure for lists vs dictionaries. We also know how to test the data type of the dictionary value and we know how to reference the keys, values and indexes.

Reconstructing  the dictionary

To create a flattened dictionary (no list or dictionary values), we will use a combination of navigation techniques just shown. The only additional techniques required are a bit of list and dictionary comprehension i.e. how to create and update a dictionary, and appending items to a list.

We create empty lists and dictionaries in the same way, but we append to these data structures in a different way.

907

We will now use the above in conjunction with our nested looping structures to create a flattened dictionary.

The principle of the technique is to isolate every key: value pair, including unpacking list or dictionary values. Each unpacked key: value pair is then added to a new dictionary that no longer has list of dictionary values. Finally we append each dictionary to create a new list of dictionaries, and the resulting dictionary list is added to a dataframe.


import json, pandas

with open('d:\\data\\json\\data.json') as json_file:
    mydict = json.load(json_file) 

dict_lst = [] #initialise and empty list to hold the final result

#outer loop through each dictionary
for obj in mydict:
    flat_dict = {} #initilaise a target dictionary
    #loop through each key: value pair in dictionary
    for key, value in obj.items():
        #test the data type of each value
        #handle list and dictionary types in the respective ways
        if type(value) == dict:
            for key2, value2 in value.items():
                #use both the original key and value as values in new dict
                flat_dict.update({'cheapest':key2, 'price':value2})
        elif type(value) == list:
            for item in value:
                #create a key of values version0, version1, version2, etc
                flat_dict.update({'version' + str(value.index(item)): item})
        else:
            flat_dict.update({key: value})
    #append the flattened dictionary to our new dictionary list
    dict_lst.append(flat_dict)

#add the dictionary list to a dataframe
df = pandas.DataFrame.from_dict(dict_lst)

906

This technique has greater flexibility in terms of allowing other ways of manipulating the dictionary. Ironically it also appears easier than the technique shown in the previous post where we used dataframes. Best of all, this technique forces a level of understanding of dictionary data structures.

In part 3 of the JSON series I want to explore a real world example that include JSON with a metadata section and retrieving data from an API.

Tagged with: , , ,
Posted in Power BI, Python, Python4PowerBI

Loading JSON … it looks simple … Part 1

In modern application environments we will encounter JSON (JavaScript Object Notation) as the preferred file format for data-interchange. Many APIs and files are delivered in JSON format. Working with JSON format could potentially be challenging, especially if you are new to it. It is easy to get thrown by values that are variable length lists or values that are dictionaries (nested objects).  So the three key things I want to cover in this post are:

  • The basics of the JSON and how it compares to a Python dictionary
  • Dealing with values that are lists (including variable length lists)
  • Dealing with nested objects (including where keys differ)

There will potentially be many opportunities to improve the code. Please let me know or comment so that I can learn in the process.

The JSON format is best understood by comparing it to a Python dictionary.  Be aware though that the JSON format (rules for data representation in text format) only appears similar to the Python dictionary (a data type that supports add, delete, sort, etc).

Python dictionaries

Dictionaries  consist of key: value pairs separated by a comma and enclosed inside braces. The value may be of type string, number, boolean, list or object (dictionary).

string_dict   = {"product": "Power BI"}
number_dict   = {"year": 2015}
boolean_dict  = {"free": True}
array_dict    = {"paid": ["pro", "premium"]}
object_dict   = {"price": {"pro": "$10"} }

mydict = {"product": "Power BI",
          "year": 2015,
          "free": True,
          "paid": ["pro", "premium"],
          "cheapest": {"pro": "$10"}}

To access the values we reference the keys:

801

Note how the second example combines a key with a list index, and the third example combines the key for the outer object with the key of the nested object.

List of dictionaries

We can also create an array (list) of dictionaries:

dict_array    = [
                {"product": "Power BI",
                 "year": 2015,
                 "free": True,
                 "versions": ["free",
                              "pro",
                              "premium"],
                 "cheapestpaid": {"pro": "$10"}},
                {"product": "Tableau",
                 "year": 2004,
                 "free": False,
                 "versions": ["creator",
                              "explorer",
                              "viewer"],
                 "cheapestpaid": {"creator": "$70"}},
                {"product": "QlikSense",
                 "year": 2015,
                 "free": True,
                 "versions": ["cloudbasic",
                              "cloudbusiness",
                              "desktop",
                              "enterprise"],
                 "cheapestpaid": {"cloudbusiness": "$15"}}
                ]

We can reference a specific array item by its index, then the dictionary key as follows:

802

Here the list index precedes the dictionary key. Remember that the list index offset is 0.

Creating a JSON file from a dictionary

First we convert the dictionary to a JSON string using the built in json package. The json.dumps function converts a dictionary into JSON encoded format. Add the following to the code above:

import json
print(dict_array)
json.dumps(dict_array)

Compare the output of the dictionary and the JSON encoded string. The appearance is almost identical. The dictionary displays the key/value pairs in single quotes, while the JSON string displays the key/value pairs in double quotes and the entire string in single quotes.

803

Now we can write the JSON string to a file. Here we use json.dump (without the s) which allows us to write the file rather than a unicode string.

with open('d:\\data\\json\\data.json', 'w') as outfile:
    json.dump(dict_array, outfile, indent=4)

The indent=4 parameter produces a pretty json layout i.e. human readable format.

Load the JSON file

In the same manner that we wrote the JSON file, we can read the file using the json.load function. At this point the file content is loaded as a list of dictionaries.

with open('d:\\data\\json\\data.json') as json_file:
    dict_lst = json.load(json_file)

Adding the dictionary to a dataframe

import pandas
df = pandas.DataFrame.from_dict(dict_lst)

From the output we can see that we still need to unpack the list and dictionary columns.

808

One of the methods provided by Pandas is json_normalize. This works well for nested columns with the same keys … but not so well for our case where the keys differ.

The dictionary you wish you got

Here is a simple example where the keys are the same:

mydict = [{"company": "ABC",
           "location": {"province": "Gauteng",
                        "city": "Sandton" }},
          {"company": "XYZ",
           "location": {"province": "Western Cape",
                      "city": "Cape Town" }}]

import pandas
pandas.DataFrame.from_dict(mydict)

from pandas.io.json import json_normalize
json_normalize(mydict)

809

Back to the real world

In our example, our keys dont match across dictionaries, so the result is not great.

from pandas.io.json import json_normalize
json_normalize(dict_lst)

810

Because the keys differ they are split across different columns. Also the list column is not handled. Lets use a different technique to handle the list column. Pandas apply allows us to apply operations across dataframes (rows/columns). Here we use Pandas Series to create a column for each list item.

import pandas
df = pandas.DataFrame.from_dict(dict_lst)
df2 = df['versions'].apply(pandas.Series)

811

The resulting dataframe can be concatenated with the existing one as follows:

 
df3 = pandas.concat([df[:], df2[:]], axis=1)

812

So we still have to deal with the dictionary column. We will use dataframe methods (melt, merge, drop) that were covered in previous posts, along with some new methods. I’ve taken a different approach here by providing all the steps with comments and then a visual that shows the output at each step.

import pandas
df = pandas.DataFrame.from_dict(dict_lst)
#split the versions list column into individual columns
df1 = df['versions'].apply(pandas.Series)
#concatenate the original dataframe with the individual version columns
df2 = pandas.concat([df[:], df1[:]], axis=1)
#split the dictionary column into individual columns
df3 = pandas.DataFrame(df2['cheapestpaid'].values.tolist(), index=df2.index)
#add the product column (to create a lookup column)
df4 = pandas.concat([df['product'], df3[:]], axis=1)
#melt (unpivot) the value columns
df5 = pandas.melt(df4, id_vars=['product'], value_vars=['cloudbusiness','creator','pro'], var_name='cheapest')
#drop NaN values to finalise the lookup table
df6 = df5.dropna()
#merge with the lookup table
df7 = pandas.merge(df2, df6, on='product', how='outer')
#drop unwanted columns
df8 = df7.drop(['cheapestpaid','versions'], axis=1)

813

So the result is not the simplest of code, however it is not that complex either. We solved the entire problem using Pandas. What I have not tested is how we would solve this without Python in Power BI (visual Query Editor or M). It would also be interesting to compare solution complexity. In my case, I already had these Python patterns, so it was a matter of script re-use.

In the next post I want to explore methods of traversing and reshaping the data without Pandas and dataframes. It will provide better understanding of the JSON format and expand on the possibilities with Python script.

 

 

 

Tagged with: , ,
Posted in Power BI, Python, Python4PowerBI

Reshaping ugly CSV with Python

This is another post based on a demo from my SQLSaturday session about Python integration in Power BI. The data file is also available from this link or download your own file from the Reserve Bank link below.

It is a pattern I use fairly often because flat files lend itself to the artistic license of the data provider i.e. they can do just about anything they want with the flat file.

Lets take a look at a real world example. The South African Reserve Bank provides daily exchange rate data (ZAR per USD) through an embedded Reporting Services environment. When you download the data as Comma Separated Value (CSV), this is what you get:

701

The data represents 3 days of data. We only require the section indicated as “useful data”. If you look at the file structure carefully, you will notice that it could still be treated as a CSV file. When you open the file with the Query Editor, you will see that Power BI already treats the file like a CSV (or an Excel file that does not have all cells populated).

702

I am going to illustrate two methods of dealing with this in Python. You may encounter scenarios where either one could be useful.

Method 1: Treat the file as CSV

This method will essentially mimic what the Power BI Query Editor does with the file. We will use the Pandas read_csv method, but we will use some parameters to handle our requirements. Without indicating that there are three columns, read_csv will throw an exception when it encounters a third field further down the file. To indicate three columns we simply specify the columns.

import pandas
df = pandas.read_csv('D:\\data\\dirty\\ExchangeRateDetail.csv',
                     skiprows=4,
                     names =['Date','Value','Extra'])

You will notice that we have ignored the first four rows that we don’t need.

703

From this point on its a simple process that we have encountered in previous posts. We are going to do three things to get the file into a ready to use state:

  • Drop the extra column
  • Remove (filter) unwanted rows
  • Explicitly cast the data types
import pandas
df = pandas.read_csv('D:\\data\\dirty\\ExchangeRateDetail.csv',
                     skiprows=4,
                     names =['Date','Value','Extra'])
df = df.drop(['Extra'], axis=1) # drop "Extra" column
df = df[df.Date != 'The Value'] # filter rows
df = df[df.Date != 'The Value_label'] # filter rows
df['Date'] = pandas.to_datetime(df['Date']) # explicit data type cast
df['Value'] = pandas.to_numeric(df['Value']) # explicit data type cast

The final output can now be used as a Python data source in Power BI

704

Method 2: Read the file line by line

In this example we will read the data line by line. Each line will get add to a list i.e. we will end up with a list of lines. We will then use some list comprehension techniques filter the data before adding it to a dataframe.

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
    lines = [line for line in f]

The new line characters form part of the line as shown below in our list where each item is a line in the file.

705

There are also special characters at the start of the file, but we dont need to worry about this as we will not use this item in the list.

To strip off the “\n” we will use the rstrip method which removes characters from the right hand side of a string. By not specifying a character, whitespace characters are removed. This includes “\n”.

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
    lines = [line.rstrip() for line in f]

706

With the newline removed, we can now start filtering the list of items(lines). We start by filtering out the empty items (blank lines).

lines = list(filter(None, lines))

707

Next we remove the first 3 items, but keep the rest of the list.

lines = lines[3:]

708

Our final filter condition is to remove all items that start with “The”

lines = [t for t in lines if not t.startswith('The')]

709

Now our list is ready to convert to a dataframe.

import pandas
df = pandas.DataFrame(lines)

710

Note that we end up with a single column (0) dataframe. Our final steps to get the dataframe into a ready to use state are:

  • split the column on comma separator
  • drop the original column
  • Explicitly cast the data types

Lets put a complete final script together that include the final steps.

import pandas

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
    lines = [line.rstrip() for line in f]

lines = list(filter(None, lines)) # drop blank lines
lines = lines[3:] # remove top rows
lines = [t for t in lines if not t.startswith('The')] # remove rows starting with "The"

df = pandas.DataFrame(lines) # add remaining rows to a dataframe
df['Date'] = pandas.to_datetime(df[0].str.split(',').str.get(0)) # split of the Date column
df['Value'] = df[0].str.split(',').str.get(1) # split off the value column
df = df.drop([0], axis=1) # drop the original column (0)
df['Date'] = pandas.to_datetime(df['Date']) # explicit data type cast
df['Value'] = pandas.to_numeric(df['Value']) # explicit data type cast

The final result is the same as method 1.

Hopefully that provides some ideas on how to deal with flat files in general. As mentioned at the start of this post, it is often challenging to deal with flat files as it is up to the data provider to produce a well structured file.

Posted in Power BI, Python, Python4PowerBI

Unpivot Excel data with Python

This post is is based on another demo from my SQLSaturday session on Python integration in Power BI. Its a handy real world scenario that could easily be re-applied with minor changes to the script. The primary pattern is unpivot functionality which is achieved through the melt method in pandas.  A few additional useful dataframe techniques are illustrated, including pivot_table.

I am using the Tourist Accommodation data from the Statistics South Africa website. The file (tourism.xlsx) is also available on the SQLSaturday website. Once you open the file observe the following:

601

  • the first 8 columns (starting with the letter H) are the header columns
  • the remaining columns are a series of months i.e. MO012007 is month 1 (Jan), 2007
  • the numeric values have mixed scale including thousands, millions & percentages
  • column H04 is a mix of metric and accommodation type

The Python script will provide a way to transform the data even if the columns increase as months get added.

Step 1: Load Excel data into a dataframe

import pandas
df = pandas.read_excel('D:\\data\\excel\\tourism.xlsx')

The read_excel method accepts a host of parameters, including specifying the sheet_name. Our Excel file has only 1 sheet, so the sheet_name is not required.

Step 2: Split columns into actual headers and months

We start by adding the dataframe columns to a list. Then we loop through the list and split the month columns from the actual headers.

#create a list of all the columns
columns = list(df)
#create lists to hold headers & months
headers = []
months = []
#split columns list into headers and months
for col in columns:
    if col.startswith('MO'):
        months.append(col)
    else:
        headers.append(col)

Here we have the best case scenario because all month columns start with “MO”. If no pattern exists and the columns are sequenced, you could split on position i.e. headers = columns[:8], months = columns[8:] will produce the same result. Worst scenario is when the order of the columns are random. In this case you would manually specify the list elements.

Step 3: Melt (unpivot) the month columns

Unpivoting will transpose all month columns to rows. For each month column a new row is created using the same header columns.

df2 = pandas.melt(df,
                  id_vars=headers,
                  value_vars=months,
                  var_name='Date',
                  value_name='Val')

All month columns will be dropped and a new column ‘Date’ will contain the original month column header. The values will bed added to a new column called ‘Val’.

602

Now reconstruct and convert the Date column to an actual date. We get the year and month parts from the column and just assume the day is always 1. This is explicitly converted to a date using the to_datetime method.

df2['Date'] = pandas.to_datetime(df2['Date'].str[4:9] + '-' + df2['Date'].str[2:4] + '-01')

Step 4: Standardize the values 

We noted at the start that the values were a mixture of numeric scales which is indicated in column H17. To determine the possible value scales we use df2.H17.unique() which returns the following list of unique values: [‘Thousand’, ‘Percentage’, ‘Rands’, ‘R million’]. We create a function to apply the scale to the Val column.


 def noscale(row):
    if row['H17'] == 'Thousand':
        return row['Val'] * 1000
    elif row['H17'] =='R million':
        return row['Val'] * 1000000
    elif row['H17'] =='Percentage':
        return row['Val'] / 100
    else:
        return row['Val']

df2['FullValue'] = df2.apply(noscale, axis=1)

“Rands” are just the local currency (ZAR), so no scale is required.

603

Step 5: Additional clean up 

The next 3 steps in our script splits the composite column and removes unwanted rows and columns.


#split composite column
df2['Metric'] = df2['H04'].str.split(' - ').str.get(0)
df2['Category'] = df2['H04'].str.split(' - ').str.get(1)
#remove aggregate rows
df2 = df2[df2.Category != 'Total industry']
#remove unwanted columns
df2 = df2.drop(['H01','H02','H03','H04','H15','H17','H25','Val'], axis=1)

Note how rows are filtered. We set the dataframe equal to the existing dataframe where the Category column is not equal to ‘Total Industry’

604

At this point the script could be used as is, but in Power BI we would need to understand how the measure(s) would be created. For example, we could have a generic measure [Value] = SUM(‘Tourism'[FullValue]). We would then use the Metric column as a filter/slicer. Most likely you would rather want a measure such as [Units Available] = CALCULATE ( SUM ( ‘Tourism'[FullValue] ), ‘Tourism'[Metric]=”Stay units available” ). In this case it is cumbersome to create the measures, so we rather reshape the table to make things easier.

Step 6: Pivot the Metrics (optional)

If the metrics provided are fixed (consistent), then pivoting the metrics  is a good idea. This creates a column for each metric. This is possible with the Pandas pivot_table method.

df3 = pandas.pivot_table(df2,
                         values='FullValue',
                         columns=['Metric'],
                         index=['H16','Date','Category'])

From the results below, this appears to be correct.

605

The side effect of the pivot_table method is that the dataframe is re-indexed based on the index parameter passed to the method. This is not really a problem, but for Power BI it is. You may have noted from earlier posts that the Python Script in Power BI does not provide access to the dataframe index. So we would in effect lose the first three columns from our result above.

The fix is as simple as re-indexing the dataframe. We can also drop the ‘Total income’ column as we can let Power BI handle the aggregations.

df3.reset_index(inplace=True)
df3 = df3.drop(['Total income'], axis=1)

Our final script below will produce 3 dataframes which will be available for selection in Power BI.

import pandas

#read Excel into dataframe
df = pandas.read_excel('D:\\data\\excel\\tourism.xlsx')

#create place holder lists for actaul headers vs months
headers = []
months = []

#split columns into headers and months
for col in list(df):
    if col.startswith('MO'):
        months.append(col)
    else:
        headers.append(col)

#unpivot months
df2 = pandas.melt(df,
                  id_vars=headers,
                  value_vars=months,
                  var_name='Date',
                  value_name='Val')

#create a date field from the month field
df2['Date'] = pandas.to_datetime(df2['Date'].str[4:] + '-' + df2['Date'].str[2:4] + '-01')

#create a function to resolve mixed numeric scale in the data
def noscale(row):
    if row['H17'] == 'Thousand':
        return row['Val'] * 1000
    elif row['H17'] =='R million':
        return row['Val'] * 1000000
    elif row['H17'] =='Percentage':
        return row['Val'] / 100
    else:
        return row['Val']

#apply the noscale function to the dataframe
df2['FullValue'] = df2.apply(noscale, axis=1)

#split composite column
df2['Metric'] = df2['H04'].str.split(' - ').str.get(0)
df2['Category'] = df2['H04'].str.split(' - ').str.get(1)
#remove aggregate rows
df2 = df2[df2.Category != 'Total industry']
#remove unwanted columns
df2 = df2.drop(['H01','H02','H03','H04','H15','H17','H25','Val'], axis=1)

#unpivot metrics
df3 = pandas.pivot_table(df2,
                         values='FullValue',
                         columns=['Metric'],
                         index=['H16','Date','Category'])
#reset index
df3.reset_index(inplace=True)
#drop aggregate column
df3 = df3.drop(['Total income'], axis=1)

All that is left is to paste your script into Power BI and select the dataframe that works best for you.

606

Of course go and experiment with Python visualization 😊

 

 

Posted in Power BI, Python, Python4PowerBI

Power BI date table with Python

I spoke about Python integration in Power BI at SQLSaturday #793 in Cape Town this past weekend. You can download the deck (which includes the notes) and the Python scripts (which include sample data) from the SQLSaturday site. One of the demos included the creation of a date table to support time intelligence DAX functions in Power BI. I thought the example would be interesting for a number of reasons:

  • Time intelligence is one of my favorite features of Power BI (Power Pivot & SSAS)
  • A few examples of M script to create date tables already exist (it would be interesting to compare)
  • Illustrate some handy dataframe capability (including manual creation of utility dataframes, and treating dataframes like tables to perform join operations)

Before we start, lets quickly recap the basic criteria for a date table in Power BI:

  • A separate date table is required (and joined to the table to calculate against)
  • The date table must be at a day level (1 row equals 1 day)
  • The days must be contiguous (no gaps even if your data is not daily)
  • The days must be available for full years (1 Jan to 31 Dec)
  • A column of data type ‘Date’ must exist on the date table

Step 1: Create a list of sequential dates

For this we use the Pandas date_range method.

501

The output shows the first 10 and last 10 elements of the list, the data type and the number of elements (length).

Step 2: Create a dataframe from the list

A list can easily be converted into a dataframe. The result is a dataframe with one column.

502

The column is assigned a default name of 0. The second line renames the column.

Step 3: Derive additional columns

For this step we make use of the string format time (strftime) method to derive required columns from the Date column. Where a numeric data type column is required, we explicitly convert the column using the dataframe apply function.

503

At this step you already have a basic date table. In many cases financial calendars do not end in December. In this example the financial year ends in March.

Step 4: Create a financial calendar lookup dataframe

I took a different approach with the financial calendar columns because I could easily do this inside the script with Python. Instead of using conditional evaluation (if, case, switch, etc), I decided to create a lookup “table” that a could join to. The lookup table would either provide the column I need, or help me create the column I need.

504

In the creation of this fy lookup dataframe, each column is provided as a list input i.e. each row in the code adds a column to the dataframe. All the lookup dataframe does is remap the financial periods to start in April (calendar month 4). The FinYearDif column will help us calculate the financial year in the next step.

Step 5: Add the financial year periods

We do this by doing a left outer join to the lookup dataframe. This is achieved through the merge method. We specify MonthNumber as the join column and outer as the join type. Finally we derive the financial year and then drop the helper column (axis=1 indicates the column axis).

505

Add to Power BI with the Python Script …

Bonus step: Persist result to a SQL Server database

I hinted previously that its easy to persist the data to a SQL Server database using the dataframe.to_sql method. Add this after your code:

506

You may want to revisit the post on SQL Server connectivity. The last line of code is all that’s new.

A thought

I added the to_sql method and used this in my Python script for Power BI. If this will be supported in the Power BI Service, I may consider writing to SQL Server in the Python script, then reading the result from SQL Server. I could even run stored procedures that only load deltas to SQL Server. Should something like this be allowed in the Power BI Service, it could mean that a Power BI refresh also runs my (ETL) process to persist data to SQL Server.

Download the script

As mentioned at the start of this post, this was part of a SQL Saturday demo. The complete script (and more) is available from the SQL Saturday website for download.

I’ll unpack the other examples in future posts.

 

Posted in Power BI, Python, Self-Service BI