No More Tableau Downtime: Metadata API for Proactive Data Health

-

In today’s world, the reliability of information solutions is every part. After we construct dashboards and reports, one expects that the numbers reflected there are correct and up-to-date. Based on these numbers, insights are drawn and actions are taken. For any unexpected reason, if the dashboards are broken or if the numbers are incorrect — then it becomes a fire-fight to repair every part. If the problems will not be fixed in time, then it damages the trust placed on the info team and their solutions. 

But why would dashboards be broken or have flawed numbers? If the dashboard was built appropriately the primary time, then 99% of the time the difficulty comes from the info that feeds the dashboards — from the info warehouse. Some possible scenarios are:

  • Few ETL pipelines failed, so the brand new data just isn’t yet in
  • A table is replaced with one other recent one 
  • Some columns within the table are dropped or renamed
  • Schemas in data warehouse have modified
  • And plenty of more.

There continues to be a probability that the difficulty is on the Tableau site, but in my experience, a lot of the times, it’s all the time because of some changes in data warehouse. Regardless that we all know the foundation cause, it’s not all the time straightforward to start out working on a fix. There’s no central place where you’ll be able to check which Tableau data sources depend on specific tables. If you have got the Tableau Data Management add-on, it could help, but from what I do know, its hard to seek out dependencies of custom sql queries utilized in data sources.

Nevertheless, the add-on is just too expensive and most firms don’t have it. The true pain begins when you have got to undergo all the info sources manually to start out fixing it. On top of it, you have got a string of users in your head impatiently waiting for a quick-fix. The fix itself won’t be difficult, it might just be a time-consuming one.

What if we could anticipate these issues and discover impacted data sources before anyone notices an issue? Wouldn’t that just be great? Well, there’s a way now with the Tableau Metadata API. The Metadata API uses GraphQL, a question language for APIs that returns only the info that you just’re enthusiastic about. For more information on what’s possible with GraphQL, do try GraphQL.org.

On this blog post, I’ll show you tips on how to connect with the Tableau Metadata API using Python’s Tableau Server Client (TSC) library to proactively discover data sources using specific tables, so that you could act fast before any issues arise. Once you understand which Tableau data sources are affected by a particular table, you’ll be able to make some updates yourself or alert the owners of those data sources concerning the upcoming changes in order that they will be prepared for it.

Connecting to the Tableau Metadata API

Lets connect with the Tableau Server using TSC. We want to import in all of the libraries we would wish for the exercise!

### Import all required libraries
import tableauserverclient as t
import pandas as pd
import json
import ast
import re

So as to connect with the Metadata API, you should have to first create a private access token in your Tableau Account settings. Then update the & with the token you simply created. Also update together with your Tableau site. If the connection is established successfully, then “Connected” can be printed within the output window.

### Hook up with Tableau server using personal access token
tableau_auth = t.PersonalAccessTokenAuth("", "", 
                                           site_id="")
server = t.Server("https://dub01.online.tableau.com/", use_server_version=True)

with server.auth.sign_in(tableau_auth):
        print("Connected")

Lets now get a listing of all data sources which might be published in your site. There are various attributes you’ll be able to fetch, but for the present use case, lets keep it easy and only get the id, name and owner contact information for each data source. This can be our master list to which we are going to add in all other information.

############### Get all of the list of information sources in your Site

all_datasources_query = """ {
  publishedDatasources {
    name
    id
    owner {
    name
    email
    }
  }
}"""
with server.auth.sign_in(tableau_auth):
    result = server.metadata.query(
        all_datasources_query
    )

Since I would like this blog to be focussed on tips on how to proactively discover which data sources are affected by a particular table, I’ll not be going into the nuances of Metadata API. To higher understand how the query works, you’ll be able to discuss with a really detailed Tableau’s own Metadata API documentation.

One thing to notice is that the Metadata API returns data in a JSON format. Depending on what you’re querying, you’ll find yourself with multiple nested json lists and it could possibly get very tricky to convert this right into a pandas dataframe. For the above metadata query, you’ll find yourself with a result which would really like below (that is mock data just to offer you an idea of what the output looks like):

{
  "data": {
    "publishedDatasources": [
      {
        "name": "Sales Performance DataSource",
        "id": "f3b1a2c4-1234-5678-9abc-1234567890ab",
        "owner": {
          "name": "Alice Johnson",
          "email": "[email protected]"
        }
      },
      {
        "name": "Customer Orders DataSource",
        "id": "a4d2b3c5-2345-6789-abcd-2345678901bc",
        "owner": {
          "name": "Bob Smith",
          "email": "[email protected]"
        }
      },
      {
        "name": "Product Returns and Profitability",
        "id": "c5e3d4f6-3456-789a-bcde-3456789012cd",
        "owner": {
          "name": "Alice Johnson",
          "email": "[email protected]"
        }
      },
      {
        "name": "Customer Segmentation Evaluation",
        "id": "d6f4e5a7-4567-89ab-cdef-4567890123de",
        "owner": {
          "name": "Charlie Lee",
          "email": "[email protected]"
        }
      },
      {
        "name": "Regional Sales Trends (Custom SQL)",
        "id": "e7a5f6b8-5678-9abc-def0-5678901234ef",
        "owner": {
          "name": "Bob Smith",
          "email": "[email protected]"
        }
      }
    ]
  }
}

We want to convert this JSON response right into a dataframe in order that its easy to work with. Notice that we want to extract the name and email of the owner from contained in the owner object. 

### We want to convert the response into dataframe for simple data manipulation

col_names = result['data']['publishedDatasources'][0].keys()
master_df = pd.DataFrame(columns=col_names)

for i in result['data']['publishedDatasources']:
    tmp_dt = {k:v for k,v in i.items()}
    master_df = pd.concat([master_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

# Extract the owner name and email from the owner object
master_df['owner_name'] = master_df['owner'].apply(lambda x: x.get('name') if isinstance(x, dict) else None)
master_df['owner_email'] = master_df['owner'].apply(lambda x: x.get('email') if isinstance(x, dict) else None)

master_df.reset_index(inplace=True)
master_df.drop(['index','owner'], axis=1, inplace=True)
print('There are ', master_df.shape[0] , ' datasources in your site')

That is how the structure of master_df would seem like:

Sample output of code

Once we now have the primary list ready, we are able to go ahead and begin getting the names of the tables embedded in the info sources. When you are an avid Tableau user, you understand that there are two ways to choosing tables in a Tableau data source — one is to directly select the tables and establish a relation between them and the opposite is to make use of a custom sql query with a number of tables to realize a brand new resultant table. Subsequently, we want to deal with each the cases.

Processing of Custom SQL query tables

Below is the query to get the list of all custom SQLs utilized in the positioning together with their data sources. Notice that I even have filtered the list to get only first 500 custom sql queries. In case there are more in your org, you should have to make use of an offset to get the following set of custom sql queries. There’s also an option of using cursor method in Pagination when you desire to fetch large list of results (refer here). For the sake of simplicity, I just use the offset method as I do know, as there are lower than 500 custom sql queries used on the positioning.

# Get the info sources and the table names from all of the custom sql queries used in your Site

custom_table_query = """  {
  customSQLTablesConnection(first: 500){
    nodes {
        id
        name
        downstreamDatasources {
        name
        }
        query
    }
  }
}
"""

with server.auth.sign_in(tableau_auth):
    custom_table_query_result = server.metadata.query(
        custom_table_query
    )

Based on our mock data, that is how our output would seem like:

{
  "data": {
    "customSQLTablesConnection": {
      "nodes": [
        {
          "id": "csql-1234",
          "name": "RegionalSales_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Regional Sales Trends (Custom SQL)"
            }
          ],
          "query": "SELECT r.region_name, SUM(s.sales_amount) AS total_sales FROM ecommerce.sales_data.Sales s JOIN ecommerce.sales_data.Regions r ON s.region_id = r.region_id GROUP BY r.region_name"
        },
        {
          "id": "csql-5678",
          "name": "ProfitabilityAnalysis_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Product Returns and Profitability"
            }
          ],
          "query": "SELECT p.product_category, SUM(s.profit) AS total_profit FROM ecommerce.sales_data.Sales s JOIN ecommerce.sales_data.Products p ON s.product_id = p.product_id GROUP BY p.product_category"
        },
        {
          "id": "csql-9101",
          "name": "CustomerSegmentation_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Segmentation Analysis"
            }
          ],
          "query": "SELECT c.customer_id, c.location, COUNT(o.order_id) AS total_orders FROM ecommerce.sales_data.Customers c JOIN ecommerce.sales_data.Orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.location"
        },
        {
          "id": "csql-3141",
          "name": "CustomerOrders_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "query": "SELECT o.order_id, o.customer_id, o.order_date, o.sales_amount FROM ecommerce.sales_data.Orders o WHERE o.order_status = 'Accomplished'"
        },
        {
          "id": "csql-3142",
          "name": "CustomerProfiles_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "query": "SELECT c.customer_id, c.customer_name, c.segment, c.location FROM ecommerce.sales_data.Customers c WHERE c.active_flag = 1"
        },
        {
          "id": "csql-3143",
          "name": "CustomerReturns_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "query": "SELECT r.return_id, r.order_id, r.return_reason FROM ecommerce.sales_data.Returns r"
        }
      ]
    }
  }
}

Similar to before once we were creating the master list of information sources, here also we now have nested json for the downstream data sources where we would wish to extract only the “name” a part of it. Within the “query” column, your entire custom sql is dumped. If we use pattern, we are able to easily seek for the names of the table utilized in the query.

We all know that the table names all the time come after FROM or a JOIN clause and they often follow the format ... The is optional and a lot of the times not used. There have been some queries I discovered which used this format and I ended up only getting the database and schema names, and never the entire table name. Once we now have extracted the names of the info sources and the names of the tables, we want to merge the rows per data source as there will be multiple custom sql queries utilized in a single data source.

### Convert the custom sql response into dataframe
col_names = custom_table_query_result['data']['customSQLTablesConnection']['nodes'][0].keys()
cs_df = pd.DataFrame(columns=col_names)

for i in custom_table_query_result['data']['customSQLTablesConnection']['nodes']:
    tmp_dt = {k:v for k,v in i.items()}

    cs_df = pd.concat([cs_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

# Extract the info source name where the custom sql query was used
cs_df['data_source'] = cs_df.downstreamDatasources.apply(lambda x: x[0]['name'] if x and 'name' in x[0] else None)
cs_df.reset_index(inplace=True)
cs_df.drop(['index','downstreamDatasources'], axis=1,inplace=True)

### We want to extract the table names from the sql query. We all know the table name comes after FROM or JOIN clause
# Note that the name of table will be of the format ..
# Depending on the format of how table is named, you should have to switch the  expression

def extract_tables(sql):
    #  to match database.schema.table or schema.table, avoid alias
    pattern = r'(?:FROM|JOIN)s+((?:[w+]|w+).(?:[w+]|w+)(?:.(?:[w+]|w+))?)b'
    matches = re.findall(pattern, sql, re.IGNORECASE)
    return list(set(matches))  # Unique table names

cs_df['customSQLTables'] = cs_df['query'].apply(extract_tables)
cs_df = cs_df[['data_source','customSQLTables']]

# We want to merge datasources as there will be multiple custom sqls utilized in the identical data source
cs_df = cs_df.groupby('data_source', as_index=False).agg({
    'customSQLTables': lambda x: list(set(item for sublist in x for item in sublist))  # Flatten & make unique
})

print('There are ', cs_df.shape[0], 'datasources with custom sqls utilized in it')

After we perform all of the above operations, that is how the structure of cs_df would seem like:

Sample output of code

Processing of normal Tables in Data Sources

Now we want to get the list of all of the regular tables utilized in a datasource which will not be a component of custom SQL. There are two ways to go about it. Either use the publishedDatasources object and check for upstreamTables or use DatabaseTable and check for upstreamDatasources. I’ll go by the primary method because I would like the outcomes at a knowledge source level (principally, I would like some code able to reuse when I would like to ascertain a particular data source in further detail). Here again, for the sake of simplicity, as a substitute of going for pagination, I’m looping through each datasource to make sure I even have every part. We get the upstreamTables inside the sphere object in order that must be cleaned out.

############### Get the info sources with the regular table names utilized in your site

### Its best to extract the tables information for each data source after which merge the outcomes.
# Since we only get the table information nested under fields, in case there are tons of of fields 
# utilized in a single data source, we are going to hit the response limits and is not going to have the option to retrieve all the info.

data_source_list = master_df.name.tolist()

col_names = ['name', 'id', 'extractLastUpdateTime', 'fields']
ds_df = pd.DataFrame(columns=col_names)

with server.auth.sign_in(tableau_auth):
    for ds_name in data_source_list:
        query = """ {
            publishedDatasources (filter: { name: """"+ ds_name + """" }) {
            name
            id
            extractLastUpdateTime
            fields {
                name
                upstreamTables {
                    name
                }
            }
            }
        } """
        ds_name_result = server.metadata.query(
        query
        )
        for i in ds_name_result['data']['publishedDatasources']:
            tmp_dt = {k:v for k,v in i.items() if k != 'fields'}
            tmp_dt['fields'] = json.dumps(i['fields'])
        ds_df = pd.concat([ds_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

ds_df.reset_index(inplace=True)

That is how the structure of ds_df would look:

Sample output of code

We will have to flatten out the fields object and extract the sphere names in addition to the table names. For the reason that table names can be repeating multiple times, we’d need to deduplicate to maintain only the unique ones.

# Function to extract the values of fields and upstream tables in json lists
def extract_values(json_list, key):
    values = []
    for item in json_list:
        values.append(item[key])
    return values

ds_df["fields"] = ds_df["fields"].apply(ast.literal_eval)
ds_df['field_names'] = ds_df.apply(lambda x: extract_values(x['fields'],'name'), axis=1)
ds_df['upstreamTables'] = ds_df.apply(lambda x: extract_values(x['fields'],'upstreamTables'), axis=1)

# Function to extract the unique table names 
def extract_upstreamTable_values(table_list):
    values = set()a
    for inner_list in table_list:
        for item in inner_list:
            if 'name' in item:
                values.add(item['name'])
    return list(values)

ds_df['upstreamTables'] = ds_df.apply(lambda x: extract_upstreamTable_values(x['upstreamTables']), axis=1)
ds_df.drop(["index","fields"], axis=1, inplace=True)

Once we do the above operations, the ultimate structure of ds_df would look something like this:

Sample output of code

Now we have all of the pieces and now we just need to merge them together:

###### Join all the info together
master_data = pd.merge(master_df, ds_df, how="left", on=["name","id"])
master_data = pd.merge(master_data, cs_df, how="left", left_on="name", right_on="data_source")

# Save the outcomes to analyse further
master_data.to_excel("Tableau Data Sources with Tables.xlsx", index=False)

That is our final master_data:

Sample Output of code

Table-level Impact Evaluation

Let’s say there have been some schema changes on the “Sales” table and you desire to know which data sources can be impacted. You then can simply write a small function which checks if a table is present in either of the 2 columns — upstreamTables or customSQLTables like below.

def filter_rows_with_table(df, col1, col2, target_table):
    """
    Filters rows in df where target_table is an element of any value in either col1 or col2 (supports partial match).
    Returns full rows (all columns retained).
    """
    return df[
        df.apply(
            lambda row: 
                (isinstance(row[col1], list) and any(target_table in item for item in row[col1])) or
                (isinstance(row[col2], list) and any(target_table in item for item in row[col2])),
            axis=1
        )
    ]
# For instance 
filter_rows_with_table(master_data, 'upstreamTables', 'customSQLTables', 'Sales')

Below is the output. You possibly can see that 3 data sources can be impacted by this modification. You can too alert the info source owners Alice and Bob prematurely about this in order that they can start working on a fix before something breaks on the Tableau dashboards.

Sample output of code

You possibly can try the entire version of the code in my Github repository here.

That is just one among the potential use-cases of the Tableau Metadata API. You can too extract the sphere names utilized in custom sql queries and add to the dataset to get a field-level impact evaluation. One can even monitor the stale data sources with the extractLastUpdateTime to see if those have any issues or have to be archived in the event that they will not be used any more. We can even use the dashboards object to fetch information at a dashboard level.

Final Thoughts

If you have got come this far, kudos. This is only one use case of automating Tableau data management. It’s time to reflect on your personal work and think which of those other tasks you possibly can automate to make your life easier. I hope this mini-project served as an enjoyable learning experience to know the facility of Tableau Metadata API. When you liked reading this, you may also like one other one among my blog posts about Tableau, on a number of the challenges I faced when coping with big .

Also do try my previous blog where I explored constructing an interactive, database-powered app with Python, Streamlit, and SQLite.


Before you go…

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x