Toggle Menu

The human in the machine

Runaway Query Termination via Looker and Slack

author profile photo

By Jay Ozer

clock icon

6m

I believe that the hallmark of tech-forward businesses in the 2020s will be the ability to combine business intelligence (BI) and data democratization (DD). BI tools allow users across the company to monitor sales, operational, and customer metrics live as they happen, and DD allows individual non-technical users to go beyond just monitoring reports, to actively modifying, adapting, and even creating unique views on the fly, using easy point-and-click interfaces.

One inevitable consequence of BI/DD platforms, such as Looker, is slow and non-performant queries. Because the SQL code is procedurally written, strange inefficiencies in user-click-generated code come up frequently. It is important to identify those queries, shut them down, and then work to improve them with the users.

During overloaded and extremely busy times, I’ve found even the best-built database can lag with these types of queries. If you too find your queries are always queued and never seem to complete; if you too find your Looker dashboards are running slow, read on! An ever-growing list of scheduled jobs in the Queries section of Looker with no completed status in sight is never fun, especially on a Friday. I am going to show you step-by-step how to find runaway queries that bog down your system, terminate them, and then post the details to Slack using the Looker SDK.

Before we get into the solution, it all started when Looker users reached out to me to help solve two related issues: a steady increase in the number of incomplete queries (resulting in failed dashboards), and slow overall performance. I asked Looker users to stagger their scheduled runs throughout the day, as opposed to mostly in the mornings, and we worked to create workarounds for taxing queries such as merged queries and ones with custom filters. Although these solutions seemed to help, they did not completely solve these issues. This motivated me to find a better solution!

At States Title, we are solution-driven and believe that inclusion and teamwork produce the best results. These are part of our core values. Data democratization allows us to move quickly as one team.

One component of data democratization is reducing the complexity and redundancy of data storage and processing. Looker enables more centralized data sharing and easy data discovery and analysis across multiple input systems, but you must build it! We have been building a modern data stack at States Title with the philosophy of storing all data, all the time, with a uniform structure that allows for rapid data ingestion and consumption. Data Engineering Manager, Matthew Phillips, recently wrote the details on State Title’s Data Science Blog.

Our current strategy is to allow all Looker users to develop and schedule dashboards, but as mentioned above this can create inconsistencies and inefficient queries that often slow down the entire system. Optimization of queries is tough work and there are many nuances, but it is necessary for greater response times. We at States Title believe in continuous improvement of the entire reporting platform. That is why I will provide you the methods necessary to locate runaway queries, kill them, and share the details on Slack so other Looker users can improve them as a community.

Prerequisites

Create the History dashboard:

This dashboard is going to provide Looker SMEs with actionable analytics and help them respond promptly. Creating metadata reports with i__looker is intuitive, and for this post, Looker Customer Love confirmed that this look should work on any instance. I saved mine as “Health Dash”. I hope you can come up with a better name! Here is the Expanded URL, simply add your instance name and you should be good to go. Remember, for users to see the History explore, they will need the ‘see_system_activity’ permission. To limit visibility to the dashboard, I created a new permission set and assigned it to each of our Looker users individually. After adding your instance name, save it as a dashboard because the available formats in a Look when posting to a Slack channel are limited to .csv and .txt and neither would work here.

Create Looker.ini and a Slack app:

My goal here was not only to find a practical solution to an existing problem but an actionable one. Posting information to a Slack channel brings information in front of the larger Looker community, and allows for instant insights and remediation of queries as indicated.

To set up Looker.ini, I followed the instructions on the Looker SDK PyPI site within the “Configuring the SDK” section. When this script was developed, the only versions available were 3.0 and 3.1. I have not tested this with the 4.0 version yet, if you do, please let me know how it works out for you!

Creating the Looker.ini file is as simple as creating a text file, populating with the info in the instructions, and saving it in your project directory. Looker.ini file has the API version, base URL for the API, your client id, and the secret. Your Looker admin can help you obtain your API keys.

If you would like to replicate the environment I work in, consider AWS SageMaker. First, I created a base model, then installed the looker_sdk and slackclient packages. See requirements.txt.

To post to Slack via the API, you need to create a Slack app and the slackclient python package. I aptly called my app “lookerhealth.” There are many Bot Token Scopes to choose from. The combination of chat:write.public, channels:read, chat:write, chat:Write.public, files:read, files:write worked for me. Another tip here is to remember to reinstall the Slack app every time a new scope is added. Slack API documentation is easy to follow and comprehend – I was able to debug issues quickly by a few trials and errors.

The method

1. The following code imports the looker_sdk and slack libraries and loads all running queries. I am using API version 3.1.

import looker_sdk
import slack
from datetime import datetime

# Load connection parameters with Looker.ini
sdk = looker_sdk.init31("/root/Looker_SDK/Looker.ini")

# Load all running queries 
queries = sdk.all_running_queries() 

2. The following code kills queries that are running longer than the given threshold value and collects QueryIDs in a list called results. After following the queries for a couple of days, which I find fairly soothing, I have concluded that none of the reports take more than a few minutes to complete, therefore a ten minutes threshold would be safe to set. 

The “sources_to_exclude” variable lists three sources. For my use case, I didn’t want to cancel any jobs from these, however, you can alter the list to your needs.

results=[]
exceptions=[]
# Job won't terminate queries from these sources 
sources_to_exclude = ['PDT Regenerator', 'alerts', 'regenerator'] 
# Runtime threshold for query termination - currently set for 10 mins (60secs * 10)
threshold = 60 * 10  

for query in queries: 
    query_created_at = datetime.strptime(query.created_at.split('.')[0].replace('T',' '),'%Y-%m-%d %H:%M:%S')
    source = query.source
    running_time = (datetime.utcnow() - query_created_at).total_seconds()
    if running_time > threshold and source not in sources_to_exclude: 
        results.append(query.query.id) 
        try:
            sdk.kill_query(query.query_task_id)           
        except: 
            exceptions.append(query.query_task_id) 

3. Next step is to load the Slack API token and set the Slack channel. To add the QueryIDs as a suffix to the dashboard URL link, data prep was necessary. We replaced all empty spaces, and dropped the square brackets from the list.

# Slack token for lookerhealth app
token='my_slack_token' 
# Name of the Slack channel
channel='#looker_health' 

# https://<your_instance.looker.com>/dashboards/777
# Data prep
res = str(results).replace(" ", "")[1:-1]
# Add queryids
dash = 'https://your_instance.looker.com/dashboards/777?QueryID={}'.format(res) 

4. Final step is to post the link to the Slack channel by the following post_message_to_slack function. I referenced it from python-slackclient’s GitHub page.

def post_message_to_slack(channel, text):
    if len(results) != 0: 
        client = slack.WebClient(token=token)
        client.chat_postMessage(channel=channel, text=text)
        
# Exceptions includes a list of query task ids that failed to run
# Posts the dashboard URL with the 'res' variable as the query filters
post_message_to_slack(channel, dash)
if exceptions:
    message = "Exceptions: {}".format(exceptions)
    post_message_to_slack(channel, message)

Putting it all together: https://gist.github.com/jayozer/96d42c9b12eecfd2b3d95eb70198ee04

5. Celebrate success! A link to “Health Dash” is posted to the Slack channel.

Final thoughts

Currently, we plan to schedule the script at the top of every hour to catch runaway queries before they become issues. By utilizing this method, Looker users can rapidly free up their data pipeline from runaway queries, thereby preventing the system from being bogged down. At the same time, we help users to improve the queries so that the slowdown doesn’t happen again. Throw in an upgrade to Snowflake Enterprise, and the States Title dashboards are now running blazing fast!

 

Get notified when new blogs post