Export Lists from Databricks

Export Databricks audience data for personalization and segmentation in Airship.

View as Markdown

Data teams use Databricks to generate insights and audience segments with SQL and Python. You can expand on these existing processes by sending the data to Airship to create attributesMetadata used for audience segmentation and personalization. Attributes extend the concept of tags by adding comparison operators and values to determine whether or not to target a user, helping you better evaluate your audience. or Uploaded (Static) ListsA reusable audience list that you create. Uploaded lists are static and updatable. In the API, they are referred to as Static Lists. for targeting and personalization.

Two methods are available:

  • Upload directly to Airship’s SFTP server.
  • Write to an Amazon S3 bucket that Airship retrieves on a schedule.

Both approaches expect CSV files formatted as described in SFTP upload for CSV files.

Sample attributes table

The code examples on this page reference a Delta table named sample_attributes. The schema below shows the shape of the data the examples assume. In practice, your attributes will come from your existing Databricks tables or views. To follow along with the examples as-is, create the table using this SQL:

Create a sample table
CREATE TABLE sample_attributes (
    named_user STRING,
    first_name STRING,
    last_name STRING,
    loyalty_id LONG,
    favourite_color STRING,
    tacos_desired LONG
);
INSERT INTO sample_attributes VALUES ('8732eda2-c13c-4a2a-9123-fb5bf0bccffb', 'John', 'Smith', 60001, 'Green', 77);
INSERT INTO sample_attributes VALUES ('ba06daf2-66f8-43ce-a152-b0605b9b834e', 'Alice', 'Jones', 87301, 'Blue', 1);

Export to Airship using SFTP

Follow these steps to upload data from Databricks to Airship using SFTP on a schedule.

Generate SFTP keys

Airship’s SFTP server uses SSH key pairs for authentication. Follow the steps in SFTP upload for CSV files to generate your key pair and add the public key to Airship. Set the Purpose to Attributes, and note the host, port, and username for use in the following Databricks steps.

Store credentials in Databricks secrets

Next, use the Databricks CLI to store your SFTP username and private key in a Databricks secret scope so they are encrypted and not exposed in notebook code:

Create a secret scope and store credentials
databricks secrets create-scope airship-sftp
databricks secrets put-secret airship-sftp username --string-value 'YOUR_SFTP_USERNAME'
databricks secrets put-secret airship-sftp private-key --string-value "$(cat /path/to/your/private_key.pem)"

Create a Databricks notebook

Now you can create a Databricks notebook using the following Python code. It queries your source table, writes the result to CSV, and uploads it to Airship using SFTP. The first cell installs paramiko, which isn’t included in the Databricks runtime by default.

Upload data to Airship using SFTP
%pip install paramiko

import paramiko
from io import StringIO

# Set these to the host and port from your Airship SFTP setup
host = "sftp.airship.com"
port = 5222

# Retrieve credentials from Databricks secrets
username = dbutils.secrets.get(scope="airship-sftp", key="username")
private_key_str = dbutils.secrets.get(scope="airship-sftp", key="private-key")

# Convert the private key to a file-like object
private_key = paramiko.RSAKey.from_private_key(StringIO(private_key_str))

# Query your source table and write to a temporary CSV
df = spark.sql("SELECT * FROM sample_attributes").toPandas()
df.to_csv("/tmp/out.csv", index=False)

# Connect and upload using SFTP
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
try:
    ssh.connect(host, port=port, username=username, pkey=private_key)
    sftp = ssh.open_sftp()
    sftp.put("/tmp/out.csv", "/out.csv", confirm=False)
    # confirm=False is required because Airship SFTP does not support `ls`
    print("Upload successful")
except Exception as e:
    print(f"Error with SFTP: {e}")
finally:
    ssh.close()

Schedule the export

In your Databricks notebook, you can run the export on a recurring schedule:

  1. Select Schedule, then Add a schedule.
  2. Set your desired frequency, then select Create.

Export to Amazon S3

Follow these steps to upload data from Databricks to an Amazon S3 bucket that Airship retrieves on a schedule.

Set up the S3 inbound integration in Airship

Follow the steps in Configure an inbound integration in Amazon Web Services to connect Airship to your S3 bucket. When prompted, select Attributes or Static Lists as the data type.

Note the bucket name and folder path you configure. Your Databricks notebook will write files to that location.

Store AWS credentials in Databricks secrets

Next, use the Databricks CLI to store your AWS access key ID and secret access key in a Databricks secret scope so they are encrypted and not exposed in notebook code:

Create a secret scope and store credentials
databricks secrets create-scope airship-s3
databricks secrets put-secret airship-s3 access-key-id --string-value 'YOUR_AWS_ACCESS_KEY_ID'
databricks secrets put-secret airship-s3 secret-access-key --string-value 'YOUR_AWS_SECRET_ACCESS_KEY'

Create a Databricks notebook

Now you can create a Databricks notebook using the following Python code. It queries your source table, writes the result to CSV, and uploads it to your S3 bucket. boto3 is available on Databricks clusters without additional installation.

Upload data to Amazon S3
import boto3
import io

# Retrieve AWS credentials from Databricks secrets
access_key = dbutils.secrets.get(scope="airship-s3", key="access-key-id")
secret_key = dbutils.secrets.get(scope="airship-s3", key="secret-access-key")

# Query your source table
df = spark.sql("SELECT * FROM sample_attributes").toPandas()

# Write to S3
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)

s3 = boto3.client(
    "s3",
    aws_access_key_id=access_key,
    aws_secret_access_key=secret_key,
    region_name="YOUR_BUCKET_REGION",  # for example, us-east-1
)
s3.put_object(
    Bucket="YOUR_BUCKET_NAME",
    Key="airship/attributes.csv",
    Body=csv_buffer.getvalue(),
)
print("Upload successful")

Airship tracks which files it has already processed, so only files added since the last sync are ingested. You are responsible for managing processed files in your bucket, such as setting a time-to-live (TTL) policy to delete old files.

Schedule the export

In your Databricks notebook, you can run the export on a recurring schedule that aligns with your Airship retrieval frequency:

  1. Select Schedule, then Add a schedule.
  2. Set your desired frequency, then select Create.

Targeting users

Once your data is transferred, you can use it to target users. See: