# Export Lists from Databricks

Export Databricks audience data for personalization and segmentation in Airship.

Data teams use Databricks to generate insights and audience segments with SQL and Python. You can expand on these existing processes by sending the data to Airship to create [attributes](https://www.airship.com/docs/reference/glossary/#attribute) or [Uploaded (Static) Lists](https://www.airship.com/docs/reference/glossary/#uploaded_list) for targeting and personalization.

Two methods are available:

- Upload directly to Airship's SFTP server.
- Write to an Amazon S3 bucket that Airship retrieves on a schedule.

Both approaches expect CSV files formatted as described in [SFTP upload for CSV files](https://www.airship.com/docs/guides/audience/segmentation/sftp-upload/).

## Sample attributes table

The code examples on this page reference a Delta table named `sample_attributes`. The schema below shows the shape of the data the examples assume. In practice, your attributes will come from your existing Databricks tables or views. To follow along with the examples as-is, create the table using this SQL:

**Create a sample table**

```sql
CREATE TABLE sample_attributes (
    named_user STRING,
    first_name STRING,
    last_name STRING,
    loyalty_id LONG,
    favourite_color STRING,
    tacos_desired LONG
);
INSERT INTO sample_attributes VALUES ('8732eda2-c13c-4a2a-9123-fb5bf0bccffb', 'John', 'Smith', 60001, 'Green', 77);
INSERT INTO sample_attributes VALUES ('ba06daf2-66f8-43ce-a152-b0605b9b834e', 'Alice', 'Jones', 87301, 'Blue', 1);
```


## Export to Airship using SFTP

Follow these steps to upload data from Databricks to Airship using SFTP on a schedule.

### Generate SFTP keys

Airship's SFTP server uses SSH key pairs for authentication. Follow the steps in [SFTP upload for CSV files](https://www.airship.com/docs/guides/audience/segmentation/sftp-upload/) to generate your key pair and add the public key to Airship. Set the Purpose to **Attributes**, and note the host, port, and username for use in the following Databricks steps.

### Store credentials in Databricks secrets

Next, use the [Databricks CLI](https://docs.databricks.com/dev-tools/cli/install.html) to store your SFTP username and private key in a Databricks secret scope so they are encrypted and not exposed in notebook code:

**Create a secret scope and store credentials**

```bash
databricks secrets create-scope airship-sftp
databricks secrets put-secret airship-sftp username --string-value 'YOUR_SFTP_USERNAME'
databricks secrets put-secret airship-sftp private-key --string-value "$(cat /path/to/your/private_key.pem)"
```


### Create a Databricks notebook

Now you can create a Databricks notebook using the following Python code. It queries your source table, writes the result to CSV, and uploads it to Airship using SFTP. The first cell installs `paramiko`, which isn't included in the Databricks runtime by default.

**Upload data to Airship using SFTP**

```python
%pip install paramiko

import paramiko
from io import StringIO

# Set these to the host and port from your Airship SFTP setup
host = "sftp.airship.com"
port = 5222

# Retrieve credentials from Databricks secrets
username = dbutils.secrets.get(scope="airship-sftp", key="username")
private_key_str = dbutils.secrets.get(scope="airship-sftp", key="private-key")

# Convert the private key to a file-like object
private_key = paramiko.RSAKey.from_private_key(StringIO(private_key_str))

# Query your source table and write to a temporary CSV
df = spark.sql("SELECT * FROM sample_attributes").toPandas()
df.to_csv("/tmp/out.csv", index=False)

# Connect and upload using SFTP
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
try:
    ssh.connect(host, port=port, username=username, pkey=private_key)
    sftp = ssh.open_sftp()
    sftp.put("/tmp/out.csv", "/out.csv", confirm=False)
    # confirm=False is required because Airship SFTP does not support `ls`
    print("Upload successful")
except Exception as e:
    print(f"Error with SFTP: {e}")
finally:
    ssh.close()
```


### Schedule the export

In your Databricks notebook, you can run the export on a recurring schedule:

1. Select **Schedule**, then **Add a schedule**.
1. Set your desired frequency, then select **Create**.

## Export to Amazon S3

Follow these steps to upload data from Databricks to an Amazon S3 bucket that Airship retrieves on a schedule.

### Set up the S3 inbound integration in Airship

Follow the steps in [Configure an inbound integration](https://www.airship.com/docs/integrations/aws/#configure-an-inbound-integration) in *Amazon Web Services* to connect Airship to your S3 bucket. When prompted, select **Attributes** or **Static Lists** as the data type.

Note the bucket name and folder path you configure. Your Databricks notebook will write files to that location.

### Store AWS credentials in Databricks secrets

Next, use the [Databricks CLI](https://docs.databricks.com/dev-tools/cli/install.html) to store your AWS access key ID and secret access key in a Databricks secret scope so they are encrypted and not exposed in notebook code:

**Create a secret scope and store credentials**

```bash
databricks secrets create-scope airship-s3
databricks secrets put-secret airship-s3 access-key-id --string-value 'YOUR_AWS_ACCESS_KEY_ID'
databricks secrets put-secret airship-s3 secret-access-key --string-value 'YOUR_AWS_SECRET_ACCESS_KEY'
```


### Create a Databricks notebook

Now you can create a Databricks notebook using the following Python code. It queries your source table, writes the result to CSV, and uploads it to your S3 bucket. `boto3` is available on Databricks clusters without additional installation.

**Upload data to Amazon S3**

```python
import boto3
import io

# Retrieve AWS credentials from Databricks secrets
access_key = dbutils.secrets.get(scope="airship-s3", key="access-key-id")
secret_key = dbutils.secrets.get(scope="airship-s3", key="secret-access-key")

# Query your source table
df = spark.sql("SELECT * FROM sample_attributes").toPandas()

# Write to S3
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)

s3 = boto3.client(
    "s3",
    aws_access_key_id=access_key,
    aws_secret_access_key=secret_key,
    region_name="YOUR_BUCKET_REGION",  # for example, us-east-1
)
s3.put_object(
    Bucket="YOUR_BUCKET_NAME",
    Key="airship/attributes.csv",
    Body=csv_buffer.getvalue(),
)
print("Upload successful")
```


Airship tracks which files it has already processed, so only files added since the last sync are ingested. You are responsible for managing processed files in your bucket, such as setting a time-to-live (TTL) policy to delete old files.

### Schedule the export

In your Databricks notebook, you can run the export on a recurring schedule that aligns with your Airship retrieval frequency:

1. Select **Schedule**, then **Add a schedule**.
1. Set your desired frequency, then select **Create**.

## Targeting users

Once your data is transferred, you can use it to target users. See:

* [Targeting your audience using attributes](https://www.airship.com/docs/guides/audience/attributes/targeting/)
* [Using Uploaded Lists](https://www.airship.com/docs/guides/audience/segmentation/audience-lists/uploaded/#using-uploaded-lists)
* [Segments](https://www.airship.com/docs/guides/audience/segmentation/segments/)
