NEW! Advanced PDF + OCR Interface for Document AI

Set up Databricks UC volume storage

Connect Label Studio Enterprise to Databricks Unity Catalog (UC) Volumes to import files as tasks and export annotations as JSON back to your volumes. This connector uses the Databricks Files API and operates only in proxy mode (presigned URLs are not supported by Databricks).

Prerequisites

Authentication options

PAT (Personal Access Token)

Traditional authentication using a long-lived token created in the Databricks workspace UI

  • Pros: Simple to set up, works everywhere, does not require a paid account
  • Cons: Long-lived credential, tied to user account
  • Best for: Development, testing, personal projects

Generate a PAT

You can generate tokens from your Databricks workspace under Account > Settings > Developer > Access tokens. See Databricks personal access token authentication.

When configuring storage in Label Studio, you will be asked for your access token.

Databricks Service Principal

This is an OAuth-based authentication using a service principal created in the Databricks Account Console. Works on all cloud platforms (AWS, GCP, Azure).

  • Pros: Not tied to user account, OAuth tokens auto-refresh
  • Cons: Requires Databricks Account Console access (not available on the free tier)
  • Best for: Production workloads, automation, CI/CD
  • Token endpoint: {workspace_host}/oidc/v1/token

Set up Service Principal authentication

  1. Go to the Databricks Account Console and select User management > Service principals > Add service principal.
  2. Generate an OAuth secret under the service principal settings (Secrets > Generate secret).
  3. Save the client_id and generated secret.
  4. Assign the service principal to your workspace.

See Manage service principals and Authorize service principal access to Databricks with OAuth.

When configuring storage in Label Studio, you will be asked for the following:

  • Client ID–This is the Application ID for your service principal.
  • Client Secret–This is the client secret you generated after creating your service principal.

note

For Service Principal authentication, Label Studio automatically acquires and refreshes OAuth access tokens (~1 hour lifetime). No manual token rotation needed.

Microsoft Entra Service Principal for Azure Databricks

OAuth-based authentication using an Entra app registration.
Azure Databricks only.

  • Pros: Integrates with Azure identity management, OAuth tokens auto-refresh
  • Cons: Azure-only, requires Entra configuration, requires Databricks Account Console access (not available on the free tier)
  • Best for: Azure environments with centralized identity management
  • Token endpoint: https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token

Set up Service Principal authentication

  1. Open the Microsoft Entra admin center.
  2. Select App registrations on the right and click New registration. See Register an application.
  3. Under Overview, note Application (client) ID and Directory (tenant) ID.
  4. Under Certificates & secrets, add a new client secret.
  5. Go to the Databricks Account Console and select User management > Service principals > Add service principal.
  6. Enter the Application ID from Entra.
  7. Assign the service principal to your workspace.

For more information, see Authorize service principal access to Azure Databricks with OAuth.

When configuring storage in Label Studio, you will be asked for the following:

  • Azure Tenant ID–This is the Directory (tenant) ID available from the Overview page of your app registration.
  • Client ID–This is the Application (client) ID available from the Overview page of your app registration.
  • Client Secret–This is the client secret you generated after registering your application.

note

For Service Principal authentication, Label Studio automatically acquires and refreshes OAuth access tokens (~1 hour lifetime). No manual token rotation needed.

Create a source storage connection

From Label Studio, open your project and select Settings > Cloud Storage > Add Source Storage.

Select Databricks Files (UC Volumes) and click Next.

Configure Connection

Complete the following fields and then click Test connection:

Storage Title Enter a name for the storage connection to appear in Label Studio.
Workspace Host Enter your workspace URL, for example https://<workspace-identifier>.cloud.databricks.com
Authentication Method Select an authentication method and then enter the required information. See Authentication options above.
Catalog
Schema
Volume
Specify your volume path (UC coordinates). You can find this from the Catalog Explorer in Databricks (see screenshot below).

Screenshot of Databricks UI and LS UI

Import Settings & Preview

Complete the following fields and then click Load preview to ensure you are syncing the correct data:

Bucket Prefix Optionally, enter the directory name within the volume that you would like to use. For example, data-set-1 or data-set-1/subfolder-2.
Import Method Select whether you want create a task for each file in your container or whether you would like to use a JSON/JSONL/Parquet file to define the data for each task.
File Name Filter Specify a regular expression to filter bucket objects. Use .* to collect all objects.
Scan all sub-folders Enable this option to perform a recursive scan across subfolders within your container.

Review & Confirm

If everything looks correct, click Save & Sync to sync immediately, or click Save to save your settings and sync later.

Tip

You can also use the API to sync import storage.

URI schema

To reference Databricks files directly in task JSON (without using source storage), use Label Studio’s Databricks URI scheme:

dbx://Volumes/<catalog>/<schema>/<volume>/<path>

Example:

{ "image": "dbx://Volumes/main/default/dataset/images/1.jpg" }

Troubleshooting

  • If your file preview returns zero files, verify the path under /Volumes/<catalog>/<schema>/<volume>/<prefix?> and your permissions.
  • Ensure the Workspace Host has no trailing slash and matches your workspace domain.
  • If previews work but media fails to load, confirm proxy mode is allowed for your organization in Label Studio (Organization > Usage & License > Features) and network egress allows Label Studio to reach Databricks.

Proxy and security

This connector streams data through the Label Studio backend with HTTP Range support. Databricks does not support presigned URLs, so this option is also not available in Label Studio.

Create a target storage connection

Repeat the steps from the previous section but using Add Target Storage. Use the same workspace host, token, and volume path (UC coordinates).

For your Bucket Prefix, set an export folder to use (e.g., exports/${project_id}) and determine whether you want to allow files to be deleted from target storage.

When file deletion is enabled, if you delete an annotation in Label Studio (via UI or API), Label Studio will also delete the corresponding exported JSON file from your target storage for this storage connection.

Note that this only affects files that were exported by that target storage, not your source media or tasks. Your PAT or SP permissions must also allow deletion.

After adding, click Sync to export annotations as JSON files to your target volume.

Tip

You can also use the API to sync export storage.

Add storage with the Label Studio API

You can also use the API to programmatically create connections. See our API documentation..