NEW! Advanced PDF + OCR Interface for Document AI

Set up Google Cloud Storage

Dynamically import tasks and export annotations to Google Cloud Storage (GCS) buckets in Label Studio. For details about how Label Studio secures access to cloud storage, see Secure access to cloud storage.

Configure access to your Google Cloud Storage bucket

First, review the information in Cloud storage for projects and Secure access to cloud storage.

Then you will need to complete the following prerequisites:

1. Enable programmatic access to your bucket

See Cloud Storage Client Libraries in the Google Cloud Storage documentation for how to set up access to your GCS bucket.

2. Set up authentication to your bucket

Your account must have the Service Account Token Creator and Storage Object Viewer roles and storage.buckets.get access permission. See Setting up authentication and IAM permissions for Cloud Storage in the Google Cloud Storage documentation.

note

If you are using WIF, see Service account permissions below.

3. Configure CORS

Set up cross-origin resource sharing (CORS) access to your bucket, using a policy that allows GET access from the same host name as your Label Studio deployment. See Configuring cross-origin resource sharing (CORS) in the Google Cloud User Guide.

note

This is only required if you are using pre-signed URLs. If you are using proxying, you do not have to configure CORS. For more information, see Pre-signed URLs vs Storage proxies.

Use or modify the following example:

echo '[
   {
      "origin": ["*"],
      "method": ["GET"],
      "responseHeader": ["Content-Type","Access-Control-Allow-Origin"],
      "maxAgeSeconds": 3600
   }
]' > cors-config.json

Replace YOUR_BUCKET_NAME with your actual bucket name in the following command to update CORS for your bucket:

gsutil cors set cors-config.json gs://YOUR_BUCKET_NAME

Google Cloud Storage

Before you begin:

Google Application Credentials

You will need to provide Google Application Credentials. These will be a JSON file that you input while setting up your storage.

  1. From the Google Cloud Console, go to IAM & Admin > Service Accounts.
  2. Select the specific service account you need credentials for. If you don’t have one, create a new one.
  3. In the service account details, go to the Keys tab and click Add Key > Create new key.
  4. Select the JSON key type and click Create. The JSON file will be generated and automatically downloaded to your computer.

See also:

note

If you're using a service account to authorize access to the Google Cloud Platform, make sure to activate it. See gcloud auth activate-service-account.

Create a source storage connection

From Label Studio, open your project and select Settings > Cloud Storage > Add Source Storage.

Select Google Cloud Storage and click Next.

Configure Connection

Complete the following fields and then click Test connection:

Field Description
Storage Title Enter a name to identify the storage connection.
Bucket Name Enter the name of your GCS bucket.
Google Application Credentials Enter the JSON file with the GCS credentials you created to manage authentication for your bucket.

On-prem users: Alternatively, you can use the GOOGLE_APPLICATION_CREDENTIALS environment variable and/or set up Application Default Credentials, so that users do not need to configure credentials manually.

See Application Default Credentials for enhanced security below.
Google Project ID Enter the ID of your Google project in which the bucket is located (for example, my-label-studio-project).

If you're unsure, you can find this in Google Cloud Console under IAM & Admin > Settings.
Use pre-signed URLs (On) /
Proxy through the platform (Off)
This determines how data from your bucket is loaded:
  • Use pre-signed URLs: Label Studio generates time-limited HTTPS links directly to your S3/GCS/Azure objects and redirects the browser there (HTTP 303), so annotators’ browsers download media straight from cloud storage. This is usually faster and scales better, but requires correct CORS and presign permissions on the bucket. It also means traffic flows from browser to storage, not through Label Studio.
  • Proxy through the platform – The backend downloads the file from cloud storage and streams it to the browser, so all media traffic passes through the Label Studio server. This keeps data fully inside the Label Studio/network boundary, enforces task-level access checks on every request, and avoids CORS/presign setup, but uses more Label Studio worker resources and can be slightly slower.

For more information, see Pre-signed URLs vs Storage proxies.
Expire pre-signed URLs (minutes) Control how long pre-signed URLs remain valid.

Import Settings & Preview

Complete the following fields and then click Load preview to ensure you are syncing the correct data:

Bucket Prefix Optionally, enter the directory name within your bucket that you would like to use. For example, data-set-1 or data-set-1/subfolder-2.
Import Method Select whether you want create a task for each file in your bucket or whether you would like to use a JSON/JSONL/Parquet file to define the data for each task.
File Name Filter Specify a regular expression to filter bucket objects. Use .* to collect all objects.
Scan all sub-folders Enable this option to perform a recursive scan across subfolders within your container.

Review & Confirm

If everything looks correct, click Save & Sync to sync immediately, or click Save to save your settings and sync later.

Tip

You can also use the API to sync import storage.

Create a target storage connection

From Label Studio, open your project and select Settings > Cloud Storage > Add Target Storage.

Select Google Cloud Storage and click Next.

Complete the following fields:

Storage Title Enter a name to identify the storage connection.
Bucket Name Enter the name of your GCS bucket.
Bucket Prefix Optionally, enter the directory name within your bucket that you would like to use. For example, data-set-1 or data-set-1/subfolder-2.
Google Application Credentials Enter the JSON file with the GCS credentials you created to manage authentication for your bucket.

On-prem users: Alternatively, you can use the GOOGLE_APPLICATION_CREDENTIALS environment variable and/or set up Application Default Credentials, so that users do not need to configure credentials manually.

See Application Default Credentials for enhanced security below.
Google Project ID Enter the ID of your Google project in which the bucket is located (for example, my-label-studio-project).

If you're unsure, you can find this in Google Cloud Console under IAM & Admin > Settings.
Can delete objects from storage Enable this option if you want to delete annotations stored in the bucket when they are deleted in Label Studio. Your credentials must include the ability to delete bucket objects.

After adding the storage, click Sync.

Tip

You can also use the API to sync export storage.

Application Default Credentials for enhanced security for GCS

If you use Label Studio on-premises with Google Cloud Storage, you can set up Application Default Credentials to provide cloud storage authentication globally for all projects, so users do not need to configure credentials manually.

The recommended way to to do this is by using the GOOGLE_APPLICATION_CREDENTIALS environment variable. For example:

export GOOGLE_APPLICATION_CREDENTIALS=json-file-with-GCP-creds-23441-8f8sd99vsd115a.json

Google Cloud Storage with Workload Identity Federation (WIF)

You can also use Workload Identity Federation (WIF) pools with Google Cloud Storage.

Unlike with application credentials, WIF allows you to use temporary credentials. Each time you make a request to GCS, Label Studio connects to your identity pool to request temporary credentials.

For more information about WIF, see Google Cloud - Workload Identity Federation.

Before you begin:

Service account permissions

You will need a service account that has the following permissions

  • Bucket: Storage Admin (roles/storage.admin)
  • Project: Service Account Token Creator (roles/iam.serviceAccountTokenCreator)
  • Project: Storage Object Viewer (roles/storage.viewer)

See Create service accounts in the Google Cloud documentation.

Create a Workload Identity Pool

There are several methods you can use to create a WIF pool.

Using Terraform

An example script is provided below. Ensure all required variables are set:

  • GCP project variables:

    • var.gcp_project_name

    • var.gcp_region

  • SaaS provided by HumanSignal:

    • var.aws_account_id = 490065312183

    • var.aws_role_name = label-studio-app-production

Then run:

terraform init
terraform plan
terraform apply

Once applied, you will have a functioning Workload Identity Pool that trusts the Label Studio AWS IAM Role.

## Variables
/* AWS variables are so that AWS-hosted Label Studio resources can reach out to request credentials */

variable "gcp_project_name" {
  type        = string
  description = "GCP Project name"
}

variable "gcp_region" {
  type        = string
  description = "GCP Region"
}

variable "label_studio_gcp_sa_name" {
  type        = string
  description = "GCP Label Studio Service Account Name"
}

variable "aws_account_id" {
  type        = string
  description = "AWS Project ID"
}

variable "aws_role_name" {
  type        = string
  description = "AWS Role name"
}

variable "external_ids" {
  type        = list(string)
  default = []
  description = "List of external ids"
}

## Outputs

output "GCP_WORKLOAD_ID" {
  value = google_iam_workload_identity_pool_provider.label-studio-provider-jwt.workload_identity_pool_id
}

output "GCP_WORKLOAD_PROVIDER" {
  value = google_iam_workload_identity_pool_provider.label-studio-provider-jwt.workload_identity_pool_provider_id
}

## Main

provider "google" {
  project = var.gcp_project_name
  region  = var.gcp_region
}

resource "random_id" "random" {
  byte_length = 4
}

locals {
  aws_assumed_role = "arn:aws:sts::${var.aws_account_id}:assumed-role/${var.aws_role_name}"

  external_id_condition = (
    length(var.external_ids) > 0
    ? format("(attribute.aws_role == \"%s\") && (attribute.external_id in [%s])",
      local.aws_assumed_role,
      join(", ", formatlist("\"%s\"", var.external_ids))
    )
    : format("(attribute.aws_role == \"%s\")", local.aws_assumed_role)
  )
}

resource "google_iam_workload_identity_pool" "label-studio-pool" {
  workload_identity_pool_id = "label-studio-pool-${random_id.random.hex}"
  project                   = var.gcp_project_name
}

resource "google_iam_workload_identity_pool_provider" "label-studio-provider-jwt" {
  workload_identity_pool_id          = google_iam_workload_identity_pool.label-studio-pool.workload_identity_pool_id
  workload_identity_pool_provider_id = "label-studio-jwt-${random_id.random.hex}"

  attribute_condition = local.external_id_condition

  attribute_mapping = {
    "google.subject"        = "assertion.arn"
    "attribute.aws_account" = "assertion.account"
    "attribute.aws_role"    = "assertion.arn.contains('assumed-role') ? assertion.arn.extract('{account_arn}assumed-role/') + 'assumed-role/' + assertion.arn.extract('assumed-role/{role_name}/') : assertion.arn"
    "attribute.external_id" = "assertion.external_id"
  }

  aws {
    account_id = var.aws_account_id
  }
}

data "google_service_account" "existing_sa" {
  account_id = var.label_studio_gcp_sa_name
}

resource "google_service_account_iam_binding" "label-studio-sa-oidc" {
  service_account_id = data.google_service_account.existing_sa.name
  role               = "roles/iam.workloadIdentityUser"

  members = [
    "principalSet://iam.googleapis.com/${google_iam_workload_identity_pool.label-studio-pool.name}/attribute.aws_role/${local.aws_assumed_role}"
  ]
}
Using the gcloud command line

Replace the bracketed variables ([PROJECT_ID], [POOL_ID], [PROVIDER_ID], etc.) with your own values.

Make sure you escape quotes or use single quotes when necessary.

  1. Create the Workload Identity pool:

     gcloud iam workload-identity-pools create [POOL_ID] \
    --project=[PROJECT_ID] \
    --location="global" \
    --display-name="[POOL_DISPLAY_NAME]"

    Where:

    • [POOL_ID] is the ID that you want to assign to your WIF pool (for example, label-studio-pool-abc123). Note this because you will need to reuse it later.
    • [PROJECT_ID] is the ID of your Google Cloud project.
    • [POOL_DISPLAY_NAME] is a human-readable name for your pool (optional, but recommended).
  2. Create the provider for AWS.

    This allows AWS principals that have the correct external ID and AWS role configured to impersonate the Google Cloud service account. This is necessary because the Label Studio resources making the request are hosted in AWS.

    gcloud iam workload-identity-pools providers create-aws [PROVIDER_ID] \
    --workload-identity-pool="[POOL_ID]" \
    --account-id="490065312183" \
    --attribute-condition="attribute.aws_role==\"arn:aws:sts::490065312183:assumed-role/label-studio-app-production\"" \
    --attribute-mapping="google.subject=assertion.arn,attribute.aws_account=assertion.account,attribute.aws_role=assertion.arn,attribute.external_id=assertion.external_id"
    

    Where:

    • [PROVIDER_ID] is a provider ID (for example, label-studio-app-production).
    • [POOL_ID]: The pool ID you provided in step 1.
  3. Grant the service account that you created earlier the iam.workloadIdentityUser role.

    gcloud iam service-accounts add-iam-policy-binding [SERVICE_ACCOUNT_EMAIL] \
    --role="roles/iam.workloadIdentityUser" \
    --member="principalSet://iam.googleapis.com/projects/[PROJECT_NUMBER]/locations/global/workloadIdentityPools/[POOL_ID]/attribute.aws_role/arn:aws:sts::490065312183:assumed-role/label-studio-app-production"

    Where:

    • [SERVICE_ACCOUNT_EMAIL] is the email associated with you GCS service account (for example, my-service-account@[PROJECT_ID].iam.gserviceaccount.com).

    • [PROJECT_NUMBER]: Your Google project number. This is different than the project ID. You can find the project number with the following command:

      gcloud projects describe $PROJECT_ID --format="value(projectNumber)"

    • [POOL_ID]: The pool ID you provided in step 1.

Before setting up your connection in Label Studio, note what you provided for the following variables (you will be asked to provide them):

  • [POOL_ID]
  • [PROVIDER_ID]
  • [SERVICE_ACCOUNT_EMAIL]
  • [PROJECT_NUMBER]
  • [PROJECT_ID]
Using the Google Cloud Console

Before you begin, ensure you are in the correct project:

Screenshot of the GCS console with project highlighted

  1. From the Google Cloud Console, navigate to IAM & Admin > Workload Identity Pools.

  2. Click Get Started to enable the APIs.

  3. Under Create an identity pool, complete the following fields:

    • Name: This is the pool ID (for example, label-studio-pool-abc123). Note this ID because you will need it again later.
    • Description: This is the display name for the pool (for example, “Label Studio Pool”).
  4. Under Add a provider pool, complete the following fields:

    • Select a provider: Select AWS. This is the location where the Label Studio components responsible for issuing requests are stored.
    • Provider name: Enter Label Studio App Production (you can use a different display name, but you need to ensure that the corresponding provider ID is still label-studio-app-production)
    • Provider ID: Enter label-studio-app-production.
    • AWS Account ID: Enter 490065312183.
  5. Under Configure provider attributes, enter the following:

    • Click Add condition and then enter the following:

      attribute.aws_role=="arn:aws:sts::490065312183:assumed-role/label-studio-app-production"

    • Click Edit mapping and then add the following:

      • google.subject = assertion.arn
      • attribute.aws_role = assertion.arn.contains('assumed-role') ? assertion.arn.extract('{account_arn}assumed-role/') + 'assumed-role/' + assertion.arn.extract('assumed-role/{role_name}/') : assertion.arn (this might be filled in by default)
      • attribute.aws_account = assertion.account
      • attribute.external_id = assertion.external_id
  6. Click Save.

  7. Go to IAM & Admin > Service Accounts and find the service account you want to allow AWS (Label Studio) to impersonate. See Service account permissions above.

  8. From the Principals with access tab, click Grant Access.

    Screenshot of grant access button

  9. In the New principals field, add the following:

    principalSet://iam.googleapis.com/projects/[PROJECT_NUMBER]/locations/global/workloadIdentityPools/[POOL_ID]/attribute.aws_role/arn:aws:sts::490065312183:assumed-role/label-studio-app-production

    Where:

    • [PROJECT_NUMBER] - Replace this with your Google project number. This is different than the project ID. To find the project number, go to IAM & Admin > Settings.
    • [POOL_ID] - Replace this with the pool ID (the Name you entered in step 3 above, e.g. label-studio-pool-abc123).
  10. Under Assign Roles, use the search field in the Role drop-down menu to find the Workload Identity User role.

    Screenshot of principal window

  11. Click Save

Before setting up your connection in Label Studio, note the following (you will be asked to provide them)

  • Your pool ID - available from IAM & Admin > Workload Identity Pools
  • Your provider ID - available from IAM & Admin > Workload Identity Pools (this should be label-studio-app-production)
  • Your service account email - available from IAM & Admin > Service Accounts. Select the service account and the email is listed under Details.
  • Your Google project number - available from IAM & Admin > Settings
  • Your Google project ID - available from IAM & Admin > Settings

Create a source storage connection

From Label Studio, open your project and select Settings > Cloud Storage > Add Source Storage.

Select Google Cloud Storage (WIF Auth) and click Next.

Configure Connection

Complete the following fields and then click Test connection:

Field Description
Storage Title Enter a name to identify the storage connection.
Bucket Name Enter the name of your GCS bucket.
Workload Identity Pool ID This is the ID you specified when creating the Work Identity Pool. You can find this in Google Cloud Console under IAM & Admin > Workload Identity Pools.
Workload Identity Provider ID This is the ID you specified when setting up the provider. You can find this in Google Cloud Console under IAM & Admin > Workload Identity Pools.
Service Account Email This is the email associated with the service account you set up as part of the prerequisites. You can find it in the Details page of the service account under IAM & Admin > Service Accounts. For example, labelstudio@random-string-382222.iam.gserviceaccount.com.
Google Project ID Your Google project ID. You can find this in Google Cloud Console under IAM & Admin > Settings.
Google Project Number Your Google project number. You can find this in Google Cloud Console under IAM & Admin > Settings.
Use pre-signed URLs (On) /
Proxy through the platform (Off)
This determines how data from your bucket is loaded:
  • Use pre-signed URLs: Label Studio generates time-limited HTTPS links directly to your S3/GCS/Azure objects and redirects the browser there (HTTP 303), so annotators' browsers download media straight from cloud storage. This is usually faster and scales better, but requires correct CORS and presign permissions on the bucket. It also means traffic flows from browser to storage, not through Label Studio.
  • Proxy through the platform – The backend downloads the file from cloud storage and streams it to the browser, so all media traffic passes through the Label Studio server. This keeps data fully inside the Label Studio/network boundary, enforces task-level access checks on every request, and avoids CORS/presign setup, but uses more Label Studio worker resources and can be slightly slower.

For more information, see Pre-signed URLs vs Storage proxies.
Expire pre-signed URLs (minutes) Control how long pre-signed URLs remain valid.

Import Settings & Preview

Complete the following fields and then click Load preview to ensure you are syncing the correct data:

Bucket Prefix Optionally, enter the directory name within your bucket that you would like to use. For example, data-set-1 or data-set-1/subfolder-2.
Import Method Select whether you want create a task for each file in your bucket or whether you would like to use a JSON/JSONL/Parquet file to define the data for each task.
File Name Filter Specify a regular expression to filter bucket objects. Use .* to collect all objects.
Scan all sub-folders Enable this option to perform a recursive scan across subfolders within your container.

Review & Confirm

If everything looks correct, click Save & Sync to sync immediately, or click Save to save your settings and sync later.

Tip

You can also use the API to sync import storage.

Create a target storage connection

From Label Studio, open your project and select Settings > Cloud Storage > Add Target Storage.

Select Google Cloud Storage (WIF Auth) and click Next.

Complete the following fields:

Storage Title Enter a name to identify the storage connection.
Bucket Name Enter the name of your GCS bucket.
Bucket Prefix Optionally, enter the directory name within your bucket that you would like to use. For example, data-set-1 or data-set-1/subfolder-2.
Workload Identity Pool ID This is the ID you specified when creating the Work Identity Pool. You can find this in Google Cloud Console under IAM & Admin > Workload Identity Pools.
Workload Identity Provider ID This is the ID you specified when setting up the provider. You can find this in Google Cloud Console under IAM & Admin > Workload Identity Pools.
Service Account Email This is the email associated with the service account you set up as part of the prerequisites. You can find it in the Details page of the service account under IAM & Admin > Service Accounts. For example, labelstudio@random-string-382222.iam.gserviceaccount.com.
Google Project ID Your Google project ID. You can find this in Google Cloud Console under IAM & Admin > Settings.
Google Project Number Your Google project number. You can find this in Google Cloud Console under IAM & Admin > Settings.
Can delete objects from storage Enable this option if you want to delete annotations stored in the bucket when they are deleted in Label Studio. Your credentials must include the ability to delete bucket objects.

After adding the storage, click Sync.

Tip

You can also use the API to sync export storage.

Add storage with the Label Studio API

You can also use the API to programmatically create connections. See our API documentation.

IP filtering for enhanced security for GCS

Google Cloud Storage offers bucket IP filtering as a powerful security mechanism to restrict access to your data based on source IP addresses. This feature helps prevent unauthorized access and provides fine-grained control over who can interact with your storage buckets.

Read more about Source storage behind your VPC.

Common Use Cases:

  • Restrict bucket access to only your organization’s IP ranges
  • Allow access only from specific VPC networks in your infrastructure
  • Secure sensitive data by limiting access to known IP addresses
  • Control access for third-party integrations by whitelisting their IPs
How to Set Up IP Filtering
  1. First, create your GCS bucket through the console or CLI
  2. Create a JSON configuration file to define IP filtering rules. You have two options: For public IP ranges:
    {
      "mode": "Enabled", 
      "publicNetworkSource": {
        "allowedIpCidrRanges": [
          "xxx.xxx.xxx.xxx", // Your first IP address
          "xxx.xxx.xxx.xxx", // Your second IP address
          "xxx.xxx.xxx.xxx/xx" // Your IP range in CIDR notation
        ]
      }
    }

note

If you're using Label Studio Enterprise at app.humansignal.com and accessing it from your office network:

  • Add Label Studio Enterprise outgoing IP addresses (see IP ranges)
  • Add your office network IP range (e.g. 192.168.1.0/24)
  • If both Label Studio Enterprise and your office are on the same VPN network (e.g. 10.0.0.0/16), you only need to add that VPN subnet

For VPC network sources:

{
  "mode": "Enabled",
  "vpcNetworkSources": [
    {
      "network": "projects/PROJECT_ID/global/networks/NETWORK_NAME",
      "allowedIpCidrRanges": [
        RANGE_CIDR
      ]
    }
  ]
}
  1. Apply the IP filtering rules to your bucket using the following command:

    gcloud alpha storage buckets update gs://BUCKET_NAME --ip-filter-file=IP_FILTER_CONFIG_FILE
  2. To remove IP filtering rules when no longer needed:

    gcloud alpha storage buckets update gs://BUCKET_NAME --clear-ip-filter

Limitations to Consider

  • Maximum of 200 IP CIDR blocks across all rules
  • Maximum of 25 VPC networks in the IP filter rules
  • Not supported for dual-regional buckets
  • May affect access from certain Google Cloud services

Read more about GCS IP filtering