Skip to main content
  1. Blog Posts/

Workflows, the GCP Orchestrator

·11 mins· loading · loading · ·
Table of Contents
GCP - This article is part of a series.
Part 3: This Article

Workflows is one of those products that’s simply straightforward but is extremely useful. From one glance, we can see that it’s YAML, it calls APIs, and glues things together. And that’s precisely the point.

A lot of real systems are not hard because one service is complicated. They are hard because five small services need to behave like one coherent process, and the coordination logic ends up smeared across cron jobs, HTTP handlers, retries, ad-hoc queues, and a very optimistic amount of application code.

That is the hole Workflows is trying to fill.

Workflow, what’s that?
#

Basically:

  • a workflow is a definition
  • an execution is one run of that definition
  • Workflows is the managed orchestration layer that runs those executions

It is serverless, it can hold state, it can branch, retry, poll, sleep, wait for callbacks, and keep an execution history you can inspect later. Google’s own overview is explicit about the broad shape here: Workflows orchestrates services including Cloud Run, BigQuery, and generic HTTP APIs, and it can hold state, retry, poll, or wait for up to a year.

In other words: you can model a multi-step process without owning the worker that is sitting around coordinating it.

Google Cloud Workflows orchestration diagram showing Workflows coordinating multiple services and APIs.
Source: Google Cloud Workflows documentation. Google Cloud content is licensed under CC BY 4.0.
flowchart LR
    A[Trigger] --> B[Workflow execution]
    B --> C[Call Cloud Run]
    B --> D[Call BigQuery / Google APIs]
    B --> E[Wait, retry, or branch]
    E --> F[Return result or fail cleanly]

Why Not Just Put This in App Code?
#

Because app code is usually a terrible place for some kinds of orchestration.

If your whole “workflow” is one request hitting one service that calls one database, then no, you probably do not need Workflows. Just write the code.

But things get ugly when the process starts looking like this:

  1. call service A
  2. wait for something slow to finish
  3. call service B
  4. branch on the response
  5. retry one part but not the other
  6. maybe wait for an approval or external callback
  7. keep an execution trail so someone can debug it later

You can jam all of that into app code, and people do it all the time! But that does not mean it is a good idea.

Workflows starts making sense when the main problem is not computation, but coordination.

Connectors vs HTTP Calls
#

Workflows can call things in two broad ways:

  • connectors for Google Cloud APIs
  • HTTP requests for generic HTTP endpoints, including Cloud Run

Connectors are not just convenience wrappers. Google’s docs explicitly call out that connectors handle request formatting for Google Cloud APIs, use built-in IAM auth, and come with built-in behavior for retries and long-running operations.

That means connectors are great when you want to talk to Google Cloud services as Google Cloud services.

But HTTP is different.

If you are calling a generic endpoint, or hitting a Cloud Run service as a normal web endpoint, that is an HTTP call. That is also the right place to talk about trust boundaries: HTTP calls are usually where your workflow stops talking to Google-managed APIs and starts talking to application endpoints.

Connectors are not the same thing as invoking Cloud Run

Connector-based Google API calls should not be confused with calling a service like Cloud Run. Cloud Run is typically invoked with HTTP.

How Workflows Actually Gets Triggered
#

Workflows are not limited to one trigger style. You can start executions in a few different ways, and each one fits a slightly different kind of problem.

  • manual: console, CLI, API, client libraries
  • scheduled: Cloud Scheduler
  • event-driven: Eventarc, including Pub/Sub-backed events
  • buffered: Cloud Tasks

Manual Execution
#

The obvious one is manual execution. Google’s execution docs cover running a workflow from the console, the CLI, direct API requests, and client libraries. That is the “I want to kick this off now” path. Useful for testing, useful for operators, and useful when another service wants to start an execution programmatically.

Scheduled Execution
#

Then there is the scheduled path. If what you really want is “run this every five minutes” or “run this every Monday at 9 AM,” the usual tool is Cloud Scheduler. Under the hood, Scheduler hits the Workflows executions API on a cron-like schedule. So yes, cron-style orchestration is absolutely a normal Workflows use case.

Event-Driven Execution
#

Then there is the event-driven path, which is where people usually start asking about Pub/Sub. Yes, Workflows can absolutely be event-driven, but the important detail is how. Workflows is not the event bus. It is the thing that gets invoked by the event-routing layer. The usual setup is Eventarc receiving an event, routing it, and using Workflows as the destination. Pub/Sub can be one of those event sources.

  • Eventarc can trigger a workflow from supported events or Pub/Sub messages
  • the event is considered delivered as soon as the workflow execution starts
  • if the workflow starts and later fails, Eventarc does not retry just because the workflow failed afterward
  • the deduplication window for exactly-once event processing is 24 hours

Queued Execution
#

There is also a queued or buffered path, which is useful when you expect bursts. Google’s Cloud Tasks integration docs show how to push workflow executions through a task queue instead of firing them directly. That gives you rate control, retry control, and a buffer when requests might otherwise exceed Workflows limits. The important caveat is that Cloud Tasks helps you reliably start the execution. It does not monitor the workflow to completion. If the workflow starts and later fails, that is now the workflow’s problem, not the queue’s.

How to Read a Workflow File
#

You can easily see the steps involved with the visualizer tool on GCP console, but in my opinion that’s not enough. The better way to read a workflow definition is to read the source and treat it like a control-flow document:

flowchart TD
    A[main] --> B[params]
    A --> C[steps]
    C --> D[explicit next]
    C --> E[implicit fallthrough]
    C --> F[control flow]
    C --> G[side effects]
    C --> H[reliability behavior]
    F --> I[switch / for]
    F --> J[parallel branches]
    G --> K[HTTP calls / connectors]
    H --> L[sleep / callbacks / polling]

1. Start with main
#

That is the entry point. Find the main block first and do not get distracted by anything else until you know:

  • what parameters enter the workflow
  • what the first side effect is
  • what the final return path looks like

2. Look for the control flow
#

Once you know where execution starts, find:

  • next
  • switch
  • for
  • parallel
  • subworkflows
  • raise
  • return

That tells you the shape of the process. Is it linear? Branching? Fan-out? Waiting on multiple paths? That is the skeleton.

The first thing I check here is whether the flow is explicit or implicit. In Workflows, a step can point to the next step with next, or it can simply fall through to the next step in order. If the file uses a lot of explicit next, that usually means the author wanted the control flow to be very deliberate. If it mostly relies on fallthrough, the workflow is probably meant to read top to bottom like a script.

parallel deserves special attention. That is the point where the workflow stops being a straight line and starts doing more than one thing at once. When you see parallel, ask two questions immediately:

  • which branches are independent
  • what has to finish before execution can move on

That is usually where the real intent lives.

3. Look for the side effects
#

Now find the steps that actually touch the outside world:

  • connector calls
  • HTTP calls
  • callback creation
  • polling
  • pub/sub publish

Those are the steps with operational consequences.

4. Look for reliability behavior
#

  • retries
  • backoff
  • sleeps
  • callbacks
  • timeouts

If a step has retries, that tells you the author expects transient failure. If a step sleeps or polls, that tells you the workflow is coordinating a slow external process. If a step creates a callback, that tells you the process is waiting for another actor to resume it.

A Small Example
#

Here is a deliberately small workflow that calls a Cloud Run service, waits, checks status, and branches based on the result:

main:
  params: [input]
  steps:
    - init:
        assign:
          - run_url: ${sys.get_env("RUN_URL")}
          - job_id: ${input.job_id}

    - startJob:
        call: http.post
        args:
          url: ${run_url + "/start"}
          auth:
            type: OIDC
          body:
            job_id: ${job_id}
        result: start_response

    - waitABit:
        call: sys.sleep
        args:
          seconds: 10

    - checkStatus:
        call: http.get
        args:
          url: ${run_url + "/status/" + job_id}
          auth:
            type: OIDC
        result: status_response

    - decide:
        switch:
          - condition: ${status_response.body.state == "DONE"}
            return: ${status_response.body}
          - condition: ${status_response.body.state == "FAILED"}
            raise: "Job failed"

    - notFinished:
        raise: "Job is still running"

In this scenario, the useful reading is:

  • input enters through job_id
  • the workflow kicks off work in Cloud Run
  • it waits without owning a worker
  • it checks state through another HTTP call
  • the workflow itself owns the branching logic

Retries, Waiting, and Callbacks
#

This is where Workflows starts showing its real value.

Retries
#

Retries are first-class behavior in Workflows. Connector calls also come with built-in retry behavior, and Google’s connector docs call out idempotent retry policy for GET and non-idempotent retry policy for other methods. That means reliability is not an afterthought you bolt on later. It is part of the workflow definition itself.

Polling and long-running operations
#

Connectors also know how to deal with long-running operations. The docs note that connectors can poll long-running operations periodically using exponential backoff, and each polling attempt counts as a billable step. That is useful, but not free.

So the tradeoff is straightforward:

  • more polling gives you lower latency
  • less polling gives you lower cost

Callbacks
#

Callbacks are one of the more interesting parts of the product. They let a workflow pause and wait for another service to make a request to a callback endpoint, which resumes the execution. Google positions this as a way to wait on an event without polling.

This is great for:

  • approval flows
  • external systems that need to resume the workflow
  • human-in-the-loop steps

Pricing and Limits
#

Workflows pricing is step-based, which is exactly the kind of thing that sounds simple until it is not. The main billing unit is step execution.

Google’s pricing page is explicit about a few things that matter:

  • successful steps count
  • failed steps count
  • retried steps count again
  • each retry attempt is another billed step execution

There is also a distinction between internal and external steps:

  • internal steps include calls to *.googleapis.com, sys.log, and polling by connectors for long-running operations
  • external steps include requests to external APIs and waits for HTTP callbacks

One surprisingly useful nuance from the pricing docs: if you call Google Cloud services through their normal Google-hosted domains like *.run.app, those calls are billed as internal steps, not external ones. Custom domains do not get that treatment.

So if you are using Workflows to orchestrate Cloud Run, the hostname choice can change the bill.

Does retry delay itself get billed?

Not as some separate “waiting fee,” no. The billable unit is step execution. The attempts cost money, and the backoff strategy changes cost indirectly by changing how many attempts happen and how fast they happen.

Limits worth knowing up front
#

There are a few limits worth knowing early so they do not surprise you later:

  • max execution duration: 1 year
  • execution retention after completion: 90 days
  • concurrent executions per region per project: 10,000
  • backlogged executions per region per project: 100,000
  • max steps in a single execution: 100,000
  • source code size: 128 KB
  • cumulative size for variables, arguments, and events: 512 KB

Those numbers are from the current quota docs, and they are enough to tell you what kind of product this is:

  • long-lived orchestration is fine
  • giant payloads are not
  • it can scale, but not infinitely and not without quota thinking

So When Should You Reach for It?
#

Workflows are good at:

  • orchestrating several Cloud Run services
  • stitching together Google Cloud APIs
  • waiting on slow external processes
  • handling approval or callback flows
  • making multi-step operational flows explicit and inspectable

In other words, it is good when the hard part is coordination.

However, it does have its downsides:

  • CPU-heavy work
  • large data processing
  • high-frequency per-request logic that belongs inside an application
  • giant branching application logic that should just live in code

If your first instinct is “can I shove the whole application into Workflows?” the answer is probably no. That is not a failure of the product. That is just using the wrong tool.

Key Takeaway
#

Reach for Workflows when you have a multi-step process that crosses service boundaries, has visible state transitions, and would become annoying to coordinate cleanly in application code.

Do not reach for it just because you want everything to be declarative. That is how you end up with orchestration YAML pretending to be an application framework, and nobody deserves that.

Further Reading
#

Workflows overview thumbnail
Workflows overview
Official overview of Workflows, including orchestration scope, state, retries, polling, and callbacks.
https://docs.cloud.google.com/workflows/docs/overview
Understand connectors thumbnail
Understand connectors
Official connector docs covering formatting, IAM auth, retries, and long-running operations.
https://cloud.google.com/workflows/docs/connectors
Syntax overview thumbnail
Syntax overview
Official syntax reference for workflow structure, steps, control flow, retries, and expressions.
https://docs.cloud.google.com/workflows/docs/reference/syntax
Trigger a workflow with events or Pub/Sub messages thumbnail
Trigger a workflow with events or Pub/Sub messages
Official docs for Eventarc and Pub/Sub-triggered Workflows, including delivery and dedup behavior.
https://docs.cloud.google.com/workflows/docs/trigger-workflow-eventarc
Wait using callbacks thumbnail
Wait using callbacks
Official callback docs for pausing and resuming workflow executions without polling.
https://docs.cloud.google.com/workflows/docs/creating-callback-endpoints
Workflows pricing thumbnail
Workflows pricing
Official pricing page for internal steps, external steps, retries, callbacks, and cost optimization.
https://cloud.google.com/workflows/pricing
Quotas and limits thumbnail
Quotas and limits
Official quotas and limits for execution duration, retention, concurrency, data size, and workflow size.
https://docs.cloud.google.com/workflows/quotas

Closing Thoughts
#

Workflows is easy to underestimate because it looks simple. But that simplicity is the product. It is there to make process logic explicit, inspectable, and less tangled than the alternative.

If Cloud Run is where your service code lives, Workflows is one of the places where the coordination logic can finally stop pretending to be app code. That alone makes it worth understanding.

Jevin Laudo
Author
Jevin Laudo
Backend engineer passionate about scalable systems, tech, and sharing what I learn.
GCP - This article is part of a series.
Part 3: This Article

Related