Workflows, the GCP Orchestrator

Table of Contents

GCP - This article is part of a series.

Part 3: This Article

Workflows is one of those products that’s simply straightforward but is extremely useful. From one glance, we can see that it’s YAML, it calls APIs, and glues things together. And that’s precisely the point.

A lot of real systems are not hard because one service is complicated. They are hard because five small services need to behave like one coherent process, and the coordination logic ends up smeared across cron jobs, HTTP handlers, retries, ad-hoc queues, and a very optimistic amount of application code.

That is the hole Workflows is trying to fill.

Workflow, what’s that?
#

Basically:

a workflow is a definition
an execution is one run of that definition
Workflows is the managed orchestration layer that runs those executions

It is serverless, it can hold state, it can branch, retry, poll, sleep, wait for callbacks, and keep an execution history you can inspect later. Google’s own overview is explicit about the broad shape here: Workflows orchestrates services including Cloud Run, BigQuery, and generic HTTP APIs, and it can hold state, retry, poll, or wait for up to a year.

In other words: you can model a multi-step process without owning the worker that is sitting around coordinating it.

Google Cloud Workflows orchestration diagram showing Workflows coordinating multiple services and APIs. — Source: Google Cloud Workflows documentation. Google Cloud content is licensed under CC BY 4.0.

flowchart LR
    A[Trigger] --> B[Workflow execution]
    B --> C[Call Cloud Run]
    B --> D[Call BigQuery / Google APIs]
    B --> E[Wait, retry, or branch]
    E --> F[Return result or fail cleanly]

Why Not Just Put This in App Code?
#

Because app code is usually a terrible place for some kinds of orchestration.

If your whole “workflow” is one request hitting one service that calls one database, then no, you probably do not need Workflows. Just write the code.

But things get ugly when the process starts looking like this:

call service A
wait for something slow to finish
call service B
branch on the response
retry one part but not the other
maybe wait for an approval or external callback
keep an execution trail so someone can debug it later

You can jam all of that into app code, and people do it all the time! But that does not mean it is a good idea.

Workflows starts making sense when the main problem is not computation, but coordination.

Connectors vs HTTP Calls
#

Workflows can call things in two broad ways:

connectors for Google Cloud APIs
HTTP requests for generic HTTP endpoints, including Cloud Run

Connectors are not just convenience wrappers. Google’s docs explicitly call out that connectors handle request formatting for Google Cloud APIs, use built-in IAM auth, and come with built-in behavior for retries and long-running operations.

That means connectors are great when you want to talk to Google Cloud services as Google Cloud services.

But HTTP is different.

If you are calling a generic endpoint, or hitting a Cloud Run service as a normal web endpoint, that is an HTTP call. That is also the right place to talk about trust boundaries: HTTP calls are usually where your workflow stops talking to Google-managed APIs and starts talking to application endpoints.

Connectors are not the same thing as invoking Cloud Run

Connector-based Google API calls should not be confused with calling a service like Cloud Run. Cloud Run is typically invoked with HTTP.

How Workflows Actually Gets Triggered
#

Workflows are not limited to one trigger style. You can start executions in a few different ways, and each one fits a slightly different kind of problem.

manual: console, CLI, API, client libraries
scheduled: Cloud Scheduler
event-driven: Eventarc, including Pub/Sub-backed events
buffered: Cloud Tasks

Manual Execution
#

The obvious one is manual execution. Google’s execution docs cover running a workflow from the console, the CLI, direct API requests, and client libraries. That is the “I want to kick this off now” path. Useful for testing, useful for operators, and useful when another service wants to start an execution programmatically.

Scheduled Execution
#

Then there is the scheduled path. If what you really want is “run this every five minutes” or “run this every Monday at 9 AM,” the usual tool is Cloud Scheduler. Under the hood, Scheduler hits the Workflows executions API on a cron-like schedule. So yes, cron-style orchestration is absolutely a normal Workflows use case.

Event-Driven Execution
#

Then there is the event-driven path, which is where people usually start asking about Pub/Sub. Yes, Workflows can absolutely be event-driven, but the important detail is how. Workflows is not the event bus. It is the thing that gets invoked by the event-routing layer. The usual setup is Eventarc receiving an event, routing it, and using Workflows as the destination. Pub/Sub can be one of those event sources.

Eventarc can trigger a workflow from supported events or Pub/Sub messages
the event is considered delivered as soon as the workflow execution starts
if the workflow starts and later fails, Eventarc does not retry just because the workflow failed afterward
the deduplication window for exactly-once event processing is 24 hours

Queued Execution
#

There is also a queued or buffered path, which is useful when you expect bursts. Google’s Cloud Tasks integration docs show how to push workflow executions through a task queue instead of firing them directly. That gives you rate control, retry control, and a buffer when requests might otherwise exceed Workflows limits. The important caveat is that Cloud Tasks helps you reliably start the execution. It does not monitor the workflow to completion. If the workflow starts and later fails, that is now the workflow’s problem, not the queue’s.

How to Read a Workflow File
#

You can easily see the steps involved with the visualizer tool on GCP console, but in my opinion that’s not enough. The better way to read a workflow definition is to read the source and treat it like a control-flow document:

flowchart TD
    A[main] --> B[params]
    A --> C[steps]
    C --> D[explicit next]
    C --> E[implicit fallthrough]
    C --> F[control flow]
    C --> G[side effects]
    C --> H[reliability behavior]
    F --> I[switch / for]
    F --> J[parallel branches]
    G --> K[HTTP calls / connectors]
    H --> L[sleep / callbacks / polling]

1. Start with `main`
#

That is the entry point. Find the main block first and do not get distracted by anything else until you know:

what parameters enter the workflow
what the first side effect is
what the final return path looks like

2. Look for the control flow
#

Once you know where execution starts, find:

next
switch
for
parallel
subworkflows
raise
return

That tells you the shape of the process. Is it linear? Branching? Fan-out? Waiting on multiple paths? That is the skeleton.

The first thing I check here is whether the flow is explicit or implicit. In Workflows, a step can point to the next step with next, or it can simply fall through to the next step in order. If the file uses a lot of explicit next, that usually means the author wanted the control flow to be very deliberate. If it mostly relies on fallthrough, the workflow is probably meant to read top to bottom like a script.

parallel deserves special attention. That is the point where the workflow stops being a straight line and starts doing more than one thing at once. When you see parallel, ask two questions immediately:

which branches are independent
what has to finish before execution can move on

That is usually where the real intent lives.

3. Look for the side effects
#

Now find the steps that actually touch the outside world:

connector calls
HTTP calls
callback creation
polling
pub/sub publish

Those are the steps with operational consequences.

4. Look for reliability behavior
#

retries
backoff
sleeps
callbacks
timeouts

If a step has retries, that tells you the author expects transient failure. If a step sleeps or polls, that tells you the workflow is coordinating a slow external process. If a step creates a callback, that tells you the process is waiting for another actor to resume it.

A Small Example
#

Here is a deliberately small workflow that calls a Cloud Run service, waits, checks status, and branches based on the result:

main:
  params: [input]
  steps:
    - init:
        assign:
          - run_url: ${sys.get_env("RUN_URL")}
          - job_id: ${input.job_id}

    - startJob:
        call: http.post
        args:
          url: ${run_url + "/start"}
          auth:
            type: OIDC
          body:
            job_id: ${job_id}
        result: start_response

    - waitABit:
        call: sys.sleep
        args:
          seconds: 10

    - checkStatus:
        call: http.get
        args:
          url: ${run_url + "/status/" + job_id}
          auth:
            type: OIDC
        result: status_response

    - decide:
        switch:
          - condition: ${status_response.body.state == "DONE"}
            return: ${status_response.body}
          - condition: ${status_response.body.state == "FAILED"}
            raise: "Job failed"

    - notFinished:
        raise: "Job is still running"

In this scenario, the useful reading is:

input enters through job_id
the workflow kicks off work in Cloud Run
it waits without owning a worker
it checks state through another HTTP call
the workflow itself owns the branching logic

Retries, Waiting, and Callbacks
#

This is where Workflows starts showing its real value.

Retries
#

Retries are first-class behavior in Workflows. Connector calls also come with built-in retry behavior, and Google’s connector docs call out idempotent retry policy for GET and non-idempotent retry policy for other methods. That means reliability is not an afterthought you bolt on later. It is part of the workflow definition itself.

Polling and long-running operations
#

Connectors also know how to deal with long-running operations. The docs note that connectors can poll long-running operations periodically using exponential backoff, and each polling attempt counts as a billable step. That is useful, but not free.

So the tradeoff is straightforward:

more polling gives you lower latency
less polling gives you lower cost

Callbacks
#

Callbacks are one of the more interesting parts of the product. They let a workflow pause and wait for another service to make a request to a callback endpoint, which resumes the execution. Google positions this as a way to wait on an event without polling.

This is great for:

approval flows
external systems that need to resume the workflow
human-in-the-loop steps

Pricing and Limits
#

Workflows pricing is step-based, which is exactly the kind of thing that sounds simple until it is not. The main billing unit is step execution.

Google’s pricing page is explicit about a few things that matter:

successful steps count
failed steps count
retried steps count again
each retry attempt is another billed step execution

There is also a distinction between internal and external steps:

internal steps include calls to *.googleapis.com, sys.log, and polling by connectors for long-running operations
external steps include requests to external APIs and waits for HTTP callbacks

One surprisingly useful nuance from the pricing docs: if you call Google Cloud services through their normal Google-hosted domains like *.run.app, those calls are billed as internal steps, not external ones. Custom domains do not get that treatment.

So if you are using Workflows to orchestrate Cloud Run, the hostname choice can change the bill.

Does retry delay itself get billed?

Not as some separate “waiting fee,” no. The billable unit is step execution. The attempts cost money, and the backoff strategy changes cost indirectly by changing how many attempts happen and how fast they happen.

Limits worth knowing up front
#

There are a few limits worth knowing early so they do not surprise you later:

max execution duration: 1 year
execution retention after completion: 90 days
concurrent executions per region per project: 10,000
backlogged executions per region per project: 100,000
max steps in a single execution: 100,000
source code size: 128 KB
cumulative size for variables, arguments, and events: 512 KB

Those numbers are from the current quota docs, and they are enough to tell you what kind of product this is:

long-lived orchestration is fine
giant payloads are not
it can scale, but not infinitely and not without quota thinking

So When Should You Reach for It?
#

Workflows are good at:

orchestrating several Cloud Run services
stitching together Google Cloud APIs
waiting on slow external processes
handling approval or callback flows
making multi-step operational flows explicit and inspectable

In other words, it is good when the hard part is coordination.

However, it does have its downsides:

CPU-heavy work
large data processing
high-frequency per-request logic that belongs inside an application
giant branching application logic that should just live in code

If your first instinct is “can I shove the whole application into Workflows?” the answer is probably no. That is not a failure of the product. That is just using the wrong tool.

Key Takeaway
#

Reach for Workflows when you have a multi-step process that crosses service boundaries, has visible state transitions, and would become annoying to coordinate cleanly in application code.

Do not reach for it just because you want everything to be declarative. That is how you end up with orchestration YAML pretending to be an application framework, and nobody deserves that.

Closing Thoughts
#

Workflows is easy to underestimate because it looks simple. But that simplicity is the product. It is there to make process logic explicit, inspectable, and less tangled than the alternative.

If Cloud Run is where your service code lives, Workflows is one of the places where the coordination logic can finally stop pretending to be app code. That alone makes it worth understanding.

Author

Jevin Laudo

Backend engineer passionate about scalable systems, tech, and sharing what I learn.

GCP - This article is part of a series.

Part 1: GCP for Starters

Part 2: BigQuery, More Than Just SQL?

Part 3: This Article

Workflows, the GCP Orchestrator

Workflow, what’s that?
#

Why Not Just Put This in App Code?
#

Connectors vs HTTP Calls
#

How Workflows Actually Gets Triggered
#

Manual Execution
#

Scheduled Execution
#

Event-Driven Execution
#

Queued Execution
#

How to Read a Workflow File
#

1. Start with `main`
#

2. Look for the control flow
#

3. Look for the side effects
#

4. Look for reliability behavior
#

A Small Example
#

Retries, Waiting, and Callbacks
#

Retries
#

Polling and long-running operations
#

Callbacks
#

Pricing and Limits
#

Limits worth knowing up front
#

So When Should You Reach for It?
#

Key Takeaway
#

Further Reading
#

Closing Thoughts
#

Related

Workflow, what’s that? #

Why Not Just Put This in App Code? #

Connectors vs HTTP Calls #

How Workflows Actually Gets Triggered #

Manual Execution #

Scheduled Execution #

Event-Driven Execution #

Queued Execution #

How to Read a Workflow File #

1. Start with main #

2. Look for the control flow #

3. Look for the side effects #

4. Look for reliability behavior #

A Small Example #

Retries, Waiting, and Callbacks #

Retries #

Polling and long-running operations #

Callbacks #

Pricing and Limits #

Limits worth knowing up front #

So When Should You Reach for It? #

Key Takeaway #

Further Reading #

Closing Thoughts #

Related

Workflow, what’s that?
#

Why Not Just Put This in App Code?
#

Connectors vs HTTP Calls
#

How Workflows Actually Gets Triggered
#

Manual Execution
#

Scheduled Execution
#

Event-Driven Execution
#

Queued Execution
#

How to Read a Workflow File
#

1. Start with `main`
#

2. Look for the control flow
#

3. Look for the side effects
#

4. Look for reliability behavior
#

A Small Example
#

Retries, Waiting, and Callbacks
#

Retries
#

Polling and long-running operations
#

Callbacks
#

Pricing and Limits
#

Limits worth knowing up front
#

So When Should You Reach for It?
#

Key Takeaway
#

Further Reading
#

Closing Thoughts
#