AIP-151

Long-running operations

Occasionally, an API may need to expose a method that takes a significant amount of time to complete. In these situations, it is often a poor user experience to simply block while the task runs; rather, it is better to return some kind of promise to the user and allow the user to check back in later.

The long-running operations pattern is roughly analogous to a Python Future, or a Node.js Promise. Essentially, the user is given a token that can be used to track progress and retrieve the result.

Guidance

Individual API methods that might take a significant amount of time to complete should return a google.longrunning.Operation object instead of the ultimate response message.

// Create a book.
rpc CreateBook(CreateBookRequest) returns (google.longrunning.Operation) {
  option (google.api.http) = {
    post: "/v1/{parent=publishers/*}/books"
    body: "book"
  };
  option (google.longrunning.operation_info) = {
    response_type: "Book"
    metadata_type: "OperationMetadata"
  };
}
  • The response type must be google.longrunning.Operation. The Operation proto definition must not be copied into individual APIs.
    • The response must not be a streaming response.
  • The method must include a google.longrunning.operation_info annotation, which must define both response and metadata types.
    • The response and metadata types must be defined in the file where the RPC appears, or a file imported by that file.
    • If the response and metadata types are defined in another package, the fully-qualified message name must be used.
    • The response type should not be google.protobuf.Empty (except for Delete methods), unless it is certain that response data will never be needed. If response data might be added in the future, define an empty message for the RPC response and use that.
    • The metadata type is used to provide information such as progress, partial failures, and similar information on each GetOperation call. The metadata type should not be google.protobuf.Empty, unless it is certain that metadata will never be needed. If metadata might be added in the future, define an empty message for the RPC metadata and use that.
  • APIs with messages that return Operation must implement the Operations service. Individual APIs must not define their own interfaces for long-running operations to avoid non-uniformity.
  • If an RPC supports a validate-only mode, the response to a validation request must be one of the following:
    • A successful response with an Operation which is already complete, with the done field set to true, and a valid (but potentially empty) response message in the response field, wrapped in a google.protobuf.Any message. The name field may be empty, to avoid the service having to maintain state for successful validation.
    • An immediate error response (typically "bad request")
    • An Operation with the done field set to false, to indicate long-running validation. In this case, the name field must be set, to allow clients to poll the long-running validation operation until it has completed. Successful validation must eventually be represented by an operation with done=true and a valid (but potentially empty) wrapped response message in the response field. Unsuccessful validation must eventually be represented by an operation with done=true and the error details provided in the error field.

Note: User expectations can vary on what is considered "a significant amount of time" depending on what work is being done. A good rule of thumb is 10 seconds.

Standard methods

APIs may return an Operation from the Create, Update, or Delete standard methods if appropriate. In this case, the response type in the operation_info annotation must be the standard and expected response type for that standard method.

When creating or deleting a resource with a long-running operation, the resource should be included in List and Get calls; however, the resource should indicate that it is not usable, generally with a state enum.

Parallel operations

A resource may accept multiple operations that will work on it in parallel, but is not obligated to do so:

  • Resources that accept multiple parallel operations may place them in a queue rather than work on the operations simultaneously.
  • Resources that do not permit multiple operations in parallel (denying any new operation until the one that is in progress finishes) must return ABORTED if a user attempts a parallel operation, and include an error message explaining the situation.
  • Resources with declarative-friendly APIs may allow subsequent updates to preempt existing operations. In this case, the latest update begins processing and previous operations are marked as ABORTED with an error message explaining the situation.

Expiration

APIs may allow their operation resources to expire after sufficient time has elapsed after the operation completed.

Note: A good rule of thumb for operation expiry is 30 days.

Errors

Errors that prevent a long-running operation from starting must return an error response (AIP-193), similar to any other method.

Errors that occur over the course of an operation may be placed in the metadata message. The errors themselves must still be represented with a google.rpc.Status object.

Backwards compatibility

Changing either the response_type or metadata_type of a long-running operation is a breaking change.

Rationale

Validate-only behavior

The guidance for validate-only responses comes from a tension between clients, which benefit from "fully formed" operations that can be treated uniformly, and servers, which don't wish to maintain additional state for trivial operations. It seems counterintuitive that just validating a request should generate more state, but a full operation response that can be fetched later would either require that or "special" singleton operation IDs. The guidance provided is a compromise: by returning a "done" operation, clients can use existing logic to check that the operation has completed successfully (and therefore doesn't need to be fetched for an updated status) but server don't need to maintain any additional state.

Changelog

  • 2024-04-23: Provided pattern for validation on RPCs returning long-running operations.
  • 2022-05-31: Added compatibility section.
  • 2020-08-24: Clarified that responses are not streaming responses.
  • 2020-06-24: Added guidance for parallel operations.
  • 2020-03-20: Clarified that both response_type and metadata_type are required.
  • 2019-11-22: Added a short explanation of what metadata_type is for.
  • 2019-09-23: Added guidance on errors.
  • 2019-08-23: Added guidance about fully-qualified message names when the message name is in another package.
  • 2019-08-01: Changed the examples from "shelves" to "publishers", to present a better example of resource ownership.