AIP-236

Policy preview

A policy is a resource that provides rules that admit or deny access to other resources. Generally, the outcome of a policy can be evaluated to a specific set of outcomes.

Changes to policies without proper validation may have unintended consequences that can severely impact a customer’s overall infrastructure setup. To safely update resources, it is beneficial to test these changes via policy rollout APIs.

Preview is a rollout safety mechanism for policy resources, which gives the customer the ability to validate the effect of their proposed changes against production traffic prior to the changes going live. The result of the policy evaluation against traffic is logged in order to give the customer the data required to test the correctness of the change.

Firewall policies exemplify a case that is suitable for previewing. A new configuration can be evaluated against traffic to observe which IPs would be allowed or denied. This gives the customer the data to guide a decision on whether to promote the proposed changes to live.

The expected flow for previewing a policy is as follows:

  1. The user creates an experiment containing a new policy configuration intended to replace the live policy.
  2. The user uses the "startPreview" method to start generating logs which compare the live and experiment policy evaluations against live traffic.
  3. The user inspects the logs to determine whether the experiment has the intended result.
  4. The user uses the "commit" method to promote the experiment to live.

Guidance

Non-goals

This proposal is for a safety mechanism for policy rollouts only. Safe rollouts for non-policy resources are not in scope.

Experiments

A new configuration of a policy to be previewed is stored as a nested collection under the policy. These nested collections are known as experiments.

A hypothetical policy resource called, Policy, is used throughout. It has the following resource name pattern:

projects/{project}/locations/{location}/policies/{policy}

The experimental versions of the resource used for previewing or other safe rollout practices are represented as a nested collection under Policy using a new resource type. The resource type must follow the naming convention RegularResourceTypeExperiment.

The following pattern is used for the experiment collection:

projects/{project}/locations/{location}/policies/{policy}/experiments/{experiment}

A proto used to represent an experiment must contain the following:

1. The required top-level fields for a resource, like `name` and `etag`
2. The policy message that is being tested itself
3. The field, `preview_metadata`, which contains metadata specific to
   previewing the experiment of a specific resource type.
message PolicyExperiment {

  // google.api.resource, name, and other annotations and fields

  // The policy experiment. This Policy will be used to preview the effects of
  // the change but will not affect live traffic.
  Policy policy = 2;

  // The metadata associated with this policy experiment.
  PolicyPreviewMetadata preview_metadata = 3
      [(google.api.field_behavior) = OUTPUT_ONLY];

  // Allows clients to store small amounts of arbitrary data.
  map<string, string> annotations = 4;
}
  • The experiment proto must have a top-level field with the same type as the live policy.
    • It must be named as the live resource type. For example, if the experiment is for FirewallPolicy, then this field must be named firewall_policy.
    • The name inside the embedded policy message must be the name of the live policy.
  • When the user is ready to promote an experiment, they must copy the policy message into the live policy and delete the experiment. This can be done manually or via a "commit" custom method.
  • A product may support multiple experiments concurrently being previewed for a single live policy.
    • Each experiment must generate logs having each entry preceded by log_prefix so that the user can compare the results of the experiment with the behavior of the live policy.
    • The number of experimental configurations for a given live policy may be capped at a certain number and the cap must be documented.
  • Cascading deletes must occur: if the live policy is deleted, all experiments must also be deleted.
  • map<string,string> annotations must allow clients to store small amounts of arbitrary data.

Metadata

preview_metadata tracks all metadata of previewing the experiment. The messages must follow the convention: RegularResourceTypePreviewMetadata. This is so the proto can be defined uniquely for each resource type in the same service with experiments.

message PolicyPreviewMetadata {
  // Possible values of the state of previewing the experiment.
  enum State {
    // Default value. This value is unused.
    STATE_UNDEFINED = 0;

    // The experiment is actively previewing.
    ACTIVE = 1;

    // The previewing of the experiment has been stopped.
    SUSPENDED = 2;
  }

  // The state of previewing the experiment.
  State state = 1;

  // An identifying string common to all logs generated when previewing the
  // experiment. Searching all logs for this string will isolate the results.
  string log_prefix = 2;

  // The most recent time at which this experiment started previewing.
  google.protobuf.Timestamp start_time = 3;

  // The most recent time at which this experiment stopped previewing.
  google.protobuf.Timestamp stop_time = 4;
}
  • PolicyPreviewMetadata must have the fields defined in the proto above.
    • It may have additional fields if the service or resource requires it.
  • When an experiment is first previewed, preview_metadata must be absent.
    • It is present on the experiment once the "startPreview" method is used.
  • All preview_metadata fields must be output only.
  • state changes between ACTIVE and SUSPENDED when previewing is started or stopped. This happens when the "startPreview" or "stopPreview custom methods are invoked, respectively.
  • The first time the "startPreview" custom method is used, the system must create preview_metadata and do the following:
    • It must set the state to ACTIVE
    • It must populate start_time with the current time.
      • start_time must be updated every time state is changed to ACTIVE.
    • It must set a system generated log_prefix string, which is a predefined constant hard coded by the system developers.
    • The same value is used for previewing experiments for the given resource type. For example, "FirewallPolicyPreviewLog" for FirewallPolicy.
  • When the "stopPreview" custom method is used, the system must do the following:
    • It must set the state to SUSPENDED
    • It must populate the stop_time with the current time.

Methods

create

  • The resource must be created using long-running Create and google.longrunning.operation_info.response_type must be PolicyExperiment.
  • Creating a new experiment to preview must support the following use cases:
    • Preview a new policy.
    • Preview an update to an already live policy.
    • Preview a deletion of a current policy.
  • For the update and delete use cases, the policy field in the experiment must have the full payload of the live policy copied into it, including the name.
    • The user must set the rules to the new intended state to preview an update.
    • The user must set set the rules to represent a no-op to preview a delete.
  • To preview a new policy, the system must do the following:
    • If the system does not support a nested collection without a live policy, the user must create a live policy and set the rules to represent a no-op. For example, the rules of a no-op policy may be empty.
      • An experiment is created as a child of the no-op policy.
  • If the system supports previewing multiple experiments for a live policy, calling "create" more than once must create multiple experiments.

update

  • The resource must be updated using long-running Update and google.longrunning.operation_info.response_type must be PolicyExperiment.
  • The name inside policy must not change but the other fields can in order to change the experiment being previewed because this policy is intended to replace the live policy, and the name of the live policy must not change.
  • The system must set the state to SUSPENDED if the state was ACTIVE at the time of an update.
    • This is so the user can easily distinguish between different versions of the experiment being previewed.

get

  • The standard method, Get, must be included for PolicyExperiment resource types.

list

  • The standard method, List, must be included for PolicyExperiment resource types.
  • Filtering on PolicyPreviewMetadata indicates which experiments are actively previewed.
    • For example, the following filter string returns a List response with experiments being previewed: preview_metadata.state = ACTIVE.

delete

  • The resource must be deleted using long-running Delete and google.longrunning.operation_info.response_type must be PolicyExperiment.

startPreview

// Starts previewing a PolicyExperiment. This triggers the system to start
// generating logs to evaluate the PolicyExperiment.
rpc StartPreviewPolicyExperiment(StartPreviewPolicyExperimentRequest)
    returns (google.longrunning.Operation) {
  option (google.api.http) = {
    post: "/v1/{name=policies/*/experiments/*}:startPreview"
    body: "*"
  };
  option (google.longrunning.operation_info) = {
    response_type: "PolicyExperiment"
    metadata_type: "StartPreviewPolicyExperimentMetadata"
  };
}

// The request message for the startPreview custom method.
message StartPreviewPolicyExperimentRequest {
  // The name of the PolicyExperiment.
  string name = 1;
}
  • This custom method is required.
  • google.longrunning.Operation.metadata_type must follow guidance on Long-running operations
  • This method must trigger the system to start generating logs to preview the experiment.
  • Whenever the method is called successfully, the system must set the following values in the PolicyPreviewMetadata:
    • log_prefix to the predefined constant.
    • start_time to the current time
    • state to ACTIVE.
  • If the method is called on an experiment with the rules representing a no-op, then the system must preview the deletion of the live policy.

stopPreview

// Stops previewing a PolicyExperiment. This triggers the system to stop
// generating logs to evaluate the PolicyExperiment.
rpc StopPreviewPolicyExperiment(StopPreviewPolicyExperimentRequest)
    returns (google.longrunning.Operation) {
  option (google.api.http) = {
    post: "/v1/{name=policies/*/experiments/*}:stopPreview"
    body: "*"
  };
  option (google.longrunning.operation_info) = {
    response_type: "PolicyExperiment"
    metadata_type: "StopPreviewPolicyExperimentMetadata"
  };
}

// The request message for the stopPreview custom method.
message StopPreviewPolicyExperimentRequest {
  // The name of the PolicyExperiment.
  string name = 1;
}
  • This custom method is required.
  • google.longrunning.Operation.metadata_type must follow guidance on Long-running operations
  • This method must trigger the system to stop generating logs to preview the experiment.
  • Whenever the method is called successfully, the system must set the following values in the PolicyPreviewMetadata:
    • stop_time to the current time
    • state to SUSPENDED

commit

The resource may expose a new custom method called "commit" to promote an experiment. The system copies policy from the experiment into the live policy and then deletes the experiment.

Declarative clients may manually copy fields from an experiment into the live policy and then delete the experiment rather than calling "commit" if preferable.

// Commits a PolicyExperiment. This copies the PolicyExperiment's policy message
// to the live policy then deletes the PolicyExperiment.
rpc CommitPolicyExperiment(CommitPolicyExperimentRequest)
    returns (google.longrunning.Operation) {
  option (google.api.http) = {
    post: "/v1/{name=policies/*/experiments/*}:commit"
    body: "*"
  };
  option (google.longrunning.operation_info) = {
    response_type: "google.protobuf.Empty"
    metadata_type: "CommitPolicyExperimentMetadata"
  };
}

// The request message for the commit custom method.
message CommitPolicyExperimentRequest {
  string name = 1;
  string etag = 2;
  string parent_etag = 3;
}
  • google.longrunning.Operation.metadata_type must follow guidance on Long-running operations
  • The method must atomically copy policy from the experiment into the live policy, and then delete the experiment.
  • If any experiment fails "commit", previewing it must not stop, and the live policy must not be updated.
  • The method can be called on an experiment in any state.
  • The etag must match that of the experiment in order for commit to be successful. This is so the user does not commit an unintended version of the experiment.
    • If no etag is provided, the API must not succeed to prevent the user from unintentionally committing a different version of the experiment as intended.
    • A parent_etag may be provided to guarantee that the experiment overwrites a specific version of the live policy.
  • The method is not idempotent and calling it twice on the same experiment must return a 404 NOT_FOUND as the experiment is deleted as part of the first call.

Changes to live policy API methods

delete

  • A delete of the live policy must delete all experiments.
  • To maintain the experiments while negating the effect of the live policy, the live policy must be changed to a no-op policy instead of using this method.

Logging

Logging is crucial for the user to evaluate whether an experiment should be promoted to live.

Logs must contain the results of the evaluated experiment, the etag associated with that experiment alongside that of the live policy, and be preceded by the value of log_prefix. - The etag fields help the user identify which configurations of the live and experiment are evaluated in the log. - log_prefix helps the user separate logs specifically generated for previewing the experiment from other use cases.

Overall, these logs help the user make a decision about whether to promote the experiment to live.

Changelog

  • 2023-04-27: Methods for start and stop renamed. State to enum. Annotations added.
  • 2023-03-30: Initial AIP written.