Welcome

Axess is a library for authenticating users (and non-human callers) in Axum applications. Most of what it offers is not unusual: a session layer, factor verification, cookies, policy evaluation. What makes it interesting is what those pieces refuse to do, and what the refusals add up to.

The first refusal is the most consequential. A session in axess is never "half logged in". The authentication state is a typed enum with five variants, and a partially-completed login is one of those variants (Authenticating) rather than an Authenticated session with a missing flag. Handlers that receive a session cannot mistake a one-factor login for a finished one because the types do not permit it. Code reviewers reading a handler do not need to model which fields are populated when; the variant they match on tells them what is in scope and what is not.

The second refusal concerns time and entropy. Production authentication code is full of both: session identifiers come from the operating system's random source, one-time-password windows are measured against the wall clock, account lockout clears on a schedule. None of that goes through the operating system directly in axess. Every wall-clock read, every random byte, is sourced from a Clock or SecureRng trait whose production implementation delegates to the system and whose test implementation reproduces a controlled sequence. The same login flow, including all its side effects on the session registry and the audit log, runs in a unit test without infrastructure and without flakes. The discipline is called deterministic simulation testing, and it is the reason a race condition between token issuance and use can become a failing test rather than a postmortem.

The third refusal is to ship scattered if user.role == "admin" checks as the authorisation story. Cedar Policy, the policy language axess uses for authorisation, is declarative, schema-validated, deny-by-default, and one language for what most codebases split between role checks, ownership checks, and contextual rules. Policies live in policy files, not in handlers. The Rust code asks the question and receives an AuthzDecision. Reviewing the policy set is then a single artifact review, not a hunt across handlers.

Most of the rest of axess follows from these three decisions, in combination with one structural choice: the library is split across ten small crates so adopters who do not need (say) a given federation adapter do not compile its dependencies. The split is also the verifier-versus-orchestrator boundary in code. Per-credential algorithms live on one side (axess-factors), the state machine, composition, and federation machinery live on the other (axess-core). The split is the most important line in the workspace, and it is described in detail in the next chapter.

What axess is not

It helps to know the boundaries. Axess is not a SaaS, has no hosted control plane, and does not own your user database. It is a library your application depends on, and your application keeps owning its data. It is not an Identity Provider in its primary use. In OAuth/OIDC terms axess is the Relying Party (the application that delegates identity to an external IdP and runs a session on the resulting tokens), not the OpenID Provider (the IdP itself, with login UI, consent screens, and user database). For the OP role, point axess at Keycloak, Ory Hydra, Okta, Azure AD, or whatever SSO your organisation already runs. The local-idp feature does mint workload JWTs in-process, but that is on-host service-to-service issuance, not a user-facing OP; Local IdP covers the surface. It is not an HTTP server. Axum is the HTTP server; axess plugs into it as a Tower layer plus a set of extractors, and your code is what owns the lifecycle. It is not a general-purpose session library. The session machinery is in service of the authentication state machine, not the other way around; if all you need is HTTP sessions without authentication or authorisation, smaller libraries do that better.

The workspace, in one table

The crate split is structural, and the table below is a fair approximation of which one you reach for in any given situation. The chapter Architecture at a glance expands on the dependency direction and the rules that keep leaf crates from depending on the orchestrator.

CrateRole
axessFacade. Re-exports the public API. Application code depends on this.
axess-coreSession state machine, AuthnService, AuthzStore, federation adapters (OAuth, OIDC, LDAP, mTLS, FIDO2, JWT, K8s SA, GitHub OIDC), device identity, OBO/delegated access, middleware, storage backends. The orchestrator.
axess-factorsPer-credential verifier primitives: Argon2id, TOTP, HOTP. Composable on their own.
axess-identityTyped IDs (UserId, TenantId, WorkloadId) and the Principal { Human, Workload } enum.
axess-eventsAudit event payloads and async sinks.
axess-cacheTTL+LRU cache with single-flight. Used by the Cedar entity cache and the OIDC JWKS cache.
axess-clockClock trait, SystemClock, MockClock. The DST time foundation.
axess-rngSecureRng trait, SystemRng, MockRng. The DST entropy foundation.
axess-stringsShared string newtypes (Arc<str> interning).
axess-macrosrequire_authn!, require_partial_authn!, require_authz! procedural macros.

When to reach for axess

Axess fits when at least two of the following are true. Multi-factor authentication that varies per user or per tenant is the most common driver, because composing factors and threading their result through a typed state machine is the value over a single-factor session library. Policy-driven authorisation in one language, across roles, relationships, and contextual conditions, is the second. Multi-tenancy is the third, since axess scopes factors, methods, and policies at three tiers (Global, Tenant, User) by default. Device identity, workload identity, or delegated access are the fourth, fifth, and sixth, in the order most adopters need them. Regulated industries land here for the audit pipeline and FAPI 2.0 conformance work, because the trail axess emits is already shaped as evidence.

It does not fit when a single-factor session is all you need. It does not fit when you want a hosted IdP. It does not fit when your protocol is not HTTP, because the state machine is shaped against Axum extractors and middleware. Each of these has better answers elsewhere.

If you are evaluating axess, the next chapter is the one to read. Architecture at a glance covers the verifier-versus-orchestrator line, the dependency direction, the three independent state slices that make up a request, and the DST mechanics that ride underneath. Twenty minutes there will save an hour in every chapter after.

If you are starting an integration, jump to Getting started. It walks through a minimal Axum application end-to-end and points at examples/sqlite/ for the production-shaped version with a real database, encrypted sessions at rest, two-factor login, rate limiting, health checks, and metrics.

If you have inherited an existing axess integration, the left-hand navigation is grouped by concern. Parts II and V (Authentication and Sessions) carry most of the day-to-day surface; the rest reads as reference and can wait until you need it.

If you are responsible for the production deployment, read Security posture and Operations runbook before launch. The defaults in code are conservative for development; production has specific knobs that must be set explicitly, and both chapters name them.

Status

Axess is at v0.2.0, pre-publication. The API is stabilising, with (the crates.io publish) named as the next milestone. The breaking changes accumulated against the previous stable line are catalogued in Migration guide. Until the first crates.io release, minor versions may break source compatibility; the goal post-publish is to maintain the SemVer discipline that Rust libraries are held to elsewhere.

Vulnerability reports go through the private channel described in SECURITY.md. Please do not file security issues on the public GitHub tracker.

Architecture at a glance

This chapter describes the shape of the axess workspace: which crate owns what, how the pieces compose, what stays put under which kind of change, and where adopters plug in. The goal is to make the rest of the book pre-cached. Once you have the four architectural decisions below in mind (the verifier-versus-orchestrator line, the three state slices, the DST foundation, and the naming conventions), every later chapter slots into place without further explanation.

If you are evaluating axess, read this chapter end-to-end. If you are already mid-integration, you can skim and come back when something feels surprising.

Workspace shape

Axess is ten library crates plus a set of example applications. The split is not cosmetic. It enforces a structural invariant (leaf crates do not depend on the orchestrator), it gates compile cost for features adopters do not use, and it makes the verifier-versus-orchestrator line explicit in the dependency graph.

flowchart TD
  facade["axess<br/><i>facade</i>"]
  core["axess-core<br/><i>orchestrator</i>"]
  factors["axess-factors<br/><i>verifiers</i>"]
  macros["axess-macros<br/><i>guard macros</i>"]
  identity["axess-identity<br/><i>typed IDs</i>"]
  events["axess-events<br/><i>audit payloads</i>"]
  cache["axess-cache<br/><i>TTL cache</i>"]
  clock["axess-clock<br/><i>Clock trait</i>"]
  rng["axess-rng<br/><i>SecureRng trait</i>"]
  strings["axess-strings<br/><i>Arc&lt;str&gt;</i>"]

  facade --> core
  facade --> factors
  facade --> macros

  core --> factors
  core --> identity
  core --> events
  core --> cache
  core --> clock
  core --> rng
  core --> strings

  factors --> identity
  factors --> clock
  factors --> rng

  cache --> clock
  events --> identity

The axess crate is a thin facade that re-exports the curated public API from axess-core and axess-factors. Application code depends on this crate and only this crate. The internal split is free to reorganise without breaking adopters, provided the types surfaced at the facade level stay compatible.

axess-core is the orchestrator. It owns the session state machine, AuthnService, AuthzStore, the Axum middleware stack (CSRF, rate limit, request id, trace id), session storage backends, the device-identity ladder, the workload identity resolvers, and the audit dispatch. If a type drives a transition or owns persistent state, it lives here.

axess-factors holds the per-credential verifiers. The list is long because the credential surface authentication actually has is long: Argon2id, TOTP, HOTP, email OTP, FIDO2, LDAP bind, mTLS, OAuth and OIDC (with discovery, JWKS cache, and logout-token claim validation), JWT validation, federation adapters for Kubernetes service accounts and GitHub Actions and generic OAuth resource servers, a bearer-token extractor, an outbound OAuth client, and the PKCE helpers. The crate is composable on its own and is the obvious extension point when you need a custom factor: implement the verifier trait, register it with the service, the rest stays the same.

Everything else in the workspace is a leaf. Each leaf crate owns one concept (typed IDs, TTL cache, the Clock trait), and depends only on other leaves on its own row of the dependency graph. The structural invariant under review is straightforward: no leaf crate may depend on axess-core. Flipping any of these to depend on the orchestrator would create a cycle through the facade and is rejected at review.

The verifier-versus-orchestrator line

The most important line in the workspace runs between axess-factors and axess-core. Per-credential algorithms and their data shapes live on the verifier side. The sum types and the composition machinery that combine them live on the orchestrator side.

This is concrete. The Fido2Config struct, the Fido2Verifier trait, and the WebAuthn ceremony itself live in axess-factors. The FactorKind::Fido2 variant, the FactorConfig::Fido2(Fido2Config) wrapping, and the FactorStep::factor(FactorKind::Fido2) composition helper live in axess-core. The same pattern applies to LDAP, to OAuth, to every factor: the algorithm and its config are verifier-side, the enum variant and the composition are orchestrator-side.

The reason for the line is the kind of change each side absorbs. The verifier is the thing you might want to swap (an alternative WebAuthn library, a custom OTP scheme, an LDAP binding that reads from a sidecar rather than directly). The orchestrator is the thing you do not swap (the state machine, the audit dispatch, the storage interface) but do want to extend (add a factor, add a workflow, add a backend). Keeping the two in separate crates makes the swap and the extension into independent operations. A change in axess-factors does not invalidate orchestrator code; a change in axess-core does not touch the verifier crates.

The line also shows up in the dependency direction. axess-core depends on axess-factors, never the reverse.

The one exception that proves the rule

axess-core hosts one piece of code that does not fit the RP-side-orchestrator framing: the in-process IdP under crate::local_idp (feature local-idp, off by default). LocalIdp mints workload-identity JWTs on-host, which is OP-side issuance, not verifier composition. It lives in axess-core deliberately, not by oversight. The choice is between two costs: carve LocalIdp into a sibling crate that mirrors the verifier/issuer split at workspace shape, or accept one feature-gated OP-side module inside the orchestrator crate. The carve-out has been considered (see the ROADMAP) and rejected on the same reasoning that retired the earlier axess-delegated crate: the structural benefit is real but small, the maintenance overhead of an additional workspace member is real, and no adopter is asking for LocalIdp as a separate dependency. Adopters who do not enable local-idp pay nothing for it; adopters who do enable it find it through axess::local_idp::* regardless of which crate hosts the implementation.

The internal layout reflects the boundary even when the crate boundary does not. Primitives shared between the production [LocalIdp] and the test [LocalIdpFixture] live in axess-core/src/local_idp/primitives.rs, outside the testing/ tree, so production code does not have to import from a test module. The fixture itself stays under crate::testing::local_idp and imports the primitives, which is the dependency direction the prior arrangement got backwards.

The three state slices

Most authentication libraries conflate three independent state machines into one bag of fields and call the result a "session". Axess keeps them separate. This is not a stylistic choice; the slices answer different questions, change on different cadences, and are owned by different concerns.

flowchart LR
  subgraph auth["Authentication state"]
    direction TB
    s1["Guest"] --> s2["Identifying"]
    s2 --> s3["Authenticating"]
    s3 --> s4["Authenticated"]
    s3 --> s5["PendingWorkflow"]
    s5 --> s4
  end
  subgraph authz["Authorisation state"]
    direction TB
    a1["AuthzStore<br/><i>policies + schema<br/>(loaded once)</i>"]
    a2["AuthzSession<br/><i>per-request facade</i>"]
    a3["AuthzEntityProvider<br/><i>app-supplied graph</i>"]
    a1 --> a2
    a3 --> a2
  end
  subgraph principal["Principal state"]
    direction TB
    p1["Principal::Human"]
    p2["Principal::Workload"]
  end

Authentication state is AuthState, the session state machine covered in Part II. It transitions through factor verification, lives inside SessionData behind a cookie, and is what AuthnService::verify_factor mutates. It answers the question "is this caller authenticated, and to what tier?" It changes on factor verification, which is rare in absolute terms.

Authorisation state is AuthzStore, holding the Cedar policy set and its schema, loaded once at startup. A per-request AuthzSession then evaluates those policies against an entity graph that the application supplies through an AuthzEntityProvider. It does not live in the session; it is rebuilt fresh per request. It answers the question "is this principal allowed to perform this action against this resource?" It changes when policies are redeployed, which is even rarer.

Principal state is Principal { Human | Workload }. A human principal carries a UserId and TenantId; a workload principal carries a WorkloadId. The principal is extracted from the authentication state for humans and from a workload-identity resolver (bearer JWT, mTLS, K8s service account, and so on) for non-humans. It changes on every single request.

The slices are independent because they answer different questions and change on different cadences. Treating them as one bag conflates the questions and the cadences. Keeping them apart lets each evolve without disturbing the others.

Deterministic simulation testing

Every place in axess that reads wall time or sources entropy on the hot path goes through an injected trait. This is the discipline that lets the test suite be reproducible and that lets subtle timing or ordering bugs become failing tests rather than rare incidents.

Two traits carry the foundation. The first is Clock:

pub trait Clock: Send + Sync {
    fn now(&self) -> chrono::DateTime<chrono::Utc>;
}

pub struct SystemClock;          // delegates to chrono::Utc::now()
pub struct MockClock { /* ... */ } // advances under test control

The second is SecureRng:

pub trait SecureRng: Send + Sync {
    fn fill_bytes(&self, dest: &mut [u8]);
}

pub struct SystemRng;          // delegates to getrandom
pub struct MockRng { /* ... */ } // seeded; reproduces byte sequence

A small detail matters here. SecureRng::fill_bytes takes &self, not &mut self. The mock implementation guards its internal counter with a Mutex so that Arc<dyn SecureRng> is dyn-compatible and concurrent use is serialised without forcing every call site to plumb a mutable borrow through. The trade is a single locked critical section per random fill, which is irrelevant on the authentication hot path.

The wiring matches. AuthnService<I, F> holds Arc<dyn SecureRng> and Arc<dyn Clock> as construction-time fields. The service is generic over the identity store (I) and factor store (F) but type-erased over clock and RNG, so swapping in MockRng or MockClock does not change the service's type signature. Tests do this with .with_rng(MockRng::new(seed)) and .with_clock(MockClock::default()); production wires SystemRng and SystemClock.

The same discipline extends to backends. The pattern is uniform: the production implementation talks to a real database or external service, and a Mock* implementation does the same thing in memory under test control.

TraitProduction implementationTest mock
AuthnBackendreal databaseMockBackend
SessionRegistryValkey or memoryMemorySessionRegistry
OAuthProviderHTTP plus JWKS cacheMockOAuthProvider
Fido2ProviderWebAuthn ceremonyMockFido2Provider
LdapProviderLDAP directoryMockLdapProvider
DeviceStoreSQL or ValkeyMemoryDeviceStore
DeviceResolverheader or IPRedactedResolver, NoopDeviceResolver

A complete login including session-registry interactions, factor verification, refresh-token rotation, and audit emission can be exercised in a #[tokio::test] with no database, no Valkey, no network. The same test that detects a regression on a development laptop detects it in CI without further configuration.

One carve-out is worth naming. The axess-cache crate has an opt-in moka-cache feature that runs Moka's wall-clock-driven background eviction. That feature breaks DST and is documented as breaking it. The default ClockTtlCache takes a Clock trait and is DST-clean. If your test suite runs against the default configuration, you are inside the determinism envelope.

Storage backends

Identity persistence is adopter-owned. Axess does not prescribe a user or tenant or factor schema, because every application already has one and the schemas do not agree on much. What axess does prescribe is the trait surface you implement, split into three tiers so that adopters can narrow what they have to write.

The narrowest tier is IdentityLookup, with ten read verbs. It is enough to support a read-replica path or a test fixture. The middle tier, IdentityAuthnLog: IdentityLookup, adds four per-attempt audit writes; it is required for production because lockout decisions depend on the audit log. The widest tier, IdentityAdmin: IdentityAuthnLog, adds nine verbs covering privileged provisioning, suspension, and GDPR erasure, and is required for any control-plane surface.

The umbrella alias IdentityStore: IdentityAdmin preserves the all-three-tiers shape for production backends. NoopAuthnLog is an adapter that wraps an IdentityLookup and satisfies the IdentityAuthnLog signature with a no-op, suitable for fixtures and read-replica contexts. Production must implement IdentityAuthnLog directly, however; the noop disables lockout, which is a security posture you do not want by accident.

Session, refresh-token, and device storage have first-party backends for the obvious targets:

TraitMemorySQLitePostgresMySQLValkey
SessionStorealways-onsqlitepostgresmysqlvalkey
SessionRegistryalways-on(adopter)(adopter)(adopter)valkey
RefreshTokenStorealways-onadopteradopteradopteradopter
DeviceStoredevicedevice, sqlitedevice, postgres(adopter)device, valkey
DelegatedCredentialStorealways-onadopteradopteradopteradopter

The word "adopter" means axess defines the trait and provides a memory implementation; the SQL or Valkey-backed implementation is yours. The chapter Identity store implementation walks through the pattern, and examples/sqlite/ ships a complete one.

Session backends are also re-exported through the facade under the axess::backends::{sqlite, postgres, mysql, valkey, memory} namespace. Application code writes use axess::backends::sqlite::{SessionStore, DeviceStore} rather than stitching together flat SqliteSessionStore, SqlDeviceStoreError, and similar symbols. The grouping is a facade detail; backend module paths inside axess-core are internal.

The generic Store<K, V> surface

All five session backends also implement the generic axess_core::store::Store<SessionId, SessionData> trait. Adopters who want a backend-agnostic key/value-with-TTL surface (test doubles, generic operations endpoints, multi-backend deployments) can hold an Arc<dyn Store<…>> or a generic S: Store<…> and dispatch uniformly. SessionStore stays the primary surface for session-domain operations (cycle, find_sessions_for_user) because those carry primitives the generic Store deliberately omits.

A fully codec-parameterised SqlStore<K, V, C: Codec<V>> was evaluated and rejected. The dialect-specific SQL bodies are too thin to justify the sqlx::Database bound noise: only ON CONFLICT versus ON DUPLICATE KEY UPDATE plus three placeholder styles differ. The slice that does dedupe cleanly lives in session/storage/sql_helpers.rs.

Naming conventions

A reviewer reading axess code can predict a type's responsibility from its prefix and suffix. The conventions are tight on purpose; they let you scan a module index without reading any function bodies.

Type prefixes

PrefixScopeExamples
Auth*Shared across authentication and authorisationAuthSession, AuthState, AuthEvent, AuthMethod, AuthPrincipal
Authn*Authentication onlyAuthnService, AuthnError, AuthnScope, AuthnBackend
Authz*Authorisation onlyAuthzStore, AuthzSession, AuthzDecision, AuthzError

Auth* is shared infrastructure. Authn* is what you reach for when handling a login attempt. Authz* is what you reach for when deciding whether a request may proceed. If you see a function that takes AuthSession and returns AuthzDecision, you know without opening it that it is bridging authentication state into authorisation evaluation.

Type suffixes

SuffixMeaning
*OutcomeMulti-variant result from an authentication operation (LoginOutcome, FactorOutcome, SignupOutcome)
*DecisionBinary allow/deny verdict (AuthzDecision)
*ConfigConfiguration or parameters (SessionConfig, TotpConfig, RateLimitConfig)
*StorePersistence trait or implementation (SessionStore, IdentityStore, DeviceStore)
*RegistrySession validity tracking (SessionRegistry, MemorySessionRegistry)
*ProviderExternal integration trait (OAuthProvider, Fido2Provider, LdapProvider)
*ResolverExtract typed value from a request (DeviceResolver, PrincipalResolver)
*ErrorError type (AuthnError, OAuthError, CryptoError)
*BuilderBuilder pattern (SessionConfigBuilder, AuthEventBuilder)

The conventions are not retroactive style guides. They are how the public surface is built today. New types adopt them; PR review catches violations.

Method verb conventions

VerbSemanticsExamples
get_*Lookup by primary key, deterministic, O(1)get_user(id)
find_*Search by business criteria, may scanfind_user(identifier, tenant)
load_* / save_*Deserialise / serialise persisted stateload_factor(scope, kind)
begin_* / complete_*Multi-step ceremony start / finishbegin_login(), complete_oauth_login()
verify_*Check a credential or assertionverify_factor()

If you read find_user_by_email, you know it may be O(n) and may miss. If you read get_user, you know the id was already validated and the call should succeed unless the user was deleted.

Visibility

Internal types for cross-module use within axess-core (SessionHandle, SessionInner, LoadOutcome, FinalizeOutcome) are pub(crate). The public API surface is defined by the re-exports in axess-core's lib.rs and the facade in axess's lib.rs. The default for new types is pub(crate); promotion to pub requires concrete demand.

Security invariants

Three invariants run through every part of the workspace. They are not advice; they are enforced by lints, by review, and in some cases by the type system.

The first is #![forbid(unsafe_code)], declared at the root of every crate. There is no unsafe code in axess. There never will be unsafe code in axess. If a future change needs it, the change goes elsewhere.

The second is constant-time comparison for any byte-level secret check. HMAC cookie verification, TOTP code verification, OAuth CSRF state, refresh-token device binding, session fingerprint: all of these compare bytes through subtle::ConstantTimeEq. The alternative, == on bytes, leaks timing information and is rejected at review.

The third is secret zeroization on drop. Password hashes are wrapped in ZeroizedString. TOTP and HOTP shared secrets use Zeroizing. The session signing key zeroes its bytes in its Drop impl. The discipline is not perfect (an attacker with sufficient memory access can still win), but the surface is reduced.

The full production posture, including integration requirements and compliance touch-points, is in Security posture.

What lives where, in one paragraph

If you read nothing else from this chapter: state machines, storage, middleware, federation adapters, device identity, and OBO/delegated access live in axess-core. Factor algorithm primitives (Argon2id, TOTP, HOTP) live in axess-factors. Typed IDs and the principal enum live in axess-identity. Anything that delegates to time or randomness goes through axess-clock or axess-rng. Adopters depend on the axess facade; the internal split is free to reorganise behind that boundary.

Everything else is detail. The rest of the book is detail.

Further reading

  • The session state machine covers the five-state machine in full, including PendingWorkflow.
  • Factors and methods covers verifier composition, method authoring, and the scope hierarchy.
  • Cedar policy fundamentals covers policy loading, the evaluator, and the entity provider contract.
  • Session lifecycle and crypto envelope covers the cookie shape, the AES-256-GCM envelope, and fingerprint binding.
  • Contributing covers the AX-NNN policy, the DST non-negotiable, and the naming conventions tied back to this chapter.

Getting started

This chapter is the on-ramp. It assumes you can read Rust and have seen Axum, but it does not assume you know axess. The goal by the end is a small running Axum application that logs a user in with a password, holds the session in a signed cookie, and rejects requests to a protected route until the login is complete.

We will skip the database for as long as possible. Replacing the in-memory backend with a real SQLite backend is a one-trait swap, covered at the end and walked through in detail in Identity store implementation and the working examples/sqlite/ reference application.

If you already have an Axum application and want the punch list: add the dependencies in Dependencies, drop in the SessionLayer and AuthnService from The minimum viable wiring, and wire the login handler from Adding password login. The rest of the chapter is rationale and a tour of the production-shaped example.

Prerequisites

You need Rust 1.87 or later on the stable channel (the workspace MSRV), Axum 0.8.x, and a Tokio runtime in your binary (#[tokio::main] is fine). Axess does not depend on system libraries, message brokers, or external IdPs by default. The defaults are deliberately zero-infra: the in-memory session store, the in-memory backend, and the password, TOTP, HOTP, and email-OTP factors all work out of the box for development and tests.

Dependencies

The shortest functional Cargo.toml looks like this.

[dependencies]
axess = "0.2"             # facade -- depend on this, never on the internal crates
axum = "0.8"
tokio = { version = "1", features = ["macros", "rt-multi-thread"] }
tower = "0.5"             # transitively from axum, but listed for clarity

The defaults of the axess facade enable authz and device. Everything else is opt-in via features. For this chapter we will also turn on memory, the in-memory session store used for development and tests.

axess = { version = "0.2", features = ["memory"] }

The complete feature reference lives in the crate-level docs on docs.rs and is surveyed in the project's README. Per-feature chapters in this book (Backends, OAuth, and so on) state their required feature at the top.

The minimum viable wiring

A minimum axess setup has four moving pieces, used in the same order they are wired. The backend looks up users and verifies their factors. The session store persists session data across requests. A signing key HMAC-signs the session cookie so it cannot be tampered with. The AuthnService is what handlers reach for to drive the state machine. On top of those four pieces sits one Tower layer, SessionLayer, which reads the cookie at the start of every request, hydrates the session, and writes it back on response.

Here is the whole thing in one file. We will walk through each line right after.

use axess::{
    AuthnService, InMemoryBackend, InMemorySessionStore,
    SessionLayer, AuthSession,
};
use axum::{Router, routing::get, response::IntoResponse, http::StatusCode};
use std::{sync::Arc, time::Duration};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Backend -- one type implements both IdentityStore and FactorStore.
    let backend = InMemoryBackend::new()
        .with_user_password("alice", "default", "Gnomes2+");

    // 2. Session store + 3. signing key.
    let session_store = InMemorySessionStore::new();
    let signing_key: [u8; 32] = [0; 32]; // PLACEHOLDER, see "Signing keys" below.

    // 4. AuthnService -- type-erased over clock and RNG; production wires
    //    SystemClock + SystemRng.
    let service = Arc::new(AuthnService::new(backend.clone(), backend));

    // 5. SessionLayer threads the session through each request.
    let session_layer = SessionLayer::new(session_store, signing_key)
        .with_ttl(Duration::from_secs(86_400))
        .with_secure(false); // dev only -- see "Cookie security" below.

    let app = Router::new()
        .route("/", get(public_page))
        .route("/dashboard", get(protected_page))
        .with_state(service)
        .layer(session_layer);

    let listener = tokio::net::TcpListener::bind("127.0.0.1:3000").await?;
    axum::serve(listener, app).await?;
    Ok(())
}

async fn public_page() -> &'static str {
    "everyone can see this"
}

async fn protected_page(session: AuthSession) -> impl IntoResponse {
    if session.is_authenticated() {
        (StatusCode::OK, "welcome").into_response()
    } else {
        (StatusCode::UNAUTHORIZED, "log in first").into_response()
    }
}

This compiles and runs. Visiting http://127.0.0.1:3000/ returns "everyone can see this". Visiting /dashboard returns 401, because no session is authenticated yet. Adding the login flow is the next section.

What each line is doing

InMemoryBackend::new() constructs a backend that holds users, factor configurations, and authentication-attempt logs in memory. The convenience method with_user_password seeds one user (alice, in tenant default, with the Argon2id-hashed password Gnomes2+). Production replaces this with a real backend that implements IdentityStore and FactorStore against your database. The trait surface is identical.

InMemorySessionStore::new() is the trivial session backend. Session data lives in a HashMap behind an RwLock, and disappears on process exit. The first replacement is axess::backends::sqlite::SessionStore (with the sqlite feature), covered in Backends.

AuthnService::new(backend.clone(), backend) takes two arguments because the identity store and the factor store can be different types. In the in-memory case they are the same object, hence the clone. In production they typically remain the same struct (a single backend implementing both traits), again with a clone.

SessionLayer::new(store, key) constructs the Tower layer. The chained .with_ttl(86_400) sets a one-day session lifetime, and .with_secure(false) permits HTTP cookies for local development. See Cookie security below for the production setting.

AuthSession is an Axum extractor. Receiving it as a handler argument hydrates the session for the current request, and is_authenticated() returns true only when the state is AuthState::Authenticated. There are also is_guest(), is_authenticating(), and a typed .state() accessor if you want to match on the enum directly.

Adding password login

The convenience seeded by with_user_password configures a single-factor method called password. A login is two HTTP requests. The first is POST /login with a JSON body carrying the username and password. Axess transitions the session from Guest to Authenticating, verifies the password, and on success transitions to Authenticated. Every request after that carries the cookie that identifies the session, and AuthSession reads Authenticated.

use axess::{AuthnService, AuthSession, LoginOutcome};
use axum::{extract::State, response::IntoResponse, http::StatusCode, Json};
use serde::Deserialize;
use std::sync::Arc;

#[derive(Deserialize)]
struct LoginForm {
    username: String,
    password: String,
}

async fn login(
    session: AuthSession,
    State(service): State<Arc<AuthnService<InMemoryBackend, InMemoryBackend>>>,
    Json(form): Json<LoginForm>,
) -> impl IntoResponse {
    // 1. Begin the login. Transitions Guest -> Authenticating.
    match service.begin_login(&session, &form.username, "default").await {
        Ok(_) => {}
        Err(e) => return (StatusCode::UNAUTHORIZED, format!("{e}")).into_response(),
    }

    // 2. Verify the password factor.
    use axess::FactorCredential;
    match service
        .verify_factor(
            &session,
            FactorCredential::Password(form.password.clone()),
        )
        .await
    {
        Ok(LoginOutcome::Authenticated { .. }) => {
            (StatusCode::OK, "logged in").into_response()
        }
        Ok(LoginOutcome::AwaitingFactor { remaining }) => {
            // Unreachable for a password-only method, but the branch matters
            // when chaining factors (password + TOTP, etc).
            (StatusCode::OK, format!("need more factors: {remaining:?}")).into_response()
        }
        Err(e) => (StatusCode::UNAUTHORIZED, format!("{e}")).into_response(),
    }
}

The call to begin_login is what transitions the session from Guest to Authenticating. The transition records the user id, the tenant, and the list of factors still required (just Password for a single-factor method). Then verify_factor consumes one factor and returns a LoginOutcome. The successful terminal case is LoginOutcome::Authenticated, meaning every required factor has passed. The intermediate case is LoginOutcome::AwaitingFactor, meaning the factor verified but more are required; the state stays Authenticating and remaining lists what is still needed.

The branching is the whole point of the explicit state machine. There is no version of "logged in" that means "we believe one factor, you can let them in". is_authenticated() returns true only when every required factor has passed.

Wiring the login route

The minimum-viable router picks up the new handler:

let app = Router::new()
    .route("/", get(public_page))
    .route("/login", axum::routing::post(login))
    .route("/dashboard", get(protected_page))
    .with_state(service)
    .layer(session_layer);

A login flow now works end-to-end. Start the server, curl once to log in, hold the cookie, curl again to reach /dashboard.

$ curl -c jar -X POST http://127.0.0.1:3000/login \
       -H 'content-type: application/json' \
       -d '{"username":"alice","password":"Gnomes2+"}'
logged in

$ curl -b jar http://127.0.0.1:3000/dashboard
welcome

What just happened

A full request walks the following path. The numbers correspond to the wiring steps from The minimum viable wiring.

The browser sends the request with a Cookie: header carrying the session id. SessionLayer (5) extracts the cookie, verifies its HMAC signature against the signing key, looks up the session in the InMemorySessionStore (2), and rebuilds the AuthState. Axum invokes the handler with the hydrated AuthSession extractor. The handler reads or mutates the session through AuthnService (4), and mutations flag the session dirty. On response, SessionLayer re-serialises the session if it is dirty, re-signs the cookie, and sets it on the response.

The state machine, the backend, the session store, and the layer are independent moving parts. Swapping the in-memory backend for a SQLite-backed one does not touch the state machine or the session store. Swapping the session store for Postgres does not touch the state machine or the backend.

Signing keys

The example uses [0; 32] as the signing key. That is fine for a five-minute demonstration. It is not fine for anything else.

In production the signing key is a 32-byte random value loaded from a secrets manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault, sealed Kubernetes secrets, or your platform's equivalent). The key must be stable across process restarts; the HMAC of an existing session cookie is computed with this key, and if the key changes underneath, every existing session becomes invalid on the next request.

Rotating the signing key is supported via SessionLayer::with_previous_key, which keeps the old key available for a transitional period so that sessions signed with the previous key continue to validate while new sessions sign with the new one. The Operations runbook walks through the rotation sequence in detail.

Setting .with_secure(false) in the example permits the cookie to be sent over HTTP, which is necessary for localhost development. In production, you terminate TLS at the edge and call .with_secure(true). The cookie will then only be sent over HTTPS. The other defaults are already production-shaped: HttpOnly is on, SameSite=Lax is set, and the cookie path is the application root.

The Cookies, fingerprinting, hijack detection chapter covers the rest of the surface: the HMAC fingerprint binding that detects when a session cookie is replayed from a different user agent, the trusted-proxy configuration that controls how X-Forwarded-For is interpreted, and the SameSite=Strict trade-off.

Going further

This chapter is deliberately the minimum. The real examples/sqlite/ extends the same shape with everything you will actually want in production: a real SQLite backend (OurBackend implements IdentityStore and FactorStore over a sqlx::SqlitePool), a SQLite-backed session store with AES-256-GCM encryption at rest, a password + TOTP two-factor login for a second user, self-service signup and TOTP enrollment, a password-reset flow with email-OTP, rate limiting on the auth routes, a health check on the session store, atomic auth-attempt counters exposed at /metrics, and a background interval task that purges expired sessions. Read the example, run it, compare its app.rs to the snippet in this chapter. The shape is the same; there are simply more pieces wired in.

After that, the order in which you read the rest of the book depends on your goal.

GoalNext chapter
Add a second factor (TOTP, FIDO2, OAuth)Factors and methods
Replace InMemoryBackend with your databaseIdentity store implementation
Switch the session store to Postgres, MySQL, or ValkeyBackends: SQLite, Postgres, MySQL, Valkey
Add authorisation policiesCedar policy fundamentals
Run multiple tenantsMulti-tenancy
Federated login (Google, Okta, Azure AD)OAuth 2.0 and OIDC
Workload identity for non-human callersWorkload identity overview
Production deploymentOperations runbook

Common stumbling points

A handful of failures bite first-time integrators. They are worth naming up front so the chapter that solves them is easy to find.

If your handler cannot see AuthSession, the extractor needs the layer to populate request extensions. Add use axess::AuthSession; and check that SessionLayer is in .layer(...) on the router.

If begin_login returns UserNotFound, the tenant probably does not match. The example seeds alice in tenant default; passing a different tenant returns UserNotFound deliberately, not "user exists in a different tenant". Axess never leaks tenant membership across tenant boundaries.

If sessions disappear on process restart, that is correct for InMemorySessionStore. Use SqliteSessionStore, PostgresSessionStore, or ValkeySessionStore (with their respective features) for persistence. See Backends.

If you need to attach application data to a session, SessionData has a custom field for that. The size cap is 64 KiB to keep oversize cookies from becoming a DoS surface. See Session lifecycle and crypto envelope §"Custom session data".

If the user logs out, AuthSession::clear() or service.logout(&session).await resets the state to Guest, rotates the session id (defeating fixation), and clears the cookie on response.

Each of these has a dedicated chapter or section later in the book. The goal here was to get you running, not to be complete. You are running. The rest is detail.

The session state machine

AuthState is the most important type in axess. Everything that matters about an authenticated session, both for the type system and for a reviewer reading a handler, is captured by which of its five variants you are looking at. The whole library is built around the representational claim that authentication is not a boolean, not a flag column, and not a row in a sessions table that the handler reads and then trusts. It is an enum, transitions on the enum are methods on the enum, and a partial login is a distinct variant rather than a "finished" session with one field missing.

This chapter walks through the variants, the transition method, the outcome enum that dispatches transitions, the PendingWorkflow escape hatch for signup and password reset, and the orchestration-versus-pure split that keeps the state machine independently testable.

The five variants

The enum lives at axess-core/src/session/data.rs in the workspace. Each variant carries exactly the data its phase needs. There is no field on Authenticated for "current factor being verified" because at that point no factor is in progress, and there is no field on Guest for "tenant" because no user has been identified yet. The absence is the point.

pub enum AuthState {
    Guest,

    Identifying {
        user_id: UserId,
        tenant_id: TenantId,
    },

    Authenticating {
        user_id: UserId,
        tenant_id: TenantId,
        method_name: Arc<str>,
        remaining: Vec<FactorKind>,
        completed: Vec<FactorKind>,
        attempt_count: u32,
        last_attempt: Option<DateTime<Utc>>,
    },

    Authenticated {
        user_id: UserId,
        tenant_id: TenantId,
        authn_time: DateTime<Utc>,
        factors_completed: Vec<FactorKind>,
    },

    PendingWorkflow {
        user_id: UserId,
        tenant_id: TenantId,
        workflow: WorkflowState,
    },
}

Guest is the default. A request with no cookie, or a cookie whose session has been logged out or expired, arrives at the handler with an AuthSession whose state is Guest. There is no user identity in scope.

Identifying is the brief intermediate state for flows that prompt for a username before asking for any credential. Most applications skip it and go straight from Guest to Authenticating. It exists for the two-page login pattern where step one collects the identifier and step two collects the password, possibly with the identifier carried over a hidden form field or a short-lived intermediate token. The variant records who is being identified but says nothing about credentials.

Authenticating is where most of the action happens. The session knows who it is trying to authenticate, which method is in progress (because a tenant might have multiple methods, and the choice is locked in before any factor runs), what factors are still required, what factors have already been verified this attempt, how many credential attempts have been made, and when the last attempt landed. The last two fields exist because lockout decisions depend on them. A method that allows three attempts before locking the user out for fifteen minutes needs exactly this information, and putting it in the variant rather than in a side table keeps the decision local and reviewable.

Authenticated is the terminal success state. It carries the user id, the tenant id, the moment of successful authentication (for the audit trail), and the list of factors that were used. The factor list is load-bearing. A tenant policy that requires Fido2 for certain routes can check factors_completed.contains(&FactorKind::Fido2) directly, without consulting an external store.

PendingWorkflow is the variant most adopters do not initially expect and end up reaching for once they ship a real signup flow. It models the state where a user has authenticated enough to identify themselves but is in the middle of a multi-step ceremony (signup, password reset, email verification, or a custom workflow) and should not be treated as fully logged in until the ceremony completes. The variant wraps a WorkflowState that records which workflow is in progress, which step the user is on, and when the workflow started.

pub struct WorkflowState {
    pub kind: WorkflowKind,
    pub current_step: u32,
    pub total_steps: u32,
    pub initiated_at: DateTime<Utc>,
}

pub enum WorkflowKind {
    Signup,
    PasswordReset,
    EmailVerification,
    Custom(Arc<str>),
}

Custom(Arc<str>) is the extension point. If your application has a KYC flow, a hardware-key registration flow, or a multi-step recovery ceremony, you name it as a custom workflow and the session machinery treats it like the built-in kinds. The string is interned through Arc<str> because workflow names recur and the cost of repeated allocation adds up across a busy login surface.

The transition method

Factor verification is the only mutation that the state machine exposes. The transition is AuthState::advance_factor, which takes a FactorKind and a timestamp, and returns an AdvanceOutcome that tells the caller what just happened.

impl AuthState {
    pub(crate) fn advance_factor(
        &mut self,
        kind: &FactorKind,
        authn_time: DateTime<Utc>,
    ) -> AdvanceOutcome { ... }
}

pub enum AdvanceOutcome {
    NotApplicable,
    StillAuthenticating,
    Completed,
}

The visibility on the method is pub(crate), which is the choice that keeps the orchestration honest. The pure state mutation is reachable only from within axess-core. Application code never calls it directly. Instead, application code calls AuthnService::verify_factor, which is the orchestrator method that locks the session, performs the factor's cryptographic verification through axess-factors, calls advance_factor on the typed state, and dispatches on the returned outcome.

The three outcomes are exhaustive. NotApplicable means the call was made against a state that does not accept factor verification (you cannot verify a factor against a Guest session, for instance). StillAuthenticating means the factor verified and more factors are required to complete the method. Completed means the final required factor for this method just passed, and the session should transition to Authenticated. The orchestration layer translates Completed into a typed Authenticated variant with the right authn_time and factors_completed, applies session id rotation to defeat fixation, and writes the session back to the store.

The orchestration split

The split between AuthState (the pure data and pure transition methods) and AuthSession (the Axum extractor with its RwLock, dirty flag, and side-effect dispatch) is a deliberate choice with two payoffs.

The first payoff is testability. Unit tests on the state machine do not need tokio, do not need RwLock, do not need a fake session store, and do not need an extractor harness. They construct an AuthState directly, call advance_factor (or one of the other pub(crate) transition methods), and assert on the resulting variant. A regression in the transition logic surfaces as a one-line test against the enum, not as an integration test against a contrived HTTP request.

The second payoff is auditability. Every orchestration side effect (id rotation, fingerprint binding, dirty-flag handling, store write-back) lives in one file (the SessionService::call() method, walked through in Session lifecycle and crypto envelope) rather than scattered across transition methods. A code review of the orchestration is self-contained; a code review of the state machine is self-contained; neither has to mentally reconstruct the other.

The pattern is worth naming because it shows up again in the runtime. Pure state machines compose cleanly with async orchestrators that hold the locks and dispatch side effects, and the two halves get reviewed and tested independently.

When Authenticated is and is not the right shape

The natural temptation when integrating axess for the first time is to treat Authenticated as the "done" state and Guest as the "not done" state, and to ignore the intermediate variants. Resist it. The intermediate variants are how axess represents real-world flows that do not fit a binary, and reaching into them lets your application behave correctly without inventing parallel state on the side.

A signup flow that captures a username and password, mints a session, and then asks the user to verify their email before granting any access should sit in PendingWorkflow { kind: EmailVerification, ... }, not in Authenticated. A handler that protects the dashboard checks is_authenticated(), which returns true only for the Authenticated variant, and the user sees the email-verification page until the ceremony completes. The variant change at completion time then transitions to Authenticated, the same handler now lets the user in, and the application does not need to model a "needs to verify email" column on the users table.

A password-reset flow follows the same pattern with WorkflowKind::PasswordReset. The user proves identity (with an email-OTP, say), the session enters PendingWorkflow, the password-reset page becomes accessible, the user submits a new password, and the session transitions back to Guest (forcing them to log in fresh with the new password). The reset page is unreachable from Guest and unreachable from Authenticated, which is correct in both directions: a not-logged-in user should not see it, and a fully logged-in user does not need it.

The pattern generalises to any post-identification ceremony. The typical question to ask is "should the user be considered fully logged in during this step?" If the answer is no, PendingWorkflow is the right variant. If the answer is yes, and you simply want the user to do something next, then Authenticated plus a flag on the user record fits better.

Logging out and identifier rotation

AuthnService::logout (and AuthSession::clear, which calls into it) transitions any state to Guest. The transition is more than a state change. The session identifier is rotated, the cookie is cleared on the response, the session row is deleted from the session store, and an audit event is emitted. The combination defeats session fixation. Even if an attacker knew the session id before logout, the id changes on the next login.

The orchestration layer also rotates the session id at the transition to Authenticated, for the same reason. A user who logs in receives a new session id, distinct from any id observed while they were a guest. The cookie is reissued; the old id is unreachable on subsequent requests. The rotation is invisible to application code and lives in the orchestration; the state machine just sees the variant change.

Custom session data

Real applications need to attach data to a session that axess does not model: a preference, a feature-flag selection, a partial form draft. SessionData has a custom field for this, and the size cap is sixty-four kilobytes. The cap exists because the session is round-tripped through a cookie (or its server-side analogue), and a session that grows without bound becomes a DoS surface. Sixty-four kilobytes is enough for almost any sensible use; anything larger probably belongs in the database keyed by user id rather than in the session.

Adding a custom field is purely additive. The SessionData struct exposes custom: HashMap<String, serde_json::Value> (the implementation may evolve, but the field-with-cap shape is stable), and you write through accessor methods on the session handle. The state-machine variants do not change. The schema-migration story covered in Schema migration handles upgrade paths without breaking existing sessions.

What this enables

The state machine is the foundation that lets the rest of the book be shorter. Factor composition (Factors and methods) works because Authenticating::remaining is a list, not a single field. Step-up authentication works because the orchestrator can transition from Authenticated to Authenticating with a non-empty remaining list when a sensitive route demands a stronger factor. Cedar authorisation works because Authenticated carries factors_completed, which the entity provider can serialise into a Cedar attribute the policy can match on. Audit events work because every transition produces a distinct AuthEvent variant with the right fields populated.

None of these features required a different enum; they all read out of the state machine that was already there. The enum carries the authentication question, and the rest of the library asks it.

Further reading

The chapters that build directly on this one are Factors and methods (which factors fit into the variants, and how methods compose), Scope hierarchy (how begin_login picks the right method given Global, Tenant, and User overrides), and Refresh tokens and session continuity (how the session continues across long-lived sessions, key rotation, and detection of token theft). Session lifecycle and crypto envelope in Part V covers the cookie, the encryption envelope, and the orchestration's dirty-flag handling.

Factors and methods

A factor is a single credential check: a password, a TOTP code, a WebAuthn assertion, an LDAP bind, an OAuth token exchange. A method is a sequence of factors that together count as a successful login. Composing factors into methods, and scoping methods to apply per-user or per-tenant rather than globally, is the day-to-day surface adopters work with. This chapter explains the vocabulary, the types that carry it, and the pattern for adding a factor that axess does not ship.

Vocabulary

The four words that recur are factor, step, method, and scope. They sound interchangeable in casual writing, and they are not in the code.

A factor is one credential verifier, identified by a FactorKind variant: Password, Totp, Hotp, EmailOtp, Fido2, LdapBind, or Federated(FederatedProvider). Each factor has a config struct (PasswordConfig, TotpConfig, and so on) that the relevant adopter seeds at provisioning time and the service reads at verification time.

A step is one node in a method. A step is either a Required(kind) demand for a specific factor, or an AnyOf(vec![kind1, kind2, ...]) disjunction that lets the user choose among several factors at that position. The step is the unit of authoring; a method is a sequence of steps.

A method is an ordered sequence of steps with a stable name. Examples in the wild: "password-only" (one step, Required(Password)), "password-then-TOTP" (two steps, Required(Password) then Required(Totp)), "password-then-second-factor" (two steps, Required(Password) then AnyOf(vec![Totp, Fido2, EmailOtp])). The name matters because the session records which method is in progress, and the audit trail names the method when recording success or failure.

A scope is the tier at which a method is configured. There are three tiers (Global, Tenant, and User), covered in detail in Scope hierarchy. The short version: a global default applies everywhere; a tenant can override it; a user can override the tenant. Resolution is the simple inversion of authority: user override beats tenant override beats global default.

The factor list

The current FactorKind enum and its companion config sum-type live in axess-core/src/authn/factor.rs.

pub enum FactorKind {
    Password,
    Totp,
    Hotp,
    EmailOtp,
    Fido2,
    LdapBind,
    Federated(FederatedProvider),
}

pub enum FederatedProvider {
    Github,
    Google,
    Microsoft,
    Custom(String),
}

pub enum FactorConfig {
    Password(PasswordConfig),
    Totp(TotpConfig),
    Hotp(HotpConfig),
    EmailOtp(EmailOtpConfig),
    Fido2(Fido2Config),
    LdapBind(LdapBindFactorConfig),
    // Federated configs live with their provider's verifier crate.
}

FactorKind is the discriminator the state machine carries. FactorConfig is the data the verifier needs. They mirror each other because the verifier-versus-orchestrator split (see Architecture at a glance) puts the algorithm and its config in axess-factors and puts the discriminator and the composition machinery in axess-core. A new factor lands as a new FactorKind variant, a new FactorConfig variant, and a new verifier crate (or module) under axess-factors.

The federated case is intentionally a parameterised variant rather than a flat list. Each federated provider has its own configuration shape (Google's audience claim differs from GitHub's; Microsoft adds tenant directory parameters), and the wire formats are different enough that flattening them into one enum would require a discriminator inside the config. Parameterising the kind itself makes the config sum-type smaller and the type system honest about the variation.

Custom(String) is the extension point for IdPs the upstream library does not name explicitly. Adopters who federate against Okta, Auth0, Azure AD as a generic OIDC provider, or an in-house IdP plug in with the OAuth-RS resolver and a custom string identifier; the workload identity chapter (Workload identity overview) describes the same pattern from the inbound-resolver side.

How factors compose

The composition primitives are FactorStep and Method. A FactorStep is one node in a method. A Method is a vector of steps plus a name.

pub enum FactorStep {
    Required(FactorKind),
    AnyOf(Vec<FactorKind>),
}

pub struct Method {
    pub name: Arc<str>,
    pub steps: Vec<FactorStep>,
}

The two-step Required(Password) then AnyOf(vec![Totp, Fido2]) method handles a common shape: the user must enter their password, then must complete one of two second factors, and the choice of second factor is theirs (perhaps because they have not registered a passkey yet, or perhaps because their phone is at home and they only have their hardware key with them). The state machine's Authenticating::remaining field carries the residue of steps yet to complete: after the password step, remaining looks like [AnyOf(vec![Totp, Fido2])] and the application's login page renders the choice between them.

Required(kind) is shorthand for a one-element AnyOf(vec![kind]), but the distinction matters for audit clarity. A successful login that went password + totp reads cleanly when the audit log records "completed Required(Totp)"; the same login through an AnyOf step records "completed AnyOf::Totp" and a reviewer asks why the choice was offered at all. Use Required when there is no choice.

The orchestrator does not support arbitrary expression trees of factors (you cannot say "two of these three" with a single step). The omission is on purpose. Real authentication methods are short sequences with at most one decision point per step, and admitting arbitrary expressions would invite policies that pass formal review but defeat operational understanding.

The verify_factor path

Application code drives factor verification through AuthnService::verify_factor. The signature is

pub async fn verify_factor(
    &self,
    credential: &FactorCredential,
    session: &AuthSession,
) -> Result<FactorOutcome, AuthnError<I::Error>>;

with FactorCredential the runtime credential value:

pub enum FactorCredential {
    Password(ZeroizedString),
    OtpCode(Arc<str>),
    Fido2Assertion(serde_json::Value),
}

and FactorOutcome the result of the call:

pub enum FactorOutcome {
    Authenticated,
    FactorRequired(FactorKind),
    InvalidCredential,
    Locked { until: Option<DateTime<Utc>> },
}

The handler in your application takes the credential off the request (form body, JSON, header, whatever), wraps it in the right FactorCredential variant, and calls verify_factor. Three things then happen inside the service.

First, the service acquires the session's write lock and reads its current state. If the state is not Authenticating, the call returns an AuthnError. If the state is Authenticating, the service inspects remaining to determine which factor is expected next. A mismatch between the credential the client supplied and the factor the method expects returns FactorOutcome::InvalidCredential without engaging the verifier, which keeps the cryptographic cost of failed attempts predictable.

Second, the service dispatches to the appropriate verifier in axess-factors. The password case calls Argon2id. The TOTP case calls the RFC 6238 verifier with the user's stored secret and the current window. The FIDO2 case calls the WebAuthn ceremony, which is itself stateful and threads through the session's challenge field. Federated cases dispatch to their respective OAuth or OIDC handlers.

Third, the service translates the verifier's result into a FactorOutcome and an AdvanceOutcome from the state machine. A successful verification calls AuthState::advance_factor, which returns Completed if no factors remain (the orchestrator promotes the session to Authenticated) or StillAuthenticating if more factors are required (the orchestrator leaves the state in Authenticating and returns FactorOutcome::FactorRequired(kind) with the next expected kind). A failed verification increments attempt_count, updates last_attempt, and returns FactorOutcome::InvalidCredential or Locked depending on the attempt policy.

The Locked outcome is the lockout decision in band. The until: Option<DateTime<Utc>> field carries the unlock time when one is scheduled (a five-minute exponential backoff after three attempts, for instance) or None when the lockout requires administrative intervention. The application surfaces this to the user with the right copy; the audit log records the lockout regardless.

Begin and complete

verify_factor is the verb that drives a method forward, but a login also has a start and an end. The start is AuthnService::begin_login, which transitions a Guest session into Authenticating. The end is the orchestrator's promotion of Authenticating to Authenticated when the last factor completes (or to PendingWorkflow when a workflow is in progress).

begin_login does three things that are worth naming explicitly. First, it looks up the user in the configured identity store, in the tenant that the caller named, and returns UserNotFound if no user matches. Second, it loads the method that applies to that user under the scope hierarchy (covered in Scope hierarchy); the result is the specific sequence of steps the user will walk. Third, it transitions the session into Authenticating with the loaded method's remaining set to the method's full step list.

complete_signup is the corresponding verb for the PendingWorkflow case. After a signup ceremony completes (email verified, KYC checks passed, terms accepted), the orchestrator transitions the session from PendingWorkflow { kind: Signup, ... } to Authenticated. The factor list on the resulting Authenticated variant is the list that was used during the signup, which is what the audit trail wants and what subsequent policy evaluation reads.

Adding a custom factor

The pattern for adding a factor that axess does not ship is the same pattern that produced the factors that axess does ship. There are four moving parts.

The first part is the verifier itself. It lives in axess-factors (or in a separate crate that depends on axess-factors) and exposes a function or trait that takes the stored config plus the runtime credential and returns a verifier-side result. For a hash-based factor this is straightforward (compute the hash, constant-time compare); for a ceremony-based factor (FIDO2, OAuth) the verifier threads through the session-side challenge and the response.

The second part is the FactorKind variant. Adding a variant is a breaking change to the public surface, which is what you want: any match on FactorKind in adopter code now flags a missing arm, and the adopter chooses to handle the new factor or to reject it with an explicit pattern. There is no "add a variant silently" mechanism in axess, and that omission is intentional.

The third part is the FactorConfig variant and the storage adapter that loads it. The factor config goes into the configured factor store; the load path resolves the scope (Global, Tenant, User) and returns the right config for the user being authenticated. Adopters implement the factor store, so the storage decision is theirs.

The fourth part is the credential type. A new factor that requires a new shape of input adds a variant to FactorCredential. A factor that maps to one of the existing variants (a password-like factor reuses Password, a code-based factor reuses OtpCode) avoids the addition.

The work is small. The factors that axess ships today each take fewer than a thousand lines of Rust including tests. The reason the work stays small is that the orchestration and the state machine do not change; the verifier is doing one job, behind a fixed contract.

Step-up authentication

Step-up is the pattern where an already-Authenticated session is asked to re-prove identity (or to prove with a stronger factor) before performing a sensitive action. Axess models this by transitioning the state from Authenticated back to Authenticating with a non-empty remaining list. The orchestrator method that drives this is AuthnService::require_step_up, which takes the session and the factor or factors the caller demands.

The state-machine view is uniform. The session is Authenticating again; the factor list contains the stepped-up factors; the session remembers (in completed) which factors it already cleared. verify_factor works the same way it did during the original login, and on the final Completed outcome the session transitions back to Authenticated with a fresh authn_time and an updated factors_completed.

The application controls when step-up is required. The Cedar policy engine can express "this action requires Fido2 in factors_completed" (see Cedar policy fundamentals), or the handler can demand it directly. The state machine does not impose a policy; it provides the shape that lets the policy be enforced.

What this enables

A method composed of a Required(Password) followed by an AnyOf(vec![Totp, Fido2, EmailOtp]) covers an enormous share of real deployments without any further structure. A per-tenant override for a specific tenant that requires Required(Fido2) instead of the disjunction covers the rare case where one tenant must be stricter. A per-user override that adds Required(EmailOtp) for a flagged user covers the regulatory case where one user is on a watch list.

None of these require new code beyond an entry in the method store. The state machine, the verifier dispatch, and the audit pipeline all read the method out of the configured scope and execute it. The next chapter, Scope hierarchy, covers the configuration tier in detail.

Further reading

Scope hierarchy covers Global, Tenant, and User configuration tiers and how begin_login resolves them at runtime. Cedar policy fundamentals covers how the policy engine reads factors_completed and authorises against it. Part III, Factor cookbooks, has a chapter per real-world factor (Password and TOTP, FIDO2 and WebAuthn passkeys, OAuth 2.0 and OIDC, and so on) that walks through the integration details one factor at a time.

Scope hierarchy

Methods and factor configurations live at three tiers: Global, Tenant, and User. The mechanism is simple, the consequences are not. Done well, the three-tier hierarchy makes multi-tenant SaaS deployment feel like one configuration with two override surfaces. Done badly, it becomes a maze where nobody can answer "what method is this user actually using?" without running a query. This chapter walks through the mechanism and the patterns that keep it operationally clear.

The three tiers

AuthnScope lives in axess-core/src/authn/types.rs. It is a three-variant enum, ordered from broadest to narrowest:

pub enum AuthnScope {
    Global,
    Tenant(TenantId),
    User { tenant_id: TenantId, user_id: UserId },
}

Global is the workspace-wide default. A method or factor configured at global scope applies to every user in every tenant unless something overrides it.

Tenant(TenantId) is a per-tenant override. A method configured at tenant scope applies to every user in that tenant, overriding the global default for that tenant.

User { tenant_id, user_id } is a per-user override. A method configured at user scope applies to that one user, overriding both the tenant and global defaults for that user.

The ordering is the ordering of authority. Narrower beats broader.

How resolution works

At begin_login time the service needs to know which method this user should authenticate against. The resolution walks the scope chain from narrowest to broadest, returning the first match it finds.

The chain helper AuthnScope::lookup_chain produces the ordered sequence of scopes to query. For a user with tenant_id = T and user_id = U, the chain is [User { T, U }, Tenant(T), Global]. The factor store walks this list and returns the first configured method.

async fn load_factor_with_fallback(
    user_scope: &AuthnScope,
    tenant_id: &TenantId,
    kind: &FactorKind,
) -> Result<Option<FactorConfig>, FactorStoreError> {
    for scope in user_scope.lookup_chain() {
        if let Some(config) = factor_store.load_factor(&scope, kind).await? {
            return Ok(Some(config));
        }
    }
    Ok(None)
}

The same chain is used for each factor in the method. A method that chains password and TOTP looks up the password config first (which might be a user-scoped override) and then the TOTP config (which might be a tenant default). Each factor's configuration is resolved independently, which is the right shape for the common case where the user has chosen their own TOTP device but the tenant has standardised the password policy.

The storage convention matches the tier model. The factor store schema typically has tenant_id and user_id columns that are nullable, with the following semantics:

tenant_iduser_idScope
NULLNULLGlobal
setNULLTenant(tenant_id)
setsetUser { tenant_id, user_id }

ScopeColumns is the in-code representation of this pair; it lives next to AuthnScope and is what the SQL adapters use when building queries.

What gets scoped

The hierarchy applies to three kinds of object: factor configurations, methods, and lockout policies. Each plays the same game, with the same chain-walking resolution.

Factor configurations are the per-factor stored data: the password hash for a user, the TOTP secret for a user, the FIDO2 credential public keys for a user, the LDAP bind parameters for a tenant. Most factor configurations are user-scoped because they belong to a specific user (a password hash is intrinsically per-user). A few are tenant-scoped because they belong to a tenant configuration (LDAP bind parameters, OIDC discovery URLs). A very few are global (the system default Argon2id parameters, the system default TOTP drift window).

Methods are the ordered sequences of factor steps. A tenant typically configures a single default method (password-plus-TOTP, say), and a small minority of tenants override it (a regulated tenant requires FIDO2 instead of TOTP). Individual users very rarely have a custom method; when they do, it is because policy demands a stronger factor for a flagged user.

Lockout policies are the rate and threshold for locking out a user after repeated failed attempts. Defaults are global. Tenants with stricter risk postures override at tenant scope. Per-user lockout policies exist but are rare; they usually mean "this user is on a watch list and gets locked out faster than the rest".

The pattern across all three is identical. Configure a sensible global default. Let tenants override when they have a real reason. Reach for the user-scoped override only when policy demands per-individual differentiation. The more configuration you do at the narrowest scope, the more state you have to reason about during incidents.

Migration patterns

The scope hierarchy is the right tool for rolling out factor changes in a controlled way. The pattern is to introduce the change at the narrowest scope, verify it on a small population, and broaden the scope as confidence accumulates.

A worked example. A SaaS deployment wants to require FIDO2 for all users, replacing the existing password-plus-TOTP method. The cautious roll-out has three phases.

Phase one is User-scoped pilot. The operations team configures the new method (Required(Password) then Required(Fido2)) at user scope for a small set of internal users. These users go through the new flow first, surface any UX problems, and validate that the FIDO2 ceremony works end-to-end against the application's relying-party configuration.

Phase two is Tenant-scoped pilot. The team configures the new method at tenant scope for a single early-adopter tenant. Their users transition next, and the pilot widens to a population that includes real customer traffic. The user-scoped overrides from phase one are removed (they no longer differ from the tenant default).

Phase three is Global rollout. With confidence from both pilot phases, the team configures the new method at global scope. The tenant-scoped override for the early-adopter tenant is removed at the same time, since it no longer differs from the global default. The roll-out is complete; the method store has one row (the global default) instead of many.

The pattern works in reverse for emergency revocation. If the new method has a bug that surfaces after global rollout, the team can override at tenant scope or user scope for the affected population without redeploying the application or reverting the global config. The narrower scope wins; the affected users walk the old method while the bug is fixed.

How Cedar policy interacts

The scope hierarchy answers "what method does this user authenticate with?" Cedar answers "what is this user allowed to do once authenticated?" The two surfaces are distinct, and confusing them leads to authorization-as-authentication mistakes.

A common pattern is to use Cedar to require a method outcome rather than to choose one. A policy might require that factors_completed.contains("Fido2") for an action against a sensitive resource. The method itself remains the resolved one from the scope hierarchy. If the method does not include FIDO2, the user reaches the sensitive route and gets a deny; the application then offers step-up to add FIDO2 (covered in Factors and methods §"Step-up authentication"), the user completes it, and the policy now passes.

The split between choice (scope hierarchy) and demand (Cedar policy) is what makes this work. The hierarchy decides what factors are available; the policy decides which of them are required for which actions. A user can have a stronger method than the policy minimum and satisfy the policy without effort; a user with a weaker method gets prompted for step-up.

Anti-patterns

The hierarchy invites a few mistakes that are worth naming explicitly.

The first is overusing user-scoped configuration. Every user-scoped row in the factor store is a piece of state that an operator has to maintain. If a tenant decides to change its method, the tenant-scoped row updates; the user-scoped overrides do not. After a few months of incremental changes, the user-scoped rows are out of sync with the intended policy, and nobody remembers why each row exists. The fix is to use user scope only when policy genuinely requires per-individual differentiation, and to document the reason in a separate field next to the row.

The second is using the hierarchy as a feature flag. The temptation is to roll out a new factor by user-scoping it to internal users, then forget about the user-scoped rows after the global rollout. The hierarchy is a good migration tool but a bad permanent home for temporary state. After a rollout completes, remove the narrower-scope overrides that no longer differ from the broader-scope default. The audit trail still records the historical use; the live configuration is clean.

The third is conflating method scope with tenant identity. The hierarchy says nothing about which tenants exist; it says only how to resolve a configuration for a given (tenant, user) pair. Tenant provisioning, tenant suspension, and tenant deletion are covered in Multi-tenancy.

What this enables

The hierarchy is the reason an axess deployment scales from "one company with one method" to "a SaaS with hundreds of tenants, each with its own posture, and a few high-risk users on stricter policies" without restructuring the application. The same code path (begin_login, verify_factor, Authenticated) handles the single-tenant case and the hundred-tenant case. The only difference is which scope holds the configuration.

The pattern is not unique to axess. Cedar policies, audit retention policies, and rate-limit thresholds all follow the same three-tier pattern. The vocabulary is consistent across the library so a reviewer who has internalised the resolution rule does not have to re-learn it for each subsystem.

Further reading

Multi-tenancy covers tenant provisioning, the TenantId lifecycle, cross-tenant refusal, and the three-lever lockout. Cedar policy fundamentals covers how authorisation policy reads the resolved method's factors_completed field. Identity store implementation walks through the storage adapter that resolves the scope chain against a relational schema.

Refresh tokens and session continuity

A session cookie keeps a user logged in until it expires or is cleared. A refresh token is the mechanism that extends that lifetime past the cookie's short window, without exposing a long-lived bearer credential to the client. The shape of the mechanism matters more than most adopters initially realise, because the choice between "long cookie" and "short cookie plus refresh token" is the choice between "stolen cookie is valid for a day" and "stolen cookie is valid for an hour and then detectable as theft when the legitimate user next refreshes".

This chapter covers the refresh token shape in axess: hash-only storage, token families for reuse detection, device binding and cascade revocation, and the configuration surface adopters tune. The relevant code lives in axess-core/src/session/refresh.rs.

Why refresh tokens at all

A naive long-lived session is one cookie that lives for a month. If the cookie is stolen, the attacker has a month of access. The legitimate user has no way to know the cookie was stolen unless they notice the attacker's actions in their account.

A short-lived session with a refresh token is two credentials. The session cookie lives for an hour and grants access. The refresh token lives for a month and grants only the right to mint a new session cookie. The refresh exchange happens server-side, typically when the session cookie expires; the client sends the refresh token, the server checks it, and the server issues a fresh session cookie (and optionally a fresh refresh token).

The cost is one extra round-trip per hour. The benefit is twofold. First, a stolen session cookie expires within the hour. Second, and more importantly, a stolen refresh token gets caught the next time either the attacker or the legitimate user attempts to refresh, because the system detects that a token has been used twice and revokes the entire token family.

The stored shape

RefreshToken is the row that lives in the refresh token store:

pub struct RefreshToken {
    pub id: RefreshTokenId,
    pub user_id: UserId,
    pub tenant_id: TenantId,
    pub token_hash: String,
    pub issued_at: DateTime<Utc>,
    pub expires_at: DateTime<Utc>,
    pub revoked: bool,
    pub device_info: Option<String>,
    pub family_id: Option<TokenFamilyId>,
    pub device_id: Option<DeviceId>,
}

Three fields are worth dwelling on.

token_hash is the SHA-256 hash of the token string, not the string itself. The plaintext token is generated when the token is issued (through SecureRng for DST), returned to the client once, and never stored. The hash is what lives in the database. A database breach that leaks every row of the refresh token store does not leak any usable token, because the hash is one-way. The verification path hashes the client-supplied plaintext and compares it constant-time against the stored hash.

The hashing uses an optional pepper, configured through RefreshTokenConfig::hash_pepper. When set, the hash is HMAC-SHA256(pepper, plaintext); when unset, the hash is plain SHA-256(plaintext). The pepper is a deployment-level secret stored outside the database (in the secrets manager that holds the session signing key, typically) and adds defence in depth: an attacker who breaches the database alone cannot mount an offline brute-force attack against the hashes.

family_id is the link to the token's lineage. Every refresh token issued in a single authentication chain shares a TokenFamilyId. The first token issued at login starts a family; each subsequent token issued by rotation extends the same family. When the system detects that a token from a family has been used after rotation (which is what theft looks like), it revokes the entire family.

device_id is the link to the device identity ladder. When a refresh token is bound to a device, revoking the token can cascade to revoke the device, and revoking the device cascades to revoke every token bound to it. The cascade is bidirectional and is the mechanism that makes "log out everywhere on this device" work in practice. Device identity covers the device ladder in detail.

How families catch theft

The interesting part of the design is the family. The mechanism is worth walking through with a concrete sequence.

Alice logs in. The server issues refresh token A, in family F. A is delivered to her browser; the hash of A is stored in the database with family_id = F.

An hour later, Alice's session cookie expires. Her browser sends A back to refresh. The server hashes the plaintext, finds the row, verifies it is not revoked, marks A as revoked (rotation), and issues a new refresh token B in the same family F. B is delivered to the browser.

Meanwhile, an attacker has stolen the cookie and copied A. The attacker now sends A to refresh. The server hashes the plaintext, finds the row, and sees that A is already marked revoked.

The clean refresh-after-rotation invariant says that a revoked token should never be presented again. If it is, either Alice's browser is broken (unlikely), or the network retried (rare and recoverable), or the token has been stolen and the attacker is racing the legitimate user. The conservative response is to assume the worst: revoke the entire family F. Token B (which Alice's browser holds and has not yet used) is now revoked. The next time Alice's browser refreshes, it fails. The user has to log in again, but during the brief window between detection and re-login the attacker has no access either.

The detection-and-revoke pattern is implemented in the refresh_session function: when a revoked token is presented, the function calls revoke_family(user_id, family_id) and emits an audit event noting the suspected compromise. The application can also wire an on_token_compromise callback to receive the event synchronously and take application-specific action (logging Alice out of related sessions, alerting her by email, escalating to fraud review).

The pattern catches a class of attacks that long-lived sessions cannot detect at all. Even a sophisticated attacker who avoids generating alerts cannot avoid the family revoke, because the legitimate user's next refresh inevitably triggers it. The trade-off is one re-login per detected compromise; given the alternative is silent access, the trade-off is worth it.

Device-binding cascade

When the device feature is enabled, refresh tokens are bound to the device that received them. A refresh token issued from a browser on Alice's laptop carries device_id = Some(laptop). A refresh token issued from her phone carries device_id = Some(phone). Family revocation cascades to the device store, marking the relevant device as Revoked; device revocation cascades back to the token store, revoking every token bound to the device.

The cascade is the mechanism behind "log out everywhere on this device" and "this device was lost, revoke all access from it". The operator marks the device revoked in the device store; the cascade revokes every refresh token bound to it; the next refresh from that device fails. The user is logged out of every session that ran through the device, including any session that was idle but still holding a refresh token.

The opposite direction matters too. When a family-revoke triggers from a token-reuse detection, the cascade marks the relevant device as compromised. The device's three-stage trust ladder (Unknown to Seen to Trusted, covered in Device identity) is short-circuited to the terminal Revoked state. Subsequent logins from the same device fingerprint surface as a fresh Unknown device, which the user re-establishes trust on with whatever step-up the application requires.

The collect_family_device_targets helper gathers (TenantId, DeviceId) pairs from a family for the cascade. The helper exists because the device store and the refresh-token store are independent persistence layers, and the cascade is the place where they coordinate. The application's on_token_compromise callback receives the list and decides which cascade to apply (some applications mark devices Revoked directly; others write an intermediate audit event and let an operator confirm).

Configuration

RefreshTokenConfig is the operator's tuning surface:

pub struct RefreshTokenConfig {
    pub ttl: Duration,
    pub max_per_user: usize,
    pub rotation: bool,
    pub hash_pepper: Option<Vec<u8>>,
}

The defaults are conservative for most applications: a thirty-day TTL, ten concurrent tokens per user, rotation enabled, and no pepper. Each field is worth a few words of guidance.

ttl is how long a refresh token is valid before it expires without being used. Thirty days is enough that most users do not feel the expiry in normal use, and short enough that an abandoned device's tokens become unusable in a bounded time. Applications with stricter posture set this lower; applications with weak step-up at re-login set this higher.

max_per_user is the cap on how many refresh tokens a user can have active at once. The cap exists to prevent a runaway "log in from every device the user owns" pattern from filling the token store. Issuing a new token past the cap evicts the oldest one. Ten is generous for most users (a phone, a laptop, a tablet, plus a few spares); applications with operators who routinely log in from ephemeral machines push this higher.

rotation controls whether a refresh issues a new token (true) or extends the existing one (false). Rotation enabled is the default and is what makes family-based theft detection work. Rotation disabled is faster (one less write per refresh) but defeats the family detection mechanism, because a token never moves to revoked under normal use. The recommendation is to leave rotation on; the performance cost is negligible.

hash_pepper is the optional shared secret used to HMAC the token hash. Adding a pepper is a defence-in-depth measure that helps when the database is breached but the secrets manager is not. The pepper must be stable across the deployment (otherwise existing tokens become unverifiable); rotation is supported through the same pattern as the session signing key, covered in Operations runbook.

Atomicity contracts

The RefreshTokenStore trait documents that production backends must implement three methods atomically. The atomicity is what makes the family-based theft detection sound; a non-atomic implementation opens a TOCTOU window where an attacker could race the legitimate user past the detection.

rotate_token must atomically mark the current token revoked and issue a new token in the same family. Two requests racing each other must result in one rotation and one detected reuse, not two rotations.

issue_with_eviction must atomically issue a new token and evict the oldest if the user is at the max_per_user cap. A non-atomic implementation can leave a user with eleven active tokens momentarily, which is harmless, or evict the wrong token under contention, which can log a legitimate session out for no reason.

revoke_family must atomically revoke every token in a family. Partial revocation defeats the detection mechanism: an attacker holding a token from a half-revoked family can still refresh.

The first-party SQL adapters use transactions to satisfy these contracts. Custom adapters need to do the same; the contract is documented on the trait so reviewers can check it explicitly.

What this enables

Refresh tokens and session cookies are the two ends of a continuum between "convenience" and "security". A session cookie alone is the convenience end. A refresh token with family-based theft detection and device-binding cascade is what lets axess sit much closer to the security end without compromising user experience: sessions feel permanent because they refresh transparently, and theft gets caught the next time anyone attempts a refresh.

The mechanism is the same one that lets axess support "log out of everything" at the user-account level and "this device was lost" at the device level, because the cascade between tokens, families, and devices is the same in both directions. A session that has lived its whole life behind axess can be revoked through any of the three handles, and the others follow.

Further reading

Device identity covers the three-stage device assurance ladder, the per-tenant fingerprint pepper, and the retention sweep. Session lifecycle and crypto envelope covers the session cookie itself, the AES-256-GCM envelope, and the orchestration that issues and reads cookies. Operations runbook covers key rotation for the session signing key, the refresh-token pepper, and the device fingerprint pepper.

Password and TOTP

The four factors axess-factors ships by default (password, totp, hotp, email_otp) are the ones most adopters reach for first. They require no external IdP, no specialised hardware, no extra infrastructure. This chapter walks through password (Argon2id) and TOTP (RFC 6238), the two most common combination in practice, with references to HOTP and email OTP at the end. The pattern these factors illustrate generalises to every other factor in the library.

The feature flags password, totp, hotp, email_otp are all on by default in axess-factors. No Cargo.toml change is needed to use them.

Password (Argon2id)

The password factor verifies a user-supplied secret against a stored Argon2id hash. The choice of Argon2id rather than bcrypt or PBKDF2 is the standard one for new systems built today; the parameter tuning is the operational lever you reach for first.

The configuration struct is PasswordConfig. It carries the Argon2id parameters (memory cost, time cost, parallelism), the minimum and maximum password length, and the optional pepper. Defaults are calibrated for a server class that can spare about fifty milliseconds of CPU per verification, which is what current guidance considers an appropriate cost ceiling for an interactive login.

pub struct PasswordConfig {
    pub argon: Argon2idParams,    // memory, time, parallelism
    pub min_length: usize,        // default 8
    pub max_length: usize,        // default 128
    pub pepper: Option<Vec<u8>>,  // optional, see below
}

The pepper is an optional secret stored outside the database (in the secrets manager that holds the session signing key). When set, the hash is HMAC-SHA256(pepper, password) before Argon2id processes it. The defence-in-depth benefit is the same one refresh-token peppers provide: a database breach alone does not enable an offline brute-force attack against the password hashes, because the pepper is not in the database.

The maximum password length matters for DoS protection. Argon2id is deliberately expensive; an attacker who can submit a megabyte of password text per request can wedge the server with a handful of concurrent attempts. The cap is one hundred and twenty-eight characters by default, which is generous for legitimate users (no password manager generates more than that) and bounded enough that the worst case per request stays under a hundred milliseconds.

The minimum is eight characters, which is below the modern recommendation but matches what most users encounter elsewhere. A deployment serious about password quality lifts this to twelve or fourteen, alongside a length-and-character-class meter on the signup form. Axess does not enforce password complexity rules beyond the length range; complexity meters live in the registration UI, where they can produce real feedback.

TOTP (RFC 6238)

The TOTP factor verifies a six-digit code derived from a shared secret and the current time window. The shared secret is twenty bytes of cryptographic randomness, generated at enrolment time and stored alongside the user's other factor configurations.

The configuration struct is TotpConfig:

pub struct TotpConfig {
    pub secret: ZeroizedString,   // base32-encoded 20-byte secret
    pub digits: u32,              // default 6
    pub period: Duration,         // default 30 s
    pub algorithm: HmacAlgorithm, // default Sha1 (RFC 6238)
    pub drift_window: u32,        // default 1
}

secret is zeroized in memory on drop. The string is base32-encoded because that is what TOTP apps expect when scanning a QR code or pasting a manual key; the bytes underneath are twenty cryptographically random bytes from SecureRng. Adopters serialise the secret to and from their factor store however the store's encryption envelope prefers.

digits is six in line with every TOTP authenticator in production use. RFC 6238 admits up to eight, but no widely deployed TOTP app generates eight-digit codes, so the field exists for symmetry rather than for variability.

period is the time window each code is valid for. Thirty seconds is the RFC default and what every authenticator app expects. Increasing the period (to sixty seconds, say) reduces the chance that a user typing slowly enters a code that has just expired, at the cost of doubling the window an intercepted code remains valid. The recommendation is to keep this at thirty unless you have a specific reason to change it.

algorithm is the HMAC primitive used to derive the code. SHA-1 is the RFC 6238 default and remains universally compatible. SHA-256 is the harder-to-collide choice; some authenticator apps do not yet support it. Stay on SHA-1 unless you have control over the authenticator app the users will use.

drift_window is the count of adjacent time windows the verifier accepts. A drift window of one means the verifier accepts codes from the current window plus one window on either side, covering a ninety-second total acceptance range against a thirty-second period. The drift accommodates a few seconds of clock skew between server and authenticator. Lifting it to two or three reduces user friction at the cost of slightly increasing the brute-force attack surface; the default of one is the right trade for most deployments.

Composing password and TOTP

A method that combines password and TOTP is two FactorSteps:

use axess::{FactorKind, FactorStep, Method};

let password_plus_totp = Method {
    name: "password-then-totp".into(),
    steps: vec![
        FactorStep::Required(FactorKind::Password),
        FactorStep::Required(FactorKind::Totp),
    ],
};

The method is stored at whatever scope the deployment wants (Global default, Tenant override, User override; see Scope hierarchy). At begin_login time the resolver loads the method, the session transitions to Authenticating with remaining = [Password, Totp], and the login flow walks the two factors in order.

The application's login page renders the password prompt while the session is in Authenticating with remaining[0] == Password, and the TOTP prompt while in Authenticating with remaining[0] == Totp. A successful TOTP verification calls advance_factor, which returns Completed, and the orchestrator transitions the session to Authenticated. The user is logged in.

A common variant offers TOTP plus another second factor as a choice:

let password_plus_2fa_choice = Method {
    name: "password-then-2fa-choice".into(),
    steps: vec![
        FactorStep::Required(FactorKind::Password),
        FactorStep::AnyOf(vec![
            FactorKind::Totp,
            FactorKind::Fido2,
            FactorKind::EmailOtp,
        ]),
    ],
};

The login page after the password step shows three options. The user picks one; the application calls verify_factor with the appropriate credential; on success, the session is authenticated.

TOTP enrolment

Enrolment is a separate ceremony from login. The user is already authenticated (often immediately after signup), and the application walks them through registering a TOTP device. The shape is uniform across deployments.

The server generates a new TOTP secret through SecureRng. It serialises the secret as a base32 string and as an otpauth://totp/<issuer>:<account>?secret=<base32>&issuer=<issuer> URI suitable for embedding in a QR code. The UI displays the QR code (scanned by the user's TOTP app) and offers a copy of the base32 secret for users whose apps prefer manual entry.

The user enters a six-digit code from their app, the server verifies it against the same TOTP algorithm that login uses, and on success the server persists the secret to the factor store under the user's scope. The user is now enrolled. Their next login that demands TOTP will succeed.

Two operational details matter at enrolment.

The first is that the verification at enrolment must succeed before the secret is persisted. A user who scans the QR code but mistypes the verification code (or scans into the wrong app) should not be left with a stored secret that they cannot reproduce. The standard pattern is: generate the secret in memory, display the QR code, hold the secret in a short-lived enrolment record (in the session custom field, for example), verify the user's code, persist on success, discard on failure.

The second is recovery codes. A user who loses access to their TOTP device cannot log in with a method that requires TOTP. The deployment must offer a recovery path: either a recovery code printed at enrolment time (a long random string the user stores in a password manager), an email-OTP fallback factor, or an administrative reset flow with identity verification. Axess does not opinionate which path to take; the choice depends on the deployment's risk profile. The common pattern is to generate a recovery code at enrolment, treat it as a one-shot factor stored under the user's scope, and offer it as an alternative second factor.

HOTP and email OTP, briefly

The HOTP factor is the counter-based variant of TOTP. Instead of deriving the code from the current time window, the verifier derives it from a monotonically-increasing counter that advances on every successful verification. HOTP is the right choice for hardware tokens that have no clock (some YubiKey configurations, for instance). The configuration mirrors TotpConfig with a counter field instead of a period.

The email OTP factor verifies a six-digit code delivered to the user out of band, typically by email. The configuration carries the code length, the validity window (default fifteen minutes), and the count of allowed attempts before the code is revoked. The delivery is the application's responsibility; axess provides the verification side, the application provides the email send. The chapter Audit events covers the events emitted at email-OTP issuance and verification.

Threat model

A password-plus-TOTP login is robust against three common attacks and weak against one.

It is robust against a password leak (an attacker with the password alone cannot complete login without the TOTP code), against a TOTP secret leak (an attacker with the TOTP secret alone cannot complete login without the password), and against credential stuffing (an attacker reusing leaked credentials from another service is unlikely to also have the user's TOTP secret).

It is weak against a real-time phishing attack: a fake login page that prompts the user for their password, forwards it to the real server, prompts the user for their TOTP code, forwards that to the real server, and steals the resulting session. FIDO2 (covered in FIDO2 and WebAuthn passkeys) is the standard defence against this class of attack, because the WebAuthn ceremony binds the authentication to the origin and cannot be replayed against a different origin.

For applications where real-time phishing is a credible threat (financial services, healthcare, anything that handles regulated data), the recommendation is to offer FIDO2 as the second factor and treat TOTP as a fallback for users who do not yet have a passkey. The combination is what most regulators are asking for today.

Troubleshooting

A few failures recur often enough to be worth naming.

If TOTP verification fails consistently, the most likely cause is clock skew between the server and the authenticator app. The drift_window config accommodates a few seconds; larger drift points to a misconfigured NTP setup on either side. Logging the generated and accepted windows at debug level surfaces the offset quickly.

If TOTP verification fails for some users but not others, the likely cause is that the affected users scanned the QR code into an app that defaults to SHA-256 (some less common authenticators do), while the server defaults to SHA-1. The fix is to either align the server to SHA-256 (and re-enrol users), or to ensure the QR code URI explicitly specifies SHA-1.

If password verification is slow under load, the Argon2id parameters are probably set higher than the server class can support at the offered concurrency. The fix is to either lower the memory cost or to add CPU. Lower the memory cost first; below sixty-four megabytes you are out of the modern recommendation, and sixty-four megabytes is what current guidance suggests as a minimum.

If password verification is fast but logins occasionally take multiple seconds, the bottleneck is somewhere else (the factor store, the session store, an outbound network call in the login handler). Inspect the trace.

Further reading

Factors and methods covers the composition machinery this chapter exercises. FIDO2 and WebAuthn passkeys covers the WebAuthn second factor that supplants TOTP for the highest-assurance deployments. Identity store implementation covers how the password hash and TOTP secret are persisted alongside the user. Audit events covers the events emitted at every step of the password and TOTP flow.

FIDO2 and WebAuthn passkeys

FIDO2 is the answer to real-time phishing. Every other second-factor mechanism in this book (TOTP, HOTP, email OTP, SMS) is vulnerable to an attacker who proxies the user's input through a fake login page to the real server in real time. WebAuthn, the browser-side standard that FIDO2 implements, binds each authentication to the origin where the credential was registered, and a credential registered against accounts.example.com cannot be exercised against accounts-example-com.attacker.example. The defence is structural, not behavioural: the browser refuses to use the credential at the wrong origin, regardless of what the user clicks.

This chapter walks through the integration: the two-ceremony model, relying-party configuration, storage, the resident-key choice, and the rollout patterns for shipping passkeys alongside an existing password-and-TOTP flow.

The feature flag is fido2 (off by default), enabled with features = ["fido2"] on the axess facade.

The two ceremonies

WebAuthn has two ceremonies, and an integration touches both. The first is registration: the user has authenticated to your application by some other means (signup with email verification, an already-logged-in session, an OAuth callback) and is registering a new authenticator. The second is authentication: the user already has a registered credential, is logging in, and the WebAuthn ceremony proves possession.

Registration is the more involved of the two because it is where the relying-party configuration matters. The server starts the ceremony by calling Fido2Provider::begin_registration, which returns a CreationChallengeResponse. The handler serialises that to JSON and returns it to the browser, which calls navigator.credentials.create() with the JSON deserialised. The browser produces a RegisterPublicKeyCredential, which the page posts back to the server. The server calls Fido2Provider::finish_registration, which verifies the response against the challenge stored in the session, and on success returns the credential public key and metadata. The application persists this to the factor store under the user's scope, indexed by the credential id.

Authentication mirrors registration. The server calls Fido2Provider::begin_authentication, which returns a RequestChallengeResponse listing the credential ids the user has registered. The browser calls navigator.credentials.get() with the serialised challenge. The browser produces a PublicKeyCredential, the page posts it back, and the server calls Fido2Provider::finish_authentication, which verifies the signature against the stored public key.

The challenge in both ceremonies is the part the session machinery threads through. The server generates the challenge from SecureRng at begin, stores it in the session (in a typed field on SessionData::custom or a dedicated extension), and reads it back at finish. The challenge is one-shot; it is consumed at finish, regardless of success, to prevent replay. The whole ceremony lives inside the typed state machine: a begin without a subsequent finish leaves the session in a state where the next call expects the finish, and any other call returns an error.

Relying-party configuration

The relying party is the server that owns the credentials. WebAuthn identifies the relying party by an origin (the host plus scheme plus port) and by a relying-party id (an effective domain suffix of the origin). The two pieces of configuration that matter at registration time are:

pub struct Fido2Config {
    pub rp_id: String,            // e.g. "example.com"
    pub rp_name: String,          // human-readable, "Example Inc."
    pub rp_origin: Url,           // e.g. "https://accounts.example.com"
    pub user_verification: UserVerificationPolicy,
    pub attestation: AttestationConveyancePreference,
    pub resident_key: ResidentKeyRequirement,
}

rp_id is the relying-party id, a string equal to or a suffix of the host part of the origin. Setting it to the apex domain (example.com rather than accounts.example.com) lets credentials registered on the accounts subdomain be used across other subdomains of the same apex (app.example.com, admin.example.com), which is usually what you want. Setting it to the full hostname scopes credentials to that hostname alone, which is appropriate when other subdomains belong to other applications you do not trust.

rp_origin is the full origin where the registration happens. The browser cross-checks this against the page's actual origin and refuses the registration if they do not match. Wildcards are not allowed; multi-region deployments register credentials under each region's specific origin.

user_verification controls whether the authenticator must verify the user's presence (a fingerprint, a PIN, a face scan) at authentication time, in addition to proving possession of the authenticator. Required is the right setting for high-assurance deployments; Preferred is the right setting for usability when some authenticators do not support verification.

attestation controls how much detail the authenticator reports about itself at registration. None is the right default unless you have a specific reason to track which authenticator models your users register (some regulatory frameworks require this for hardware-key deployments). Direct records the attestation statement; the trade-off is privacy (the authenticator vendor is identifiable from the attestation).

resident_key controls whether the authenticator stores the credential identifier on-device (a resident key, or "passkey"), or whether the credential id is server-side and the authenticator only stores the key material. Required is what makes a passkey: the user does not have to type a username, because the authenticator holds the credential id and surfaces it directly to the browser. Preferred allows either, with the device's preference deciding. Discouraged is the legacy mode where the server provides the credential id list.

The passkey-or-not choice is the most consequential for usability. Passkeys are what users mean today when they say "biometric login": the user clicks a button, taps their fingerprint, and they are in. Non-resident credentials require the user to enter a username first, which is the older WebAuthn flow and what most existing TOTP-style second factors look like. New deployments should default to passkeys; older deployments adopting WebAuthn alongside existing flows often start with non-resident and migrate.

Storage

Each registered credential is one row in the factor store. The row carries:

  • The credential id (a byte string, base64-encoded for storage).
  • The public key (the bytes WebAuthn returns at registration).
  • The signature counter (used to detect credential cloning).
  • The attestation statement (if attestation was set above None).
  • The user verification flag from registration.
  • The authenticator transports list (USB, NFC, internal, hybrid).

The signature counter is the part that catches credential cloning. WebAuthn authenticators increment the counter on each successful signature. The server stores the counter at registration and updates it at each authentication. A counter that fails to increase between authentications indicates the credential has been cloned (the clone's counter started from the same value as the original and diverged on first use). The defence is conservative: revoke the credential and require re-registration.

The credential is scoped per user (each user has zero or more registered credentials), which is the natural per-user scope from the chapter Scope hierarchy. A user with multiple authenticators (a phone passkey plus a hardware key, say) has multiple credentials under their scope; the authentication ceremony enumerates them and the user's authenticator (or the browser, in the passkey case) picks one.

Adding passkeys to an existing flow

A common rollout is to keep an existing password-and-TOTP method and to offer passkey enrolment as an opt-in. The method shape:

let passkey_or_password = Method {
    name: "passkey-or-password".into(),
    steps: vec![
        FactorStep::AnyOf(vec![
            FactorKind::Fido2,
            FactorKind::Password,
        ]),
        // When Password is chosen, demand a second factor.
        // Implementing this conditional path takes a richer state
        // machine; the common simplification is two methods.
    ],
};

The conditional in the second step (require TOTP only if the user took the password branch) is the part axess does not natively support, because FactorStep does not nest. Two parallel methods handle the case more cleanly: one method (passkey-only, single-step FIDO2) for users with a registered passkey, and another method (password-then-totp, two-step password-and-TOTP) for users without one. The scope hierarchy chooses the right method per user. When a user enrols a passkey, the application updates their user-scoped method to passkey-only; if the passkey is later revoked, the application reverts to password-then-totp.

The pattern keeps both flows in production simultaneously, lets each user transition independently, and avoids the conditional in the state machine. The audit log records which method ran for which user, so the rollout is observable.

Threat model

A passkey login is robust against the classes of attack that password-and-TOTP is weak against. Real-time phishing is defeated because the credential is origin-bound at registration. Credential stuffing is defeated because the credential is unique to the relying party. Server-side breach is defeated because what is stored is a public key, not a secret.

The remaining attack surface is:

The first is a compromised endpoint. An attacker with full control of the user's device can ask the authenticator to perform any authentication the device permits. The defence here is user-verification: the authenticator must prove the user is present (biometric or PIN). For a deployment where this matters, user_verification: Required is non-negotiable.

The second is account recovery. A user who loses their passkey needs to recover access; the recovery path becomes the weakest link in the chain. The recommendation is to enrol at least two passkeys (a primary on the phone, a backup on a hardware key, say), and to offer a step-up administrative recovery flow with strong identity verification rather than a password-reset email. The recovery flow gets attacked because the primary login is robust; make sure the recovery is at least as strong.

The third is sync-fabric credentials. A passkey synced through Apple iCloud Keychain, Google Password Manager, or 1Password is available on every device the user has signed into that sync fabric. This is what makes passkeys usable; it also means a breach of the sync fabric compromises the credential. The implication is operational, not architectural: deployments that must defend against sync-fabric compromise pair the passkey with a device-bound credential (a hardware key, an attestation-bound device passkey), and require the device-bound credential for the highest-sensitivity actions through Cedar policy.

Troubleshooting

A few failures recur during initial integration.

If the browser rejects the registration with "The relying party ID is not a registrable suffix of the page origin", the rp_id does not match the page origin. Setting rp_id to example.com while the registration page is on accounts.example.com works; setting it to attacker.com does not. Check the host part of the actual URL the browser is on.

If authentication succeeds on one device and fails on another, the likely cause is that the credential is a passkey on one device but not synced to the other. Some authenticators register non-syncable credentials by default; check the resident_key: Required setting and the device's documentation.

If the signature counter check fails for legitimate users, the authenticator may not implement the counter (some legacy hardware keys do not). The fix is to log the counter mismatch and let the authentication proceed, sacrificing clone detection for usability on those specific authenticators. The decision is policy, and the application surfaces it explicitly.

Further reading

Factors and methods covers the composition machinery this chapter exercises. Device identity covers the device-bound trust ladder that complements passkeys for high-assurance deployments. Cedar policy fundamentals covers how a policy demands FIDO2 for specific actions (the factors_completed.contains("Fido2") check).

OAuth 2.0 and OIDC

Federated login through an Identity Provider you do not control is the most common reason adopters reach for OAuth. The user has a Google account, an Okta account, a corporate Azure AD account, and the application accepts a login from any of them rather than asking the user to invent and remember another password. The mechanism is OAuth 2.0 for the authorisation flow and OpenID Connect for the identity assertion layered on top. This chapter walks through what axess wires up automatically, what the integration code has to do, and the failure modes that have specific defences.

The feature flag is oauth (off by default), enabled with features = ["oauth"] on the axess facade. The feature transitively enables oidc (the discovery and JWKS-cache machinery) and jwt (the ID token validator).

Axess supports generic OIDC-based external login and SSO, including standard providers such as Google and Microsoft Entra ID when configured with the appropriate issuer metadata and client credentials. SAML / Shibboleth federation is not currently supported out of the box.

The shape of the flow

A federated login involves the user, the application (the OAuth client, in OAuth language, which is axess), and the Identity Provider (the OAuth server, which is the third-party IdP). The flow is the authorisation code grant with PKCE, which is what every modern OIDC deployment uses.

sequenceDiagram
    actor User
    participant App as Application (axess client)
    participant IdP as Identity Provider

    User->>App: GET /auth/login/google
    App->>App: build auth URL with PKCE + state + nonce
    App->>User: 302 to IdP authorize endpoint
    User->>IdP: GET /authorize?...
    IdP->>User: login + consent
    User->>App: GET /auth/callback?code=...&state=...
    App->>IdP: POST /token (code + pkce_verifier)
    IdP->>App: { id_token, access_token, refresh_token }
    App->>App: validate ID token (issuer, audience, nonce, signature)
    App->>App: optionally fetch /userinfo
    App->>App: transition session to Authenticated
    App->>User: 302 to /dashboard

The flow has six pieces axess does for you and three pieces the integration code is responsible for. The six axess-owned pieces are: generating the PKCE verifier and challenge, generating and binding the CSRF state, generating and binding the OIDC nonce, the discovery of the IdP's endpoints and signing keys, the token exchange itself, and the ID token validation including signature, audience, nonce, and the azp check when the audience is multi-valued. The three pieces the integration owns are: the redirect to the IdP authorize URL, the callback handler that picks up the code, and the application-specific mapping from the validated claims to the user record in the local identity store.

The provider

OAuthProvider is the trait that represents an IdP. The trait is asynchronous because every method may need to fetch JWKS, perform discovery, or hit the token endpoint. Adopters do not implement this trait themselves under normal circumstances; the OAuthProviderConfig constructor in axess-factors produces a provider from a discovery URL plus client credentials, and the returned provider implements the trait.

let provider = OAuthProviderConfig::discover(
    "https://accounts.google.com/.well-known/openid-configuration",
    client_id,
    client_secret,
    "https://your-app.example.com/auth/callback/google".parse()?,
)
.await?;

discover fetches the IdP's discovery document, validates it contains the endpoints axess needs (authorization, token, JWKS, userinfo, sometimes end-session), constructs a Discovery value, and sets up the JWKS cache against the IdP's signing-key endpoint. The cache is single-flight (concurrent JWKS misses dedupe to one request) and debounced (the cache refuses to refresh more often than once every few seconds, defeating a denial-of-service that triggers constant JWKS fetches).

The configuration record carries the client id and secret (both provisioned at the IdP), the redirect URI (where the IdP sends the user after authentication), the ceremony timeout (how long the intermediate state on the session may live before the flow has to restart), and the list of scopes to request (openid and profile at minimum; email if the application needs the user's email address; offline_access if the application needs a refresh token to continue acting as the user after the initial session expires).

Begin the login

The handler that starts the federated login transitions the session into a state that holds the PKCE verifier, the CSRF state, and the nonce, and returns a redirect to the IdP's authorize URL with those values bound in.

use axess::{AuthnService, AuthSession, OAuthLoginOptions};
use axum::response::{IntoResponse, Redirect};

async fn begin_oauth_login(
    session: AuthSession,
    State(service): State<Arc<AuthnService<...>>>,
    Path(provider_name): Path<String>,
) -> impl IntoResponse {
    match service
        .begin_oauth_login(&session, &provider_name, OAuthLoginOptions::default())
        .await
    {
        Ok(auth_url) => Redirect::to(auth_url.as_str()).into_response(),
        Err(e) => (StatusCode::BAD_REQUEST, format!("{e}")).into_response(),
    }
}

begin_oauth_login does three things internally. First, it generates the PKCE verifier through SecureRng and derives the S256 challenge that travels in the authorize URL. Second, it generates the CSRF state and the OIDC nonce, also through SecureRng, and stores all three values (verifier, state, nonce) in the session's intermediate state. Third, it composes the authorize URL with the client id, the redirect URI, the requested scopes, the PKCE challenge, the state, and the nonce, and returns it.

The redirect URI passed at this step must exactly match the one registered with the IdP at provisioning time. A mismatch is the single most common reason a federated login fails out of the box.

Handle the callback

The IdP, on successful user authentication and consent, redirects the user to the registered redirect URI with a code and a state query parameter. The application's callback handler picks these up, verifies the state matches what was stored on the session (defeating CSRF), and calls into axess to perform the token exchange.

async fn finish_oauth_login(
    session: AuthSession,
    State(service): State<Arc<AuthnService<...>>>,
    Path(provider_name): Path<String>,
    Query(callback): Query<CallbackQuery>,
) -> impl IntoResponse {
    match service
        .finish_oauth_login(&session, &provider_name, &callback.code, &callback.state)
        .await
    {
        Ok(_authenticated) => Redirect::to("/dashboard").into_response(),
        Err(e) => (StatusCode::UNAUTHORIZED, format!("{e}")).into_response(),
    }
}

#[derive(serde::Deserialize)]
struct CallbackQuery {
    code: String,
    state: String,
}

finish_oauth_login does seven things internally. First, it reads the PKCE verifier, the CSRF state, and the nonce from the session's intermediate state. Second, it cross-checks the supplied state against the stored state, returning OAuthError::CsrfMismatch if they disagree. Third, it constructs the POST to the IdP's token endpoint, including the code, the PKCE verifier, the client id, and the client secret. Fourth, it parses the response and extracts the ID token, access token, and (optional) refresh token. Fifth, it validates the ID token: signature against the cached JWKS, issuer match, audience match, nonce match, expiry, azp check when the audience is multi-valued. Sixth, it optionally fetches the userinfo endpoint with the access token to supplement the ID token claims. Seventh, it transitions the session to Authenticated (or to PendingWorkflow if the federated flow is part of a multi-step ceremony like signup).

If any of the seven steps fails, the function returns an OAuthError variant naming what failed. The session does not transition; the intermediate state is cleared (to prevent replay); the callback handler can render an error.

ID token validation

The ID token validation is where most of the security of an OIDC integration lives. Axess performs the full set of checks RFC 6749 and OpenID Connect Core 1.0 require; the integration code does not have to write them. The checks are:

The first is signature verification against the IdP's JWKS. The cache holds the current signing keys; if the ID token's kid header does not match a cached key, the cache refreshes (subject to the single-flight and debounce protections). A signature that fails against the refreshed keys produces OAuthError::SignatureInvalid.

The second is the issuer check. The ID token's iss claim must exactly match the discovery document's issuer field. A mismatch indicates either a misconfigured IdP, a discovery-document substitution attack, or an attempt to replay an ID token from a different issuer; all three produce OAuthError::IssuerMismatch.

The third is the audience check. The ID token's aud claim must contain the client's registered client id. If aud is a single value, the check is straightforward. If aud is an array (which happens when the IdP issues tokens valid for multiple clients), the check ensures the client id is in the array, and additionally enforces the azp (authorized party) check: the azp claim must exist and equal the client id, regardless of the array's contents. The azp check defeats a class of attacks where an ID token issued for one client is replayed against a different client whose id is also in the audience array.

The fourth is the nonce check. The ID token's nonce claim must exactly match the nonce that was generated at begin_oauth_login time and stored in the session. The nonce defeats ID token replay: an attacker who captures an ID token cannot reuse it against the same client because the session-bound nonce will not match on a later login.

The fifth is the expiry check. The ID token's exp claim must be in the future at the moment of validation, with a small clock-skew allowance. The clock comes from the injected Clock trait, so DST tests can exercise expiry handling deterministically.

The sixth is iat (issued-at) bounds. The token must have been issued within the last few minutes; tokens older than that indicate replay. The bound is configurable but defaults to five minutes, which matches what RFC 7519 implementations typically use.

Back-channel logout

When the IdP supports OIDC back-channel logout, the IdP sends a POST to a registered logout endpoint at the application with a logout_token. The application validates the token and, on success, revokes the user's session.

The validation is similar to ID token validation but slightly different: the audience and issuer checks apply, the azp check applies when audience is multi-valued, and an additional check on the events claim verifies the token is a back-channel logout token (the URI http://schemas.openid.net/event/backchannel-logout must be present). Axess implements this through OAuthProvider::verify_logout_jwt, which returns the claims on success.

The size cap on the logout token is eight kilobytes, the iat bound is five minutes, and the clock-skew tolerance is sixty seconds. The caps protect against denial-of-service through oversize tokens; the bounds defeat replay of a captured logout token after a meaningful delay.

RP-Initiated Logout

The opposite direction is RP-Initiated Logout: the application initiates a logout that propagates to the IdP, so the user is logged out of the IdP session as well as the application session. Axess constructs the end-session URL through OAuthProvider::build_end_session_url, which takes the ID token hint (the user's last issued ID token, signed by the IdP), an optional post_logout_redirect_uri (where to send the user after logout), and an optional state value.

The post_logout_redirect_uri must be on an allowlist that the application configures. The allowlist exists to defeat open-redirect attacks: an attacker who can manipulate the redirect URI could send the user to an arbitrary external site after logout, which is the shape of a phishing setup. The allowlist is a small explicit list of allowed URIs; anything else is rejected at build_end_session_url time.

Multiple providers

A common shape is to offer login with several IdPs side by side (Google, GitHub, Microsoft). Each provider is its own OAuthProvider instance constructed at startup; the application registers them under a provider_name key. The login URL carries the provider name (GET /auth/login/google); the callback URL also carries the name (GET /auth/callback/google). Axess dispatches to the right provider per request.

A per-tenant variation is also common: each tenant's users federate against the tenant's own IdP (an Okta workspace, an Azure AD directory). The provider name in this case is the tenant slug; the provider is constructed at tenant provisioning time (or lazily, on first use) and cached. The scope hierarchy chapter covers the pattern for storing per-tenant configurations.

Threat model

OAuth and OIDC together are robust against a handful of attacks when the implementation does the validations above correctly.

Against CSRF on the callback: the state parameter binds the callback to the session that started the login. An attacker who tricks a user into hitting the callback URL with a stolen code cannot complete the login because the state will not match.

Against ID token replay: the nonce binds the ID token to the session's login attempt. An ID token captured by an attacker cannot be replayed against a different session.

Against ID token forgery: signature validation against the JWKS catches an attacker who synthesises an ID token without the IdP's signing key.

Against audience confusion (an ID token issued for one client used against another): the audience check plus the azp check on multi-element audiences catch this.

Against authorization code interception: PKCE binds the code to the verifier the application generated. An attacker who intercepts the code cannot exchange it without the verifier.

Against open-redirect phishing on logout: the allowed_post_logout_redirect_uris allowlist catches an attacker who tries to manipulate the redirect URI.

The attacks OAuth and OIDC do not defend against are the ones FIDO2 defends against (real-time phishing of the IdP login page itself) and the ones that depend on the IdP's own security posture (a compromised IdP issues compromised tokens, and no client-side check catches that). The defence for the latter is operational: monitor which IdPs the application accepts, audit periodically, and rotate the registered client secret if the IdP suffers a breach.

Troubleshooting

A few failure modes recur during initial integration.

If the callback returns an error about state mismatch, the most likely cause is that the user took longer than the ceremony timeout to complete the IdP login. The intermediate state on the session has expired and the state value is no longer recoverable. Increasing the ceremony timeout (a generous fifteen minutes is reasonable) is the fix.

If the token exchange returns an invalid-client error, the client id or secret in OAuthProviderConfig does not match what the IdP has registered. The most common variant is using a public-client id at the IdP while configuring axess with a confidential-client expectation (or vice versa). Check the IdP's client registration page.

If the ID token validation returns an audience mismatch on an IdP that supports multiple clients, the aud claim is probably an array and the azp claim is missing. Some IdPs do not emit azp when they should; configuring the IdP to issue azp is the fix. Axess deliberately refuses to bypass the azp check because doing so would open the audience-confusion attack.

If the userinfo endpoint returns a 401 after a successful token exchange, the access token's scopes do not include the ones the userinfo endpoint requires. The fix is to add the required scopes (typically profile and email) to the scopes configuration.

Further reading

FAPI 2.0 covers the financial-grade extensions (PAR, DPoP, JARM) that layer on top of the OAuth provider for regulated deployments. Workload identity overview covers the inbound resolver side of the same machinery, where the application is the OAuth server accepting tokens issued by federated workload-identity systems. Local IdP covers the in-process IdP, both production LocalIdp for workload-identity issuance and the LocalIdpFixture that mints test tokens against a controllable JWKS for integration tests.

FAPI 2.0

FAPI is the OpenID Foundation's Financial-grade API profile, a set of additional requirements on top of OAuth 2.0 and OIDC that address the threat model of regulated financial APIs. The headline differences from baseline OAuth are mandatory Pushed Authorization Requests (PAR), mandatory sender-constrained tokens through DPoP or mTLS, optional JWT Authorization Response Mode (JARM), and stricter ID token lifetime bounds. This chapter walks through what FAPI adds, how axess exposes it, and when to reach for it.

The feature flag is fapi (off by default), which implies oauth. The base OAuth chapter (OAuth 2.0 and OIDC) covers everything that remains true under FAPI; this chapter covers only what changes.

Axess is the Relying Party, not the OP

A FAPI deployment has two parties. The OpenID Provider (OP, also called the IdP) owns user identity, runs the login UI, and issues tokens; in open-banking this is typically the bank's own SSO or a hosted Keycloak / Ory Hydra / Curity instance. The Relying Party (RP) is the application that delegates identity to the OP, accepts the resulting tokens, and runs a session on top. Axess fills the RP role. PAR, DPoP, JARM, and RP-Initiated Logout are all RP-side protocols that exist to talk to an external OP; without an OP to talk to, none of them make sense.

This is a deliberate architectural choice. Building a FAPI-conformant OP is a multi-year project (Keycloak, Hydra, Curity, and the commercial vendors are the established options) and is largely disjoint from the RP-side machinery axess provides (sessions, MFA verifiers, Cedar authorization). The verifier-vs-orchestrator split in the workspace (covered in Architecture at a glance) is the internal expression of the same boundary; axess does not become the OP, and adopters are expected to point at one. The examples/fapi/ crate ships a pre-configured Keycloak realm in a podman container as a quick way to get an OP locally for the demo, but in production the issuer URL would point at whatever OP your organisation already runs.

The local-idp feature is the one place axess does issuance, but that is on-host workload-identity issuance (service-to-service flows where a sidecar mints JWTs for its own workloads), not a user-facing OP. Local IdP covers that surface.

What FAPI changes

The four headline mechanisms address four specific gaps in baseline OAuth.

Pushed Authorization Requests (PAR, RFC 9126) move the authorization parameters off the redirect URL. Instead of the application constructing a query-string-laden authorize URL and redirecting the user to it, the application makes a direct POST to the IdP's PAR endpoint containing the parameters, receives an opaque request_uri in return, and constructs a much shorter authorize URL containing only the client id and the request URI. The defence is twofold: the authorization parameters never appear in browser history or referer headers, and the parameters cannot be tampered with in transit because the user only carries a reference to them.

DPoP (Demonstration of Proof of Possession, RFC 9449) binds the access token to a key pair the client controls. Each request the client makes to a protected resource carries a JWT signed with the client's DPoP key, and the token validator at the resource server checks that the access token was issued for a thumbprint of that key. The defence is against bearer-token theft: an attacker who captures the access token (from logs, a misconfigured proxy, a debugging surface) cannot use it without also having the DPoP private key, which never leaves the client.

JARM (JWT Authorization Response Mode) is the optional FAPI 2.0 recommendation that the IdP return the authorization response as a signed JWT instead of as query parameters. The defence is integrity: the response cannot be tampered with after the IdP issues it. JARM is optional in FAPI 2.0; some implementations use it, others do not.

Stricter ID token bounds: FAPI 2.0 requires the ID token's nbf (not-before) claim to be enforced and the lifetime to be no longer than a short window (axess defaults to five minutes, and refuses ID tokens with nbf in the future or exp more than five minutes out). The defence is against replay through stale tokens.

When to reach for FAPI

The honest answer is: when a regulator requires it. FAPI 2.0 was designed for the open-banking ecosystem and similar regulated financial APIs, and adopting it imposes operational complexity (every client needs DPoP key management, every authorize call goes through PAR, every IdP must support the PAR endpoint) that is substantial relative to the security benefit for non-regulated deployments. A consumer-facing SaaS that takes credit card payments through Stripe does not need FAPI; an open-banking application that acts as an account-information service provider does.

The decision is binary: either you need FAPI because someone is asking you for compliance evidence, or you do not. If you do, the mechanisms below are non-negotiable, and axess implements them. If you do not, the baseline OAuth chapter covers what you need.

Configuration

FAPI is enabled per-provider by attaching a FapiConfig to an OAuthProviderConfig:

use axess::factors::oauth::{FapiConfig, SenderConstraint, OAuthProviderConfig};

let fapi_config = FapiConfig {
    sender_constraint: SenderConstraint::DPoP,
    require_jarm: false,
    max_id_token_lifetime_secs: 300,
};

let provider = OAuthProviderConfig::discover(
    "https://idp.example.com/.well-known/openid-configuration",
    client_id,
    client_secret,
    redirect_uri,
)
.await?
.with_fapi(fapi_config);

sender_constraint chooses between DPoP and mTLS for the sender-constrained-tokens requirement. DPoP is the right choice for applications that already manage HTTPS in software; mTLS is the right choice for applications that already manage X.509 certificates for service-to-service authentication. The two cannot be combined on a single provider, but different providers in the same application can use different constraints.

require_jarm toggles JARM enforcement. When true, the authorization response from the IdP must arrive as a signed JWT; the configuration's oidc.discovery.jwks_uri is used to verify the signature. When false, the IdP may return the response as query parameters as in baseline OAuth.

max_id_token_lifetime_secs is the upper bound on ID token validity. The FAPI default is three hundred seconds (five minutes), which is short enough that a captured token expires before most replay attacks can succeed and long enough that clock skew does not cause spurious rejections.

The PAR flow

With FAPI enabled, the application starts a federated login through the PAR-enhanced auth URL rather than the query-parameter auth URL:

let auth_url = service
    .begin_oauth_login(&session, "fapi-provider", OAuthLoginOptions::default())
    .await?;
// auth_url looks like:
//   https://idp.example.com/authorize?client_id=...&request_uri=urn:ietf:params:oauth:request_uri:...

Internally, begin_oauth_login detects the FAPI configuration and takes the PAR branch. The branch performs a POST to the IdP's PAR endpoint with the full set of authorization parameters (client id, redirect URI, scopes, PKCE challenge, CSRF state, nonce), receives the request_uri and its expires_in, and constructs the shorter authorize URL the user is redirected to.

The PAR exchange happens server-to-server and is authenticated. The authentication is whatever the IdP requires (client secret POST, client secret basic, mTLS, or signed JWT assertion); axess passes through the credential that OAuthProviderConfig was constructed with.

The callback flow on the application side is unchanged. The IdP redirects the user back to the application's callback URL with a code; the application calls finish_oauth_login with the code and state; axess performs the token exchange and ID token validation.

DPoP key management

DPoP binds each access token to a public key the client controls. The application generates a key pair at session start (or at application start, for some deployments), uses the private key to sign a DPoP proof JWT on each request to a protected resource, and the resource server verifies the proof and matches the JWK thumbprint against the access token's binding.

Axess exposes the proof-generation primitive through OAuthProvider::generate_dpop_proof:

let proof: DpopProof = provider.generate_dpop_proof(
    "GET",                                         // HTTP method
    "https://resource.example.com/data",           // target URL
    Some(&access_token),                           // bind to this access token
    &dpop_key,                                     // the application's key
)?;

let response = http_client
    .get("https://resource.example.com/data")
    .header("Authorization", format!("DPoP {}", access_token))
    .header("DPoP", &proof.proof_jwt)
    .send()
    .await?;

The proof JWT contains the HTTP method, the target URL, a nonce, a timestamp, and the thumbprint of the binding key. The resource server checks all of these against the access token's cnf (confirmation) claim, which carries the thumbprint at token issuance.

Key lifecycle is the operational concern. A DPoP key pair generated per session is the safest choice (a compromised session is bounded to one key); a key pair generated per application instance is the easiest choice (one key to manage). The trade-off is between blast radius and operational complexity. Most deployments choose per-session keys for high-sensitivity flows and per-instance keys for routine flows.

Token revocation

FAPI 2.0 expects that compromised tokens can be revoked through the IdP's revocation endpoint (RFC 7009). The application calls revocation when the user logs out, when a session is administratively ended, or when token theft is detected. Axess exposes revocation through OAuthProvider::revoke_token:

provider.revoke_token(&access_token, Some(TokenTypeHint::AccessToken)).await?;
provider.revoke_token(&refresh_token, Some(TokenTypeHint::RefreshToken)).await?;

The revocation endpoint, when present in the discovery document, is called with the token to revoke and an optional type hint. The IdP responds with a 200 regardless of whether the token was actually revoked (intentionally, to defeat token-existence enumeration).

Revoking the refresh token is the more important call. The access token typically has a short lifetime (matching the FAPI ID token bound) and expires on its own; the refresh token has a longer life and an unrevoked one allows continued access through new access tokens. A logout that revokes only the access token leaves the refresh token active, which is rarely what the application wants.

Testing FAPI flows

There are three useful test modes, picked by what you want to exercise.

For Rust unit and integration tests, the FAPI feature pairs with the local-idp feature. The LocalIdpFixture in axess-core::testing::local_idp mints FAPI-grade tokens with the right nbf/exp bounds and exposes a shared JwkSet handle that a JwtVerifier borrows for signature verification. The fixture is an in-process value, not an HTTP service: PAR and discovery endpoints are not part of its surface. For FAPI flows that need a real PAR exchange, use Keycloak or another OP (see the end-to-end walkthrough below). The pattern for unit tests is to write against an OAuthProvider trait object, parameterise it over fixture and live, and run both in CI. Local IdP covers the fixture in detail.

For an end-to-end browser walkthrough, the examples/fapi/ crate ships with a pre-configured Keycloak realm under examples/fapi/keycloak/. One podman compose up -d brings up Keycloak with PAR required, PKCE S256 required, DPoP-bound tokens enabled, the axess-fapi-client client registered, and a seeded user (alice/alice) ready to log in. The example's OAuthProviderConfig::discover(...) call points at the local Keycloak issuer through env vars, and the same code talks to a real production IdP when those env vars point elsewhere. Docker users can substitute docker compose for podman compose; podman is the documented path.

For compliance certification, the OpenID Foundation runs a free hosted conformance suite at https://www.certification.openid.net/. It acts as a scripted OP that drives an RP through the full FAPI 2.0 test matrix including adversarial cases (missing PAR, bad DPoP, replay, wrong audience). Point it at the example's /auth/callback to produce a certifiable artifact; use Keycloak for everyday development.

Threat model

FAPI 2.0 closes the attacks baseline OAuth leaves open in regulated contexts.

Against authorization-parameter tampering: PAR moves the parameters off the URL, so they cannot be modified by an intermediary.

Against bearer-token theft: DPoP (or mTLS) binds tokens to keys the attacker does not have, so a captured token is unusable.

Against ID token replay through stale tokens: the strict lifetime bound shrinks the replay window to minutes.

The attacks FAPI does not close are the same ones baseline OAuth does not close: a compromised IdP issues compromised tokens regardless of the profile, and a compromised client device gives the attacker access to the DPoP private key alongside everything else.

Troubleshooting

If the PAR exchange fails with invalid_client, the application's PAR endpoint authentication does not match what the IdP expects. Some IdPs require mTLS authentication on PAR even when the rest of the flow uses client secrets; check the IdP's PAR documentation.

If DPoP verification fails at the resource server, the most common cause is a clock-skew issue between the client and the resource server. The DPoP proof's timestamp is checked within a small window (a few seconds typically); larger skew triggers spurious failures. Synchronise both sides against the same NTP source.

If JARM verification fails, the signing key the IdP uses for JARM may differ from the key used for ID token signing. Some IdPs publish separate JWKS for the two; the discovery document should indicate this, but configurations occasionally miss it. Inspect the discovery document.

Further reading

OAuth 2.0 and OIDC covers the base OAuth machinery this chapter extends. Workload identity overview covers the resolver side of OAuth, where axess is the resource server rather than the client. Local IdP covers the test fixture for FAPI-grade integration testing.

LDAP bind

LDAP bind is the right factor for enterprise deployments where the authoritative user store is Active Directory, OpenLDAP, or a similar directory server. The application does not own user passwords; the directory does. The verification mechanism is a simple bind against the directory with the user's distinguished name and password; if the bind succeeds, the user has authenticated.

The feature flag is ldap (off by default), enabled with features = ["ldap"] on the axess facade.

When LDAP fits

LDAP fits when three conditions hold. The first is that the authoritative user identities live in an LDAP directory the application can reach. The second is that the directory administrators have agreed to allow simple binds from the application's deployment network. The third is that the directory speaks LDAP, not some other protocol that wraps LDAP semantics (SAML, OIDC) which would route through the OAuth factor instead.

When those conditions hold, LDAP gives the application authentication-as-a-service from the directory without the application ever storing a user password. New employees added to the directory can log into the application immediately; departed employees removed from the directory lose access immediately. The directory is the source of truth.

When those conditions do not hold (a SaaS deployment where users come from many organisations, a directory the application cannot reach over a stable network, an authoritative store that is not LDAP), the right answer is OAuth or OIDC against an IdP that the organisation does support.

Configuration

LdapProviderConfig carries the connection details:

pub struct LdapProviderConfig {
    pub url: String,                          // ldaps://ad.example.com:636
    pub bind_dn_template: String,             // "uid={user},ou=people,dc=example,dc=com"
    pub starttls: bool,                       // upgrade ldap:// to TLS via STARTTLS
    pub connection_timeout: Duration,         // typical 5-10 seconds
    pub group_search: Option<LdapGroupSearch>,
}

url is the directory's URL. The ldaps:// scheme means TLS is established at the transport layer (port 636 by default); the ldap:// scheme means cleartext, possibly upgraded to TLS via STARTTLS. Cleartext without STARTTLS is acceptable only on a private network where the directory traffic does not leave a trusted segment; production deployments use one of the encrypted forms.

bind_dn_template is the pattern axess uses to construct a user's distinguished name from their login identifier. The string {user} in the template is replaced with the identifier the user typed. The example above turns the username alice into the DN uid=alice,ou=people,dc=example,dc=com, which is then used in the bind request.

starttls triggers a STARTTLS upgrade after the initial cleartext connection establishes. The mechanism is widely supported and is the right choice when the directory accepts both cleartext and TLS on the same port (usually 389). When the directory exposes a separate TLS port (usually 636), use ldaps:// instead and leave this false.

connection_timeout bounds how long a bind attempt may take. Five to ten seconds is typical. Longer timeouts admit slow failure modes into the login path; shorter timeouts produce spurious failures when the directory is briefly slow. Tune to match the directory's observed latency.

group_search is optional. When set, after a successful bind axess performs an additional search to enumerate the user's group memberships. The result is returned alongside the bind outcome and can be used by the application to populate the user's authorisation attributes.

pub struct LdapGroupSearch {
    pub base_dn: String,           // "ou=groups,dc=example,dc=com"
    pub filter_template: String,   // "(member={dn})"
    pub group_attr: String,        // "cn" -- attribute identifying the group
}

filter_template interpolates {dn} (the bound user's DN) or {user} (the original identifier) into an LDAP filter. The example filter (member={dn}) matches groups that list the user's DN in their member attribute, which is the OpenLDAP convention. Active Directory typically uses memberOf on the user record itself instead, in which case the group search is unnecessary because the groups are already attributes of the user.

The verification flow

The verification flow is straightforward. The user submits a username and password to the application. The application calls AuthnService::verify_factor with the LDAP bind credential; axess expands the bind DN template with the username, opens a TLS connection to the directory, performs a simple bind with the constructed DN and the user's password, optionally searches for groups, and unbinds.

A successful bind transitions the session as any factor would: the state machine calls advance_factor, which returns Completed if LDAP was the last required factor or StillAuthenticating if more factors are required. A failed bind returns FactorOutcome::InvalidCredential, and the user sees the standard failed-login message.

The connection model is per-attempt. Each bind opens a fresh TLS connection, performs the bind, and closes. There is no connection pooling. The trade-off is operational simplicity (no pool to size, no idle-connection management) against per-attempt latency (a TLS handshake on each login). For most deployments the latency is acceptable; busy directories with thousands of binds per second benefit from a connection pool at the network layer (HAProxy, nginx) rather than inside the application.

Mixing LDAP with other factors

LDAP can be the only factor in a method (the directory's bind is the entire authentication), or it can be one factor in a chain.

A common shape in enterprise deployments is LDAP followed by TOTP. The user enters their LDAP credentials, the directory verifies them, and then axess prompts for the user's TOTP code. The TOTP secret is stored in axess's own factor store (not in LDAP), under the user's scope. The combination gives directory-managed passwords with an application-managed second factor; the directory does not need to know about TOTP and the application does not need to know about the password.

A variation is LDAP followed by AnyOf(vec![Totp, Fido2]), allowing the user to register a passkey alongside or instead of TOTP. The flow is otherwise unchanged.

Threat model

LDAP bind is robust against the same attacks any second-factor method is robust against: credential reuse from other services, local password lists, offline brute-forcing of a stolen hash (the hash never leaves the directory).

It is weak against attacks the directory itself is weak against. A directory that allows anonymous binds is vulnerable to attribute enumeration. A directory whose bind path is misconfigured to accept empty passwords for any DN is catastrophically vulnerable. The defence is operational: configure the directory correctly, audit periodically, and treat the LDAP factor's security as a function of the directory's security posture.

The application also has to be careful about what it logs. The bind password should never appear in application logs at any level, including trace. Axess does not log it; adopters' own login handlers need to make the same guarantee. The standard pattern is to mark the password field as zeroized and to route it directly into the verifier without touching it again.

Troubleshooting

If binds fail consistently with "invalid credentials" for known-good passwords, the bind DN template is most likely wrong. Active Directory typically expects userPrincipalName (the user's email address) or sAMAccountName (a short login name) in the bind, not a constructed DN. The template might need to be {user}@example.com rather than uid={user},ou=people,dc=example,dc=com.

If the connection succeeds but the bind times out, the directory is under load or the connection is being inspected by a middlebox that buffers slowly. The connection timeout fires; the user sees a generic failure. Inspect the network path.

If the group search returns nothing, the filter template might be wrong or the bound user might not have permission to read group membership. OpenLDAP often requires explicit ACLs for the bound user to enumerate groups they are members of; Active Directory usually grants this by default. Run the same search through a known-good LDAP client to verify.

If TLS fails with a certificate-validation error, the directory's certificate is probably signed by a private CA that the application's trust store does not include. Add the CA to the rustls trust store via the standard SSL_CERT_FILE or SSL_CERT_DIR environment variables.

Further reading

Factors and methods covers the composition machinery this chapter exercises. Identity store implementation covers how user records referenced by LDAP get provisioned in the application's identity store (typically just-in-time on first successful LDAP login). Multi-tenancy covers the case where different tenants federate to different directories.

mTLS-based authentication

Mutual TLS authenticates the client to the server at the transport layer, before the application sees the request. The client presents an X.509 certificate during the TLS handshake, the server validates the certificate against a trust anchor, and the resulting connection carries a known identity. For service-to-service traffic between parties that own both sides of the connection, mTLS is the strongest practical authentication: there is no credential to phish, no token to leak, no replay window after the handshake.

This chapter covers using mTLS as a factor for human or human-adjacent flows (a kiosk machine, an internal admin host). The other use of mTLS in axess, where the certificate identifies a workload rather than a human, is covered in Workload identity overview and specifically in Inbound: mTLS-SVID. The mechanism is the same; the interpretation of the certificate differs.

The feature flag is mtls (off by default), enabled with features = ["mtls"] on the axess facade.

Where the certificate comes from

The most important detail about an mTLS integration is that axess does not handle the TLS handshake. Axum sits behind a TLS terminator (rustls in process, or nginx, HAProxy, AWS NLB, or Cloudflare in front), and the certificate validation happens at the terminator. Axess receives the validated certificate as part of the request, extracts an identity from it, and proceeds.

The extraction is a Tower middleware the adopter wires in. The middleware reads the certificate from wherever the terminator put it:

  • For rustls in process, the certificate is in axum_server::tls_rustls::RustlsConnectInfo or an equivalent connector callback.
  • For nginx, the certificate is passed through as the X-SSL-Client-Cert header (the exact header is the deployment's choice).
  • For HAProxy, the convention is X-Client-Cert or similar.
  • For AWS NLB with TLS passthrough, rustls handles the validation; for AWS ALB with mTLS, the certificate is in X-Amzn-Mtls-Clientcert.

The middleware reads the certificate, validates that it came from a trusted source (the certificate must be present, the header must have arrived only from the trusted terminator, the deployment must not allow clients to inject the header directly), wraps the certificate chain in a PeerCertChain, and inserts it into the Axum request extensions:

use axess::factors::mtls::PeerCertChain;

async fn mtls_middleware<B>(
    mut req: Request<B>,
    next: Next<B>,
) -> Response {
    if let Some(chain) = extract_cert_from_terminator(&req) {
        req.extensions_mut().insert(PeerCertChain::from(chain));
    }
    next.run(req).await
}

The trusted-terminator check is the critical line. If the deployment accepts the certificate header from anywhere, an attacker who can reach the application directly (bypassing the terminator) can spoof any identity by setting the header themselves. The defence is to either configure the application to listen only on a socket the terminator owns, or to gate the extraction on a token the terminator injects alongside the certificate.

The trust anchor

The certificate validation that the TLS terminator performs uses a trust anchor: a set of CA certificates the terminator considers authoritative. A client certificate is accepted only if it chains back to one of those CAs.

For service-to-service mTLS within an organisation, the trust anchor is typically the organisation's own internal CA. The CA issues certificates to known clients, the terminator trusts the CA, and the validation works on the closed set of certificates the organisation has signed.

For broader deployments (a partner integration where the partner runs their own CA), the trust anchor is the partner's CA or a short list of CAs, and the validation accepts clients signed by any of them.

For consumer-facing deployments where clients might use any certificate, mTLS is the wrong factor. Use OAuth or another flow where the client does not need to provision a certificate.

From certificate to user

After the middleware inserts the PeerCertChain into the extensions, the application's login handler reads it back and maps the certificate to a user identity. The mapping depends on the deployment's conventions.

The simplest mapping is from the certificate's Subject Common Name (CN) to a username. The CA issues certificates with CNs that match the deployment's usernames, the login handler reads the CN, and the application looks up the user under that CN.

use axess::factors::mtls::PeerCertChain;
use axum::Extension;

async fn mtls_login(
    session: AuthSession,
    State(service): State<Arc<AuthnService<...>>>,
    Extension(chain): Extension<PeerCertChain>,
) -> impl IntoResponse {
    let leaf = chain.leaf().expect("validated chain has at least one cert");
    let cn = extract_common_name(leaf).expect("validated cert has a CN");

    match service.begin_login(&session, &cn, "default-tenant").await {
        Ok(_) => {}
        Err(e) => return (StatusCode::UNAUTHORIZED, format!("{e}")).into_response(),
    }

    use axess::FactorCredential;
    match service
        .verify_factor(&session, FactorCredential::Mtls { chain })
        .await
    {
        Ok(_) => Redirect::to("/dashboard").into_response(),
        Err(e) => (StatusCode::UNAUTHORIZED, format!("{e}")).into_response(),
    }
}

A more structured mapping uses a SPIFFE URI in the certificate's Subject Alternative Name (SAN). The CA issues certificates with SAN URIs of the form spiffe://<trust_domain>/<path>, and the application's login handler parses the URI to extract the trust domain, the path, and any embedded identifiers. This shape is what the workload identity chapter covers, and it remains the right shape even for human-adjacent flows because it is more structured than a CN.

The verification on the axess side is straightforward. The FactorCredential::Mtls { chain } variant carries the cert chain through verify_factor. The verifier checks that the chain is present (a sanity check, since the middleware put it there), that the leaf certificate has not expired, and that the certificate matches the user's stored mTLS configuration (which CA it should be signed by, which CN or SAN it should have). Verification success advances the state machine; failure returns InvalidCredential.

Composing mTLS with other factors

mTLS as a sole factor is appropriate for service-to-service traffic where the certificate's possession is itself the authentication event. For human flows, mTLS pairs with another factor in a method.

A common shape for a high-assurance admin interface is mTLS followed by FIDO2. The user's machine presents a certificate issued by the organisation's CA (so only employees whose machines have been provisioned can even reach the login page); the user then authenticates with a passkey (so a stolen machine is not enough, the user themselves must be present). The combination is strong against both remote attackers (they have no certificate) and local attackers (they have no passkey).

A variation uses mTLS as a tenant-scoping factor and OAuth as the user-identification factor. The certificate identifies which tenant the request is for (a partner integration's certificate maps to the partner's tenant); the OAuth flow identifies which user within that tenant. The method composes the two.

Threat model

mTLS is robust against the standard authentication attacks: credential reuse, credential stuffing, password phishing, replay. The certificate is hard to steal without compromising the device that holds the private key, and a compromised private key is no easier to use than a compromised password (both require some attacker action and both can be revoked).

It is weak against three specific attacks.

The first is private-key theft from a compromised device. An attacker with full filesystem access to a client can copy the private key, install it on their own machine, and use the certificate. The defence is to store the private key on hardware the operating system protects (a TPM, a hardware security module, a smartcard) rather than in a file. Hardware-backed keys cannot be exported and survive even a full filesystem compromise.

The second is CA compromise. An attacker who can issue certificates from a CA the application trusts can authenticate as anyone. The defence is operational: keep the issuing CA offline, use short-lived certificates so revocation is automatic, and monitor the CA's audit log. For service-to-service mTLS, a SPIFFE control plane handles this with rotating, short-lived certificates backed by an attested root.

The third is missing revocation. When a certificate is revoked (employee leaves, machine is lost), the application needs to know. The TLS terminator checks revocation through OCSP or CRL or a short-lived-certificate strategy; an unchecked revocation lets the old certificate continue to work. The defence is to wire revocation checking at the terminator and to monitor the revocation lifecycle.

Troubleshooting

If the middleware never sees a certificate, the most likely cause is that the TLS terminator is not requiring client certificates. Some terminators require explicit configuration to request the client certificate at handshake time; others accept the handshake without a certificate and silently let the request through. Check the terminator's configuration.

If certificates are present but the CN extraction returns nothing, the certificate may use a SAN URI instead of a CN. Inspect the certificate (openssl x509 -in cert.pem -text) to see what fields are present. Updating the extraction to read the SAN URI is the fix; the structured-mapping pattern above is the right shape.

If the trust-anchor configuration accepts a certificate the application does not expect, the terminator's trust store may include a CA the deployment did not intend to trust. Check the terminator's CA-bundle configuration and remove anything that should not be there. Use a dedicated trust store for client certificates rather than reusing the server's general CA bundle.

Further reading

Workload identity overview covers the workload-side use of mTLS, where the certificate identifies a service rather than a human. Inbound: mTLS-SVID covers the SPIFFE X.509-SVID profile that is the standard shape for service-to-service mTLS today. Security posture covers the production crypto requirements that apply to mTLS deployments, including FIPS-routing notes for regulated contexts.

Cedar policy fundamentals

Most application authorisation is the if user.role == "admin" style: a check scattered across handlers, expressed in code, written by whoever happened to be in the file at the time, with no shared schema and no way to review the policy as a whole. The pattern works for small applications and fails for everything else, because the authorisation logic is the part of the application that needs the most review and is also the part most likely to drift.

Cedar is a policy language designed for this exact problem. It is declarative, deny-by-default, statically checkable against a schema, and built to express RBAC, ReBAC, and ABAC in one set of rules. Axess loads a Cedar policy set at startup, validates it against a schema, and exposes per-request evaluation through a small typed interface. This chapter covers the lifecycle: loading, validation, the per-request evaluator, the contract with the application's data layer, and the error modes.

The feature flag is authz (on by default in the axess facade).

The lifecycle

Cedar in axess has three lifecycle phases: load, evaluate, redeploy. Each phase has a specific failure mode, and the design is built so the failures land at the right place.

The load phase happens once at application startup. The application constructs a PolicyStore from one or more policy files, validates the parsed policies against a schema, and produces an AuthzStore that holds the result. A load failure (a malformed policy, a type mismatch against the schema, an action that references an undefined entity) is a startup failure: the application refuses to start. The defence is structural: there is no path to production with a broken policy file because the application refuses to come up.

The evaluate phase happens once per authorisation check. The application constructs an AuthzSession from the AuthzStore, a Principal (typically extracted from the session or from a workload-identity resolver), an AuthzEntityProvider that supplies the application's entity graph for this request, and a context (MFA status, IP address, the application's custom attributes). The session offers two verbs: require (allow or deny, returning an error on deny) and decide (a typed AuthzDecision). The evaluation is cheap, predictable, and deterministic.

The redeploy phase happens when policies change. The application loads a new PolicyStore from the new policy files, swaps it in behind the AuthzStore's Arc, and from the next request onward new evaluations use the new policies. A hot reload of policies is supported; the trade-off is that decisions in flight at swap time see the old policies and decisions started after see the new policies. There is no decision-caching layer in axess for this reason: a cached decision from before a redeploy would survive into the new policy regime and produce wrong answers. The chapter Entity providers and request context expands on what does and does not get cached.

Loading policies

The minimal load is a directory of .cedar files plus a schema.cedarschema file:

use axess::authz::{AuthzStore, PolicyStore};

let policy_store = PolicyStore::load_directory("./policies")?;
let schema = std::fs::read_to_string("./policies/schema.cedarschema")?;
policy_store.validate_against(&schema)?;

let authz_store = AuthzStore::new(policy_store);

The load is recursive: every .cedar file under the directory is parsed and added to the policy set. Cedar policies have no import or namespace mechanism beyond the entity-type namespace; the collection of all files is the policy set, evaluated as one.

validate_against is the call that catches malformed policies before they reach production. The validator checks that every entity type the policies reference is defined in the schema, that every attribute access is on an attribute the schema declares, and that the types align (a policy that asks principal.age > "old" gets caught because the schema declares age as a number and the literal is a string).

The schema is its own discipline. Writing a schema that accurately describes the application's entities is the hardest part of a Cedar integration. The schema names the principal types (User, Workload, Role, Group), the action types (read, write, administer), the resource types (the application's domain objects), and the parent relationships (a User is in Groups, which are in Roles, which permit Actions). The Cedar documentation covers schema authoring in detail; the chapter here focuses on what axess does with a schema once it has one.

The per-request evaluator

The AuthzSession is constructed per request and lives only as long as the request:

let session: AuthzSession = authz_store.session()
    .with_principal(principal)
    .with_entity_provider(&app.entity_provider)
    .with_context(StandardRequestContext::from_request(&request))
    .build();

match session.decide(
    Action::View,
    ResourceUid::new("Document", "doc-123"),
).await {
    Ok(AuthzDecision::Allow) => proceed(),
    Ok(AuthzDecision::Deny) => render_forbidden(),
    Err(e) => render_error(e),
}

The with_principal call binds the caller. The principal carries the user id, the tenant id, the factors completed, and the authentication time. Cedar policies can match on any of these.

The with_entity_provider call binds the application's data layer. The entity provider is the application-specific code that loads the relevant entities (the user record, their group memberships, the resource being accessed, its parents) for the evaluation. The provider returns a Cedar entity graph; the session holds it for the duration of the evaluation. The next chapter, Entity providers and request context, covers the provider contract in detail.

The with_context call binds the contextual attributes. The StandardRequestContext covers the common cases: MFA status, IP address, the time of the request. Applications can extend it with custom keys (a custom-headers map, a tenant-feature-flag set, a geographical location).

The decide verb evaluates the policies and returns AuthzDecision::Allow or AuthzDecision::Deny. The verb is async because the entity provider may need to fetch entity data from a database. The require verb is a thin wrapper that returns an error on Deny, suitable for handlers that want to short-circuit on a denied request.

What policies look like

A Cedar policy is a permit or forbid statement against a principal, action, and resource, with optional when conditions. The simplest possible policy:

permit (
    principal,
    action == Action::"read",
    resource
);

This is the "everyone can read everything" policy. It permits any principal to perform the read action against any resource. It is useful for nothing in production but illustrates the shape.

A real RBAC policy:

permit (
    principal in Role::"finance-viewer",
    action == Action::"read",
    resource in TenantData::"acme"
) when {
    principal.tenant_id == "acme"
};

This permits any principal in the finance-viewer role to read any resource in the acme tenant's data, but only when the principal is also in the acme tenant. The in operator is set membership against the entity graph: the policy is asking the entity provider "is this principal in this role?", which the provider answers from the application's data.

A ReBAC policy:

permit (
    principal,
    action == Action::"edit",
    resource
) when {
    resource.owner == principal
};

This permits a principal to edit a resource only when the resource's owner attribute equals the principal. Ownership is the ReBAC relationship; the schema declares Document has an owner attribute of type User, and the entity provider populates it from the document's row.

An ABAC policy:

permit (
    principal,
    action == Action::"write",
    resource in TenantData::"acme"
) when {
    principal.tenant_id == "acme"
    && context.mfa == true
    && context.ip like "10.*"
};

This permits writes to the acme tenant's data when the principal is in the tenant, has completed MFA, and is connecting from an internal IP range. Context attributes come from the StandardRequestContext (or custom extensions); the schema declares them so the validator can type-check the policy.

The three styles compose freely in one rule. A real production policy is typically a mix: roles establish broad permissions, relationships restrict to ownership, attributes restrict to high-assurance contexts. Cedar's deny-by-default behaviour means the rules accumulate as positive grants; no rule denies, and the absence of a permitting rule is itself a deny.

Errors

The AuthzError enum has variants for the cases that go wrong:

pub enum AuthzError {
    PolicySetInvalid(String),       // load-time, should never reach prod
    SchemaValidationFailed(String), // load-time
    EntityNotFound { uid: String }, // evaluator could not load an entity
    ContextMissing(String),         // policy needed a context key not provided
    EvaluationFailed(String),       // Cedar internal error (rare)
    Cancelled,                      // request cancelled during evaluation
}

The load-time variants should never reach production because the PolicyStore::validate_against call catches them at startup.

The runtime variants are recoverable but specific. EntityNotFound means the entity provider returned no entity for a UID a policy referenced; the deployment may have a stale Cedar reference or a race between policy and data. ContextMissing means a policy referenced a context key the request did not provide; the schema should have caught this at load time but did not (a context key the schema declared as optional, used in a policy as if required). EvaluationFailed is the catch-all for Cedar's own errors, which are rare in well-formed policy sets.

Every variant produces a deny. There is no path where an evaluation error produces an allow. The defence is structural and is one of the reasons Cedar was chosen.

When to use require versus decide

The two verbs differ in their failure handling. require returns an error on Deny (so the handler short-circuits with an error without needing an explicit match); decide returns the typed decision (so the handler can branch).

The recommendation is to use require in handlers (the most common case: deny gives a 403, allow proceeds), and decide in code that needs to express a non-binary outcome (a UI that hides buttons rather than displaying them and denying on click, an admin panel that shows what the current user could do).

// require version: handler short-circuits on deny
async fn delete_document(
    session: AuthzSession,
    Path(doc_id): Path<String>,
) -> Result<Json<()>, AppError> {
    session
        .require(Action::Delete, ResourceUid::new("Document", &doc_id))
        .await?;
    // ... proceed with delete
}

// decide version: branch on the decision
async fn dashboard(
    session: AuthzSession,
) -> impl IntoResponse {
    let can_create_doc = matches!(
        session.decide(Action::Create, ResourceUid::new("Document", "*")).await,
        Ok(AuthzDecision::Allow)
    );
    render_dashboard(can_create_doc)
}

The wildcard resource UID in the second example is a Cedar convention for "is the principal allowed to perform this action at all?"; it relies on the policy set being written with that question in mind.

What policies cannot do

Cedar is the right tool for asking "is this allowed?". It is not the right tool for everything that pattern-matches like authorisation but is actually something else.

It is not for rate limiting. Rate limits are stateful (they depend on the rate of past requests, not the content of the current request), expensive to express in declarative terms, and not what Cedar is built for. Use the RateLimitLayer middleware (covered in Rate limiting).

It is not for input validation. A request with an invalid body fails at deserialisation, not at authorisation. Cedar policies that try to enforce body-shape constraints duplicate validation logic and run after the body has already been parsed.

It is not for state transitions. A workflow that allows a transition from Pending to Approved but not from Pending to Closed is a state machine, not a policy. Implement the state machine in code (or in a axess-style typed state machine for the workflow); use Cedar to gate access to the transition operations.

It is not for caching decisions across requests. Policies and entity graphs are mutable; cached decisions are stale by construction. Axess deliberately caches entity graphs (which are much more stable) and not decisions.

The next chapter, Entity providers and request context, covers the entity-graph caching mechanism and the contract between Cedar and the application's data layer.

Further reading

Entity providers and request context covers the AuthzEntityProvider trait, the StandardRequestContext extension points, and the caching posture. RBAC, ReBAC, and ABAC patterns walks through worked examples of each style and how they compose in one policy set. The principal model covers the principal types the evaluator binds to.

Entity providers and request context

A Cedar policy evaluates against three inputs: a principal, an action, a resource, plus an entity graph that gives the policies the data they need to reason about (which roles the principal is in, which group owns the resource, what the principal's MFA status is). The policy set is loaded once at startup. The principal and action come from the request. The entity graph and the request context come from the application, per request, through two interfaces this chapter covers: the AuthzEntityProvider trait and the StandardRequestContext extension surface.

Doing both of these well determines whether the Cedar integration holds up under load. A naive entity provider that loads an entire user's group membership on every request will be the slowest part of the request lifecycle. A request context that omits an attribute a policy expects produces denies that are hard to debug. The shapes below avoid both failure modes.

The entity provider contract

AuthzEntityProvider is the trait the application implements. The job is to take a request's principal and resource UIDs, and return a Cedar entity graph rich enough that the evaluator can answer the policy questions:

#[async_trait]
pub trait AuthzEntityProvider: Send + Sync {
    async fn entities(
        &self,
        principal: &Principal,
        resources: &[ResourceUid],
    ) -> Result<EntitySet, AuthzProviderError>;
}

The provider receives the principal (so it can load the principal's groups, roles, and any attributes the policies need) and the list of resource UIDs the request is touching (so it can load the resources, their parents, and their attributes). It returns an EntitySet, which is Cedar's typed entity graph: each entity has a UID, a set of attributes, and a list of parent entities.

The contract is "return enough to answer the policies, no more." An entity set that omits an entity a policy references produces an EntityNotFound error at evaluation time. An entity set that includes hundreds of entities the policy never touches wastes the database time. The right shape is the minimum set the policies need for this request.

What "enough" means

The policies that the evaluator runs against the entity set typically need a few categories of data.

The principal's parents. Every role the principal is in, every group they belong to. A policy that says principal in Role::"finance-viewer" needs the principal's parents list to include Role::"finance-viewer" if the principal is in that role. The provider populates this from the application's role-and-group store.

The principal's attributes. The user's tenant id, MFA status, factors completed, custom attributes the policies use. Many of these are already on the Principal value; the provider attaches them as Cedar attributes on the principal entity.

The resource's parents. The tenant that owns it, the project it belongs to, any logical grouping the policies might match against. A policy that says resource in TenantData::"acme" needs the resource's parents list to include TenantData::"acme" if the resource belongs to that tenant.

The resource's attributes. The owner, the visibility setting, the classification level, anything the policies need. The provider populates these from the resource's row.

The principal's relationships to the resource. A ReBAC policy that matches resource.owner == principal needs the resource's owner attribute to equal the principal's UID. If the resource is shared with the principal through a separate sharing record, the provider either expresses it as an attribute on the resource (a shared_with list) or as a parent (the principal is in a "viewers" group attached to the resource).

The application's data model is the source of truth for all of this; the provider's job is to shape the data into Cedar's vocabulary.

A worked provider

A typical provider for a document-management application looks like this:

struct AppEntityProvider {
    db: PgPool,
}

#[async_trait]
impl AuthzEntityProvider for AppEntityProvider {
    async fn entities(
        &self,
        principal: &Principal,
        resources: &[ResourceUid],
    ) -> Result<EntitySet, AuthzProviderError> {
        let mut set = EntitySet::new();

        // Principal: load roles and groups, attach as parents.
        let user_id = principal.user_id().ok_or(AuthzProviderError::NotHuman)?;
        let memberships = sqlx::query_as::<_, (String,)>(
            "SELECT role_uid FROM user_roles WHERE user_id = $1"
        )
        .bind(user_id.to_string())
        .fetch_all(&self.db)
        .await?;

        let principal_uid = ResourceUid::new("User", &user_id.to_string());
        set.insert(Entity {
            uid: principal_uid.clone(),
            attrs: principal_attrs(principal),
            parents: memberships
                .into_iter()
                .map(|(uid,)| ResourceUid::parse(&uid).unwrap())
                .collect(),
        });

        // Resources: load each resource's row + tenant parent.
        for resource in resources {
            if resource.entity_type() == "Document" {
                let row: DocumentRow = sqlx::query_as("SELECT * FROM documents WHERE id = $1")
                    .bind(resource.id())
                    .fetch_one(&self.db)
                    .await?;
                set.insert(Entity {
                    uid: resource.clone(),
                    attrs: document_attrs(&row),
                    parents: vec![ResourceUid::new("TenantData", &row.tenant_id)],
                });
            }
        }

        Ok(set)
    }
}

The shape is uniform: one principal entity (with parents from the role-and-group store), one or more resource entities (each with parents from the tenant model and attributes from the resource's row). The provider uses Postgres in this example; the choice is the application's. The key shape is that the loads are batched per request (one query for memberships, one or two for the resources), not per policy or per entity.

Caching entities, not decisions

The single most important performance choice in a Cedar integration is what to cache. Axess takes the conservative line: entity graphs are cached aggressively, decisions are never cached.

Decisions cannot be cached because they are functions of the entity graph, the policy set, and the context. Any of the three can change between the cache write and the cache read: the entity graph because the database has updated (a role granted, a relationship added), the policy set because a redeploy has happened, the context because the request is different. A cached decision that survives any of these changes produces a wrong answer. The defence is to not cache decisions at all.

Entity graphs can be cached because they are functions of the database state at a known point in time. The cache key is the principal UID plus the resource UIDs; the cache value is the entity set; the cache TTL is a function of how stale the application is willing to tolerate.

Axess provides an AuthzSessionCache decorator that wraps an AuthzSession. The decorator caches the entity graph for a configurable TTL (default sixty seconds for low-sensitivity deployments, one second or less for high-sensitivity deployments, or even off for the highest-sensitivity ones). The cache is keyed by (tenant_id, principal_uid, resource_uids).

The TTL is the lever. Sixty seconds is fine for a deployment where a role change can take a minute to propagate (most internal admin panels). Anything tighter requires the cache to be invalidated on role changes, which means the application's role-mutation code calls into the cache to flush the affected entries. The CacheInvalidator trait on EntityCache is the surface for this; applications that need stricter consistency wire the invalidations explicitly.

The chapter Session lifecycle and crypto envelope covers the generic axess-cache machinery the entity cache uses. Operations runbook covers the operational signals for the cache (hit rate, eviction rate, invalidation rate).

The standard request context

The context is the third input to a policy evaluation. It carries the per-request attributes that are not on the principal or the resource: the MFA status, the IP address, the time of the request, the custom keys the application wants to expose to policies.

StandardRequestContext is the built-in implementation:

pub struct StandardRequestContext {
    pub mfa: bool,
    pub ip: Option<IpAddr>,
    pub now: DateTime<Utc>,
    pub custom: BTreeMap<String, serde_json::Value>,
}

impl StandardRequestContext {
    pub fn from_request(req: &Request) -> Self { /* ... */ }

    pub fn with_custom(mut self, k: impl Into<String>, v: serde_json::Value) -> Self {
        self.custom.insert(k.into(), v);
        self
    }
}

The from_request constructor pulls what it can from the request: the IP from the trusted-proxy chain, the MFA status from the session's factors_completed, the time from the clock. The with_custom builder adds application-specific keys.

Policies can match on any of these:

permit (
    principal,
    action == Action::"write",
    resource
) when {
    context.mfa == true
    && context.ip like "10.*"
    && context.custom.region == "eu"
};

The schema declares the context shape:

type Context = {
    mfa: Bool,
    ip: String,
    custom: {
        region?: String,
        ...
    }
};

Required fields are checked at policy load time; optional fields are checked at evaluation time. A policy that uses a required field the request omits produces a startup error (good, caught early). A policy that uses an optional field the request omits produces a deny at runtime with ContextMissing (acceptable, deny is the conservative answer).

When to extend the context

The custom keys exist to bridge application state that does not fit on the principal or the resource. Common cases:

The first is a tenant feature flag. A policy that gates a beta feature on "this tenant has opted in" reads context.custom.beta, which the application sets from the tenant's feature-flag state.

The second is the request's geographical context. A policy that restricts certain actions to certain regions reads context.custom.region, which the application populates from the load balancer's geo-IP information or from an explicit header.

The third is a stepped-up factor that is not in factors_completed because it was completed for a different reason. A policy that wants to know "did the user complete a fresh password challenge in the last five minutes" reads context.custom.password_challenge_at, which the application populates from a sidecar store of recent challenges.

The pattern across all three: the application owns the data, the context is the carrier, the policy sees a typed attribute it can match on.

Failure modes and visibility

The two failure modes worth knowing are EntityNotFound and ContextMissing, both of which surface as Deny from the evaluator. The right response is the same in both cases: log the failure with enough detail to diagnose, surface a generic deny to the user, and keep the audit trail.

EntityNotFound typically means the entity provider should have loaded an entity but did not. The fix is in the provider: load the missing entity, or update the policy to not reference it.

ContextMissing typically means a policy was written against a context key the application does not provide. The fix is in the schema: declare the key as optional and update the policy to handle its absence, or update the application to provide it.

Axess emits an AuthzEvent for every evaluation, regardless of outcome. The chapter Audit events covers the event surface; the relevant variants here are AuthzEvent::EntityNotFound and AuthzEvent::ContextMissing, both of which name the missing key and the policy that referenced it. A spike in either suggests a mismatch between the policy set and the rest of the deployment; operational dashboards should alert on it.

What this enables

The provider-and-context contract is what makes Cedar usable against an arbitrary application data model. The schema names the shape; the policies match on the shape; the provider populates the shape from whatever the application's storage actually looks like. The three layers are independent, which means a database migration that changes how roles are stored does not break the policies (the provider updates; the rest stays), and a policy change does not touch the database (the policy file updates; the rest stays).

The chapter RBAC, ReBAC, and ABAC patterns covers worked examples that show the three styles composed in real policies.

Further reading

Cedar policy fundamentals covers the policy lifecycle and the evaluator surface this chapter feeds. RBAC, ReBAC, and ABAC patterns covers the policy authoring style with concrete examples for each pattern. Identity store implementation covers how the provider's principal-loading queries fit into the application's identity-store implementation. Audit events covers the AuthzEvent variants the evaluator emits.

RBAC, ReBAC, and ABAC patterns

The three letter-soup acronyms RBAC, ReBAC, and ABAC name the three standard styles of authorisation. Cedar is one of the few policy languages that admits all three in the same set of rules. This chapter walks through each style with worked examples, then shows how to compose them in a single policy set without the rules fighting each other. The examples are concrete enough that you should be able to paste them into a .cedar file and have them type-check against a corresponding schema.

RBAC: roles as groups

Role-based access control assigns users to roles and assigns permissions to roles. The model has been the workhorse of enterprise authorisation since the 1990s and remains the right starting point for most applications.

The schema declares roles and the action permissions they hold:

entity User {
    tenant_id: String,
};

entity Role;

entity Document {
    tenant_id: String,
    owner: User,
};

action read appliesTo {
    principal: [User],
    resource: [Document],
};

action edit appliesTo {
    principal: [User],
    resource: [Document],
};

The policy grants the role-action mappings:

permit (
    principal in Role::"viewer",
    action == Action::"read",
    resource
);

permit (
    principal in Role::"editor",
    action in [Action::"read", Action::"edit"],
    resource
);

The entity provider, on each request, attaches the user's role memberships as parent entities. A user in Role::"viewer" has that role in their parents list; a user in Role::"editor" has that role in their parents list and inherits read permission through the second policy's action set.

The shape works for most applications until two situations arise. The first is when permissions need to depend on the relationship between the principal and the resource (a user can edit their own documents but not others'), which is the ReBAC case below. The second is when permissions need to depend on the request context (MFA must be present for sensitive actions), which is the ABAC case below.

ReBAC: relationships as paths

Relationship-based access control assigns permissions based on the relationship between the principal and the resource, not on a role label. The classic example is ownership: a user can edit a document they own.

The schema does not change much; the relationship is already on the entity:

entity Document {
    tenant_id: String,
    owner: User,
    shared_with: Set<User>,
};

The policy expresses the relationship:

permit (
    principal,
    action == Action::"edit",
    resource
) when {
    resource.owner == principal
};

permit (
    principal,
    action == Action::"read",
    resource
) when {
    resource.owner == principal
    || principal in resource.shared_with
};

The first rule grants edit to the owner. The second rule grants read to the owner or to anyone in the resource's shared_with set. The set membership principal in resource.shared_with is the ReBAC primitive: the principal is in some set on the resource, and the policy matches on that.

More elaborate relationships involve multi-hop paths. Consider a "team" model where a user belongs to a team, the team owns projects, and the projects contain documents. The schema:

entity Team;

entity Project {
    owner_team: Team,
};

entity Document {
    project: Project,
};

entity User in [Team];

The policy that says "anyone in the team that owns the project that contains this document can read the document":

permit (
    principal,
    action == Action::"read",
    resource
) when {
    principal in resource.project.owner_team
};

The in operator follows the entity graph: resource.project yields a Project entity, .owner_team yields a Team entity, and principal in Team checks the principal's parents list. The entity provider populates the graph: the document with its project parent, the project with its owner_team attribute, the user with their team memberships. Cedar walks the graph at evaluation time.

The pattern generalises to any depth, though policies that walk more than two or three hops start to feel hard to review. When the depth gets uncomfortable, extract the relationship into an intermediate entity (a "can_view" set on the document that the application's data layer computes ahead of time) and let the policy match on the simpler shape.

ABAC: attributes as conditions

Attribute-based access control adds context to the decision. The attributes might be on the principal (MFA status, last authentication time), on the resource (sensitivity level), or on the request (IP address, time of day). A policy applies only when the attributes match.

The schema declares the attribute shapes:

entity User {
    tenant_id: String,
    mfa_completed: Bool,
    last_authn_at: Long,  // unix seconds
};

entity Document {
    tenant_id: String,
    classification: String, // "public" | "internal" | "secret"
};

type Context = {
    ip: String,
    now: Long,
};

The policy combines attribute conditions:

permit (
    principal,
    action == Action::"read",
    resource
) when {
    principal.tenant_id == resource.tenant_id
    && (
        resource.classification == "public"
        || (
            resource.classification == "internal"
            && principal.mfa_completed
        )
        || (
            resource.classification == "secret"
            && principal.mfa_completed
            && context.now - principal.last_authn_at < 900  // last 15 min
        )
    )
};

The rule grants read access in three tiers: public documents to anyone in the tenant, internal documents to anyone in the tenant with MFA completed, secret documents to anyone in the tenant with MFA completed in the last fifteen minutes. The attributes drive the gradations; the policy expresses them in one statement.

ABAC is the right tool for time-sensitive, location-sensitive, and context-sensitive policies. It is the wrong tool for static permissions (use RBAC) or for relationship checks (use ReBAC). When in doubt, write the policy and read it back: if the rule says "users in X role can perform Y," it is RBAC; if it says "users with relationship Z to this resource can perform Y," it is ReBAC; if it says "users can perform Y when condition W," it is ABAC.

Composing the three styles

A real production policy set mixes the three. A user who has the editor role (RBAC) can edit any document, but a user who owns a document (ReBAC) can edit it regardless of role, and a user trying to edit a secret document must have MFA completed (ABAC).

// RBAC layer: editors get full access.
permit (
    principal in Role::"editor",
    action in [Action::"read", Action::"edit", Action::"delete"],
    resource
);

// ReBAC layer: owners get full access to their own.
permit (
    principal,
    action in [Action::"read", Action::"edit", Action::"delete"],
    resource
) when {
    resource.owner == principal
};

// ReBAC layer: shared-with users get read access.
permit (
    principal,
    action == Action::"read",
    resource
) when {
    principal in resource.shared_with
};

// ABAC layer: secret documents require fresh MFA, forbid otherwise.
forbid (
    principal,
    action,
    resource
) when {
    resource.classification == "secret"
    && (
        !principal.mfa_completed
        || context.now - principal.last_authn_at > 900
    )
};

The forbid rule overrides any permit that would otherwise match. The structure works because Cedar evaluates all rules: if any permit matches and no forbid matches, the decision is Allow; if any forbid matches, the decision is Deny regardless of what permits also match.

The pattern is to express the broad grants through permit rules in increasing specificity (role, relationship, context), then to express the absolute constraints through forbid rules. The forbid rules are typically about high-sensitivity resources or about high-risk principal states; they are the small set of cases where a positive grant is not enough.

Tenant isolation as a structural rule

Multi-tenant applications need a structural rule that no policy should ever leak data across tenants. The right shape is a single top-level forbid:

forbid (
    principal,
    action,
    resource
) when {
    principal.tenant_id != resource.tenant_id
};

The rule applies to every action on every resource. Any later permit that would have allowed a cross-tenant access is overridden. The rule is the structural defence against the worst class of authorisation bug a multi-tenant application can have: an operator from tenant A accessing tenant B's data because of a mistake in another policy.

The rule is also the right place to validate that the principal has a tenant id at all. A workload principal might be in a global trust domain (no tenant), in which case the comparison fails the type system and the rule denies. The policy authoring style is to treat tenant id as a required attribute on every multi-tenant entity, and to let this forbid catch any drift.

Step-up as a policy concern

Step-up authentication is the pattern where a user is asked to re-prove identity (or to prove with a stronger factor) before performing a sensitive action. The mechanism is in the state machine (see Factors and methods §"Step-up authentication"); the policy expresses when step-up is required.

The shape:

forbid (
    principal,
    action == Action::"delete-account",
    resource
) when {
    !("Fido2" in principal.factors_completed)
};

The rule denies the account-deletion action unless FIDO2 is in the user's completed factors. The user reaches the action with a password-and-TOTP session, gets denied, and the application offers step-up: the user completes the FIDO2 ceremony, the session's factors_completed now includes Fido2, the next request to the delete-account action passes the policy.

The pattern composes with the other styles. A permit rule says who can delete an account (RBAC: the user themselves, ReBAC: the admin who owns the user). The forbid rule adds the contextual requirement (ABAC: FIDO2 in factors_completed). The three rules together produce a policy that says "the user themselves can delete their own account, but only after completing FIDO2 in this session."

Anti-patterns

The two patterns most likely to mislead are worth naming.

The first is duplicating ReBAC as RBAC. The temptation is to materialise the ownership relationship as a per-resource role ("owner of document 123"), then write an RBAC policy that grants edit to the role. The shape works but produces an explosion of roles (one per resource), is hard to invalidate when ownership changes, and obscures the relationship that the policy is actually expressing. The right shape is to express ownership as an attribute (resource.owner == principal) and write the ReBAC policy directly.

The second is encoding state machines in policies. A workflow that allows transitions only from certain states is a state machine, not a policy. Writing it as a Cedar rule (permit ... when { resource.state == "draft" && action == "submit" }) admits the rule but makes the policy set the source of truth for what the state machine allows. The right shape is to put the state machine in code (or in a typed state machine in the application), and to use Cedar only for "who can invoke this transition" rather than "which transition is valid right now."

Schema discipline

The most consequential decision in any Cedar integration is the schema. The schema names every entity type, every attribute on every entity, every action that applies to every principal-resource pair, every required and optional context key. Getting the schema right is most of the work; getting the policies right is what follows naturally from a good schema.

Three rules help:

The first is to name entities by their domain meaning, not by the table they live in. User is the right name; usersRow is the wrong name. The policies that read like English are the ones that let reviewers do their job.

The second is to declare attributes as required only when every production deployment guarantees the attribute is present. An attribute declared as required forces the entity provider to return it on every load, which often forces the application to add an INSERT default. Optional attributes are the right default; require only when the policy logically depends on it.

The third is to update the schema whenever a policy expression needs an attribute that is not yet declared. The validator catches the inconsistency at load time; the alternative is a runtime deny that is hard to debug. The schema is not optional; treat it as part of the policy set.

Further reading

Cedar policy fundamentals covers the policy lifecycle and the evaluator surface. Entity providers and request context covers the data-loading contract the policies in this chapter depend on. Audit events covers the AuthzEvent variants the evaluator emits, including the policy id that produced each decision. The Cedar documentation covers the language in full detail and is the authoritative reference for syntax and semantics.

Session lifecycle and crypto envelope

A session in axess is a server-side record that holds the authentication state, the bound principal, and any application data the session carries. The cookie that travels between the browser and the server identifies the session, but the cookie itself does not contain the session data. This separation is what lets the session shape evolve across deployments without invalidating existing cookies, and what lets the data be encrypted at rest with keys the client never sees.

This chapter walks through the lifecycle of a single session from its creation through its expiry, the cookie shape and signing, the AES-256-GCM envelope that encrypts the data at rest, the fingerprint binding that catches cookie replay, and the dirty-flag and write-back machinery that makes the lifecycle invisible to application code.

The session cookie is small. By default it carries an opaque session id (the SessionId newtype, sixteen bytes of cryptographic randomness from SecureRng, base64-encoded for transport) plus an HMAC signature computed from the id and the deployment's signing key. The whole cookie is well under two hundred bytes.

session=<base64(session_id)>.<base64(hmac_sha256(signing_key, session_id))>

The signature defeats forgery. An attacker who guesses or brute-forces a session id cannot use it without also producing the HMAC, which requires the signing key. The signing key is the operational secret covered in the Getting started chapter: a 32-byte value loaded from a secrets manager, stable across process restarts, rotated on a schedule.

The cookie attributes are conservative by default: HttpOnly (client-side JavaScript cannot read it), SameSite=Lax (it is sent on top-level cross-site navigations but not on cross-site sub-requests), Path=/ (it applies to the whole application), and Secure when configured (it is only sent over HTTPS). The default session lifetime is a function of SessionLayer::with_ttl; the default in the constructor is twenty-four hours.

The cookie is opaque. The session id maps to a row in the session store, and the row carries the actual data. A user who copies the cookie has the session id and the signature, both of which the server already has; nothing on the cookie carries the user's identity, the factors completed, or any other session state.

The session store

The session store is the persistence layer for the data the cookie identifies. Each row in the store carries:

  • The session id (the primary key).
  • The serialised SessionData (covered below).
  • The created-at and updated-at timestamps.
  • The expiry timestamp.
  • The optional fingerprint binding (covered below).

SessionData is the application's view of the session:

pub struct SessionData {
    pub auth_state: AuthState,                  // see Part II
    pub principal_hint: Option<PrincipalHint>,  // cache of recent extractor outputs
    pub custom: HashMap<String, serde_json::Value>,  // application data
    pub schema_version: u32,                    // see Schema migration
}

The auth_state carries the state-machine variant (Guest, Authenticating, Authenticated, PendingWorkflow, Identifying). The principal_hint is an optional cache of the principal extracted during this session's authentication, kept on the session so the PrincipalResolver does not have to recompute it on every request. The custom map carries application-defined data with a sixty-four kilobyte cap. The schema_version is the field that lets the data shape evolve.

The serialisation format is MessagePack: faster than JSON, more compact, and stable across versions of serde. Backends that support binary blobs persist the bytes directly; backends that require text (some configurations of MySQL, for instance) encode the bytes as base64 first. The format is the same across all backends; switching backends does not require re-serialisation.

The AES-256-GCM envelope

The serialised session bytes are encrypted before storage. The envelope is AES-256-GCM, a standard authenticated-encryption scheme that produces a ciphertext, a tag, and a nonce. The encryption key is a 32-byte value loaded from a secrets manager at process start.

The shape of one envelope:

nonce (12 bytes) | ciphertext (variable) | tag (16 bytes)

The nonce is generated fresh per write through SecureRng. AES-GCM is sensitive to nonce reuse (a reused nonce against the same key catastrophically compromises confidentiality and authenticity); the twelve-byte random nonce gives a collision probability of about one in 2^48 per encryption, which is comfortably safe for any realistic session volume.

The additional authenticated data (AAD) carries the session id. The binding means that an encrypted blob from one session cannot be swapped into another session's row even if an attacker can write to the database. The session id is plaintext in the cookie, so this adds no confidentiality, but it adds integrity: the database is not the source of truth for "which session is this blob from."

Key rotation is the operational lever. SessionCrypto::new(key) constructs an envelope with one current key. .with_previous_key(old_key) keeps the old key available for reads, so sessions encrypted with the old key continue to decrypt while new writes use the new key. After a transition window long enough for every existing session to be rewritten (which happens naturally over the next session write, or can be forced through a background scan), the previous key can be removed.

The chapter Operations runbook covers the rotation sequence and the staged rollout for both the signing key and the envelope key.

The fingerprint binding

A session id alone is not enough to defend against cookie theft. An attacker who captures a session cookie can replay it from a different browser, IP, and operating system, and the session machinery on the server cannot tell the difference without additional signal.

The fingerprint binding is the additional signal. At session creation (typically at first login), the server computes a fingerprint from the user agent header, the IP address (read through the trusted-proxy configuration), and any other coarse features the deployment chooses to include. The fingerprint is HMAC-signed and stored alongside the session id. On every subsequent request, the server recomputes the fingerprint from the incoming request and compares it (constant-time) against the stored value.

The match has three outcomes:

  • Match exactly: the session is allowed to proceed.
  • Match within a tolerance: the session is allowed to proceed, but the divergence is logged.
  • Mismatch beyond tolerance: the session is treated as compromised and one of three responses fires (warn, re-authenticate, full logout), depending on the configured policy.

The tolerance accommodates legitimate change: a user's IP can change when they switch from wifi to cellular; their user agent can update overnight when the browser auto-updates. Strict matching on either signal produces too many false positives. The default is coarse: the IP must remain within the same /24 (for IPv4) or /64 (for IPv6), and the user agent must share its major version.

The chapter Cookies, fingerprinting, hijack detection covers the configuration knobs and the trade-offs in detail.

The Tower layer

The SessionLayer is the Tower middleware that threads the session through every request. The layer's call method is the sequencing centre of the session lifecycle.

The pseudocode of one request:

async fn call(&self, req: Request) -> Response {
    // 1. Extract the cookie (or skip if absent → Guest).
    let cookie = extract_session_cookie(&req);

    // 2. Verify the HMAC, decode the session id.
    let session_id = verify_cookie(&cookie, self.signing_key)
        .map_err(|_| ();  // fall through to a guest session

    // 3. Load the row from the session store.
    let row = self.store.load(&session_id).await;

    // 4. Decrypt the envelope, deserialise the data.
    let data = decrypt_and_deserialize(&row, &self.crypto)?;

    // 5. Verify the fingerprint binding.
    enforce_fingerprint(&data, &req, self.fingerprint_policy)?;

    // 6. Wrap into a SessionHandle, insert into request extensions.
    let handle = SessionHandle::new(session_id, data);
    req.extensions_mut().insert(handle.clone());

    // 7. Run the handler.
    let response = self.inner.call(req).await?;

    // 8. If the handle is dirty, write back.
    if handle.is_dirty() {
        let new_data = handle.into_data();
        let new_envelope = encrypt(&new_data, &self.crypto, &new_session_id);
        self.store.save(&session_id, &new_envelope).await?;
        // Reissue the cookie (with a fresh id if rotation was triggered).
        response.headers_mut().append("Set-Cookie", construct_cookie(...));
    }

    response
}

Three of the eight steps are worth dwelling on.

Step 5 (the fingerprint check) is the gate that catches replay. A mismatched fingerprint causes the handler not to run at all; the session-layer returns a 401 (or the configured response). The choice of response depends on the policy: warn-only deployments log and continue; strict deployments deny.

Step 7 is where the handler actually runs. The handler receives a SessionHandle via AuthSession (the extractor), reads or mutates it, and the mutations are tracked via the dirty flag.

Step 8 is the write-back. The session is saved only when it is dirty, which means a read-only request (the dashboard, a metric endpoint, an idle-page poll) does not write to the session store. The store sees writes proportional to the rate of state changes, not the rate of requests, which is the difference between a manageable database load and a saturated one.

The dirty flag

The dirty flag is the optimisation that makes the session store viable at the read rates a real application produces. The flag is on SessionHandle and is set by any method that mutates the session: set_authenticated, clear, set_custom, and so on.

The flag is checked at step 8 in the lifecycle above. A clean handle is dropped silently; a dirty handle triggers the serialisation, encryption, store-write, and cookie-reissue path.

The trade-off is that a read of mutable state through an immutable borrow does not mark dirty, but the application's pattern for that case is to use the typed accessors (is_authenticated, current_user_id, custom_get) that do not need a mutable borrow. Mutating accessors (clear, set_custom, the orchestrated begin_login and verify_factor paths) all set the flag.

The cookie is reissued only when the session id rotates, not on every write. Identifier rotation happens at two automatic moments (Guest → authenticated to defeat fixation; logout so the new Guest session doesn't share an id with the old) plus explicit re-issuance through AuthSession::regenerate. The routine read-write-read cycle does not rotate.

regenerate exists for the cases the library can't infer on its own: any handler that crosses a privilege boundary should call it before responding. The canonical list (drawn from OWASP ASVS V3, the OWASP Session Management Cheat Sheet, and NIST SP 800-63B on AAL transitions):

BoundaryRotate session id?Also revoke sibling sessions?
Primary loginautomaticoptional
Logoutautomatic (id invalidated)depends
MFA factor added (TOTP, WebAuthn, recovery codes, …)yesoptional
MFA factor removed or disabled (AAL drops)yesrecommended
Password / primary credential changeyesstrongly recommended
Step-up to a higher assurance levelyes;
Account recovery flow completionyesyes
Impersonation start / stopyes;
Role grant / revoke, scope changeyesdepends on direction
Tenant switch in a multi-tenant deploymentyes;
Profile edit, theme change, factor config tuningno;

Rotating does two things at once: it defeats fixation (any pre-existing id, including one an attacker planted before the boundary, becomes useless), and it caps the blast radius of a captured pre-elevation cookie (a cookie stolen at AAL1 cannot ride the new AAL2 binding). Sibling-session revocation (SessionRegistry::revoke_user_sessions) is a strictly stronger statement that matters most on credential changes, where any other device holding a stale password-derived session must be cut off.

A library hook on FactorStore::save_factor would catch some of the rows above and miss the rest (un-enrolment, password change, role grants), and would misfire on factor-config tuning that is not a privilege change. The boundary decision is necessarily app-level. Call regenerate at the handler that knows.

When the session expires

The session has two expiry mechanisms. The first is the cookie's own Max-Age attribute, which the browser enforces: after the configured TTL, the browser stops sending the cookie. The second is the session store's expiry timestamp, which the server enforces: after the timestamp passes, the store returns the row as expired (or the cleanup sweep removes it altogether).

Both are needed. The cookie expiry handles the browser-side case (the user closes the browser, the cookie is forgotten); the server-side expiry handles the case where the cookie outlives the session's intended lifetime (an attacker captures a cookie and replays it after the user's session would have expired).

The expiry is sliding by default: every dirty write updates the expiry timestamp, so an actively-used session keeps refreshing. The maximum lifetime is the configured TTL from the most recent write. A session that goes idle for the TTL expires; a session that gets a single dirty write per TTL window never expires (through ordinary use).

Some deployments want a hard cap: a session expires absolutely at a fixed time after creation, regardless of activity. The SessionLayer::with_absolute_ttl option enables this; the absolute expiry is stored at session creation and is not refreshed. The two TTLs (sliding and absolute) compose: the session expires at the earlier of the two.

Session cleanup

Expired sessions need to be removed from the store. The cleanup is the application's responsibility (axess does not run a background task on its own), but the patterns are uniform across backends.

The SQL backends expose a cleanup_expired method that deletes rows whose expiry timestamp has passed. The examples/sqlite/ reference application runs this on a tokio::interval once per hour; the interval is tunable.

The Valkey backend uses Valkey's native TTL: each row is written with an expiry, and Valkey removes it automatically. There is no cleanup task to write because the database does the work.

For deployments with millions of sessions, the cleanup pattern matters operationally. A daily delete-by-range is fine for tens of thousands; for millions, the delete needs to be incremental (a limit clause, looping through batches) to avoid long-running transactions that lock the table.

What this enables

The lifecycle as designed makes session handling invisible to application code. The handler reads AuthSession, mutates it (or does not), and the framework handles the cookie, the serialisation, the encryption, the fingerprint check, the write-back, and the expiry. The application's surface area for session bugs is small: most session-related issues are policy choices (rotate too aggressively, lockout too strict, fingerprint tolerance too tight), not bugs in the lifecycle itself.

The chapter Backends covers the storage backends in detail; the chapter Cookies, fingerprinting, hijack detection covers the fingerprint binding in detail; the chapter Schema migration covers the SessionData::schema_version field and what happens when the data shape changes between deployments.

Further reading

Backends: SQLite, Postgres, MySQL, Valkey covers the four first-party session stores and their feature-flag and dialect notes. Cookies, fingerprinting, hijack detection covers the configuration knobs for the fingerprint and the trusted-proxy configuration that determines how IP is read. Schema migration covers the SessionData::schema_version field. Operations runbook covers signing-key and envelope-key rotation.

Backends: SQLite, Postgres, MySQL, Valkey

Axess ships four first-party session storage backends. The choice between them is the operational decision the deployment makes when it picks a database, not a technical decision the application code needs to revisit. This chapter covers the capability matrix, the configuration shape per backend, and the operational notes that have caught real deployments by surprise.

The feature flags are sqlite, postgres, mysql, and valkey, all off by default. Enable the one your deployment uses.

What the backends actually do

A session storage backend implements the SessionStore trait. The trait is small and on purpose: it offers a key-value-with-TTL surface plus a handful of session-specific verbs the typical application needs.

#[async_trait]
pub trait SessionStore: Send + Sync {
    async fn load(&self, id: &SessionId) -> Result<Option<SessionRow>, StoreError>;
    async fn save(&self, row: &SessionRow) -> Result<(), StoreError>;
    async fn delete(&self, id: &SessionId) -> Result<(), StoreError>;
    async fn cycle(&self, old: &SessionId, new: &SessionId) -> Result<(), StoreError>;
    async fn cleanup_expired(&self) -> Result<usize, StoreError>;
    async fn find_sessions_for_user(
        &self,
        user_id: &UserId,
        tenant_id: &TenantId,
    ) -> Result<Vec<SessionId>, StoreError>;
}

The verbs map to operations the lifecycle in the previous chapter exercises. load retrieves a session by id. save writes a dirty session. delete removes a session on logout. cycle atomically rotates the session id (used at the Guest to Authenticated transition, and at sensitive step-up points). cleanup_expired removes rows whose expiry has passed. find_sessions_for_user is the verb behind "log this user out of all sessions" admin operations.

The implementations differ in how they store the rows and how they implement the verbs, but the surface is the same.

Capability matrix

CapabilityMemorySQLitePostgresMySQLValkey
Required featurealways-onsqlitepostgresmysqlvalkey
Encryption at restnoneoptional (AES-GCM)optional (AES-GCM)optional (AES-GCM)optional (AES-GCM)
Cluster-safenowith careyesyesyes
Native TTLn/amanual sweepmanual sweepmanual sweepyes
Session registry supportyesadopteradopteradopteryes
Schema migrationn/asqlx-migratesqlx-migratesqlx-migratenone needed

The encryption-at-rest column is the AES-256-GCM envelope from the previous chapter. The application configures it with a 32-byte key; the backend wraps the envelope around the serialised session data before writing. The envelope is optional because some deployments accept the unencrypted at-rest store (when the database is itself encrypted, when the threat model does not require it), and decrypting on every read costs a few microseconds per session. The recommendation for production is to enable encryption unless the deployment has a specific reason not to.

The cluster-safe column says whether multiple application instances can share the same backend without coordination issues. SQLite is single-writer; a deployment with one application instance behind a load balancer is fine, but multiple instances need to share the SQLite file over a filesystem the database supports (which is operating-system-dependent and risky). Postgres, MySQL, and Valkey are cluster-safe out of the box.

The native TTL column says whether the database has a native mechanism for removing expired rows. SQLite, Postgres, and MySQL do not; the application runs a periodic cleanup task. Valkey expires keys automatically as they age past their TTL, which means the cleanup task is unnecessary.

SQLite

The SQLite backend is right for development, for tests, for single-instance production deployments, and for embedded-style applications where the database lives on the same machine as the application.

Configuration:

use axess::backends::sqlite::SessionStore;
use axess::session::SessionCrypto;

let pool = sqlx::SqlitePoolOptions::new()
    .max_connections(5)
    .connect("sqlite:axess.db")
    .await?;

let crypto = SessionCrypto::new(envelope_key);  // optional encryption
let store = SessionStore::new(pool.clone(), crypto);
store.init_schema().await?;

init_schema creates the sessions table and the indexes the backend needs. It is idempotent; calling it on a database that already has the table is a no-op.

The cleanup pattern is a background task that runs store.cleanup_expired on an interval, typically once per hour. The examples/sqlite/ reference application demonstrates this in main.rs.

The operational notes:

  • SQLite locks on writes. The max_connections setting on the pool determines how many concurrent writes the database admits, and WAL mode (configured in the connection string) is what enables concurrent reads alongside writes. Use WAL mode for any deployment that has more than one request at a time.

  • The schema migration story is sqlx::migrate!: the migrations directory under the application is the source of truth, and the pool runs them at startup. Axess does not include its own migrations; init_schema is enough.

  • Backups: a SQLite session store can be backed up with the standard sqlite3 .backup command, which works against a live database. The session data is encrypted at rest if the envelope is configured, so a backup carries the same security posture as the live data.

Postgres

Postgres is the right backend for most production deployments. It is cluster-safe, has good concurrency, supports JSONB if a deployment wants to index into the session's custom map, and is the most-tested backend in axess after SQLite.

Configuration:

use axess::backends::postgres::SessionStore;

let pool = sqlx::PgPoolOptions::new()
    .max_connections(20)
    .connect("postgres://app@db:5432/axess")
    .await?;

let store = SessionStore::new(pool.clone(), SessionCrypto::new(envelope_key));
store.init_schema().await?;

The pool sizing depends on the application's request rate; twenty is a reasonable starting point for a single application instance, multiplied by the number of instances and tuned against the database's max_connections setting.

The operational notes:

  • The init_schema call creates the sessions table with an index on the expiry timestamp (for cleanup_expired) and on the user id plus tenant id (for find_sessions_for_user). The indexes are essential at any meaningful scale; do not remove them.

  • CockroachDB is wire-compatible with Postgres and works against this backend with one caveat: Cockroach's lock semantics differ in edge cases (a SELECT ... FOR UPDATE pattern that works on Postgres can produce different behaviour on Cockroach). The axess CI runs the Postgres integration suite against Cockroach to catch divergence; the failures that have surfaced are noted in this chapter when they affect adopter code.

  • Postgres extensions: pgcrypto can be used as an alternative to the AES-GCM envelope, but the axess envelope is faster (the encryption happens in the application before the network write, not on the database side) and uses the same key as other axess encryption. Stick with the envelope unless a specific deployment reason argues for pgcrypto.

MySQL

The MySQL backend is right for deployments where MySQL is the already-deployed database. The capability surface is the same as Postgres, with a handful of dialect differences that affect the implementation but not the application.

Configuration:

use axess::backends::mysql::SessionStore;

let pool = sqlx::MySqlPoolOptions::new()
    .max_connections(20)
    .connect("mysql://app@db:3306/axess")
    .await?;

let store = SessionStore::new(pool.clone(), SessionCrypto::new(envelope_key));
store.init_schema().await?;

The operational notes:

  • The dialect differences from Postgres are mostly invisible: ON CONFLICT DO UPDATE becomes ON DUPLICATE KEY UPDATE, the placeholder syntax shifts from $1 to ?, datetime precision defaults to seconds rather than microseconds. Axess handles all three internally; the application code is identical.

  • MariaDB 10.x and later versions are compatible with the same schema and the same SQL. The CI runs against both MySQL 8.x and MariaDB 10.x.

  • Timezone handling differs. MySQL stores DATETIME values as naive timestamps in the server's timezone; the backend serialises expiries as UTC and reads them back as UTC, sidestepping the implicit-conversion trap.

  • Connection options: pool sizing is the same as Postgres. MySQL has a default wait_timeout of eight hours, after which idle connections are closed; the sqlx pool handles reconnection automatically, but be aware of the setting if connection-state matters to your application.

Valkey

The Valkey backend is right for deployments where a Redis-style key-value store is already present in the architecture, or for deployments where the session-store load is high enough that the overhead of a relational database is undesirable. Valkey's TTL mechanic makes session expiry automatic: the cleanup task is not needed.

Configuration:

use axess::backends::valkey::SessionStore;

let client = redis::Client::open("redis://valkey:6379")?;
let store = SessionStore::new(client, SessionCrypto::new(envelope_key));

The Valkey backend does not need a schema initialisation; the keys are written directly with TTLs.

The operational notes:

  • Cluster mode: the Valkey client supports cluster mode through the cluster feature of the underlying redis crate. The keys axess writes are prefixed (axess:session:, axess:registry:, ...) so cluster sharding by key works without conflict.

  • Persistence: Valkey can be configured for in-memory only, for RDB snapshots, or for AOF (append-only file) durability. The axess session store is fine on any of the three; the choice trades latency against durability. For sessions specifically, in-memory is acceptable if the deployment tolerates losing all active sessions on a Valkey restart; AOF is the standard choice when sessions matter.

  • The session registry: Valkey is the only first-party backend with a session registry (the verb behind "list all sessions for a user", which the SQL backends require an adopter to wire up). The registry uses a sorted set keyed by user id, with the session ids as members and the expiry timestamp as the score. Operations on the registry are O(log n) in the number of sessions per user.

  • Eviction policy: Valkey under memory pressure can evict keys. If the eviction policy is allkeys-lru, the session store can lose sessions before their TTL fires. The recommendation is to configure Valkey with volatile-lru (only TTL'd keys are candidates for eviction), and to monitor the eviction rate. If evictions happen at all, the Valkey instance is undersized; scale up before the user-visible behaviour becomes painful.

Choosing between them

The decision tree is short.

If the deployment already has a database, use the matching backend. Postgres for Postgres, MySQL for MySQL, Valkey for Redis or Valkey.

If the deployment is starting fresh and the application is single-instance, SQLite is the simplest choice and works fine for small-to-medium scale.

If the deployment is multi-instance and starting fresh, Postgres is the conservative default. The operational tooling for Postgres is mature, the backup story is well understood, and the schema flexibility leaves room for future extensions.

If the deployment expects very high session throughput (tens of thousands of writes per second, or pathologically high read rates), Valkey is the choice. The latency is the lowest of the four, the TTL mechanic removes the cleanup task, and the cluster scaling is proven.

The choice is not irreversible. The session storage backend is behind a trait, the data shape is uniform across backends, and migrating between backends is a matter of reading from the old store and writing to the new one (during a deploy window where both are active, or with a one-time migration script that runs against a paused application). No data shape changes; the migration is purely operational.

Cross-backend Store<K, V> access

All four backends also implement the generic axess_core::store::Store<SessionId, SessionData> trait. This matters for adopters who want backend-agnostic access (test doubles, ops endpoints that work against any deployment, code that needs to switch backends at runtime).

use axess::store::Store;
use std::sync::Arc;

async fn dump_session(
    store: Arc<dyn Store<SessionId, SessionData>>,
    id: &SessionId,
) -> Option<SessionData> {
    store.get(id).await.ok().flatten()
}

The generic trait omits session-domain operations (cycle, find_sessions_for_user). Code that needs those operations uses the concrete SessionStore trait directly; code that only needs key-value-with-TTL semantics uses Store.

The duplication is deliberate. The generic trait is the common denominator across backends; the specific trait carries the session vocabulary. Mixing them lets each callsite use the narrowest surface it needs.

Further reading

Session lifecycle and crypto envelope covers the full lifecycle that exercises these backends. Cookies, fingerprinting, hijack detection covers the cookie attributes and the fingerprint binding the backends store. Schema migration covers what happens when the session data shape changes between deployments. Operations runbook covers signing-key and envelope-key rotation across all four backends.

Cookies, fingerprinting, hijack detection

The session cookie is the credential a browser presents on every request. If an attacker captures it, they can act as the user until the session expires or is revoked. The defences are layered: cookie attributes constrain how the browser handles the cookie, HMAC signing detects tampering, fingerprint binding catches replay from a different browser, and trusted-proxy configuration controls how the application reads the request's IP. This chapter covers each layer.

The session cookie carries five attributes the deployment cares about. Most have defaults that are right for production; one (Secure) needs to be set explicitly.

Path=/ makes the cookie apply to the whole application. The alternative (a narrower path) is occasionally useful for embedded deployments where the application lives under a sub-path of a larger site; for most deployments, the root path is right.

HttpOnly prevents client-side JavaScript from reading the cookie. The attribute defeats one class of cross-site scripting attack: an attacker who injects JavaScript into the page cannot read the session cookie through document.cookie and exfiltrate it. The attribute is on by default and there is rarely a reason to turn it off.

SameSite controls when the browser sends the cookie on cross-origin requests. There are three values:

  • Strict means the cookie is sent only on same-site requests. A link from an external site to your application produces a guest-state request even if the user is logged in; the user must navigate from within your site for the session to be recognised.
  • Lax (the default) means the cookie is sent on top-level cross-site navigations (a link click) but not on cross-site sub-requests (an embedded image, an XHR). The combination defeats most CSRF attacks while preserving the user experience of "click an external link, arrive logged in."
  • None means the cookie is sent on every cross-site request. This is the right setting when the application is embedded in iframes on third-party sites; it is the wrong setting otherwise.

The recommendation is Lax for most deployments. Switch to Strict for the highest-sensitivity actions; the cost is the user-experience friction of cross-site link arrivals not being logged in.

Secure requires HTTPS. The cookie is sent only on TLS-protected connections; a misconfigured load balancer that accepts cleartext HTTP does not see the cookie. The attribute is non-negotiable for production but breaks localhost development against http://, which is why SessionLayer::with_secure(false) exists as a development concession.

The Max-Age (the cookie's lifetime in seconds) matches the session TTL from SessionLayer::with_ttl. The browser stops sending the cookie after the lifetime expires; the server-side session has its own expiry that the lifecycle layer also enforces.

HMAC signing

The cookie carries an HMAC signature computed from the session id and the deployment's signing key. The format is:

<base64(session_id)>.<base64(hmac_sha256(signing_key, session_id))>

The signature defeats forgery and tampering. An attacker who guesses a session id (or who tries to mutate an existing cookie) cannot produce a valid signature without the signing key. The server rejects any cookie whose signature does not validate; the session is not loaded and the request proceeds as Guest.

The HMAC verification is constant-time. The constant-time comparison defeats a timing attack where an attacker could distinguish "valid signature for invalid id" from "invalid signature for valid id" by measuring response latency.

The signing key rotation is the operational lever for replacing the signing key without invalidating active sessions. The pattern is covered in Operations runbook. The short version is: SessionLayer::with_previous_key accepts the old key, sessions signed with the old key continue to validate, sessions signed with the new key (which is now what the layer uses for new signings) are the new default. After enough time for all old cookies to expire, the previous key is removed.

The fingerprint binding

The fingerprint is the additional signal that catches session replay from a different browser. The mechanism takes a few coarse features of the request (the user agent, the IP address, sometimes the accept-language), HMACs them together with a deployment-level pepper, and stores the result alongside the session.

let fingerprint = hmac_sha256(
    fingerprint_pepper,
    format!("{}|{}|{}",
        user_agent,
        client_ip,
        accept_language,
    ),
);

The choice of features is deliberate. They are coarse enough that the legitimate user's browser produces the same fingerprint across ordinary requests (the user agent does not change between requests, the IP is within the same prefix, the accept-language is stable), and specific enough that an attacker replaying the cookie from a different machine produces a different fingerprint.

The tolerance is the operational lever. Strict matching produces too many false positives (a user switching from wifi to cellular sees their IP change, a browser auto-update changes the user agent string). Coarse matching produces too few signals to detect replay. The default tolerance:

  • IP: same /24 for IPv4, same /64 for IPv6.
  • User agent: same major version of the same browser.
  • Accept-language: same primary language.

A request that matches within the tolerance passes. A request that diverges beyond it produces an event the policy decides what to do with.

The policy has three options:

  • FingerprintPolicy::Warn logs the mismatch and lets the request proceed. This is the right setting during initial rollout when the tolerance is being calibrated; the logs show how often legitimate users trigger mismatches, and the tolerance can be adjusted.

  • FingerprintPolicy::Reauth returns 401 and clears the session. The user has to log in again. This is the right setting for high-sensitivity actions; the user accepts the friction of re-authentication in exchange for the assurance that a captured cookie does not get away with the session.

  • FingerprintPolicy::Revoke deletes the session entirely. The user is logged out, and their other sessions remain. This is the right setting when fingerprint mismatch is a strong signal of compromise; the deployment treats it as the user being hijacked and ends the session immediately.

The default is Warn. Lift to Reauth once the warn rate is below your tolerance.

The pepper is a deployment-level secret stored alongside the session signing key. It defeats fingerprint synthesis: an attacker who knows the features (the user's IP, their user agent) cannot construct the fingerprint without the pepper, so they cannot adjust their replay to match.

Trusted-proxy configuration

The fingerprint depends on the request's IP being accurate. In many deployments the application sits behind one or more proxies (a load balancer, a CDN, a WAF), and the request's source IP is the proxy's IP, not the user's. The user's IP is in a forwarded header like X-Forwarded-For.

Reading the forwarded header is necessary but dangerous. A deployment that trusts the header without checking the source can be spoofed: a request directly to the application with a forged X-Forwarded-For header will be treated as if it came through the proxy.

The defence is the trusted-proxy configuration. The application configures which source IPs are trusted to set the header; the session layer reads the header only when the immediate request came from one of those IPs.

let layer = SessionLayer::new(store, signing_key)
    .with_trusted_proxies(vec![
        "10.0.0.0/8".parse().unwrap(),    // internal load balancer
        "172.16.0.0/12".parse().unwrap(), // VPN range
    ])
    .with_forwarded_header(ForwardedHeader::XForwardedFor);

The configuration accepts a list of CIDR ranges that the deployment trusts. Requests from inside any range have their X-Forwarded-For read; requests from outside any range use the immediate connection's IP.

The forwarded-header choice is the application's. The standard is X-Forwarded-For (a comma-separated list of IPs, the first being the original client), but some deployments use Forwarded (RFC 7239) or a proxy-specific header. Axess supports all three; configure the one the deployment uses.

For multi-hop proxy chains, the rule is the same. The request came through lb → cdn → application; the application's immediate peer is the CDN, the CDN's X-Forwarded-For lists lb and the original client. If both the CDN and the LB are in trusted ranges, the application takes the leftmost IP from the header (the original client). If only the immediate peer is trusted, the application takes the rightmost IP from the header (the next hop back).

The configuration is deployment-specific. Get it wrong in either direction (trust too much, get spoofed; trust too little, see only proxy IPs) and the fingerprint binding becomes either fragile or useless.

Defending against XSS

The cookie's HttpOnly attribute defeats one class of XSS attack (reading the cookie). It does not defeat all of them.

An attacker with JavaScript execution in the page can:

  • Submit requests on the user's behalf (the browser sends the cookie automatically). The defence is CSRF protection: most axess deployments use tower-http's CSRF middleware, which requires a CSRF token on state-changing requests, and the token is not readable from JavaScript.

  • Manipulate the page the user sees to phish credentials or to trick the user into actions. The defence is Content Security Policy (CSP) headers, which constrain what JavaScript the page can load and execute. CSP is an application-side concern, not a session-layer concern, but it composes with the session layer's defences.

The session layer's role is to constrain the cookie. The application's role is to constrain what JavaScript can do in the page. Both layers are needed; the session layer alone does not defend against XSS.

CSRF defences

The session cookie is sent on cross-origin top-level navigations because SameSite=Lax allows it. An attacker can craft a link that, when clicked from an external site, triggers a state change in the user's session (the classic CSRF attack).

The SameSite=Lax default narrows the attack: it works only on top-level GETs and on the Form element, not on XHR or fetch calls. The defences against the remaining surface:

  • Use POST (or PUT, DELETE, PATCH) for state-changing requests. GET requests should be safe.
  • Add a CSRF token to state-changing forms. The token is set in the session and read from a hidden form field; the server checks that they match. Axess does not include a CSRF middleware out of the box; the convention is to use tower-http's middleware or to write a small one.
  • For applications that need cross-origin embedded use, SameSite=None plus a strict CSRF token check is the combination. SameSite=None requires Secure, so the combination is only deployable on HTTPS.

What goes wrong, and how to tell

Three failure modes recur during initial deployment.

The first is a cookie that the browser refuses to send. The symptom is sessions that disappear between requests; the cause is almost always either Secure=true on an http:// connection (the browser refuses to send), SameSite=Strict on a cross-site navigation that should have been recognised, or a Path that does not match the request URL. Inspect the cookie's attributes in the browser's dev tools.

The second is a fingerprint that diverges for the legitimate user. The symptom is a Warn log every few sessions or a Reauth that fires on every wifi-to-cellular switch. The cause is usually the tolerance being too strict; widen the IP prefix or relax the user-agent match. The right tolerance is the smallest one that does not produce noise on legitimate traffic.

The third is the trusted-proxy configuration getting the wrong IP. The symptom is a fingerprint that matches when it should not (an attacker successfully replaying a cookie), or that diverges when it should match (a legitimate user being asked to re-authenticate). The cause is either an unintentionally trusted source (a debug endpoint left open, a VPN allowed to spoof the header) or an unintentionally untrusted proxy (the deployment forgot to add a new proxy's IP to the trusted list).

The pattern across all three: turn on the diagnostic logs, let the deployment run for a week, look at the warning rate, calibrate.

Further reading

Session lifecycle and crypto envelope covers the cookie shape and the orchestration that issues it. Backends covers the storage backends that persist the fingerprint alongside the session. Security posture covers the production crypto requirements that apply to the session layer, including the signing-key length and the FIPS-routing notes. Operations runbook covers signing-key, envelope-key, and fingerprint-pepper rotation.

Schema migration

The SessionData struct can change between axess versions. New fields get added, old fields get renamed or removed, the auth state machine gains a new variant. Existing sessions in the store carry the old shape; new code reads them and needs to produce the new shape. The mechanism that bridges the two is the schema migration on read.

This is a short chapter because the mechanism is small. The mechanism is small because the design pushes the version field into the data itself rather than into the store.

The version field

SessionData::schema_version is a u32 field set at construction and serialised with the rest of the data. At read time the deserialiser inspects the version, dispatches to the appropriate migration function for that version, and produces a current-shape SessionData.

pub struct SessionData {
    pub schema_version: u32,
    pub auth_state: AuthState,
    pub principal_hint: Option<PrincipalHint>,
    pub custom: HashMap<String, serde_json::Value>,
}

impl SessionData {
    const CURRENT_VERSION: u32 = 2;

    fn migrate(self) -> Self {
        match self.schema_version {
            0 => migrate_from_v0(self),
            1 => migrate_from_v1(self),
            _ => self,  // current, no migration needed
        }
    }
}

The migration functions are pure transformations. They take the old shape (which serde has parsed against an older SessionData definition, possibly with the version-bumped fields defaulted) and produce the new shape. Each migration handles one version step; chained migrations are run in sequence to bridge multiple version gaps.

The version is bumped every time the shape changes in a way that older code would not handle correctly. Adding an optional field with a Default impl typically does not bump the version (older code reads None, which is fine). Removing or renaming a field does. Changing the meaning of a field does.

What migrations cannot do

A migration is a pure function on the serialised bytes. It cannot talk to a database, cannot consult the user store, cannot make network calls. The version of the data is determined entirely by what is in the cookie's session record at the moment of read.

The implication: if a new shape needs information that the old shape did not carry, the migration cannot synthesise it. The options are to default the field (set it to None, or to a known placeholder), to discard the session (the migration returns an error, the layer treats the session as invalid and starts a fresh one), or to defer the population (the field is set later in the request lifecycle from the application's stores).

The first option is the standard pattern. New fields get sensible defaults, the session continues to work with the new shape, and the application populates the real value on the next dirty write.

When the session is invalidated

Sometimes the shape change is breaking in a way that no migration can bridge. The session's data refers to a user who has been deleted, the auth state references a tenant that no longer exists, the factor list contains a kind that the new version has removed. The migration's right response is to error, and the layer's right response is to treat the session as invalid.

The mechanism is the SessionData::deserialize path returning Err. The session layer catches the error, deletes the session row (or marks it expired), and treats the request as a fresh Guest. The user's cookie is still valid; the next request sets a new session, the user logs in again.

The pattern is the right one because the alternative (the layer falling through to a degraded state, leaving the session in an inconsistent shape) lets bugs persist for the lifetime of the session. Invalidating eagerly converts the bug into a one-time user-facing event (re-login) that is fixable in one round-trip, rather than a long-tail bug that surfaces sporadically.

Adding a custom field

Adopters who add their own fields to SessionData::custom follow the same pattern at the application layer. The custom map is JSON-shaped; each application-owned key is independently versioned by the application.

The common pattern is to wrap the custom value in a small struct with its own version field:

#[derive(Serialize, Deserialize)]
struct MyAppSessionData {
    schema_version: u32,
    preferences: UserPreferences,
    feature_flags: Vec<String>,
    draft_form_state: Option<DraftForm>,
}

fn read_app_data(session: &SessionData) -> MyAppSessionData {
    session
        .custom
        .get("my_app")
        .and_then(|v| serde_json::from_value::<MyAppSessionData>(v.clone()).ok())
        .map(|d| d.migrate_if_needed())
        .unwrap_or_default()
}

The application's schema_version is independent of axess's. The two evolve on different cadences and the application's version field captures the application's own changes.

When to reach for a different mechanism

The schema migration is the right tool for evolutions of the session data shape. It is the wrong tool for migrations between storage backends (use the cross-backend Store<K, V> trait or a one-off copy script) or for changes to the encryption envelope (the key-rotation mechanism, covered in Operations runbook).

It is also the wrong tool for application-level data migrations that touch the database. A migration that says "every user gains a new field on their user record" runs against the user store (via sqlx::migrate! or the application's migration tool), not against the session store. The session machinery does not interact with the user table.

The mechanism's scope is narrow on purpose. Each piece of state has its own evolution mechanism, and conflating them produces migrations that have to consider too many cases at once.

Further reading

Session lifecycle and crypto envelope covers the lifecycle that the migration runs as part of. Backends covers the storage backends and their own (database-level) migration mechanisms. Migration guide in Part VIII covers the cross-axess-version migrations that bump the SessionData::schema_version constant.

The principal model

A Principal in axess is the answer to "who is making this request?" The unusual choice, and the one this chapter explains, is that the same type answers the question for human users and for service-to-service workloads. A signed-in employee opening a page and a CI job calling an API are both principals, with different variants but the same trait surface, the same authorisation contract, and the same place in the audit trail.

This chapter covers the type, where each variant comes from, how the unified shape lets a Cedar policy treat humans and workloads with one set of rules, and why the alternative (two parallel authentication stacks) was rejected.

The type

Principal lives in axess-identity:

pub enum Principal {
    Human(HumanPrincipal),
    Workload(WorkloadPrincipal),
}

pub struct HumanPrincipal {
    pub user_id: UserId,
    pub tenant_id: TenantId,
    pub session_id: Option<SessionId>,
    pub attributes: BTreeMap<String, serde_json::Value>,
}

pub struct WorkloadPrincipal {
    pub workload_id: WorkloadId,
    pub trust_domain: TrustDomain,
    pub issuer: Issuer,
    pub tenant_id: TenantId,
    pub tenant_slug: String,
    pub service_name: String,
    pub attributes: BTreeMap<String, serde_json::Value>,
}

The two variants are intentionally not symmetric. They carry the data each principal kind actually has. A human has a user_id and is optionally inside a session (some flows act on behalf of a user without a live HTTP session, which is why the field is Option). A workload has a workload_id (a SPIFFE-format URI), a trust domain, and an issuer that says how the principal was authenticated (which OIDC provider, which JWKS, which SPIFFE control plane).

Both variants carry a tenant_id (because every request happens in the context of a tenant, whether the caller is human or not) and an open attributes map (because policies need to ask questions that the fixed fields cannot answer). The attribute map is JSON-valued so that custom attributes (a hardware-key serial, a CI build hash, a regulator classification) can be carried without changing the type.

Where each variant comes from

The two variants are constructed by two different resolvers. The split is what keeps the human and workload sides from contaminating each other.

A HumanPrincipal is constructed by a SessionResolver from an AuthSession. The resolver reads the session's AuthState, returns None if the state is not Authenticated, and otherwise reads user_id, tenant_id, and the session id off the variant. The attributes map is populated from the resolved user's stored profile data (which fields depend on the application's identity store). Construction is synchronous and cheap because everything the resolver needs is already on the session.

A WorkloadPrincipal is constructed by a PrincipalResolver from an inbound credential (a bearer JWT, an mTLS client certificate, a projected Kubernetes service-account token, a GitHub Actions OIDC token). The resolver does the verification work (signature, audience, expiry, sometimes a token-exchange against a control plane) and on success returns a WorkloadPrincipal with the validated identity. The work is async because verifying tokens typically involves a JWKS fetch or an STS round-trip. The chapter Workload identity overview covers the resolver landscape end-to-end.

The two resolvers are independent. An application that has no workloads (a customer-facing SaaS, say) never wires a PrincipalResolver and never sees a Workload variant. An application that has only workloads (an internal data-pipeline API, say) never wires a SessionResolver and never sees a Human variant. An application that mixes both wires both resolvers and a small piece of glue that decides which to consult given the incoming request shape.

Why one type

The natural alternative is two types and two stacks: a User for humans, a Service for workloads, a different middleware for each, a different authorisation contract for each, two parallel audit trails. That shape is what most libraries ship, and it is what axess deliberately rejects.

The argument for one type is straightforward when you start to write the authorisation policy. A request to a billing endpoint might be made by a finance staff member during office hours, or by a scheduled job running the monthly invoicing batch. The policy that decides whether the request is allowed is the same in both cases: this caller, in this tenant, has the right to read this resource. With one Principal type, the policy is one rule. With two types, the policy either duplicates the rule (and the duplicates drift) or branches on the caller kind (and the branches obscure the intent).

The same applies to the audit trail. A regulatory audit log that records "principal X performed action Y against resource Z at time T" works uniformly across human and workload callers when the principal type is unified. The downstream SIEM rules ("alert on any principal making more than N requests per minute to the high-sensitivity endpoint") fire on both human attacks and runaway workloads, without separate detection logic.

The unification has a cost. The Principal enum must accommodate both variants, which makes its memory footprint larger than either variant alone, and pattern-matching code has to handle both arms even when the application only uses one. The cost is paid mostly in code that loads the principal (one match per request), and not in policy evaluation or audit emission (which see the trait surface). On balance, the unification pays for itself by simplifying the policy layer.

SPIFFE shape for workloads

The WorkloadPrincipal is shaped after SPIFFE because SPIFFE is the right shape for workload identity even when the underlying credential is not literally a SVID.

A SPIFFE identity is a URI of the form spiffe://<trust_domain>/<path>. The trust domain is the federation's namespace (prod.example.com, say), and the path identifies a specific workload within that domain (/svc/billing/tenant-acme). The combination uniquely names the workload, the trust domain parameterises the verification (each domain has its own signing keys), and the path is structured enough for policies to match on patterns ("any workload under /svc/billing/*") without inventing parallel identity stacks.

Axess's workload identity layer uses this shape even when the inbound credential is a Kubernetes service-account token (which is an OIDC token, not a SVID) or a GitHub Actions OIDC token (which is also not a SVID). The relevant resolver constructs a SPIFFE-format WorkloadId from the inbound claims; downstream code sees a uniform identity. Workload identity overview covers the construction rules for each resolver.

The trust domain and issuer fields on WorkloadPrincipal are the part that policies can use to discriminate between identity sources. A policy that says "only workloads issued by our production control plane may write to the production database" reads the issuer and matches against a fixed list. A policy that says "any workload in the finance trust domain may read the audit log" reads the trust domain.

The Cedar bridge

Cedar policies take principals as entities. Axess implements ToCedarEntity for both HumanPrincipal and WorkloadPrincipal, producing entities with the canonical shape Cedar expects.

A HumanPrincipal becomes a Cedar entity with UID User::"<user_id>", attributes including tenant_id, factors_completed, and authn_time, and parent entities for the tenant and any groups the user belongs to (which the application provides through AuthzEntityProvider, covered in Entity providers and request context).

A WorkloadPrincipal becomes a Cedar entity with UID Workload::"<spiffe-uri>", attributes including trust_domain, issuer, and tenant_id, and parent entities for the trust domain and the tenant. Policies that want to match all workloads in a trust domain write principal in TrustDomain::"prod.example.com"; policies that want to match a specific workload pattern write principal.workload_id like "spiffe://prod.example.com/svc/billing/*".

The bridge is what makes one type into one policy. A Cedar policy that says

permit (
  principal,
  action == Action::"read",
  resource in TenantData::"acme"
) when {
  principal.tenant_id == "acme"
};

allows both a human user in tenant acme and a workload bound to tenant acme. The principal type does not appear in the rule because it does not need to. If the policy later needs to discriminate (say, to require MFA for humans but not for workloads), the rule that expresses the discrimination is local and readable.

When the type is empty

Some flows operate without a principal: a health check, a metrics endpoint, the login page itself. Axess models this by representing the request as Option<Principal>. The resolver returns None, the authorisation layer either short-circuits (for unauthenticated endpoints) or evaluates against principal == Principal::None (for endpoints that take a deny-by-default position toward unauthenticated callers).

The pattern matters for one specific reason. A misconfigured resolver that returns a stub principal for unauthenticated requests, instead of None, silently widens the authorisation surface. The Cedar policy evaluates against the stub and may allow actions that should require authentication. Treating "no principal" as the absence of a value, rather than as a kind of value, makes the policy author's life harder in the short term and easier in the long term: a policy that does not explicitly admit None denies it by default.

What this enables

The unified principal type is what makes the rest of the workload identity story (Part VII) and the Cedar authorisation story (Part IV) short. A handler reads Principal, the authorisation layer evaluates policies against it, and the audit pipeline emits events keyed by it. None of these layers need to know whether the caller is a human or a workload, because the type carries both possibilities and the policy author resolves the discrimination where it actually matters.

Further reading

Workload identity overview covers the resolvers that produce WorkloadPrincipal values: SPIFFE JWT-SVID, SPIFFE mTLS, Kubernetes ServiceAccount tokens, GitHub Actions OIDC, generic OAuth-RS, and cloud STS exchange. Cedar policy fundamentals covers the AuthzSession::require and AuthzSession::decide calls that take a Principal and return an AuthzDecision. Audit events covers the log emitted for each authentication and authorisation decision, including the principal serialisation.

Device identity

A device in axess is a typed aggregate, not a string in a column. A user has zero or more devices; each device has a stable identifier, a fingerprint that the session layer can match against, an assurance level on a three-stage ladder, and a relationship to the refresh tokens issued against it. The combination is the machinery behind "this device was lost, revoke its access" and "this is a new device, require step-up before we trust it." The mechanism is opt-in but on by default in the axess facade because most adopters benefit from it without specifically asking.

The feature flag is device (on by default).

The three-stage ladder

A device occupies one of four states. The first three form an assurance ladder; the fourth is terminal.

Unknown is the default for a new device. The session layer has seen this fingerprint for the first time, the user has not yet confirmed it, and no commitment has been made about trust. An unknown device can still authenticate (the user enters their password and second factor as usual), but step-up policies may require additional friction (a second confirmation email, a recovery code) before high-sensitivity actions become available.

Seen is the second state. The device has authenticated successfully at least once; the user has implicitly accepted it by continuing through the login. A seen device retains the fingerprint binding from the session layer but does not yet carry explicit trust. It is the right state for a device that the user might log in from again but has not explicitly registered.

Trusted is the third state and the steady state for primary devices. The user (or the application's administrative flow) explicitly trusted this device. The device's fingerprint binding applies; the device is the bound carrier for refresh tokens; the device can perform high-sensitivity actions without additional step-up.

Revoked is the terminal state. The device was lost, the user removed it, the security team forced a revocation, or the system detected compromise. Tokens bound to the device are revoked, sessions bound to it are deleted, and further authentication attempts from the fingerprint are blocked until the user explicitly re-establishes the device.

The transitions move strictly forward through the ladder. Unknown becomes Seen on first successful login. Seen becomes Trusted on explicit user action or after an application-configurable trust period. Any state becomes Revoked on revocation. There is no path back from Revoked; a device that was revoked and is later re-encountered registers as a new Unknown device.

The device record

The Device struct carries the per-device state:

pub struct Device {
    pub device_id: DeviceId,
    pub user_id: UserId,
    pub tenant_id: TenantId,
    pub trust_level: DeviceTrustLevel,  // Unknown | Seen | Trusted | Revoked
    pub fingerprint_hash: String,        // HMAC against the per-tenant pepper
    pub display_name: Option<String>,   // user-set ("My laptop")
    pub first_seen_at: DateTime<Utc>,
    pub last_seen_at: DateTime<Utc>,
    pub trusted_at: Option<DateTime<Utc>>,
    pub revoked_at: Option<DateTime<Utc>>,
}

The device_id is a stable identifier minted at first sight. It is what refresh tokens bind to (see Refresh tokens and session continuity), what Cedar policies can reference, and what the admin UI lists when the user inspects their registered devices.

The fingerprint_hash is the HMAC of the device's fingerprint features against a per-tenant pepper. The hash, not the raw fingerprint, lives in the database; the raw features are computed per request and matched constant-time. Storing the hash defends against database breach: an attacker who reads every row of the device store does not learn the underlying fingerprint features of any user.

The display_name is for the user. When the device transitions from Seen to Trusted the application typically asks the user to name it ("My laptop", "iPhone 15 Pro"); the name appears in the user's device-management UI. It is not used for authentication.

The per-tenant pepper

The fingerprint pepper is the secret the HMAC uses. Two design choices matter.

The pepper is per-tenant, not global. Each tenant has its own pepper, stored alongside the tenant record. The choice means that a fingerprint hash from tenant A cannot be matched against tenant B's hashes; a breach that leaks one tenant's pepper compromises only that tenant's fingerprint hashes.

The pepper is rotated when the tenant is suspended or when the deployment chooses to invalidate all device records. Rotation invalidates every device record under the tenant (their fingerprint hashes no longer match the new pepper); existing sessions remain valid (they do not depend on the device record), but new logins re-register devices from scratch.

The chapter Operations runbook covers the rotation sequence and the staged rollout.

How devices interact with refresh tokens

The cascade between devices and refresh tokens is bidirectional and is what makes "revoke this device" actually mean "revoke every session this device can refresh."

In one direction: when a device is revoked, every refresh token that carries device_id = revoked_device is invalidated. The next attempt to use any of those tokens fails. The application's session layer detects this on the next refresh and treats the session as expired.

In the other direction: when a refresh token family is invalidated through reuse detection (the family-revoke mechanism covered in Refresh tokens), the cascade marks the bound devices as compromised. The compromise is the shortcut from Trusted (or Seen) to Revoked without an intermediate state.

The cascade is what makes the system robust against both operator-initiated revocation ("the device was lost") and attack-driven revocation ("a token was stolen"). The two cases converge on the same revocation primitive; both directions of cascade fire from the same code path.

Step-up policies

The trust level becomes interesting at the Cedar policy layer. A policy that wants to require a Trusted device for sensitive actions reads principal.device.trust_level == "Trusted":

forbid (
    principal,
    action == Action::"transfer-funds",
    resource
) when {
    principal.device.trust_level != "Trusted"
};

The rule denies fund transfers from any device that is not Trusted. A user on a new (Unknown or Seen) device is prompted to trust the device first, typically by completing an additional verification step (a second-factor challenge, a confirmation email, a step-up to FIDO2).

The pattern composes with the other authorisation styles. A policy that requires both FIDO2 and a Trusted device is the two constraints together; a policy that allows any of three different ways to clear the bar is the disjunction in one rule.

Identifying a device

Each request needs to be associated with a device. The mapping runs through the DeviceResolver trait:

#[async_trait]
pub trait DeviceResolver: Send + Sync {
    async fn resolve(
        &self,
        request: &Request,
        user_id: &UserId,
        tenant_id: &TenantId,
    ) -> Result<DeviceMatch, DeviceResolverError>;
}

pub enum DeviceMatch {
    Existing(DeviceId),
    NewDevice(DeviceId),  // freshly minted, written to store
}

The default implementation computes the fingerprint from the request features (user agent, IP, accept-language) and matches it against existing devices for the user. A match returns the existing device id; a miss writes a new device row with trust_level = Unknown and returns the new id.

The default works for most deployments. Applications with stronger device-identity signals (a long-lived hardware key, a mobile app's persistent installation id, a device certificate) can provide their own DeviceResolver that consults the stronger signal first and falls back to the fingerprint match.

Caching

The device record is read on most requests (every authenticated request that involves a Cedar evaluation reads the device). A naive lookup against the device store would be the hottest read in the application.

The CachedDeviceStore decorator wraps any DeviceStore with an LRU+TTL cache. The cache key is (tenant_id, device_id); the cache value is the Device record. The TTL is short (a few seconds) so revocations propagate quickly; the LRU bound constrains memory under fan-out scenarios.

The cache is invalidated explicitly on revocation. The DeviceStore::revoke call clears the relevant cache entry and writes the revocation. Subsequent reads see the revoked state without waiting for the TTL.

The pattern is the same one Entity providers and request context covers for the Cedar entity cache. Cache the data, not the decision; invalidate eagerly on mutation; let TTLs catch the cases the invalidation missed.

PII tokenisation and GDPR

The device record carries personally-identifiable information. The fingerprint features include the IP address (which is PII under GDPR), the user agent (which can carry identifying details about the user's setup), and the timestamps (which together can identify the user's working patterns).

The defence is twofold.

The first is that the device store holds hashes, not the raw features. The fingerprint hash is the HMAC against the per-tenant pepper; an attacker who reads the store sees the hash, not the IP or user agent.

The second is the retention sweep. The DeviceStore::retention_sweep verb removes device records older than a configured threshold, along with the refresh tokens that bound to them. The sweep is the GDPR-shaped lever: data the deployment no longer needs is removed within a bounded period, and the retention is documentable.

The retention period is per-tenant. The Tenant::device_retention_days field carries it; the default is ninety days. Tenants with stricter requirements set it lower (say, thirty days for an EU tenant subject to strict GDPR interpretation); tenants with looser ones set it higher (say, three hundred and sixty-five days for a US tenant where session continuity matters more).

The chapter Multi-tenancy covers the per-tenant configuration mechanism. Security posture covers the GDPR and SOC2 touch-points.

Storage backends and writing your own

axess ships five DeviceStore implementations:

BackendFeatureNotes
MemoryDeviceStorememoryDashMap + clock-driven sweep. Dev and tests.
SqliteDeviceStoresqliteSQLx pool, INSERT … ON CONFLICT, schema in init_schema().
PostgresDeviceStorepostgresSQLx pool, same surface as the sqlite backend with the Postgres dialect.
MysqlDeviceStoremysqlSQLx pool, MySQL dialect (? binds, ON DUPLICATE KEY UPDATE, VARBINARY(32)). Compatible with MySQL 8.x and MariaDB 10.5+.
ValkeyDeviceStorevalkeyHash-per-device + per-tenant fingerprint index. Server-side EXPIRE handles purge.

All five SQL/Valkey backends share the same trait surface; switching between them requires only the init_schema call against the new pool and a different constructor at startup.

Writing an adopter-supplied store

Any storage technology can back devices as long as it can answer the ten methods on axess_core::device::DeviceStore. The shipped backends (memory, sqlite, postgres, valkey) are the reference implementations to read alongside the trait docstring; the recipe below names the contracts that aren't obvious from method signatures.

Type and Error. Implement the trait on a Clone + Send + Sync + 'static struct (typically Arc<...> around your connection pool / client). Pick a single `type Error: std::error::Error + Send + Sync

  • 'static; the existing backends use a thiserrorenum that wraps their driver error + a "missing row" variant. Don't conflate driver errors with domain errors (aNotFoundreturned by your driver should not surface asSome(Device)inload; map it to Ok(None)`).

Tenant scoping is mandatory. Every method that takes a TenantId must filter on it in the query. The peppered FingerprintHash is already keyed per-tenant, but the trait contract documents the scoping requirement explicitly to prevent cross-tenant leakage on a backend whose primary index might otherwise be only by hash. Read the docstring on find_by_fingerprint for the rationale.

save must be atomic. save is documented as idempotent upsert. Implementations that do SELECT + INSERT racy-checks must wrap them in a transaction or use the dialect's native upsert (ON CONFLICT, ON DUPLICATE KEY UPDATE, MERGE, or SETNX for KV stores). A non-atomic save produces lost updates under concurrent device-promotion calls.

record_sighting is hot-path. Every authenticated request touches this. Implement it as a single UPDATE … SET last_seen_at = ? rather than a load-modify-save round trip. The shipped backends are a guide. The CachedDeviceStore decorator (see caching, above) shields the underlying store from read pressure but the write path runs through every request.

sweep is required, not defaulted. A backend that doesn't implement sweep cannot age devices through the three-stage ladder, and the documented retention posture (90d trusted / 30d seen / 7d revoked grace) silently breaks. The trait deliberately omits a default impl so backends must answer the question, even if the answer is Err(_) with a "sweep not yet implemented" sentinel during initial development.

Sighting timestamps come from a Clock. Methods that need "now" (record_sighting, set_trust_level, sweep) accept now: DateTime<Utc> as a parameter. Callers thread clock.now() through; backends never call Utc::now() themselves. This preserves DST determinism for adopter integration tests.

Mirror the per-backend test layout. Each shipped backend has its own test module exercising the trait surface end-to-end (load round-trip, fingerprint lookup, refresh-family fan-out, retention sweep); device/storage/sqlite/tests.rs is the most complete template. Copy that suite, adapt the harness setup to your backend, and run it to catch the non-obvious contract violations (tenant-scoping leaks, non-atomic save races, sweep counts off-by- one).

Reach for CachedDeviceStore over reinventing. If your gap is "my backend is slow on load", wrap your store in CachedDeviceStore before optimising the implementation. The decorator gives you bounded-size LRU + clock-driven TTL eviction for free, with revocation propagating through set_trust_level.

What this enables

Device identity is the connective tissue between the user, the sessions they hold, the refresh tokens those sessions issue, and the authorisation decisions the application makes about them. A user with a known device gets a smoother experience: the fingerprint binding holds, the refresh tokens roll, the policies default to trust. A user with an unknown device gets friction exactly when it makes sense: a step-up before sensitive actions, a confirmation before high-trust operations. A user with a revoked device gets nothing, immediately.

The mechanism is small (a handful of types, one ladder, one cascade) but its reach is wide (every refresh, every policy evaluation, every audit event). Once you have the device aggregate in mind, the rest of the security model falls into place around it.

Further reading

Refresh tokens and session continuity covers the binding between devices and tokens, including the cascade in both directions. Cedar policy fundamentals covers how policies read the device's trust level. Multi-tenancy covers the per-tenant fingerprint pepper and retention configuration. Security posture covers the GDPR and SOC2 implications of device data.

Multi-tenancy

A tenant in axess is the unit of isolation. Users, factor configurations, sessions, devices, policies, and audit events all carry a TenantId, and the library refuses to leak data across tenants by construction. This chapter covers the model, the atomic provisioning pattern that ensures every tenant starts in a sound state, the three-lever lockout, and the operational patterns for tenant suspension and deletion.

The mechanism is on by default. There is no feature flag to toggle tenancy; the TenantId field is present on every relevant record. A single-tenant deployment uses one well-known TenantId ("default" is the convention) and effectively gets the multi-tenant machinery for free, ready to expand when a second tenant is added.

The tenant record

The Tenant struct lives in axess-identity and carries the configuration that applies to every user under the tenant:

pub struct Tenant {
    pub tenant_id: TenantId,
    pub status: TenantStatus,                    // Active | Suspended | Deleted
    pub display_name: String,
    pub fingerprint_pepper: ZeroizedString,      // per-tenant device pepper
    pub lockout_policy: LockoutPolicy,           // tenant-scoped lockout
    pub device_retention_days: u32,              // GDPR-shaped retention
    pub created_at: DateTime<Utc>,
    pub suspended_at: Option<DateTime<Utc>>,
}

The TenantId is a typed UUID (the convention in axess-identity). The status carries the tenant's lifecycle state, covered below. The fingerprint_pepper is the per-tenant device pepper from Device identity. The lockout_policy is the tenant-scoped override of the global lockout configuration, covered in the Three-lever lockout section below. The device_retention_days is the per-tenant GDPR-shaped retention period for device records.

Cross-tenant refusal as a structural rule

Every operation in axess that touches a user, a session, a device, a factor, or an event carries a tenant scope. The library checks the scope before performing the operation, and refuses any operation where the scopes do not align.

The pattern is uniform across the API. A begin_login call takes a tenant id; the user lookup is scoped to that tenant; a user with the same username in a different tenant is not returned. A verify_factor call works against the session's tenant id; a factor configuration registered in a different tenant is not consulted. A find_sessions_for_user call takes both user id and tenant id; sessions in other tenants are not returned.

The structural defence is what lets a multi-tenant deployment make the strongest possible authorisation claim: not only does the application not leak across tenants, the library underneath cannot. The Cedar policy layer can then add a top-level forbid rule that catches the rare case of an application bug that tries to authorise across tenants:

forbid (
    principal,
    action,
    resource
) when {
    principal.tenant_id != resource.tenant_id
};

The rule applies to every action on every resource, and the combination of "library refuses cross-tenant lookups" and "policy denies cross-tenant decisions" produces a deployment where a cross-tenant access is structurally impossible.

Atomic provisioning

A tenant comes into existence through AuthnService::create_tenant, which is the verb behind any "sign up a new organisation" or "administrator provisions a new tenant" flow. The call is atomic by design.

let tenant = service.create_tenant(TenantBootstrap {
    display_name: "Acme Inc.".into(),
    initial_admin: AdminUser {
        identifier: "admin@acme.example".into(),
        initial_password: Some(initial_password.into()),
    },
    initial_method: Method {
        name: "password-then-totp".into(),
        steps: vec![
            FactorStep::Required(FactorKind::Password),
            FactorStep::Required(FactorKind::Totp),
        ],
    },
    fingerprint_pepper: SecureRng::random_bytes(32),
    lockout_policy: LockoutPolicy::default(),
    device_retention_days: 90,
}).await?;

The atomicity matters because a partially-provisioned tenant is a landmine. A tenant that exists in the tenant table but has no configured method admits any user with the global default method, which may not be what the new tenant wants. A tenant with a method but no factor configurations for the admin user produces an immediate lockout. A tenant with an admin user but no factor secret for them is worse: the user record exists, the admin cannot log in, and there is no path to recovery without an out-of-band intervention.

The bootstrap struct is the contract that says "a tenant exists only after every one of these has succeeded." The implementation runs the create-tenant, create-user, create-factor-config, create-method, set-fingerprint-pepper, set-lockout-policy operations in a single transaction. On any failure the transaction rolls back; nothing is persisted; the call returns an error.

A subtler invariant in the bootstrap: every tenant must have at least one factor and one enabled method, and the admin user must have a factor configuration for every factor the method requires. The bootstrap checks both at construction; a misshapen bootstrap fails before the transaction starts.

The three-lever lockout

Lockout is the mechanism that prevents an attacker from brute-forcing credentials. Axess has three levers, applied at three scopes, that compose.

The first lever is per-user lockout. After a configurable number of failed factor verifications against the same user account, that account is locked for a configurable interval. The default is three failed attempts followed by a fifteen-minute lockout with exponential backoff on repeated failure.

The second lever is per-tenant lockout. After a configurable number of failed factor verifications across any user in the tenant within a short window, the tenant's login surface as a whole is throttled. The default is high enough that legitimate traffic does not trigger it; the lever exists to catch distributed brute-forcing across many accounts in the same tenant.

The third lever is per-IP lockout. After a configurable number of failed verifications from the same source IP within a short window, that IP is throttled or blocked outright. The default is ten attempts per minute, beyond which the requests are rejected without engaging the factor verifier. The lever catches a single attacker source attempting many accounts.

The three levers compose multiplicatively. A successful attack needs to dodge all three: stay below the per-user threshold, stay below the per-tenant threshold, and either spread across many source IPs or stay below the per-IP threshold. The cost of the attack grows as a product of the three.

The lockout configuration is in LockoutPolicy:

pub struct LockoutPolicy {
    pub per_user: LockoutScale,
    pub per_tenant: LockoutScale,
    pub per_ip: LockoutScale,
}

pub struct LockoutScale {
    pub failures_before_lockout: u32,
    pub window: Duration,
    pub backoff: BackoffPolicy,  // fixed | exponential
    pub max_lockout: Duration,
}

The policy is per-tenant by default (loaded from the tenant record's lockout_policy field). The global default applies if the tenant did not override.

Tenant suspension

A suspended tenant is still in the database but cannot authenticate. The state is reached through AuthnService::suspend_tenant, which is the operational verb behind "this tenant has not paid" or "this tenant has been flagged for compliance review."

The transition does five things atomically: it sets the tenant's status to Suspended, it sets the suspended_at timestamp, it invalidates every active session under the tenant (deletes the session rows, the user's next request comes through as Guest), it revokes every refresh token under the tenant (sets revoked = true on each), and it emits a TenantSuspended audit event.

A suspended tenant's users hit TenantSuspended on every login attempt instead of proceeding to factor verification. The error is distinct from UserNotFound because the application typically wants to render a specific page for it (a "your organisation is suspended, contact support" message), not the generic invalid-credentials flow.

Unsuspending is the inverse: unsuspend_tenant flips the status back to Active, clears suspended_at, and emits a TenantReactivated event. Sessions are not restored; users have to log in again, which is the right behaviour because their device records may have aged or rotated during the suspension.

Tenant deletion

A deleted tenant is the irreversible end of the lifecycle. The state is reached through AuthnService::delete_tenant, typically in response to a customer exit or a GDPR erasure request.

The deletion runs as a cascade. All sessions, refresh tokens, devices, factor configurations, audit events, and the tenant record itself are removed. The deletion is two-phase: the first phase marks the tenant as Deleted and stops accepting new operations on it; the second phase runs the cascade asynchronously (typically as a background task) and removes the underlying rows.

The two-phase pattern matters for two reasons. First, the cascade is potentially expensive on large tenants; running it synchronously blocks the operator's request. Second, the two-phase approach gives a recovery window: if the deletion was accidental, the first phase is reversible by flipping the status back to Suspended before the cascade runs. After the cascade, recovery requires a backup restore.

The audit events emitted during the cascade are preserved (in a separate axess.audit.tenant_deletion log) so the deletion is defensible against later inquiry. The events name the operator who initiated, the timestamp, and the counts (how many users, how many sessions, how many tokens).

Per-tenant configuration storage

The per-tenant fields (fingerprint pepper, lockout policy, device retention, methods) live in dedicated tables keyed by tenant id. The application's tenant store is one of the adopter- implemented surfaces; axess provides traits, the implementation is yours. The pattern is uniform across the surfaces:

#[async_trait]
pub trait TenantStore: Send + Sync {
    async fn get(&self, id: &TenantId) -> Result<Tenant, TenantStoreError>;
    async fn create(&self, bootstrap: TenantBootstrap) -> Result<Tenant, ...>;
    async fn suspend(&self, id: &TenantId, at: DateTime<Utc>) -> Result<(), ...>;
    async fn unsuspend(&self, id: &TenantId) -> Result<(), ...>;
    async fn delete(&self, id: &TenantId, mode: DeleteMode) -> Result<(), ...>;
    async fn update_lockout_policy(&self, id: &TenantId, policy: LockoutPolicy) -> Result<(), ...>;
    async fn rotate_fingerprint_pepper(&self, id: &TenantId, new: ZeroizedString) -> Result<(), ...>;
}

The trait surface is the tenant lifecycle in code. An adopter implements it against their own tenant table; axess calls into it on each lifecycle event.

Reserved principals

A handful of principals are reserved across all tenants. The system() principal is the one axess uses for its own internal operations (retention sweeps, scheduled rotations, audit pipeline ingestion). The principal carries no TenantId; its actions are attributed to the system itself, not to any tenant or user.

The reservation prevents an application from creating a user named "system" and inadvertently granting that user the permissions axess reserves for its background work. The UserId::is_reserved check fires at user-creation time; attempting to provision a reserved principal returns an error.

The set of reserved principals is small and stable. The chapter Audit events lists them.

What this enables

Multi-tenancy in axess is what lets a SaaS application provision new organisations without restructuring the data model, suspend problematic ones without affecting the rest, and delete departed ones cleanly with an audit trail. The fingerprint pepper rotates per-tenant; the lockout policy varies per-tenant; the device retention complies per-tenant; the policies scope per-tenant. The multi-tenant deployment is the single-tenant deployment with N>1.

Further reading

Scope hierarchy covers the three-tier (Global, Tenant, User) resolution mechanism that determines which configuration applies to which user. Device identity covers the per-tenant fingerprint pepper and the GDPR-shaped retention sweep. Identity store implementation covers the storage layer for the tenant record and the user records under it. Cedar policy fundamentals covers the cross-tenant forbid rule and the policy-scoping pattern.

Identity store implementation

Most of axess works against traits, and the identity store is the most consequential of them. The library does not prescribe a user schema, a tenant schema, or a factor schema; it prescribes a set of trait methods that the application implements against whatever schema it already has. This chapter walks through the three-tier trait split, the verbs each tier carries, the patterns for implementing them against a SQL backend, and the read-replica-and-fixtures variant that the NoopAuthnLog adapter enables.

The three tiers

The identity store is split into three trait tiers, in order of increasing privilege. An adopter that needs only read access implements the narrowest tier; an adopter that needs write access for audit purposes implements the middle tier; an adopter that needs full administrative control implements the widest tier.

// Tier 1: read-only.
#[async_trait]
pub trait IdentityLookup: Send + Sync {
    async fn get_user(&self, user_id: &UserId) -> Result<User, StoreError>;
    async fn find_user(
        &self,
        identifier: &str,
        tenant_id: &TenantId,
    ) -> Result<Option<User>, StoreError>;
    // ... eight more verbs
}

// Tier 2: read + per-attempt audit writes.
#[async_trait]
pub trait IdentityAuthnLog: IdentityLookup {
    async fn record_attempt(
        &self,
        attempt: AttemptRecord,
    ) -> Result<(), StoreError>;
    async fn record_lockout(
        &self,
        lockout: LockoutRecord,
    ) -> Result<(), StoreError>;
    async fn clear_lockout(
        &self,
        user_id: &UserId,
        tenant_id: &TenantId,
    ) -> Result<(), StoreError>;
    async fn last_attempts(
        &self,
        user_id: &UserId,
        tenant_id: &TenantId,
        limit: usize,
    ) -> Result<Vec<AttemptRecord>, StoreError>;
}

// Tier 3: read + audit + administrative writes.
#[async_trait]
pub trait IdentityAdmin: IdentityAuthnLog {
    async fn create_user(&self, user: NewUser) -> Result<User, StoreError>;
    async fn suspend_user(&self, user_id: &UserId, at: DateTime<Utc>) -> Result<(), StoreError>;
    async fn erase_user(&self, user_id: &UserId, gdpr_reason: &str) -> Result<(), StoreError>;
    // ... six more verbs covering admin lifecycle
}

// The umbrella for production: all three tiers.
pub trait IdentityStore: IdentityAdmin {}
impl<T: IdentityAdmin> IdentityStore for T {}

The hierarchy reads from narrowest to widest. An IdentityAuthnLog is an IdentityLookup plus the audit writes. An IdentityAdmin is an IdentityAuthnLog plus the administrative writes. The umbrella IdentityStore is the all-three-tiers shape that production backends implement.

Why three tiers

The split is the answer to two adopter situations the library has seen often enough to model explicitly.

The first situation is a read-replica deployment. A high-traffic application runs the login flow against a read-replica of the user database for latency reasons. The replica cannot accept writes; the application needs the read verbs without the write verbs. The IdentityLookup tier covers this. The application implements IdentityLookup against the replica and IdentityAuthnLog (which needs writes) against the primary.

The second situation is a fixture deployment. A test or an embedded usage of axess does not have a real database; the application uses an in-memory backend for the read verbs and does not care about the audit writes. The NoopAuthnLog adapter wraps an IdentityLookup and provides no-op implementations of the IdentityAuthnLog write verbs. The fixture has the trait surface it needs without writing an audit-table mock.

The third situation, less common, is a deployment with a separation between the application code that handles login and the administrative code that creates users. The application implements IdentityAuthnLog; the admin code separately implements IdentityAdmin. The split prevents the application code from accidentally calling delete_user or suspend_user because it never has the trait method in scope.

What the verbs actually do

The verbs split cleanly across the tiers.

IdentityLookup is reads. get_user is a primary-key lookup by UserId. find_user is a credentials-side lookup by identifier and tenant: the user typed alice@example.com, the application needs to know if this is a real user in this tenant. Other read verbs cover the variants: looking up a user by email when email is separately indexed, looking up a user by a federated identity key when the application supports federated login, listing the users in a tenant for admin tooling.

IdentityAuthnLog is the audit writes the lockout system depends on. record_attempt is called by verify_factor after every factor check; the record carries the user id, the tenant id, the factor kind, the outcome (success, failure, locked), the timestamp, the IP. record_lockout is called when the lockout policy fires; the record marks the user as locked until a specific moment. clear_lockout is called when the lockout window expires or when an administrator manually clears the state. last_attempts is the read verb the policy consults to make the next lockout decision.

IdentityAdmin is the privileged writes. create_user is the verb behind signup or admin provisioning. suspend_user is the verb behind administrative suspension (compliance, fraud investigation). erase_user is the GDPR-shaped verb: the user has invoked their right to be forgotten, and the verb cascades through every record that references them. Other admin verbs cover password reset (administrative, not user-initiated), identifier changes, and the per-user method override.

Implementing against SQL

The typical implementation against a SQL database is verbose but mechanical. The pattern is to implement each verb as one query (or one transaction), with the right indexes on the user table to keep the reads fast.

A reference implementation against PostgreSQL is in examples/sqlite/ (the SQLite version of the pattern). The shape:

struct OurBackend {
    pool: SqlitePool,
}

#[async_trait]
impl IdentityLookup for OurBackend {
    async fn get_user(&self, user_id: &UserId) -> Result<User, StoreError> {
        let row = sqlx::query_as::<_, UserRow>(
            "SELECT id, tenant_id, identifier, display_name, status, created_at
             FROM users
             WHERE id = ?1"
        )
        .bind(user_id.to_string())
        .fetch_one(&self.pool)
        .await?;
        Ok(row.into())
    }

    async fn find_user(
        &self,
        identifier: &str,
        tenant_id: &TenantId,
    ) -> Result<Option<User>, StoreError> {
        let row = sqlx::query_as::<_, UserRow>(
            "SELECT id, tenant_id, identifier, display_name, status, created_at
             FROM users
             WHERE identifier = ?1 AND tenant_id = ?2"
        )
        .bind(identifier)
        .bind(tenant_id.to_string())
        .fetch_optional(&self.pool)
        .await?;
        Ok(row.map(Into::into))
    }
    // ... eight more verbs
}

The patterns to note:

The tenant scope is on every query. find_user filters by both identifier and tenant id; the same identifier in a different tenant is not returned. The discipline is what enforces cross-tenant refusal at the storage layer.

The identifier comparison is whatever the deployment chose. The example treats the identifier as case-sensitive; deployments that want case-insensitive matching apply LOWER() to both sides (and index on LOWER(identifier)). The trait does not opinionate; the implementation decides.

The error type is the implementation's. The trait returns StoreError; the implementation maps sqlx::Error into it. The mapping preserves the kind of failure (connection error, query error, constraint violation) so the upstream callers can act on specific cases.

Implementing the audit writes

IdentityAuthnLog is the layer that requires care. The verbs fire on every login attempt; a slow implementation is the bottleneck of the entire authentication flow.

The pattern is to batch where possible and to keep each write small. The record_attempt table is append-only and indexed on (user_id, tenant_id, timestamp) for the last_attempts query. The lockout state lives in a separate table keyed by user; the record_lockout and clear_lockout verbs are upserts.

#[async_trait]
impl IdentityAuthnLog for OurBackend {
    async fn record_attempt(&self, attempt: AttemptRecord) -> Result<(), StoreError> {
        sqlx::query(
            "INSERT INTO authn_attempts (user_id, tenant_id, factor_kind, outcome, ip, ts)
             VALUES (?1, ?2, ?3, ?4, ?5, ?6)"
        )
        .bind(attempt.user_id.to_string())
        .bind(attempt.tenant_id.to_string())
        .bind(attempt.factor_kind.as_str())
        .bind(attempt.outcome.as_str())
        .bind(attempt.ip.map(|ip| ip.to_string()))
        .bind(attempt.ts)
        .execute(&self.pool)
        .await?;
        Ok(())
    }

    async fn last_attempts(
        &self,
        user_id: &UserId,
        tenant_id: &TenantId,
        limit: usize,
    ) -> Result<Vec<AttemptRecord>, StoreError> {
        let rows = sqlx::query_as::<_, AttemptRow>(
            "SELECT user_id, tenant_id, factor_kind, outcome, ip, ts
             FROM authn_attempts
             WHERE user_id = ?1 AND tenant_id = ?2
             ORDER BY ts DESC
             LIMIT ?3"
        )
        .bind(user_id.to_string())
        .bind(tenant_id.to_string())
        .bind(limit as i64)
        .fetch_all(&self.pool)
        .await?;
        Ok(rows.into_iter().map(Into::into).collect())
    }
    // ... record_lockout, clear_lockout
}

The last_attempts query is the hottest read in the audit layer. The index (user_id, tenant_id, ts DESC) makes it cheap; without the index, the query degrades to a table scan and the login flow slows under load.

The append-only attempts table grows. The retention story for it is in Audit pipeline: typically a hot/cold split where recent attempts (the ones the lockout policy consults) stay in the attempts table and older attempts archive to a cold store.

The NoopAuthnLog adapter

NoopAuthnLog<L> wraps an IdentityLookup and provides no-op implementations of the IdentityAuthnLog write verbs. The wrapper exists for two cases.

The first is fixtures. A test uses MockIdentityStore (implementing IdentityLookup), and verify_factor needs IdentityAuthnLog. The test wraps the mock in NoopAuthnLog, satisfies the trait, and runs without recording anything.

The second is read-replica deployments where the audit writes go through a different code path (an out-of-band log shipper, a Kafka topic, an external SIEM). The application implements IdentityLookup against the replica, wraps in NoopAuthnLog, and routes the audit writes through the side channel.

The trade-off is that the lockout policy will not function correctly under NoopAuthnLog. The policy consults last_attempts, which depends on the audit writes the noop silently discarded. Deployments that use NoopAuthnLog for the read-replica case must accept that the lockout policy is degraded unless they implement an alternative.

The chapter warns about this in the docstring of NoopAuthnLog; the warning is worth repeating: do not use NoopAuthnLog in production without an alternative lockout source.

What about workload identities

Workloads have their own identity surface, not the same one humans use. The IdentityStore traits do not cover workloads; the workload identity resolvers (Workload identity overview) have their own machinery.

The split is deliberate. Humans live in a user table; workloads live in a workload table (or do not live anywhere durable, when they are short-lived service-to-service callers). The audit events for workloads route differently from human events. The lockout policy does not apply to workloads at all. Trying to unify the two would produce a trait that does too many jobs.

The same is true for the principal model: the Principal enum has two variants, the read paths for the two variants go through two different stores. The application implements both stores and the resolver code routes appropriately.

Schema migration

The identity store is the part of the application most likely to need migrations over time: a new factor adds a column to the factor configurations table, a regulatory change requires a new field on the audit-attempts table, a refactor renames a column.

The migration mechanism is the application's, not axess's. sqlx::migrate! is the standard pattern; alternative migration tools (Diesel migrations, Atlas, custom SQL) work the same way. Axess does not need to know about the migrations; the implementation just needs to keep satisfying the trait against the new schema.

The pattern in examples/sqlite/ is the reference. The migrations/ directory carries the SQL files; the main.rs runs them at startup; the implementation queries against the latest schema.

What this enables

The trait split is what lets axess fit into existing applications without forcing a schema rewrite. The library knows nothing about the user table; it knows only that there is a trait it can call to look up users. The application's data model is the source of truth, and the trait surface is the bridge.

The three tiers and the noop adapter give the application enough flexibility to fit the awkward shapes (read replicas, fixtures, split admin) without forcing every adopter to implement the full set of verbs.

Further reading

Multi-tenancy covers the per-tenant configuration that the identity store reads and writes. Audit events covers the AuthEvent variants the audit-log verbs emit. Audit pipeline covers the hot/cold retention story for the attempts table. Migration guide covers the cross-version migrations that affect the user table.

Workload identity overview

A workload in axess is a non-human caller: a service in your service mesh, a Kubernetes pod, a CI/CD runner, a batch job, a serverless function. Workloads need to authenticate against your application the same way users do, but the credentials, the lifetimes, and the operational characteristics are different. This part of the book covers how axess models workload identity, how it resolves credentials into a typed Principal::Workload, and how the Cedar policy layer authorises workloads through the same rules it uses for human users.

The unifying claim is the one The principal model in Part II already made: humans and workloads are the same type. A Cedar policy that says resource.tenant_id == principal.tenant_id works for a logged-in user and for a SPIFFE-identified payment service without branching. The chapters in this part cover the specific resolvers that turn each credential kind into a Principal::Workload.

The cookbook chapters are siblings of this overview. Read them in the order that matches your deployment: SPIFFE-based deployments read Inbound: JWT-SVID and Inbound: mTLS-SVID; cloud-platform deployments read Inbound: federation and Cloud STS exchange; applications that call downstream services on a workload's behalf read Outbound: OAuth and Outbound: mTLS.

The resolver model

Every inbound request that carries a workload credential runs through a PrincipalResolver. The resolver inspects the credential (a bearer JWT in a header, a client certificate from the TLS handshake, a projected service-account token), validates it, and returns a Principal::Workload if the validation succeeds. The same trait is implemented for every credential kind axess supports, and applications wire only the resolvers their deployment needs.

        ┌──────────────────────┐
        │  Inbound request     │
        └──────────┬───────────┘
                   │
                   ▼
        ┌──────────────────────┐
        │ PrincipalResolver    │
        │ (per-feature impls)  │
        └──────────┬───────────┘
                   │
      ┌────────────┴───────────┐
      │                        │
      ▼                        ▼
Principal::Human         Principal::Workload
(session + factors)      (with Issuer + WorkloadId)
                                │
      ┌─────────────────────────┘
      │
      ▼
ToCedarEntity bridge
      │
      ▼
Cedar evaluation

Two resolvers ship today plus a generic third for everything else: JwtSvidResolver (SPIFFE JWT-SVID, spec-bound; mandatory spiffe:// URI in sub); MtlsResolver (SPIFFE X.509-SVID over mTLS); and WorkloadResolver, the generic JWT-bearer resolver that covers every non-SPIFFE workload-identity flow (Kubernetes projected service-account tokens, GitHub Actions OIDC, GitLab CI OIDC, Okta, Azure AD, Auth0, axess's own LocalIdP, custom internal JWT formats). The adopter supplies a small claim parser + mapping closure per issuer they care about; see examples/workload-identity/ for ready-made recipes (GitHub Actions, Kubernetes SA). The human side has its own SessionResolver covered in Part II. A MockResolver is available for DST tests. Each resolver implements the same trait and produces the same Principal shape.

Why one type covers both

A traditional auth library treats human and workload identity as two independent stacks. The session layer handles users; a separate JWT-validation middleware handles services. Neither composes with the other, and policies that need to apply to both ("only callers in the finance tenant may read this resource") end up duplicated: one rule for users in code that knows about sessions, another rule for workloads in code that knows about tokens, and the two drift apart over time as the application evolves.

Unifying on Principal removes the duplication. The Cedar policy quoted above works for a human and a workload because the policy matches on a tenant id, which both variants carry. If the policy later needs to discriminate between the two (a rule that demands human-completed MFA for an action, but admits any workload), the discrimination is expressed in one rule:

permit (
    principal,
    action == Action::"transfer-funds",
    resource
) when {
    resource.tenant_id == principal.tenant_id
    && (
        principal has Workload  // workloads bypass MFA requirement
        || (
            principal has Human
            && "Fido2" in principal.factors_completed
        )
    )
};

The discrimination is local, readable, and lives in the policy file rather than scattered across handlers.

SPIFFE and SVIDs

SPIFFE is the industry-standard model for workload identity, and the chapters that follow assume the vocabulary. The two terms worth knowing up front:

A SPIFFE ID is a URI of the form spiffe://<trust_domain>/<path>. The trust domain is the federation namespace (prod.example.com, say); the path identifies a specific workload within that domain (/svc/billing/tenant-acme). The combination uniquely names the workload across the federation.

An SVID (SPIFFE Verifiable Identity Document) is the credential that carries the SPIFFE ID. SVIDs come in two formats: JWT-SVID (a JWT signed by the trust domain's issuing authority) and X.509-SVID (a leaf certificate with the SPIFFE ID in a Subject Alternative Name URI). Both are covered in their own cookbook chapters.

SPIRE is the reference SPIFFE implementation. It handles workload attestation (verifying that a process running on a host is the workload it claims to be), SVID issuance, key rotation, and trust-domain federation. Axess does not replace SPIRE; SPIRE issues, axess validates. The two are designed to compose.

A future SpireWorkloadApiResolver (tracked in the ROADMAP as ) will talk to a local SPIRE agent socket directly, fetching fresh SVIDs on demand rather than relying on adopters to mount them into the filesystem. For now, adopters mount short-lived SVIDs into pod filesystems and configure axess against them.

Federation

A trust domain is a unit of issuance. A workload in trust domain A is identified by an A-issued SVID, validated against A's signing keys. When a workload in domain A needs to call a service in domain B, federation is the mechanism that lets B accept A's identity.

Three federation patterns appear in axess.

Same-domain is the simple case. The resolver validates the SVID against the local trust-domain bundle (the JWKS for JWT-SVIDs, the CA bundle for X.509-SVIDs). The SVID carries the local trust domain; the resolver knows where to fetch the keys.

Federated is the cross-domain case. The resolver validates the SVID against a remote trust-domain bundle, then runs the resulting identity through a TrustDomainFederation policy that maps the foreign identity (which trust domains are accepted, which path prefixes within each are admitted, how the identity is rewritten into the local namespace if at all). The federation policy is deployment configuration; axess validates, the deployment decides the rules.

External-issuer is the non-SPIFFE case. The credential is not a SVID at all (a Kubernetes service-account token, a GitHub Actions OIDC token, an Azure AD workload token). All of these go through the single generic WorkloadResolver: the adopter supplies a claim parser + mapping closure that synthesises a SPIFFE-shape WorkloadId from whichever claims the issuer's JWT carries. The synthesis is what lets the rest of the system (Cedar policies, audit events, the principal type) work uniformly: the external workload looks like any other workload by the time the policy evaluator sees it.

Cloud STS exchange

A workload that needs to call AWS, GCP, or Azure APIs can exchange its workload identity for short-lived cloud credentials. The mechanism is implemented by all three cloud providers under similar names (AWS STS AssumeRoleWithWebIdentity, GCP Workload Identity Federation, Azure Federated Identity Credentials), and axess provides adapters that bridge a validated workload identity to each of them.

The chapter Cloud STS exchange covers the configuration and the credential lifecycle. The benefit is that no long-lived cloud keys ever live on the workload's filesystem; the credentials are minted on demand from the workload identity, used briefly, and discarded.

Outbound

Axess is not only an inbound authenticator. When a service authenticates to a downstream service, it uses the same identity shape it would accept inbound. The chapters Outbound: OAuth and Outbound: mTLS cover the two ways this works: the workload presents an mTLS client certificate to the downstream's TLS server, or the workload exchanges its identity for a bearer token through an OAuth flow.

The pattern matters because it lets one identity (the workload's SVID, or its federated equivalent) carry through an entire chain of service calls. The audit trail records the same identity at every hop; revocation at the issuing authority propagates to every call that was about to use the identity.

Feature flags

The resolvers are individually feature-gated so a deployment only pays the compile cost for the credential kinds it actually uses.

FeatureResolverPurpose
jwt-svidJwtSvidResolverInbound SPIFFE JWT-SVID (spec-bound)
mtlsMtlsResolverInbound SPIFFE X.509-SVID via mTLS
jwt (auto-pulled by jwt-svid etc.)WorkloadResolverGeneric JWT-bearer workload identity for every non-SPIFFE issuer (GitHub Actions, k8s SA, GitLab CI, Okta, Azure AD, Auth0, LocalIdP, …) via adopter-supplied claim parser + mapping closure. No per-company features; see examples/workload-identity/
outbound-mtls(client side)Outbound mTLS with workload SVID
outbound-oauth(client side)Outbound OAuth client
aws-sts, gcp-wif, azure-fic(cloud STS)Exchange workload identity for cloud credentials
workload-idumbrellaSPIFFE adapters + outbound + mTLS bundle

What this part does not cover

Three concerns are intentionally outside scope.

The SPIRE Agent and Server implementations are not part of axess. Axess validates SVIDs; SPIRE issues them. The two are designed to be independent so deployments can use any SPIFFE-compliant issuer (SPIRE, an in-house implementation, a managed service like AWS IAM Roles Anywhere) without changing the axess side.

Trust domain bootstrap and root-of-trust ceremonies are out of scope. Operators manage the trust-domain bundle through SPIRE's federation API or an equivalent mechanism. Axess consumes the bundle; it does not establish it.

Service mesh integration is out of scope. Istio, Linkerd, and Consul handle mesh-level identity at the proxy layer. Axess works at the application layer above the mesh. When the mesh terminates mTLS and forwards a verified identity in a header, a custom PrincipalResolver can pick it up and produce a Principal::Workload the rest of the system understands.

Further reading

The cookbook chapters in this part each cover one resolver in detail. Start with the one matching the credential kind your deployment uses; the others are useful background for the federation and outbound scenarios. Cedar policy fundamentals covers how the policy engine handles workload principals. The principal model in Part II covers the unified Principal type that all of this resolves to.

Inbound: SPIFFE JWT-SVID

A JWT-SVID is a JWT carrying a SPIFFE identity. It is the right credential for service-to-service authentication where mTLS is impractical (the network path crosses a load balancer that does not preserve client certificates, the calling service speaks a protocol that does not support TLS client auth, the deployment favours the simplicity of bearer tokens). The JwtSvidResolver is the axess resolver that validates these tokens and produces a Principal::Workload.

The feature flag is jwt-svid (off by default).

The credential shape

A SPIFFE JWT-SVID is an ordinary JWT with two specific claim requirements. The subject (sub) claim is the SPIFFE ID, formatted as spiffe://<trust_domain>/<path>. The audience (aud) claim names the intended recipient: when your application validates the token, the audience must match a configured value.

{
  "iss": "https://spire.prod.example.com",
  "sub": "spiffe://prod.example.com/svc/billing",
  "aud": ["https://api.example.com"],
  "exp": 1735689600,
  "iat": 1735686000,
  "jti": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}

The signature is over the standard JWT body plus header, using keys published by the trust domain's issuing authority through a JWKS endpoint. The signing algorithm is RS256 or ES256 in production deployments; SPIFFE does not standardise the algorithm, but the keys advertised in the JWKS specify it.

Configuration

JwtSvidResolverConfig carries the validation parameters:

pub struct JwtSvidResolverConfig {
    pub trust_domain: TrustDomain,
    pub jwks_url: Url,
    pub expected_audiences: Vec<String>,
    pub clock_skew: Duration,
    pub max_token_age: Duration,
}

let resolver = JwtSvidResolver::new(JwtSvidResolverConfig {
    trust_domain: "prod.example.com".parse().unwrap(),
    jwks_url: "https://spire.prod.example.com/keys".parse().unwrap(),
    expected_audiences: vec!["https://api.example.com".into()],
    clock_skew: Duration::from_secs(30),
    max_token_age: Duration::from_secs(3600),
});

trust_domain is the trust domain the resolver accepts SVIDs from. A token whose sub SPIFFE ID names a different trust domain is rejected. The defence is the trust-domain isolation that SPIFFE is built around.

jwks_url is where the resolver fetches signing keys. The fetch runs through the axess-cache machinery: a single-flight cache that dedupes concurrent fetches, with debouncing to prevent denial-of-service through key-rotation thrash. The cache TTL defaults to one hour, which matches the typical SPIRE rotation schedule.

expected_audiences is the allowlist of audience values the resolver accepts. A token whose aud does not contain at least one of the expected values is rejected. Most deployments configure a single audience (the application's URL); deployments that serve multiple identities behind one resolver list each.

clock_skew is the tolerance applied to the exp and iat checks. Thirty seconds is generous; production deployments that synchronise clocks tightly through NTP can lower it.

max_token_age is the upper bound on how far in the past the token's iat claim can be. The check defeats replay of stale tokens: even if a token has not expired, a token issued more than the configured age ago is rejected. The default is one hour, which is generous; deployments with stricter posture set it lower.

Wiring the resolver

The resolver is wired as a Tower middleware that runs before the handler. The middleware reads a bearer token from the Authorization header (or wherever the deployment puts it), calls into the resolver, and on success inserts the resulting Principal into the request extensions.

use axess::workload::{JwtSvidResolver, JwtSvidLayer};

let resolver = JwtSvidResolver::new(/* ... */);
let layer = JwtSvidLayer::new(resolver);

let app = Router::new()
    .route("/api/data", get(handler))
    .layer(layer);

The handler reads the principal through an extractor:

use axess::Principal;
use axum::Extension;

async fn handler(Extension(principal): Extension<Principal>) -> &'static str {
    match principal {
        Principal::Workload(w) => {
            tracing::info!(workload = %w.workload_id, "request from workload");
            "ok"
        }
        Principal::Human(_) => {
            // The route is workload-only; reject the human request.
            // (Or route differently. Choice is the application's.)
            unreachable!("the layer only accepts workload tokens")
        }
    }
}

The middleware can be composed with other authentication paths. An application that accepts both human sessions and workload tokens wires the session layer and the JWT-SVID layer side by side; the first one to produce a principal wins.

Validation details

The validation runs through six checks in order. The order matters because cheaper checks come first: a malformed token fails parsing without ever fetching JWKS keys; an expired token is rejected without engaging the signature check.

The first check is parsing. The token must be a well-formed JWT with header, payload, and signature segments. Malformed input produces JwtSvidError::Malformed without further work.

The second check is the header. The alg field must be one of the configured allowed algorithms (RS256 or ES256 by default; deployments that need others configure them explicitly). The kid field must be present so the resolver can look up the right key.

The third check is the claims. The sub claim must be a valid SPIFFE URI under the configured trust domain. The aud claim must contain at least one of the configured expected audiences. The exp and iat claims must be present and within the clock skew and max age bounds. Missing or malformed claims produce specific error variants so the operational signal is clear.

The fourth check is the signature. The resolver looks up the key matching the token's kid in the cached JWKS, verifies the signature, and falls through on success. A signature failure triggers a JWKS cache refresh (subject to the debouncing) and a retry against the fresh keys; a failure after refresh is final.

The fifth check is the nbf (not-before) claim when present. SPIRE typically issues tokens with nbf slightly in the future to allow for clock skew on the receiver side. The check uses the same clock-skew tolerance.

The sixth check is the duplicate-jti check, when configured. SPIFFE recommends a JTI on each token to allow receivers to detect replay; an axess deployment that wants this protection configures a JTI store (typically a small Valkey cache with the configured max_token_age TTL), and the resolver checks for duplicates before admitting the token.

What the principal looks like

A successful validation produces a Principal::Workload:

Principal::Workload(WorkloadPrincipal {
    workload_id: WorkloadId::new("spiffe://prod.example.com/svc/billing"),
    trust_domain: TrustDomain::new("prod.example.com"),
    issuer: Issuer::JwtSvid {
        jwks_url: "https://spire.prod.example.com/keys".parse().unwrap(),
    },
    tenant_id: derive_tenant_from_path(...),
    tenant_slug: derive_slug_from_path(...),
    service_name: derive_service_from_path(...),
    attributes: {
        "exp": 1735689600,
        "iat": 1735686000,
        "jti": "f47ac10b-...",
    },
})

The workload_id is the parsed SPIFFE URI. The trust_domain mirrors the configured trust domain. The issuer records that the principal came through the JWT-SVID path with the specific JWKS URL. The tenant and service derivation depends on the deployment's SPIFFE path convention (the example above expects paths like /svc/<service>/<tenant>); the resolver's path-parsing logic is configurable, and examples/local_idp/ demonstrates the pattern.

The attributes map carries the rest of the token's claims, so Cedar policies can match on them if needed (a policy that demands a specific issuer signature, for instance, reads principal.attributes.iss).

Threat model

The JWT-SVID flow is robust against the standard attacks when the validation is complete.

Against token forgery: the signature check defeats it. An attacker without the issuing authority's signing key cannot mint a valid SVID.

Against token theft: the audience check defeats most of it. A token stolen from one service cannot be used against another service whose audience does not match.

Against token replay: the iat + max_token_age bound shrinks the replay window. With the optional JTI cache, replay is detected explicitly.

Against trust-domain confusion: the trust-domain match defeats cross-domain attacks. A token from a different trust domain is rejected without further consideration.

The remaining attack surface is the issuing authority itself. A compromised SPIRE control plane can mint compromised SVIDs, and no client-side check catches that. The defence is operational: secure the SPIRE control plane, monitor its audit log, rotate keys on a schedule.

Troubleshooting

If the resolver returns KeyNotFound consistently, the JWKS URL is wrong or the key advertised in the token is not yet published at the URL. The latter is common during SPIRE rotation; the caching layer's debounce can hide the rotation briefly. Force a cache refresh (or wait for the TTL) and retry.

If the resolver returns AudienceMismatch for tokens that should work, the issuing service is minting tokens with a different audience than the application expects. Either the issuer's configuration is wrong, or the application's expected_audiences list is missing the relevant value. Inspect the token (the payload is unencoded base64, so it is readable) to see what aud it carries.

If the resolver returns TrustDomainMismatch, a workload from a different domain is calling your service. If this is intentional, configure federation (the next chapter, Inbound: federation, covers the mechanism). If it is not intentional, the workload is misconfigured.

Fetching SVIDs from a local SPIRE agent

JwtSvidResolver is the verifying side; it consumes an SVID presented in an HTTP request and validates it against the trust domain's JWKS. The issuing side; fetching fresh SVIDs from a local SPIRE agent socket for outbound calls; is a separate concern.

For deployments that need to fetch SVIDs at runtime, two adopter-direct options exist on crates.io today:

  • spire-workload; higher-level wrapper around the SPIRE Workload API gRPC, including JWT-SVID fetch with auto-rotation. Most adopters reach for this first.
  • spire-api; lower-level generated gRPC client when finer control is needed.

axess does not currently wrap either crate; the SpireWorkloadApiResolver ROADMAP item lands when an adopter needs an axess-shaped surface (e.g. integration with axess-clock for rotation timing, axess-rng for ceremony nonces, or the Principal::Workload shape on the fetch result for symmetry with the verifier). Until then, the recommended path is:

  1. Use spire-workload directly in your application to fetch JWT-SVIDs against a configured audience.
  2. Present the fetched SVID on outbound calls via your HTTP client.
  3. On the receiving service, validate the SVID with JwtSvidResolver as documented above. The presenting and verifying sides interoperate without axess wrapping the fetch side.

If your deployment forces the issue (e.g. fetch-side rotation needs to drive axess-clock-pinned tests), open a tracking issue; that's exactly the adopter-demand signal the ROADMAP entry waits for.

Further reading

Workload identity overview covers the SPIFFE model and the unified Principal type this resolver produces. Inbound: mTLS-SVID covers the X.509 variant for deployments where mTLS is practical. Inbound: federation covers the cross-trust-domain patterns. Cedar policy fundamentals covers how policies match on the workload's claims through principal.attributes.

Inbound: SPIFFE X.509-SVID via mTLS

A workload authenticates over mTLS by presenting a leaf X.509 certificate that carries its SPIFFE identity in a Subject Alternative Name URI. The TLS handshake validates the certificate against the trust-domain CA bundle, the application reads the SPIFFE URI from the SAN, and the resulting identity becomes a Principal::Workload. The mechanism is the right choice for service-to-service traffic where mTLS is already in place (a service mesh, a load balancer that preserves client certs, a direct VPC peering).

The feature flag is mtls (off by default).

The credential shape

An X.509-SVID is an ordinary X.509 leaf certificate with one specific requirement: the Subject Alternative Name extension contains a URI of the form spiffe://<trust_domain>/<path>. The certificate is otherwise standard; deployments may put additional information in the subject DN, the other SAN entries, or X.509 extensions, but the SPIFFE URI is the identity the resolver reads.

The certificate chain is signed by the trust domain's CA. The chain validates the certificate's authenticity; the SAN URI identifies the workload within the trust domain.

Where the certificate comes from

Axess does not handle the TLS handshake. The handshake happens where TLS terminates (rustls in the application process, a sidecar proxy in a service mesh, a load balancer in front of the application). The terminator validates the certificate chain against the configured CA bundle, accepts or rejects the connection, and on acceptance makes the certificate available to the application.

The mechanism for making the certificate available depends on the terminator. For rustls in process, the certificate is available through axum_server::tls_rustls::RustlsConnectInfo or an equivalent connector callback, which the resolver wires through directly. For a sidecar proxy (Istio, Linkerd, Envoy in a service mesh), the proxy forwards the certificate as a header (Istio uses X-Forwarded-Client-Cert, Linkerd uses l5d-client-id), and the resolver wires through a small adapter that parses the header into a certificate. For a load balancer in passthrough TLS mode, rustls handles the validation in-process; for a load balancer in mTLS-terminating mode (AWS ALB with mTLS, Cloudflare with client-cert auth, nginx with ssl_verify_client), the load balancer forwards the certificate in a header whose name and format depend on the product.

The application's job is to extract the certificate chain from wherever the terminator put it, wrap it in PeerCertChain, and insert it into the request extensions before the resolver runs.

use axess::workload::PeerCertChain;

async fn mtls_middleware<B>(
    mut req: Request<B>,
    next: Next<B>,
) -> Response {
    if let Some(chain) = extract_cert_from_terminator(&req) {
        req.extensions_mut().insert(PeerCertChain::from(chain));
    }
    next.run(req).await
}

The critical detail: the extraction must trust only sources the deployment trusts. A request that arrives directly to the application with a forged X-Forwarded-Client-Cert header must not be accepted. Either run the application on a socket the terminator owns and reject direct connections at the network layer, or gate the header on a token the terminator injects alongside the certificate.

The resolver

MtlsResolver is the resolver that reads the chain from the extensions, extracts the SPIFFE URI, validates against the configured trust domain, and produces a Principal::Workload.

use axess::workload::{MtlsResolver, MtlsResolverConfig};

let resolver = MtlsResolver::new(MtlsResolverConfig {
    trust_domain: "prod.example.com".parse().unwrap(),
    tenant_resolver: Box::new(MyTenantResolver::new(/* ... */)),
});

The configuration is small because most of the validation work has already happened. The terminator validated the certificate chain; the resolver only needs to read the SAN URI, parse it as a SPIFFE ID, and check that the trust domain matches the configured one.

tenant_resolver is the adopter-supplied piece that maps the SPIFFE path to a TenantId. The path typically follows a convention like /svc/<service>/<tenant_slug>, and the resolver looks up the tenant id from the slug. The convention is the deployment's; axess just provides the trait surface.

The validation flow

The resolver's resolve method runs five steps.

The first step is reading the peer certificate chain from request extensions. Absence here is a configuration error (the extraction middleware did not run), and the resolver returns MtlsError::NoPeerCert.

The second step is parsing the leaf certificate. The chain may contain intermediate certificates; the leaf is the first one. The resolver extracts the SAN extension and looks for a URI value matching the SPIFFE format. Absence of a SPIFFE URI in the SAN produces MtlsError::NoSpiffeId.

The third step is parsing the SPIFFE URI. The URI must be well-formed (a spiffe:// scheme, a trust domain, a path). A malformed URI produces MtlsError::MalformedSpiffeId.

The fourth step is the trust-domain match. The parsed trust domain must equal the configured one. A mismatch produces MtlsError::TrustDomainMismatch.

The fifth step is the tenant resolution. The path is fed to the configured TenantResolver, which returns a TenantId. The resolver assembles the WorkloadPrincipal with the SPIFFE id, the trust domain, the issuer (Issuer::Mtls), and the tenant id, and returns it.

What the principal looks like

A successful validation produces:

Principal::Workload(WorkloadPrincipal {
    workload_id: WorkloadId::new("spiffe://prod.example.com/svc/billing/tenant-acme"),
    trust_domain: TrustDomain::new("prod.example.com"),
    issuer: Issuer::Mtls,
    tenant_id: TenantId::parse("acme").unwrap(),
    tenant_slug: "acme".into(),
    service_name: "billing".into(),
    attributes: { /* X.509 fields the deployment exposes */ },
})

The attributes map carries any X.509 fields the deployment chooses to surface (the certificate's serial number for audit, the certificate's expiry for short-lived-cert tracking, custom extensions). The choice is the deployment's; the resolver exposes the chain so the adopter can read what they need.

Combining with other resolvers

A common shape is mTLS as the transport-level proof of identity plus a session cookie or a JWT as the application-level proof of who the user behind the workload is. The two layers compose: the mTLS resolver runs first and establishes the workload's identity; the session or JWT layer runs second and establishes the human's identity inside the workload. Cedar policies can match on both.

The composition is what gives a deployment "the calling service is authenticated AND the user inside the call is authenticated", which is the right shape for delegated workflows. Delegated and OBO access covers the pattern from the OBO side.

Threat model

mTLS is robust against the standard attacks when the issuing CA is secure.

Against token theft: there is no token. The credential is a private key the workload holds; an attacker without the key cannot present the certificate.

Against in-flight tampering: the TLS layer protects against it. The certificate is bound to the TLS session; an attacker on the wire cannot substitute a different certificate without breaking the handshake.

Against replay: the certificate is short-lived (SPIRE typically rotates SVIDs every few hours) and bound to a TLS session. Replay across sessions requires the private key, which the attacker does not have.

The remaining attack surface is the issuing CA. A compromised CA can issue compromised certificates, and the validation cannot detect it. The defence is operational: secure the issuing CA, monitor the issuance log, rotate the CA's signing key on a schedule.

The other remaining surface is the workload's private-key storage. A workload that stores its key in a file on disk is vulnerable to file-system compromise; a workload that stores its key in a hardware enclave (TPM, HSM, KMS) is much harder to compromise. SPIRE supports both shapes through its workload-API attestation; the choice is the deployment's.

Troubleshooting

If the resolver returns NoPeerCert for connections that should work, the extraction middleware is not running, or the terminator is not forwarding the certificate. Inspect the request extensions before the resolver runs.

If the resolver returns NoSpiffeId, the certificate does not carry a SPIFFE URI in the SAN. Inspect the certificate (openssl x509 -in cert.pem -text) to see what SAN entries are present. The issuer's configuration may need to be updated to include the SPIFFE URI.

If the resolver returns TrustDomainMismatch, a workload from a different trust domain has connected. If this is intentional, configure federation (covered in Inbound: federation).

If the resolver succeeds but the tenant resolution fails, the path convention is not matching the workload's actual SPIFFE path. Inspect the path and update the tenant resolver to handle the actual format.

Further reading

Workload identity overview covers the SPIFFE model and the unified Principal type. Inbound: JWT-SVID covers the bearer token variant for deployments where mTLS is impractical. Inbound: federation covers cross-trust-domain patterns. mTLS-based authentication in Part III covers mTLS for human authentication; the validation mechanics are the same, but the interpretation of the certificate differs.

Inbound: federation

Federation is the pattern where workloads authenticate against your application using credentials issued by a third party your deployment trusts. The federating issuer typically lives outside the trust domain your own services use: Kubernetes issues service-account tokens for pods, GitHub issues OIDC tokens for Actions runs, an enterprise IdP issues tokens for cross-organisation service calls. None of these are SPIFFE issuers, but axess provides a generic resolver that bridges any JWT-bearer issuer into the unified workload-principal shape.

This chapter covers WorkloadResolver, the single resolver that handles every non-SPIFFE federation. It is gated on the jwt feature (transitively enabled by jwt-svid and the rest of the workload-identity bundle).

What federation means here

The unifying claim of federation in axess is that an external issuer's token, after validation, produces a Principal::Workload with the same shape as a SPIFFE workload. The trust domain and the SPIFFE-style path are synthesised from the issuer's claims; the issuer field on the principal records which federation produced it (Issuer::OAuth for the generic case, or one of Issuer::custom("github_actions") / Issuer::custom("kubernetes") / Issuer::custom("gitlab_ci") when audit logs need finer granularity).

The synthesis matters because the rest of the system stays uniform. A Cedar policy that says "any workload in the finance tenant may read this resource" works for a SPIFFE-identified service and for a Kubernetes pod and for a GitHub Actions run, without branching. The audit pipeline logs the same principal shape for all three. The application's code does not need to know which federation produced the request.

One resolver, many issuers

axess deliberately ships no per-issuer adapters. Each IdP's JWT claim shape is small (~20 lines for a #[derive(Deserialize)] struct, ~30 lines for a mapping closure) and adopters care about their specific IdP's exact claim semantics, not a generic average. Hard-coding wif-github, wif-k8s, wif-gitlab features in the library invites endless additions without reuse benefit.

Instead: one WorkloadResolver<C, F, R> is generic over

  • C; the adopter's #[derive(Deserialize)] claim struct
  • F; the closure mapping verified claims to WorkloadMapping
  • R; JTI replay-store type (defaults to NoReplay)

The library handles JWT verification (signature against JWKS, iss/aud/exp/nbf/alg checks), trust-domain pinning, and Principal construction. The closure handles claim → identity-components.

Ready-made recipes

examples/workload-identity/ ships claim parsers + mappers for two common issuers. Adopters copy the recipe that matches their IdP into their codebase (recommended for production) or depend on the crate directly (useful for prototypes and tests).

Kubernetes service accounts

Kubernetes mints OIDC-style tokens for pods through the TokenRequest API. A pod requests a token bound to a specific audience (the URL of your application, say), and the cluster's control plane returns a signed JWT carrying the pod's service-account identity. The token's iss is the cluster's OIDC issuer URL; the kubernetes.io.{namespace,serviceaccount.name} custom claim block carries the pod's identity.

use axess_example_workload_identity::kubernetes::{
    k8s_sa_mapper, K8sCustomClaims,
};
use axess_factors::federation::workload::WorkloadResolver;
use axess_factors::jwt::verifier::JwtVerifier;
use axess_identity::{Issuer, TrustDomain};
use std::sync::Arc;

// Startup wiring (cache the verifier; reuse across requests):
let verifier = Arc::new(
    JwtVerifier::new(cluster_jwks_handle)
        .with_issuer("https://kubernetes.default.svc.cluster.local")
        .with_audience("axess-platform"),
);
let trust_domain = TrustDomain::new("cluster.local").unwrap();

// Per request: adopter middleware peeks at the token to look up
// tenant_id from the namespace, then constructs the resolver.
let resolver = WorkloadResolver::<K8sCustomClaims, _, _>::new(
    verifier.clone(),
    trust_domain.clone(),
    tenant_id,
    Issuer::custom("kubernetes").unwrap(),
    bearer_token,
    k8s_sa_mapper(trust_domain),
);
let principal = resolver.resolve().await?;

The recipe synthesises a SPIFFE-shape workload id of the form spiffe://cluster.local/<sa_name>/<namespace>. Adjust the recipe's path layout if your trust-domain convention differs.

GitHub Actions OIDC

GitHub Actions can issue OIDC tokens for workflow runs. The token carries claims naming the repository, the workflow, the branch, the run id, and the actor. Combined with a trust-domain mapping, the token authenticates a specific workflow run from your organisation against your application.

use axess_example_workload_identity::github_actions::{
    github_actions_mapper, GitHubActionsClaims,
};
use axess_factors::federation::workload::WorkloadResolver;
use axess_factors::jwt::verifier::JwtVerifier;
use axess_identity::{Issuer, TrustDomain};
use std::sync::Arc;

let verifier = Arc::new(
    JwtVerifier::new(github_jwks_handle)
        .with_issuer("https://token.actions.githubusercontent.com")
        .with_audience("axess-platform"),
);
let trust_domain = TrustDomain::new("github.actions").unwrap();

let resolver = WorkloadResolver::<GitHubActionsClaims, _, _>::new(
    verifier.clone(),
    trust_domain.clone(),
    tenant_id,
    Issuer::custom("github_actions").unwrap(),
    bearer_token,
    github_actions_mapper(trust_domain),
);
let principal = resolver.resolve().await?;

The recipe synthesises spiffe://github.actions/<repo>/<owner> and preserves actor, workflow, ref, sha, event_name as Cedar attributes for policy use (allow only deploys from the default branch, require a specific workflow file, etc.).

Other issuers (GitLab CI, Okta, Azure AD, Auth0, …)

Write your own recipe. For any new IdP:

  1. Decode a sample JWT to identify which claims carry the workload identity (project_path? namespace_id? a custom service?).
  2. Define a #[derive(Deserialize)] struct YourClaims { ... } with only the fields you care about. JwtVerifier ignores unknown claims, so you don't have to enumerate everything the issuer sends.
  3. Write a mapper closure Fn(&VerifiedClaims<YourClaims>) -> Result<WorkloadMapping, IdentityError> that produces the (workload_id, service_name, tenant_slug, attributes) shape.
  4. Wire as above, with Issuer::custom("your_idp_label").unwrap() for audit-log attribution (the constructor validates the label format: [a-z0-9_]{1,32}).

The two shipped recipes are the templates; read their source, adapt as needed.

When federation does and does not fit

Federation is the right answer when the deployment cannot or does not want to issue its own workload identities. A Kubernetes-based deployment that wants to use the pods' service-account tokens directly fits cleanly; an open-source CI integration that accepts tokens from any GitHub Actions run fits cleanly; an enterprise deployment that integrates with a partner's Okta tenant fits cleanly.

Federation is the wrong answer when the deployment runs SPIRE (or another SPIFFE issuer) and can mint its own SVIDs. In that case the SPIFFE-native resolvers (Inbound: JWT-SVID, Inbound: mTLS-SVID) are simpler, the trust model is tighter, and the federation indirection adds nothing.

Multi-resolver deployments are common. The same application typically accepts SPIFFE-native traffic from its own services and federated traffic from external collaborators; the resolvers wire side by side, each with its own router or middleware path, and the unified principal shape lets the policies stay the same across the sources.

Threat model

The federation flows share the threat model of the underlying issuer. A Kubernetes-issued token is as secure as the cluster's OIDC issuer; a GitHub Actions token is as secure as GitHub's issuance pipeline; an OIDC IdP-issued token is as secure as the IdP.

The defences that axess adds are the standard ones: signature verification against the issuer's JWKS, iss match, aud match, expiry check, optional clock-skew and max-age bounds, trust-domain pinning at the resolver layer, and the adopter's claim-mapper closure (which decides which subject paths the application admits).

The remaining attack surfaces are the issuer-specific ones. A compromised Kubernetes control plane mints compromised tokens. A misconfigured GitHub Actions workflow leaks the OIDC token. A compromised OIDC IdP issues tokens for arbitrary identities. The defences are operational: secure each issuer, monitor for unusual issuance patterns, rotate keys on a schedule.

The audit pipeline (covered in Audit pipeline) emits an event on every successful workload authentication, recording the issuer label (Issuer::OAuth / Issuer::Custom(...)) and the synthesised identity. The events feed the SIEM rules that catch issuer-level anomalies.

Troubleshooting

If resolve() returns NotAuthenticated, the JWT failed verification; wrong issuer, wrong audience, expired, bad signature, or a custom-claim deserialisation failure. Enable tracing::debug! on axess_factors::federation::workload to see which step rejected the token.

If resolve() returns InvalidSpiffeId, the resolver verified the token but the trust domain extracted from the synthesised WorkloadId did not match the resolver's pinned trust domain. Typically a mapper bug: the closure synthesised the id under the wrong trust domain. Check the recipe's trust_domain capture.

If resolve() returns InvalidComponent(...), the claim mapper rejected the verified claims. The error message names which claim was missing or malformed. Decode the JWT payload (base64 -d of the middle segment) to compare claims against the mapper's expectations.

Further reading

Workload identity overview covers the SPIFFE model the federation resolver maps into. Cloud STS exchange covers the next step for many federated tokens: exchanging a workload identity for short-lived cloud credentials. OAuth 2.0 and OIDC in Part III covers the underlying OIDC machinery that the JwtVerifier builds on.

Cloud STS exchange

A workload that has been authenticated through one of the inbound resolvers may need to call AWS, GCP, or Azure APIs on the workload's behalf. The cloud-native pattern for this is to exchange the workload's identity for short-lived cloud credentials through the cloud provider's Security Token Service. The mechanism is supported by all three major clouds under similar names (AWS STS AssumeRoleWithWebIdentity, GCP Workload Identity Federation, Azure Federated Identity Credentials), and axess provides adapters for each.

The feature flags are aws-sts, gcp-wif, and azure-fic, plus an umbrella cloud-sts that enables all three. All are off by default.

The pattern

The pattern is uniform across clouds. The application has a validated workload identity (a JWT-SVID, a federated OIDC token, a GitHub Actions OIDC token). The application wants to call a cloud API on the workload's behalf. Instead of giving the workload a long-lived cloud key, the application exchanges the workload's identity at the cloud's STS endpoint for a short-lived credential bound to a specific cloud role.

   workload identity      STS exchange       short-lived cloud credential
        token        ───>      ───>          (15 minutes, role-scoped)
                                                    │
                                                    ▼
                                              cloud API call

The exchange happens at the application layer, server-side. The workload's identity token never leaves the application; the short-lived cloud credential is what makes the actual cloud API call. The benefit is that no long-lived cloud key ever sits on the workload's filesystem, and revocation of the workload's identity (at the issuer) propagates to the cloud access without any cloud-side action.

AWS STS

The AWS adapter calls AssumeRoleWithWebIdentity, the STS API for identity federation. The configuration:

use axess::workload::cloud_sts::{AwsStsExchanger, AwsStsConfig};

let exchanger = AwsStsExchanger::new(AwsStsConfig {
    role_arn: "arn:aws:iam::123456789012:role/billing-api-prod".into(),
    region: "eu-west-1".into(),
    session_duration: Duration::from_secs(900),  // 15 minutes
    role_session_name_strategy: SessionNameStrategy::WorkloadId,
});

The role_arn is the AWS role the credential will assume. The role's trust policy specifies which web-identity tokens may assume it; the policy is configured on the AWS side, and the application's workload-identity issuer must match what the policy allows.

The session_duration is the lifetime of the resulting credential. AWS allows between 15 minutes and 12 hours (configurable per role). Fifteen minutes is the recommended default; a longer duration trades off some defence against credential theft against the overhead of re-exchanging.

The role_session_name_strategy controls how the resulting session is named in CloudTrail and AWS audit logs. Naming the session after the workload identity (WorkloadId) makes the audit trail readable; alternative strategies are available for deployments with specific compliance requirements.

async fn call_aws(
    exchanger: &AwsStsExchanger,
    principal: &Principal,
) -> Result<(), Error> {
    let creds = exchanger
        .exchange(principal_to_token(principal))
        .await?;

    let s3_client = aws_sdk_s3::Client::from_conf(
        aws_sdk_s3::Config::builder()
            .credentials_provider(creds)
            .build()
    );
    s3_client.list_buckets().send().await?;
    Ok(())
}

GCP Workload Identity Federation

The GCP adapter calls Google Cloud's federated-credentials endpoint, which exchanges a token from an external identity provider for a Google Cloud access token. The configuration:

use axess::workload::cloud_sts::{GcpWifExchanger, GcpWifConfig};

let exchanger = GcpWifExchanger::new(GcpWifConfig {
    workload_identity_pool: "projects/123/locations/global/workloadIdentityPools/axess".into(),
    workload_identity_provider: "external-oidc".into(),
    target_principal: "billing-api@project.iam.gserviceaccount.com".into(),
    scopes: vec!["https://www.googleapis.com/auth/cloud-platform".into()],
});

The workload_identity_pool and workload_identity_provider name the GCP-side configuration that maps external identities to GCP identities. The pool and provider are configured on the GCP side through the gcloud CLI or Terraform; the application's adapter references them by name.

The target_principal is the GCP service account the exchange impersonates. The service account's IAM bindings determine which GCP resources the resulting credential can access.

The scopes list bounds what the credential can be used for. The narrowest possible scope is the recommendation; cloud-platform is the broadest and should be used only when the application genuinely needs unrestricted access.

Azure Federated Identity Credentials

The Azure adapter exchanges an external identity for an Azure AD access token through the FIC (Federated Identity Credential) mechanism. The configuration:

use axess::workload::cloud_sts::{AzureFicExchanger, AzureFicConfig};

let exchanger = AzureFicExchanger::new(AzureFicConfig {
    tenant_id: "00000000-0000-0000-0000-000000000000".into(),
    client_id: "11111111-1111-1111-1111-111111111111".into(),
    scope: "https://storage.azure.com/.default".into(),
});

The tenant_id is the Azure AD tenant. The client_id is the managed identity or application registration in that tenant that the exchange will authenticate as; the FIC binding on the managed identity determines which external tokens may exchange for it.

The scope is the Azure AD resource the resulting token is bound to. Azure tokens are audience-scoped; a token for storage cannot be used against Key Vault. List the scopes the application needs; use the .default suffix to inherit the managed identity's configured permissions.

Credential lifecycle

The short-lived credentials returned by all three STS endpoints have explicit expiry. The application's call path needs to respect the expiry:

The simple shape is one exchange per cloud call. The application exchanges, makes the call, discards the credential. The latency overhead is one STS round-trip per call (typically 50 to 200 ms depending on the cloud), which is acceptable for one-off operations.

The optimised shape is to cache the exchanged credential for the duration of its validity. The application exchanges once, caches the credential, uses it for subsequent calls until it nears expiry, then re-exchanges. The cache key is the workload identity plus the target role; the cache value is the credential plus its expiry.

The right shape depends on the call rate. Below a few calls per minute, the simple shape is fine. Above that, the optimised shape with a per-workload cache (a ClockTtlCache from axess-cache) eliminates the per-call STS round-trip.

The expiry handling needs care. A credential that expires mid-call produces an authentication error from the cloud SDK, which the application catches and translates into a re-exchange. The cache wraps the expiry check; calls that get a near-expired credential refresh proactively.

Multi-cloud deployments

A deployment that uses multiple clouds (a workload that calls both AWS and GCP, say) configures one exchanger per cloud. The two are independent; they share the workload identity as input but produce cloud-specific credentials as output.

The pattern composes cleanly. The application has a workload principal; it has one AwsStsExchanger and one GcpWifExchanger in scope; calls to AWS go through the AWS exchanger, calls to GCP go through the GCP exchanger. No cross-cloud coupling.

Threat model

Cloud STS exchange is robust against credential theft because the short-lived credentials it produces are time-bounded. A stolen credential expires within minutes regardless of the attacker's actions.

The remaining attack surfaces:

The first is the workload identity itself. A compromised workload identity can be exchanged for fresh cloud credentials at any time. The defence is to keep the workload identity short-lived (SPIRE rotates SVIDs every few hours, GitHub OIDC tokens are single-use), so a compromised identity has a bounded lifetime.

The second is the STS endpoint. A compromised STS issues compromised credentials. The defence is operational: the cloud provider secures their STS; the application validates the returned credentials by their structure (signature, format) but cannot independently verify that the STS itself is honest.

The third is the role's trust policy. A misconfigured trust policy allows any workload to assume the role, defeating the identity-based restriction. The defence is to review trust policies carefully at deployment time; the principle of least privilege applies.

Audit

Each exchange produces an audit event (a DelegatedTokenExchanged event in the axess audit pipeline) and a cloud-side audit event (CloudTrail for AWS, Cloud Audit Logs for GCP, Activity Log for Azure). The two together give a complete picture: what identity was exchanged, when, for what role, and what cloud actions the resulting credential performed.

The retention configuration is in Audit pipeline. The recommendation is longer retention for STS-exchange events than for ordinary authentication events, because the events defend against future compliance review of cross-cloud actions.

Troubleshooting

If the exchange returns AccessDenied from AWS STS, the role's trust policy does not admit the token. Check the policy's Principal.Federated and Condition blocks; the most common issues are a wrong issuer URL, a wrong audience, or a missing required claim.

If the exchange returns INVALID_ARGUMENT from GCP, the workload identity pool or provider name is wrong, or the token's shape does not match what the provider expects. Inspect the provider configuration through gcloud iam workload-identity-pools providers describe.

If the exchange returns AADSTS70021 from Azure, the FIC binding on the managed identity does not match the token's subject claim. Update the FIC configuration to match what the workload identity emits.

Further reading

Inbound: JWT-SVID, Inbound: federation cover the resolvers that produce the workload identity that gets exchanged here. Outbound: OAuth covers OAuth-based outbound credentials, which are an alternative to cloud STS for some non-cloud downstreams. Audit pipeline covers the retention configuration for cross-cloud audit events.

Outbound: OAuth

This chapter covers the case where the application authenticates itself as a workload against a downstream OAuth-protected service. The application is the OAuth client; the downstream is the resource server. The credential is an access token the application acquires through one of the OAuth client flows (client credentials, token exchange, or refresh of a stored token).

The chapter pairs with Inbound: federation and Cloud STS exchange: those cover the inbound case where the application accepts workload tokens; this covers the outbound case where the application presents them.

The feature flag is outbound-oauth (off by default).

When to use it

Three patterns lead to outbound OAuth.

The first is a service-to-service call between two services your deployment owns, where the receiving service authenticates inbound OAuth (typically through the generic WorkloadResolver from Inbound: federation). The application's outbound configuration mints a fresh token through the client-credentials grant, sends it on the request, and the receiving service validates it.

The second is a call to a SaaS service that requires OAuth (Slack, Stripe, Twilio, an enterprise CRM). The application is registered as an OAuth client at the SaaS, holds a client id and secret, and mints tokens to call the SaaS's API.

The third is a call to a downstream service on a user's behalf, where the credential is a token exchanged from the user's session or from a stored refresh token. This is the OBO case, covered in Delegated and OBO access; the outbound-oauth machinery in this chapter is what delegated-stored and delegated-exchange use under the hood.

Configuration

OutboundOAuthClient is the type that mints tokens. The configuration:

use axess::workload::outbound::{OutboundOAuthClient, OutboundOAuthConfig};

let client = OutboundOAuthClient::new(OutboundOAuthConfig {
    token_endpoint: "https://idp.example.com/oauth/token".parse().unwrap(),
    client_id: "billing-api-prod".into(),
    client_credential: ClientCredential::Secret("...".into()),
    scopes: vec!["https://api.downstream.example/.default".into()],
    audience: Some("https://api.downstream.example".into()),
});

token_endpoint is the OAuth server's token endpoint. The endpoint typically comes from the OAuth server's discovery document; the configuration is the resolved URL.

client_credential carries how the application authenticates to the token endpoint. The variants are:

pub enum ClientCredential {
    Secret(ZeroizedString),
    JwtAssertion { signing_key: SigningKey, kid: String },
    Mtls,                         // client cert from outbound TLS
    SignedJwt { /* ... */ },
}

The Secret variant is the classic OAuth client secret. The JwtAssertion variant is RFC 7523 (private_key_jwt authentication), which is what FAPI-grade integrations use; the application signs a short-lived assertion JWT with its private key, and the token endpoint validates it against the registered public key. The Mtls variant uses the outbound TLS connection's client certificate as the authentication. The SignedJwt variant covers cases where the JWT structure differs from RFC 7523.

scopes is the list of scopes requested. The narrowest possible list is the recommendation; over-broad scopes leak privilege if the resulting token is compromised.

audience is the optional audience parameter, used by some token endpoints (Azure AD, Auth0, others that follow the same pattern) to bind the resulting token to a specific resource.

Minting tokens

The simple shape calls mint_token directly:

async fn call_downstream(
    client: &OutboundOAuthClient,
) -> Result<(), Error> {
    let token = client.mint_token().await?;

    let response = http_client
        .get("https://api.downstream.example/data")
        .header("Authorization", format!("Bearer {}", token.access_token))
        .send()
        .await?;
    Ok(())
}

Each mint_token call hits the token endpoint, exchanges the client credentials, and returns the access token. The cost is one round-trip per call.

The optimised shape caches the token for the duration of its validity:

async fn call_downstream_cached(
    client: &CachedOutboundOAuthClient,
) -> Result<(), Error> {
    let token = client.get_cached().await?;
    // token is fresh or freshly-minted; cache handles the expiry.
    // ... use it
}

CachedOutboundOAuthClient is the cache wrapper. The cache uses the same ClockTtlCache machinery the rest of axess uses; the TTL is the token's expires_in value, minus a small buffer so a token that expires mid-call is refreshed proactively.

The right shape depends on the call rate. Below a few calls per minute, the simple shape works. Above that, the cache is worth the complexity.

Token exchange (RFC 8693)

The token-exchange flow is the alternative to client-credentials when the outbound call is on behalf of an inbound principal (human or workload). The application presents the inbound credential to a token-exchange-capable IdP and receives a token bound to the downstream audience.

use axess::workload::outbound::{TokenExchanger, ExchangeRequest};

let exchanger = TokenExchanger::new(/* ... */);

let token = exchanger.exchange(ExchangeRequest {
    subject_token: inbound_token,
    subject_token_type: "urn:ietf:params:oauth:token-type:jwt".into(),
    audience: "https://api.downstream.example".into(),
    scopes: vec!["read:data".into()],
}).await?;

The exchange runs through the IdP's token endpoint with the RFC 8693 parameters; the IdP validates the subject token, applies whatever exchange policy it has, and returns a token for the requested audience. The pattern is what most enterprise IdPs support today (Azure AD, Okta, Auth0); the OBO chapter covers it in detail from the application's side.

DPoP and sender-constrained tokens

The FAPI 2.0 chapter (FAPI 2.0) covers DPoP as a way to bind access tokens to a key the client controls. The outbound-oauth machinery supports DPoP through an opt-in configuration:

let config = OutboundOAuthConfig {
    // ... standard configuration ...
    sender_constraint: Some(SenderConstraint::DPoP {
        key_provider: Box::new(my_dpop_key_provider()),
    }),
};

When sender_constraint is set, the client generates a DPoP proof on each call, signed with the configured key, and attaches it to the request along with the access token. The downstream validates the proof, matches the key thumbprint against the token's binding, and serves the request.

The cost is one extra HTTP header per call plus a signature. The benefit is that a stolen access token is unusable without the DPoP key, which the client never transmits.

Threat model

The outbound OAuth flows have a smaller threat surface than the inbound flows because the application controls both ends of the trust relationship.

Against client credential theft: the credential lives in the application's secrets store. Theft requires application-level compromise, which has bigger problems than just the OAuth credential.

Against access token theft in transit: TLS protects the wire. A stolen token from a TLS-protected call requires breaking TLS, which is not the OAuth client's defence to provide.

Against access token theft at rest: tokens are short-lived (typically minutes) and held in process memory. A long-lived refresh token (in the stored OBO case) is what carries longer exposure; the encrypted credential store decorator covers that.

Against scope creep: the scopes parameter restricts what the token can do. The discipline is to request the narrowest scopes the application needs, so a compromised token has limited blast radius.

Troubleshooting

If the token endpoint returns invalid_client, the client credentials are not what the IdP expects. The most common cause is using Secret against an endpoint that requires JwtAssertion, or vice versa.

If the token endpoint returns invalid_scope, the requested scopes are not authorised for this client. Check the client's registration at the IdP to see which scopes are permitted.

If the downstream returns 401 on the apparently-fresh token, the audience does not match the downstream's expected audience. Some IdPs default the audience to the client id rather than to a resource URL; set the audience parameter explicitly.

Further reading

OAuth 2.0 and OIDC covers the inbound OAuth machinery and the shared OIDC primitives. FAPI 2.0 covers DPoP and the sender-constrained-token pattern. Delegated and OBO access covers the higher-level OBO machinery that uses outbound OAuth under the hood. Operations runbook covers client-credential rotation and the DPoP key lifecycle.

Outbound: mTLS

This chapter covers the case where the application presents an X.509 client certificate during the outbound TLS handshake to a downstream service that requires mTLS. The credential is the application's workload identity in X.509 form, typically an X.509-SVID issued by SPIRE or an equivalent. The downstream validates the certificate against its trust anchor and accepts or rejects the connection.

The feature flag is outbound-mtls (off by default).

When to use it

Outbound mTLS is the right pattern for service-to-service traffic within a federation that uses mTLS as the standard authentication mechanism (a SPIFFE-based service mesh, an intra-organisation network where everything speaks mTLS, a partner integration where both sides have agreed to mTLS). The application's certificate identifies it as a workload to the downstream; no bearer token needs to ride the request.

The pattern is operationally simpler than outbound OAuth because the authentication happens once at connection setup rather than per request. A long-lived TLS connection handles many requests without re-authenticating; a short-lived connection re-authenticates on the next request. The cost is the TLS handshake's CPU and round-trip; the benefit is no per-request authentication state.

Configuration

OutboundMtlsClient is the type that holds the certificate and key, and provides them to the outbound TLS handshake. The configuration:

use axess::workload::outbound::{OutboundMtlsClient, OutboundMtlsConfig};

let client = OutboundMtlsClient::new(OutboundMtlsConfig {
    client_cert_path: "/var/lib/axess/svid/cert.pem".into(),
    client_key_path: "/var/lib/axess/svid/key.pem".into(),
    ca_bundle_path: Some("/var/lib/axess/svid/ca.pem".into()),
    reload_interval: Some(Duration::from_secs(300)),
});

client_cert_path and client_key_path are filesystem paths to the certificate and the private key. The conventional location is where SPIRE writes them: SPIRE rotates the certificate on a configurable schedule (typically every few hours), writes the new files atomically, and the client picks them up on next read.

ca_bundle_path is the optional path to the trust anchor for the downstream's server certificate. When set, the client validates the downstream's server cert against this bundle; when unset, the client uses the system trust store.

reload_interval controls how often the client checks the certificate files for changes. The check is a stat call; an unchanged file is a no-op, a changed file triggers a re-read. The default (every five minutes) matches typical SPIRE rotation schedules; deployments with faster rotation lower this.

The TLS handshake

The client integrates with the application's HTTP client (typically reqwest, but the pattern generalises) through a custom Connector:

use axess::workload::outbound::OutboundMtlsClient;
use reqwest::Client;

let mtls = OutboundMtlsClient::new(/* ... */);

let http_client = Client::builder()
    .use_preconfigured_tls(mtls.rustls_client_config())
    .build()?;

let response = http_client
    .get("https://downstream.example/data")
    .send()
    .await?;

rustls_client_config returns a rustls ClientConfig with the certificate, key, and trust anchor configured. The use_preconfigured_tls integration on reqwest accepts this directly; other HTTP clients have similar integration points.

The handshake validates the downstream's server certificate against the configured trust anchor (or the system store), then presents the client certificate. If the downstream requires the client certificate and the application's certificate is missing or invalid, the handshake fails. If the downstream does not require the certificate, the handshake succeeds and the certificate is ignored.

Certificate rotation

The certificate rotation is what makes outbound mTLS sustainable in production. A static certificate provisioned at deployment time expires; the deployment has to redeploy to refresh it. A rotated certificate refreshes itself; the deployment runs indefinitely.

SPIRE rotates X.509-SVIDs on a schedule the operator configures (typically every few hours). The new certificate is written atomically to the filesystem (a temporary file plus a rename, so the in-progress reads see either the old or the new, never a truncated file). The application's OutboundMtlsClient reads the files at construction and on its reload interval.

The reload-interval choice matters. Too short, and the client spends CPU on stat calls. Too long, and the client uses an expired certificate, producing handshake failures. The recommendation is to set the interval to about a third of the certificate's lifetime, so a typical rotation leaves enough time for the next reload to pick up the new files before expiry.

A reload that finds a malformed certificate logs the error and keeps the previous certificate in memory. The client continues to function until the previous certificate expires, by which point either the malformed state is fixed or the handshake fails. The graceful-degradation pattern is the right shape: a botched rotation should not bring down the application immediately.

When the downstream is also axess

A common shape is two axess-instrumented services calling each other over mTLS. The calling side presents its X.509-SVID through the outbound-mtls machinery; the receiving side validates it through the mtls resolver from Inbound: mTLS-SVID. The two sides compose without any further integration: the same SPIFFE identity flows through the TLS handshake, the receiving resolver extracts it, the resulting principal is the calling service's identity.

The pattern is what gives a SPIFFE-based deployment a fully identity-aware service mesh at the application layer, without requiring a sidecar proxy. The mesh's identity is the application's identity; the audit trail records the same identity at every hop.

Threat model

Outbound mTLS shares the threat model of the X.509-SVID inbound case from Inbound: mTLS-SVID. The key-storage problem is the biggest concern: a workload whose private key is on disk is vulnerable to filesystem compromise; a workload whose key lives in a TPM, HSM, or KMS is much harder to compromise.

The additional concern for outbound is the downstream's trust configuration. A misconfigured downstream that accepts any client certificate from any CA (or that does not require client certificates at all) defeats the authentication. The defence is operational: ensure the downstream's trust configuration is correct, monitor for unexpected accepted connections, audit the configuration on a schedule.

Troubleshooting

If the handshake fails with a certificate-validation error, the downstream does not trust the application's CA. The downstream's trust bundle needs to include the application's CA; this is the downstream's configuration, not the client's.

If the handshake succeeds but the downstream returns 401 on every request, the downstream is performing authorisation against the certificate's identity rather than just authentication. Check the downstream's authorisation policy: it may require a specific SPIFFE path, a specific issuer, or a specific X.509 extension that the application's certificate does not have.

If the reload fails silently and the application uses an expired certificate, check the reload-interval configuration and the application's log output. The reload errors are logged at warn level; a missed reload typically surfaces as a "failed to read certificate" message.

Further reading

Inbound: mTLS-SVID covers the receiving side of the same machinery. Workload identity overview covers the SPIFFE model both sides use. Cloud STS exchange covers the alternative pattern for downstreams that require bearer tokens rather than mTLS. Operations runbook covers the certificate rotation and the key-storage choices for production deployments.

Delegated and OBO access

The scenario is common: your application needs to act on behalf of the user against a downstream service. A user signs in, grants your application the right to read their inbox or post on their behalf, and from that moment forward the application can make calls to the downstream service that the downstream sees as coming from the user. The mechanism is on-behalf-of (OBO) access, and axess covers two shapes through the delegated/ module under axess-core.

The feature flag is delegated (off by default), with two narrower variants (delegated-stored, delegated-exchange) that turn on each shape independently. The module lives inside axess-core rather than as a separate crate because the encryption envelope it needs already ships with the SQL session backends, so the isolation benefit a separate crate would have provided was illusory. Adopters who do not turn on the feature pay zero compile cost.

The two shapes

OBO comes in two architectural shapes. The shape matters because the operational characteristics differ: where credentials live, how often they refresh, what happens when the user revokes consent.

The first shape is stored OBO. The user grants consent once through an OAuth flow; the application receives a refresh token along with the initial access token; the application persists the refresh token; future calls to the downstream service use the refresh token to mint a fresh access token, then use the access token to make the actual call. The pattern is what most "connect your Google account" or "connect your Slack account" flows do.

The second shape is token exchange (RFC 8693). The user's session in the application carries a credential (a session cookie, a JWT session, a workload identity token). When the application needs to call a downstream service on the user's behalf, it presents the credential to a Security Token Service (STS) and receives a short-lived access token bound to the call. There is no persistent storage of credentials for the downstream; the exchange happens per call (or per a short cache window).

The two shapes solve different problems. Stored OBO is right when the application needs to act on the user's behalf when the user is not actively present (a scheduled report that pulls from Gmail at 6am, a background sync that runs while the user is offline). Token exchange is right when the application needs to act on the user's behalf only while the user has an active session, and where the user's session credential can be exchanged for a downstream credential at low cost.

Stored OBO

The stored OBO shape uses the delegated-stored feature. The machinery has three moving parts: an OAuth flow that grants initial consent, a credential store that persists the refresh token, and a refresh path that mints fresh access tokens for calls.

The initial grant is an OAuth authorization code flow where the scopes include the downstream's access scope (https://mail.google.com/, channels:read, whatever the downstream's vocabulary is) and the flow includes offline_access (the OAuth scope that asks for a refresh token). The flow's success returns both an access token (usable immediately) and a refresh token (storable for later use).

The persistence runs through the DelegatedCredentialStore trait:

#[async_trait]
pub trait DelegatedCredentialStore: Send + Sync {
    async fn save_credential(
        &self,
        owner: &CredentialOwner,
        credential: StoredCredential,
    ) -> Result<(), StoreError>;

    async fn load_credential(
        &self,
        owner: &CredentialOwner,
        downstream: &str,
    ) -> Result<Option<StoredCredential>, StoreError>;

    async fn revoke_credential(
        &self,
        owner: &CredentialOwner,
        downstream: &str,
    ) -> Result<(), StoreError>;
}

pub struct StoredCredential {
    pub access_token: ZeroizedString,
    pub refresh_token: ZeroizedString,
    pub expires_at: DateTime<Utc>,
    pub scopes: Vec<String>,
    pub downstream: String,
}

The owner is typically the user, identified by UserId and TenantId. The downstream is named by a string ("google.com", "slack", "github"), letting one user have multiple stored credentials for different downstreams.

The encrypted variant is EncryptedDelegatedCredentialStore<S, K>, a decorator that wraps any store with AES-256-GCM at-rest encryption using a key the deployment provides. The trait surface is the same; the encryption happens transparently inside the decorator. Production deployments use the encrypted variant.

The refresh path runs on demand. When the application needs to call the downstream, it loads the stored credential, checks whether the access token is still valid, and either uses it directly or runs the refresh exchange to mint a fresh access token. The fresh token replaces the stored one if rotation is configured (most downstreams rotate the refresh token on each refresh, which is the same defence the session refresh-token mechanism uses; Refresh tokens and session continuity covers the family-based theft detection in detail).

Token exchange

The token exchange shape uses the delegated-exchange feature. The machinery is much smaller because there is no persistent storage: the exchange runs per call.

The exchange is an RFC 8693 token exchange. The application presents:

  • A subject token: the credential identifying the user. This might be the user's session ID, a JWT session token, or a workload identity token that names the user.
  • A subject token type: an identifier for the kind of subject token (urn:ietf:params:oauth:token-type:access_token, urn:ietf:params:oauth:token-type:jwt, an application-specific string).
  • The audience: the downstream service the token will be used against.
  • Optional: the scope of the requested token (defaults to "all scopes the user has").

The STS validates the subject token, determines the user's identity, applies whatever policy decisions the deployment has configured (Cedar policies that govern the exchange, the user's allowed downstreams), and returns an access token bound to the audience.

use axess::delegated::{ExchangeRequest, TokenExchanger};

let exchange = TokenExchanger::new(sts_config);

let downstream_token = exchange
    .exchange(ExchangeRequest {
        subject_token: session_credential,
        audience: "https://api.downstream.example",
        scopes: vec!["read:data".to_string()],
    })
    .await?;

let response = http_client
    .get("https://api.downstream.example/data")
    .header("Authorization", format!("Bearer {}", downstream_token.access_token))
    .send()
    .await?;

The exchange runs in the request path. The latency cost is one round-trip to the STS plus the actual downstream call. The exchanged token is short-lived (typically minutes), so the application either re-exchanges per call (the simple shape) or caches the exchanged token for the duration of its validity (the optimisation, which is worth the complexity only at high call rates).

Which to use

The decision tree is short.

If the application needs to act on the user's behalf while the user is offline (a background job, a scheduled report, a notification that runs hours after the user has gone home), use stored OBO. Token exchange does not work because the user's session does not exist when the call needs to happen.

If the application calls the downstream only while the user is actively signed in, and the downstream service supports token exchange (Azure AD, Google Cloud, most enterprise SaaS that supports RFC 8693), use token exchange. The credential never hits your database, so the breach impact is smaller.

If the application needs both shapes, both work side by side. The two crates compose without conflict; turn on both feature flags.

The most common shape in practice is hybrid: token exchange for the foreground synchronous calls (the user clicks "fetch latest data from Gmail"), stored OBO for the background asynchronous calls (the nightly sync that pulls all new mail since the last run). The two flows handle the two needs.

Both shapes need an audit trail. The user granted consent at a specific moment; that moment is what defends against later disputes ("the application made calls I did not authorise").

The stored OBO shape emits a DelegatedConsentGranted audit event at the initial OAuth flow and a DelegatedCredentialUsed event on each refresh. The first event records what the user agreed to (which scopes, which downstream); the second records each use (when, against which downstream, for which operation if the application surfaces that).

The token exchange shape emits a DelegatedTokenExchanged event on each exchange. The event records the subject token's source, the audience, the scopes, and the timestamp.

The audit retention for delegated events is typically longer than for ordinary authentication events because the events defend against future disputes that may surface months or years later. The retention configuration is in Audit pipeline.

Revocation

Both shapes need a revocation path. The user (or an administrator) decides the application should no longer act on their behalf; the next call should fail.

Stored OBO revocation runs through DelegatedCredentialStore::revoke_credential. The credential is removed from the store (or marked revoked, if the store retains for audit). Subsequent loads return None; the application's call path either treats this as "user has not granted consent" or as "consent was revoked, ask again."

Token exchange revocation runs through the user's session revocation. Logging the user out invalidates the session credential, which means subsequent exchanges fail; in-flight calls that have already exchanged the token continue until the exchanged token expires (typically minutes). The granularity is coarser than stored OBO but the operational simplicity is the trade-off.

Either shape benefits from the downstream's own revocation mechanism. Most OAuth providers support RFC 7009 token revocation; calling it on logout invalidates the access and refresh tokens at the IdP, so even a stolen credential cannot be used. Stored OBO with downstream revocation gives the strongest possible revocation guarantee.

Threat model

The threat surface for OBO is unusual. The application acts as the user, which means a compromise of the application is a compromise of the user's downstream account. The defences:

The first is to minimise the scope of the OAuth grant. Request the narrowest scopes the application needs (channels:read not channels:*, the specific calendar not "all calendars"). The attacker who compromises the application can act only within the granted scopes.

The second is to encrypt the stored credentials at rest. The EncryptedDelegatedCredentialStore decorator covers this. An attacker who breaches the database without the encryption key cannot use the stored credentials.

The third is to monitor the audit events. A spike in DelegatedCredentialUsed events for a user, especially for operations the user does not typically perform, is a strong signal of compromise. The SIEM rules in Audit events name the patterns.

The fourth is to time-bound consent. Some downstreams support explicit consent expiry; for those that do not, the application can require the user to re-consent on a schedule (every ninety days, every year). The friction is real; the defence against long-lived stale grants is also real.

What this enables

OBO is what lets axess fit into the kind of application that does more than authenticate users for itself: a unified inbox that pulls from Gmail and Outlook, a CI pipeline that posts to Slack on the user's behalf, a calendar integration that books meetings. The mechanism is opt-in (the feature flag), the two shapes cover the architectural choices, and the encryption-at-rest plus the audit trail let the deployment defend its decisions.

Further reading

Refresh tokens and session continuity covers the refresh-token family-detection mechanism that also applies to stored OBO credentials. OAuth 2.0 and OIDC covers the OAuth flow that grants the initial consent. Workload identity overview covers the subject-token side of token exchange when the subject is a workload rather than a human. Audit pipeline covers the retention configuration for Delegated* audit events.

Local IdP

axess::local_idp is an in-process workload-identity issuer. It mints JWTs against a signing key it holds locally, exposes the matching JWKS, and serves the RFC 8414 discovery document. The crate exposes this surface in two layers, both built on the same primitives:

  • Production LocalIdp. Adopter wires a [LocalIdpKeyStore] implementation (file system, Vault, KMS, ...) and the [LocalIdp] reads the current + historical keys, mints, and rotates atomically on operator request.

  • Testing LocalIdpFixture. In-process value that mints JWTs with a generated keypair and exposes a JwkSet handle that a [JwtVerifier] can read. No HTTP endpoints, no key store; just mint() + jwks_handle().

Both layers share [MintClaims], [LocalIdpSigningKey], and the issuance pipeline that lives in axess::local_idp::primitives. A token minted by either layer verifies against the same JWKS shape, which is the property that lets adopters run the same downstream verifier in tests and in production.

What both layers do NOT do

Neither layer is a full OAuth 2.0 Authorization Server. There is:

  • no authorization-code flow, no PKCE handshake;
  • no end-session endpoint;
  • no refresh-token rotation;
  • no consent UX;
  • no user store.

Use a real Authorization Server (Keycloak, Ory Hydra, Okta, Auth0, Azure AD, etc.) when you need any of those. LocalIdp exists for direct workload-identity issuance: a process mints short-lived JWTs for service-to-service flows it controls.

The feature flag is local-idp (off by default), enabled with features = ["local-idp"] on the axess facade. It pulls in oauth, oidc, and jwt as transitive features.


Production: LocalIdp

When to use

  • A service needs to mint workload-identity JWTs for its own internal flows (e.g. signing tokens that downstream services will verify via the published JWKS).

  • A development or staging deployment needs a self-contained IdP without standing up Keycloak. The same code path runs in production; only the [LocalIdpKeyStore] backend changes.

  • An air-gapped or single-tenant deployment wants on-host token issuance with no external dependency.

When not to use

If you need a user-facing IdP with login UI, OIDC authorization code flow, refresh tokens, or federation, reach for Keycloak / Ory Hydra / similar. LocalIdp deliberately stops at issuance.

The LocalIdpKeyStore trait

Adopters implement persistence against their own key material:

pub trait LocalIdpKeyStore: Send + Sync + 'static {
    type Error: std::error::Error + Send + Sync + 'static;

    async fn load_all(&self) -> Result<LoadedKeys, Self::Error>;

    async fn rotate(&self, new_current: LocalIdpSigningKey)
        -> Result<(), Self::Error>;
}

pub struct LoadedKeys {
    pub current: LocalIdpSigningKey,
    pub historical: Vec<LocalIdpSigningKey>,
}

load_all returns current + historical keys from a single consistent read. The JWKS published at /.well-known/jwks.json includes all of them so tokens already in flight under a rotated-out historical key continue to verify until the operator removes that key from the store.

rotate persists a new current key, demoting the previous current to historical, atomically. Adopters typically expose this through their own admin endpoint or out-of-band tooling.

MemoryLocalIdpKeyStore for prototyping

A MemoryLocalIdpKeyStore ships with the crate for dev and test deployments where keys can live in process memory:

use axess::local_idp::{LocalIdp, LocalIdpSigningKey, MemoryLocalIdpKeyStore};

let key = LocalIdpSigningKey::generate_es256().with_key_id("v1");
let store = MemoryLocalIdpKeyStore::with_current(key);
let idp = LocalIdp::from_key_store("https://idp.example.com", store)
    .await
    .expect("load keys");

Memory storage is not for production: restarts lose the keys, and every restart produces fresh JWKS that breaks tokens already in flight. The examples/local_idp/ directory implements a file-backed [LocalIdpKeyStore] with atomic rotation that the production path should pattern after; the same shape adapts to Vault, AWS KMS, GCP KMS, or any other key management backend.

Minting

use axess::local_idp::MintClaims;
use chrono::{Duration, Utc};

let token = idp
    .mint(
        &MintClaims::new("worker-1", Utc::now() + Duration::minutes(5))
            .with_audience("https://api.example.com")
            .with_issued_at(Utc::now()),
    )
    .await?;

[MintClaims] is a builder: new(subject, exp) is the minimum; with_audience, with_audiences (multi-aud), with_issued_at, with_not_before, with_jwt_id, and with_custom_claim cover the standard JWT fields. mint_with_header accepts a caller-supplied jsonwebtoken::Header for cases that need custom header fields (typ, cty, etc.).

The clock is injectable via .with_clock(...). Production wires SystemClock; DST tests wire MockClock for reproducible issuance.

Rotation

let new_key = LocalIdpSigningKey::generate_es256().with_key_id("v2");
idp.rotate_signing_key(new_key).await?;

The call atomically:

  1. Persists the new current via [LocalIdpKeyStore::rotate].
  2. Demotes the previous current to historical.
  3. Rebuilds the JWKS snapshot so subsequent /jwks.json reads include both keys.

In-flight verifications using the old kid continue to succeed because the historical entry stays in the published JWKS.

Discovery + JWKS endpoints

LocalIdp::router() returns a ready-to-mount Axum router that serves the two standard endpoints:

let app = axum::Router::new()
    .nest("/", idp.router())
    .route("/issue", axum::routing::post(issue));

Routes:

  • GET /.well-known/openid-configuration: RFC 8414 metadata.
  • GET /jwks.json: current + historical public JWKs.

with_base_url(...) overrides the URL the discovery document advertises for jwks_uri when the IdP sits behind a reverse proxy. with_metadata_field(name, value) appends adopter-extension fields to the discovery document (scopes_supported, claims_supported, FAPI fields, etc.).

For full control, the lower-level handlers in axess::local_idp::discovery expose openid_configuration and jwks as standalone axum handlers.

Production-pattern example

The examples/local_idp/ crate is the reference implementation:

  • File-backed LocalIdpKeyStore (FileLocalIdpKeyStore) with the directory layout pattern historical/{kid}.pem + atomic current.kid pointer file.
  • POST /admin/rotate operator endpoint.
  • POST /issue mint endpoint.
  • A curl walkthrough of the full discover-mint-rotate cycle.

Testing: LocalIdpFixture

When to use

Integration tests that exercise:

  • The inbound JWT-SVID resolver (axess::authn::jwt::svid::JwtSvidResolver).
  • The OAuth Resource Server resolver path.
  • Any of the cloud STS adapters.
  • The JwtVerifier shape generally.

The fixture mints tokens that verify against its own JWKS, so a test can produce a token with mint() and pass it to the resolver under test without involving an external IdP.

What it is NOT

The fixture is not an HTTP service. It is a value with mint(), jwks_handle(), and a handful of accessors. Tests use it by:

  1. Constructing the fixture.
  2. Calling idp.mint(&MintClaims::...) to obtain a JWT.
  3. Wiring a JwtVerifier to idp.jwks_handle() so verification reads the same JWKS the fixture signed against.

There is no authorize endpoint, no token endpoint, no Tower service wrapping; the fixture just produces signed tokens and exposes the verification key set.

The feature flag is testing plus local-idp. The fixture lives under axess::testing::local_idp::LocalIdpFixture.

Construction

use axess::testing::local_idp::LocalIdpFixture;

let idp = LocalIdpFixture::new("https://test.idp.local");

new(issuer) generates a fresh RSA-2048 keypair per call. Other constructors:

  • LocalIdpFixture::with_algorithm(issuer, Algorithm::ES256): generate with a specific signing algorithm. Supported: RS256, RS384, RS512, ES256.
  • LocalIdpFixture::with_signing_key(issuer, key): explicit key (use when the test needs a stable signature across runs).

Builder methods (chained on the constructed fixture):

  • .with_historical_signing_key(key): add a key to the JWKS without rotating to it. Drives JWKS-cache-refresh tests.
  • .with_extra_public_jwk(jwk): add an externally-supplied public JWK to the published set.
  • .rotate_signing_key(new_key): swap the signing key; the old key moves to historical and remains in the JWKS.
  • .with_max_ttl(duration): cap minted token lifetime. Over-cap mints panic (test-time misuse).
  • .with_issuance_listener(arc): install an [IssuanceListener] for assertion-side recording.
  • .with_key_id(kid): override the auto-generated kid.

Minting

use axess::testing::local_idp::{LocalIdpFixture, MintClaims};
use chrono::{Duration, Utc};

let idp = LocalIdpFixture::new("https://test.idp.local");

// Standard JWT.
let token = idp.mint(
    &MintClaims::new("alice", Utc::now() + Duration::hours(1))
        .with_audience("https://api.example.com"),
);

// SPIFFE JWT-SVID shape (subject = SPIFFE ID, audience required).
let svid = idp.mint_jwt_svid(
    "test.gnomes",                  // trust domain
    "worker",                       // workload path
    "acme",                         // namespace (optional positional)
    "sts.amazonaws.com",            // audience
    Duration::minutes(5),
);

mint_with_header accepts a caller-supplied header for cases that need custom fields.

Sharing the JWKS with JwtVerifier

use axess::authn::jwt::verifier::JwtVerifier;

let verifier = JwtVerifier::new(idp.jwks_handle())
    .with_algorithms(idp.verifier_algorithms());

let claims = verifier
    .verify::<MyClaims>(&token, "https://api.example.com")
    .await?;

jwks_handle() returns an Arc<RwLock<JwkSet>> that the verifier borrows. Calls to rotate_signing_key on the fixture update the shared JWKS in place, so the verifier sees the rotation without any explicit refresh.

Feeding a cloud STS adapter

The fixture's mint_jwt_svid produces SPIFFE-shaped tokens suitable for cloud STS exchange tests:

use axess::workload::outbound::cloud_sts::aws::AwsStsClient;

let idp = LocalIdpFixture::new("https://oidc.test.local");
let token = idp.mint_jwt_svid(
    "test.gnomes", "worker", "acme",
    "sts.amazonaws.com",
    Duration::minutes(5),
);

// Hand the token to a mocked AWS STS endpoint to exercise the
// AssumeRoleWithWebIdentity flow without hitting real AWS.

Why both shapes coexist

Production LocalIdp and the test LocalIdpFixture share the same primitives module (axess::local_idp::primitives). The primitives define LocalIdpSigningKey, MintClaims, IssuanceEvent, IssuanceListener, and the internal JWT-encode pipeline. Both layers route their mint() calls through these primitives.

The consequence: a token minted by the fixture in a test verifies identically against a JwtVerifier configured with production LocalIdp's published JWKS, given the same signing key. Tests that pin a specific JWT signature exercise the same code paths that sign in production.

The split exists for what each layer adds on top:

  • Production carries the [LocalIdpKeyStore] abstraction so keys survive process restarts and can rotate without code changes.
  • Testing carries the in-memory key generation, the MockIssuanceListener, and ergonomic builders that match what test code typically wants to assert.

Neither subsumes the other; the production class is not the right fit for a unit test (no key store means no mint), and the fixture is not the right fit for production (in-memory keys lose on restart). The shared primitives are what lets both shapes claim "this is the same JWT issuer" without code duplication.

Audit events

Every authentication and authorisation decision axess makes produces an audit event. The events are typed (AuthEvent, with one variant per kind of decision), they carry every field a compliance review needs, and they emit asynchronously so the authentication hot path does not block on the audit dispatch. This chapter covers the event catalogue, the fields each variant carries, the SOC alert thresholds that map events to operational signals, and the SIEM query patterns the events are designed to feed.

The chapter pairs with Audit pipeline, which covers how the events get from the application into the regulatory store and the analytics path.

What the events are for

Authentication is a security-sensitive operation, and security-sensitive operations need a defensible audit trail. Three audiences read the trail.

The first is the compliance auditor. A regulator (or an external auditor verifying compliance with a regulator's requirements) needs to verify that the application enforced the controls the regulation requires: that MFA was demanded where MFA was required, that lockouts fired when configured, that no cross-tenant access happened. The audit trail is what answers these questions.

The second is the incident responder. When something goes wrong (a user reports unauthorised access, a SIEM rule fires on an anomalous pattern, a breach is suspected), the responder needs to reconstruct what happened: which sessions were active, what authentications succeeded, what authorisations were granted. The audit trail is what supports the reconstruction.

The third is the operational dashboard. The application's running state is visible through the audit trail: how many logins succeed per hour, what fraction trigger lockouts, which tenants are active. The trail feeds the SIEM rules and the operational metrics.

The three audiences want different things from the same data, which is what drives the dual-stream design: a regulatory stream optimised for completeness and immutability, an analytics stream optimised for query latency and aggregation. Audit pipeline covers the streams; this chapter covers the events themselves.

The event catalogue

AuthEvent is the enum. The variants are stable across axess versions; new variants can be added (variants are appended, not renumbered), but existing variants do not change shape.

The catalogue is grouped by the operation that produces each event. The grouping below is for readability; the wire format is flat.

Authentication lifecycle

LoginStarted records the beginning of a login attempt. Fields: user_id, tenant_id, method_name, client_ip, user_agent, device_id (optional), timestamp.

FactorVerified records each successful factor verification. Fields: user_id, tenant_id, factor_kind, attempt_index, timestamp.

FactorFailed records each failed factor verification. Fields: user_id, tenant_id, factor_kind, attempt_index, failure_reason (string), timestamp.

LoginCompleted records a successful end-to-end login. Fields: user_id, tenant_id, method_name, factors_completed, session_id, device_id (optional), timestamp.

LoginFailed records a failed end-to-end login (any factor failed beyond retry, or a configuration error). Fields: user_id (optional, may be unknown), tenant_id, method_name (optional), failure_reason, timestamp.

Logout records a session ending. Fields: user_id, tenant_id, session_id, reason (user-initiated, admin-revoked, session-expired, cookie-fingerprint-mismatch), timestamp.

Lockout

LockoutTriggered records the per-user or per-tenant or per-IP lockout firing. Fields: scope (User, Tenant, IP), scope_value, until (optional unlock time), triggered_by_event (the preceding FactorFailed id), timestamp.

LockoutCleared records the lockout ending (either timing out or being administratively cleared). Fields: scope, scope_value, cleared_by (Timeout, Admin), timestamp.

Device identity

The six device events fire on the device-identity lifecycle covered in Device identity.

DeviceFirstSeen records a previously-unknown device fingerprint appearing. Fields: device_id (newly minted), user_id, tenant_id, fingerprint_features (the redacted feature set), timestamp.

DeviceTrustGranted records a device transitioning to Trusted. Fields: device_id, user_id, tenant_id, granted_by (User, Admin), timestamp.

DeviceRevoked records a device transitioning to Revoked. Fields: device_id, user_id, tenant_id, revoked_by (User, Admin, System), reason, timestamp.

DevicePurged records a device record being deleted (retention sweep). Fields: device_id, user_id, tenant_id, purged_at_age_days, timestamp.

DeviceBindingAdded records a new refresh-token binding to the device. Fields: device_id, user_id, tenant_id, refresh_token_id, timestamp.

DeviceFingerprintMismatch records a session presenting a fingerprint that does not match its bound device. Fields: device_id, user_id, tenant_id, session_id, policy_action (Warn, Reauth, Revoke), timestamp.

Authorisation

AuthzAllow records a successful Cedar evaluation. Fields: principal_uid, tenant_id, action, resource_uid, matched_policy_ids, timestamp.

AuthzDeny records a denied evaluation. Fields: principal_uid, tenant_id, action, resource_uid, matched_policy_ids (the policies that produced the deny, if any), denial_reason, timestamp.

AuthzEntityNotFound records an evaluation failure where the entity provider could not produce an entity the policies needed. Fields: principal_uid, missing_entity_uid, policy_id_referencing, timestamp.

Workload identity

The workload events fire on the workload-identity resolvers' decisions.

WorkloadAuthenticated records a successful workload authentication. Fields: workload_id, trust_domain, issuer, tenant_id, client_ip, timestamp.

WorkloadRejected records a failed workload authentication. Fields: attempted_workload_id (optional), attempted_trust_domain, failure_reason, client_ip, timestamp.

Tenant lifecycle

TenantCreated, TenantSuspended, TenantUnsuspended, and TenantDeleted record the tenant lifecycle from Multi-tenancy. Fields are uniform: tenant_id, operator_principal, reason, timestamp.

Delegated access

DelegatedConsentGranted records a user granting an application OBO access (the moment of OAuth consent). Fields: user_id, tenant_id, downstream, scopes, expires_at (optional), timestamp.

DelegatedCredentialUsed records a refresh of a stored OBO credential. Fields: user_id, tenant_id, downstream, timestamp.

DelegatedConsentRevoked records a user (or admin) revoking consent. Fields: user_id, tenant_id, downstream, revoked_by, timestamp.

DelegatedTokenExchanged records an RFC 8693 token exchange. Fields: subject_principal_uid, tenant_id, audience, scopes, timestamp.

Administrative

UserSuspended, UserUnsuspended, UserDeleted, PasswordReset (by admin), FactorReset (by admin), SessionInvalidated (by admin) cover the administrative operations. Fields name the target user, the operator principal, the reason, and the timestamp.

Emit cadence and fields

The events fire synchronously from the operation that produces them (the factor verification, the policy evaluation, the administrative action). The synchronous emit guarantees that an operation that succeeds has an event; an event without a corresponding operation is impossible.

The events are then handed to the audit pipeline (a sink trait the application configures), which dispatches them asynchronously. The pipeline's failure modes do not block the operation that produced the event; a sink that is slow or unavailable does not slow the application's hot path. The trade-off is that an event the sink fails to receive is lost, which is why the pipeline has a durable buffer in front of network-bound sinks. Audit pipeline covers the pipeline's reliability story.

Every event carries: a stable wire shape (so external systems can parse against a single schema), a per-event id (for deduplication and cross-referencing), a tenant id (so multi-tenant deployments can route events to per-tenant sinks), and a timestamp (in UTC, generated through the injected Clock trait for DST).

The fields specific to each variant are listed in the catalogue above. The complete schema lives in the axess-events crate documentation.

SOC alert thresholds

The events are designed to feed SOC (Security Operations Center) alerting. The thresholds below are starting points; tune to the specific deployment.

FactorFailed events from a single source IP at a rate above one per second indicate brute-forcing. The per-IP lockout (covered in Multi-tenancy §"Three-lever lockout") catches the worst cases at the application layer; the SIEM alert covers the rate even when individual events fall below the lockout threshold.

LockoutTriggered events at any rate are worth reviewing. A legitimate user occasionally mistypes their password and triggers a lockout; the rate should be a handful per day across a deployment. A spike indicates either a credential-stuffing attack or a configuration problem.

DeviceFingerprintMismatch events fire on every mismatch beyond the configured tolerance. Above a few per hour, either the tolerance is too tight (a legitimate-traffic issue, calibrate the tolerance) or there is a real attack (a stolen cookie being replayed). Investigate.

AuthzDeny events fire on every policy deny. The rate should correlate with legitimate user error (trying to access something they cannot); a spike correlates with either a policy misconfiguration (the rule is denying things it should permit) or with an attempt to find privilege-escalation holes.

WorkloadRejected events fire on every failed workload authentication. A clean deployment should see almost none of these; even a small rate indicates either a configuration problem or an attempted attack against the workload-identity surface.

DelegatedConsentGranted events fire on every user consent moment. The rate is informational; the events are most useful at the per-user level for "show me everything this user has granted the application permission to do."

SIEM query patterns

The events are designed for cheap aggregation in a SIEM. A sketch of useful queries:

-- Brute-force detection: top failing source IPs per minute.
SELECT
    client_ip,
    DATE_TRUNC('minute', timestamp) AS minute,
    COUNT(*) AS failures
FROM auth_events
WHERE event_type = 'FactorFailed'
  AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY 1, 2
ORDER BY failures DESC
LIMIT 20;
-- Lockout activity: users locked out in the last day.
SELECT
    scope_value AS user_id,
    COUNT(*) AS lockouts,
    MAX(timestamp) AS last_lockout
FROM auth_events
WHERE event_type = 'LockoutTriggered'
  AND scope = 'User'
  AND timestamp > NOW() - INTERVAL '1 day'
GROUP BY scope_value
ORDER BY lockouts DESC;
-- Step-up gaps: users hitting AuthzDeny on actions requiring stronger factors.
SELECT
    principal_uid,
    action,
    COUNT(*) AS denials
FROM auth_events
WHERE event_type = 'AuthzDeny'
  AND denial_reason LIKE '%factors_completed%'
  AND timestamp > NOW() - INTERVAL '1 day'
GROUP BY 1, 2
ORDER BY denials DESC;
-- Suspicious devices: fingerprint mismatches followed by trust grant.
SELECT
    mismatch.user_id,
    mismatch.device_id,
    mismatch.timestamp AS mismatch_at,
    grant.timestamp AS granted_at
FROM auth_events mismatch
JOIN auth_events grant
    ON mismatch.device_id = grant.device_id
WHERE mismatch.event_type = 'DeviceFingerprintMismatch'
  AND grant.event_type = 'DeviceTrustGranted'
  AND grant.timestamp > mismatch.timestamp
  AND grant.timestamp < mismatch.timestamp + INTERVAL '1 hour';

The queries assume a SQL-shaped SIEM (Splunk SPL, Sumo Logic LogReduce, ClickHouse). Adapt to the deployment's chosen tool.

Extending the catalogue

Applications occasionally need to record events that axess does not emit: a domain-specific operation that should appear in the same trail (a fund transfer, a configuration change, a sensitive read). The pattern is to extend the audit pipeline with a custom event type that rides the same machinery.

The mechanism is the AuthEventEnvelope wrapper: any Serialize + Deserialize type can be carried alongside the built-in AuthEvent variants as long as it implements AuditPayload. The application's domain events go into the envelope; the audit pipeline routes them to the same sinks as the axess events.

The trade-off is schema. A custom event type that the SIEM does not know about does not feed the dashboards or alerts. The typical pattern is to coordinate the schema with the SIEM team before adding new event types, so the dashboards and alerts can update in parallel.

What this enables

The audit catalogue is what makes axess defensible against regulatory and incident-response scrutiny. The events are typed, complete, asynchronous, and built for the SIEM tooling that production deployments already run. The catalogue itself is the reference; the Audit pipeline chapter covers how the events flow from the application to the storage layer.

Further reading

Audit pipeline covers the dual-stream architecture (regulatory plus analytics), the hot/cold retention tiering, and the reliability story for the asynchronous dispatch. Multi-tenancy covers the tenant-scoped routing of events. Cedar policy fundamentals covers the policy evaluator that produces the Authz* events. Security posture covers the GDPR and PCI-DSS posture for audit-event PII.

Audit pipeline

The audit pipeline is what moves events from the authentication hot path to the storage layers that compliance, incident response, and operations consume. The pipeline has two streams (regulatory and analytics), three retention tiers (hot, archived, deleted), and a small number of trait surfaces that adopters implement against their own storage. This chapter covers the architecture, the configuration, and the operational patterns that make the pipeline trustworthy under load.

The chapter pairs with Audit events, which catalogues what flows through the pipeline; this chapter covers how the flow itself works.

The dual stream

Two audiences read the audit trail. Compliance auditors want completeness, immutability, and unambiguous provenance; they will accept slow queries and rigid schemas in exchange. SOC and operations teams want low query latency, flexible aggregation, and enrichment with operational context (geo lookups, ASN data, parsed user-agent strings); they accept some loss of fidelity and some divergence from the wire format in exchange.

The two requirements conflict. A single store optimised for one audience disserves the other. The pipeline's answer is to fan out: the same event flows into two streams, each shaped for its audience.

The regulatory stream uses AuthEvent directly. The shape is exactly what the catalogue in Audit events describes: stable fields, no enrichment, byte-for-byte uniform across deployments. The stream feeds the regulatory store, which is typically a database or a log archive with strong durability and immutability guarantees.

The analytics stream uses RichAuthnEvent, a denormalised wrapper that adds optional enrichment fields (device trust level, geo lookup, parsed user-agent, ASN, configurable tags). The fields are populated by an EventEnrichment closure the application provides; the closure runs once per event, populates whatever data the deployment wants, and returns the enriched event. The stream feeds the analytics store, which is typically a columnar database (ClickHouse, DuckDB) or a streaming platform (Apache Iggy with rkyv).

              AuthEvent (regulatory wire)
                       │
                       ▼
                ┌──────────────┐
                │  AuditPipe   │
                └───┬──────────┘
                    │ fan-out
        ┌───────────┼────────────┐
        ▼           ▼            ▼
   IdentityAuthnLog   AuthnAnalyticsSink    AuditArchiver
    (lockout depends)    (enriched stream)    (cold tier)
        │                  │                     │
        ▼                  ▼                     ▼
   primary store      analytics store        archive store

The fan-out runs once per event. The performance cost is small because each sink is fire-and-forget; a slow sink does not slow the authentication hot path, but it can lose events under pressure, which is the next concern.

Reliability and fire-and-forget

The pipeline's emit path is synchronous (the event is constructed on the authentication hot path and handed to the pipeline before the operation returns), but the dispatch to each sink is asynchronous. The trade-off is what every audit-pipeline design has to make.

A fully-synchronous pipeline blocks the authentication operation until every sink acknowledges the event. The latency cost is the sum of every sink's latency; one slow sink slows every login. The pattern is a non-starter for production.

A fully-asynchronous pipeline with no durability lets the events fan out to sinks in the background. The latency cost is zero (the operation returns before the sinks see the event). The trade-off is that an event lost between emit and the sink is genuinely lost; there is no retry, no acknowledgement, no delivery guarantee.

Axess takes a middle position. The synchronous emit produces an event handed to the pipeline; the pipeline buffers the event in memory or in a durable queue (the choice is configuration); a background task dispatches from the buffer to each sink with retry. The buffer absorbs sink latency without blocking the authentication operation; the buffer's durability determines whether events survive an application crash.

The configuration shape:

pub struct AuditPipeConfig {
    pub regulatory_sink: Arc<dyn IdentityAuthnLog>,
    pub analytics_sink: Option<Arc<dyn AuthnAnalyticsSink>>,
    pub buffer: BufferStrategy,    // InMemory | FsBacked { path }
    pub max_buffer_size: usize,
    pub on_buffer_full: BufferFullPolicy,  // DropOldest | Block | ShutdownAuthn
    pub enrichment: Option<Arc<dyn EventEnrichment>>,
}

buffer controls where the in-flight events live. InMemory is the simple choice: a bounded VecDeque that holds events between emit and dispatch. Events in the buffer are lost on application crash; for most deployments, the regulatory sink's own durability (the database transaction that records the event) is what matters, and the in-memory buffer is just for absorbing latency spikes.

FsBacked { path } writes the buffer to disk so events survive a crash. The cost is one local-disk write per event; the benefit is that the audit trail does not lose events to short network outages or process restarts. Deployments in regulated environments use the file-backed buffer; everyone else uses the in-memory one.

max_buffer_size is the cap. Above it, the on_buffer_full policy fires.

on_buffer_full is the choice for what happens when the buffer fills. DropOldest is the high-throughput default: the oldest buffered events are evicted so the newest fit. Block is the strict choice: the authentication operation that produced the event blocks until the buffer has room; the latency cost can be substantial but no events are lost. ShutdownAuthn is the fail-shut choice: the authentication subsystem stops accepting new logins until the buffer drains. Regulated deployments typically choose Block or ShutdownAuthn; permissive deployments choose DropOldest.

The IdentityAuthnLog sink

The regulatory sink is the IdentityAuthnLog implementation the application already provides for the lockout policy (covered in Identity store implementation). The pipeline writes events to this sink as the canonical record. The sink's storage backend is the application's choice; the typical pattern is a Postgres or MySQL table with append-only writes and an index on (user_id, tenant_id, timestamp) for the lockout-policy queries.

The pattern means the regulatory store is what the application already needs for lockout. The pipeline does not add a second database; it just uses what is already there.

The AuthnAnalyticsSink

The analytics sink is the optional stream for the SIEM and analytics consumers. The trait:

#[async_trait]
pub trait AuthnAnalyticsSink: Send + Sync {
    async fn dispatch(&self, event: RichAuthnEvent) -> Result<(), SinkError>;
}

The sink is a fire-and-forget dispatcher. A failed dispatch is logged and dropped; the buffer's retry semantics handle the transient cases. The implementations the audit-archive-fs feature provides cover the filesystem case; for streaming or columnar stores, the implementation is the application's.

A typical Apache Iggy implementation:

struct IggyAnalyticsSink {
    client: IggyClient,
    topic: String,
}

#[async_trait]
impl AuthnAnalyticsSink for IggyAnalyticsSink {
    async fn dispatch(&self, event: RichAuthnEvent) -> Result<(), SinkError> {
        let bytes = rkyv::to_bytes::<_, 256>(&event).map_err(SinkError::serialize)?;
        self.client.send(self.topic.clone(), bytes.to_vec()).await
            .map_err(SinkError::transport)?;
        Ok(())
    }
}

The rkyv serialisation is the recommendation. RichAuthnEvent derives rkyv::Archive, rkyv::Serialize, and rkyv::Deserialize, which produces a wire format that is significantly more compact than JSON, much faster to serialise, and zero-copy on the deserialise side. For a stream that pumps millions of events per day, the difference is operationally meaningful.

A ClickHouse implementation is the equivalent for batch shipping: the sink accumulates events in memory until a threshold (batch size or time interval), then issues a bulk insert. The pattern matches ClickHouse's preferred ingestion shape.

The three-tier retention

The regulatory stream's events grow without bound by default. A deployment with millions of users produces hundreds of millions of events per year; the storage cost and the query cost both trend up unless the deployment manages the retention.

The retention story has three tiers, with explicit transitions between them.

The hot tier is the live authn_attempts table (or whatever the regulatory sink writes to). Events stay in the hot tier for as long as they are operationally useful: the lockout policy's last_attempts query, the SIEM's recent-events dashboards, the incident-response window. The recommended hot retention is between 7 and 90 days, with 30 days as a sensible default for most deployments.

The archived tier is a cheaper, slower store that holds events for the compliance retention period. The data is the same; the access pattern is different. Queries against the archive are slower (typically minutes rather than milliseconds) and less flexible (no indexed lookup; full-scan reads against a known date range). The archive is the answer to "show me everything that happened to this user three years ago." The retention here is set by the regulatory regime: PCI-DSS asks for one year; banking regulations ask for seven years; HIPAA asks for six years. Configure to match.

The deleted tier is what comes after the archive expires. The events are removed entirely; the deletion is auditable (a DeletionEvent itself, recording the date range and the count) but the underlying data is gone. Some deployments never reach this tier (an indefinite archive is a defensible choice for small-volume deployments); others rotate through it on the regulatory schedule.

AuditArchiver

The transition from hot to archived runs through the AuditArchiver trait:

#[async_trait]
pub trait AuditArchiver: Send + Sync {
    async fn archive_batch(&self, events: Vec<AuthEvent>) -> Result<(), ArchiveError>;
    async fn purge_batch(&self, range: ArchiveDateRange) -> Result<usize, ArchiveError>;
}

The trait has two methods. archive_batch writes a batch of events to the cold store. purge_batch removes a date range from the archive (for the deleted-tier transition).

The pipeline runs an AuditRetentionLoop<S, A> (S is the source IdentityAuthnLog, A is the archiver) that drives the transitions on a configurable schedule:

let retention_policy = AuditRetentionPolicy {
    archive_after: Duration::from_secs(30 * 86400),   // 30 days
    purge_hot_after_archive: Duration::from_secs(7 * 86400),
    delete_archive_after: None,                       // never purge archive
};

let loop_handle = AuditRetentionLoop::new(
    identity_authn_log.clone(),
    Arc::new(my_archiver),
    retention_policy,
).run();

The loop runs once per configured interval (typically daily). Each run does three things: it reads the events from the hot tier that have aged past archive_after, it batches them into the archiver, and it purges the hot tier of events whose archive copy was made more than purge_hot_after_archive ago.

The delete_archive_after field is the optional final transition. None means the archive grows indefinitely; a configured duration means the archive itself is purged at that age.

The defaults (30 days hot, 7 days hot retention after archive, no archive deletion) are conservative for finance. PCI-DSS asks for one year of audit retention, which the defaults satisfy by keeping events in the archive indefinitely. Other regulatory regimes have different requirements; tune to match.

Filesystem archive

The audit-archive-fs feature ships FilesystemAuditArchiver, a reference implementation that writes archived events to a day-partitioned JSONL directory:

/var/lib/axess/audit/
    YYYY-MM-DD.jsonl
    YYYY-MM-DD.jsonl
    YYYY-MM-DD.jsonl
    ...

Each file is append-only, fsynced per batch, and contains newline-delimited JSON-encoded events. The format is readable by standard tools (grep, jq, awk), survives forensic investigation, and lifts cleanly into cloud object storage when the deployment moves the archive there.

The reference implementation is for deployments with straightforward audit-storage needs. Larger deployments typically use S3 (with object-lock for immutability), GCS (with retention policies), or a dedicated audit-log service (Splunk, Datadog, SumoLogic). The trait surface is the same; the implementation is the deployment's.

Backpressure and tenant isolation

In a multi-tenant deployment, one tenant's audit load can overwhelm the pipeline if the buffer is shared. The pattern that works is per-tenant pipelines: each tenant has its own AuditPipe with its own buffer and its own retention configuration. The configuration matches what the tenant has agreed to (high-throughput tenants get larger buffers; regulated tenants get file-backed buffers). One tenant's spike does not affect another's.

The cost is operational complexity: one configuration per tenant. The benefit is isolation; the SLA you offer a tenant is genuinely a per-tenant SLA, not a deployment-wide average.

For most deployments, a single shared pipeline with conservative defaults is fine. The per-tenant shape is for deployments with strict per-tenant guarantees.

What this enables

The pipeline is what turns axess's audit events into a defensible production audit trail. The dual stream serves the two audiences; the buffer absorbs latency without blocking the hot path; the retention tiers balance storage cost against query needs and regulatory requirements. The mechanism is small (a handful of traits, one fan-out, one retention loop), and the configuration is the deployment's lever for tuning to specific requirements.

Further reading

Audit events catalogues what flows through the pipeline. Identity store implementation covers the regulatory sink (the IdentityAuthnLog trait). Multi-tenancy covers the per-tenant configuration patterns. Security posture covers the GDPR posture for archived audit data and the PII fields that may need scrubbing before archive.

Rate limiting

A rate limiter is the layer that caps how many requests an identified caller may make per unit time. For an authentication surface, the rate limiter is one of the most consequential pieces of operational defence in depth: the lockout policy catches the specific case of failed credentials, but the rate limiter catches the broader case of brute-force and credential-stuffing distribution. This chapter covers the RateLimitLayer Tower middleware, the key-extraction strategies that determine what is rate-limited, the tuning patterns for different endpoints, and the SLI signal the layer produces.

Why rate limiting matters

The lockout policy in Multi-tenancy catches one specific pattern: many failures against one identifier. A rate limiter catches a wider pattern: a high volume of requests against an endpoint, regardless of identifier, regardless of success.

The shapes of attack the rate limiter catches:

Credential stuffing. An attacker with a list of credentials tries each one against the login endpoint. Each individual attempt fails on its own credentials (no lockout against any single user), but the aggregate rate is far above legitimate traffic. The rate limiter on the login endpoint, keyed by source IP, drops the attack to a trickle.

Account-existence enumeration. An attacker probes the signup endpoint to find which usernames are taken. Each request might succeed (the username is unique) or fail (the username is taken), and the response leaks the information. The rate limiter caps the enumeration rate; combined with response-shaping (return the same shape for both cases), the attack becomes impractical.

Token-replay forwarding. An attacker who has captured a valid session cookie forwards it through many connections to evade fingerprint detection. Each request looks legitimate on its own; the aggregate volume is the giveaway. The rate limiter keyed by session id catches the pattern.

Workload misbehaviour. A workload that for some reason has entered a tight loop calling the application's API. The authentication side validates the workload token on each request; the rate limiter catches the runaway pattern before it overwhelms the service.

The layer

RateLimitLayer is a Tower layer with a small configuration:

use axess::{RateLimitLayer, RateLimitConfig, KeyExtractor};
use std::time::Duration;

let layer = RateLimitLayer::new(
    RateLimitConfig::builder()
        .max_requests(10)
        .window(Duration::from_secs(60))
        .key(KeyExtractor::PeerIp)
        .build(),
);

The configuration says "no more than ten requests per minute, keyed by the peer IP." The layer counts requests against each distinct peer IP; when a key has hit the limit within the window, subsequent requests get a 429 (Too Many Requests) with a Retry-After header.

The window is a sliding token bucket. The math: each key has a bucket of max_requests tokens; each request consumes one; tokens regenerate at a rate of max_requests per window. A burst of more than max_requests requests within a short interval consumes all the tokens; subsequent requests are rejected until enough tokens have regenerated.

The state of the buckets lives in memory by default (BucketStore::InMemory). For multi-instance deployments where the same caller can reach any instance, the rate limit needs to be aggregated across instances; BucketStore::Valkey { client } shifts the state to a shared Valkey instance.

Key extraction

The key is what the rate limiter counts against. The KeyExtractor enum carries the choices:

pub enum KeyExtractor {
    PeerIp,                              // request source IP (read through trusted-proxy)
    SessionId,                           // present session id
    UserId,                              // authenticated user
    TenantId,                            // authenticated tenant
    WorkloadId,                          // authenticated workload
    Custom(Arc<dyn KeyExtractorFn>),     // application-supplied
    Composite(Vec<KeyExtractor>),        // multi-key (one bucket per combination)
}

The choice of key determines which attack the limiter catches. PeerIp catches single-source attacks; SessionId catches session-replay attacks; UserId catches per-user runaway loops; TenantId catches per-tenant runaway (which can be a noisy neighbour rather than an attack).

The Composite choice creates one bucket per combination of the named keys. A rate limit keyed by (PeerIp, UserId) lets a single legitimate user from one IP do their normal work while catching a single attacker IP that is rotating through many users (the composite key is unique per (ip, user) pair, so the attacker exhausts each pair's bucket once per user, but the total request rate stays bounded).

The Custom choice is the escape hatch for keys axess does not know about: the OAuth client id, a custom request header, the authenticated session's tenant slug. The application provides the extraction function; the layer uses it to derive the key.

Per-endpoint rate limits

Different endpoints have different sensitivities. A login endpoint can tolerate a few requests per second per IP because real users do not log in fast; a search endpoint accepts hundreds per second because real users browse. The configuration shape is typically per-endpoint:

let auth_routes = Router::new()
    .route("/login", post(login))
    .route("/signup", post(signup))
    .route("/reset-password", post(reset_password))
    .layer(RateLimitLayer::new(
        RateLimitConfig::builder()
            .max_requests(10)
            .window(Duration::from_secs(60))
            .key(KeyExtractor::PeerIp)
            .build(),
    ));

let api_routes = Router::new()
    .route("/data", get(get_data))
    .layer(RateLimitLayer::new(
        RateLimitConfig::builder()
            .max_requests(300)
            .window(Duration::from_secs(60))
            .key(KeyExtractor::SessionId)
            .build(),
    ));

let app = Router::new()
    .merge(auth_routes)
    .merge(api_routes)
    .layer(session_layer);

The pattern is to layer the rate limit on the specific routes it applies to, with the most restrictive limits on the most sensitive endpoints. A login endpoint with a tight per-IP limit is the canonical case; a token-refresh endpoint with a per-session limit is the second canonical case.

The trusted-proxy configuration covered in Cookies, fingerprinting, hijack detection applies to the PeerIp extractor here as well. Read the IP from the forwarded header only when the immediate peer is a trusted proxy; otherwise the rate limiter can be spoofed.

Tuning the windows

Tuning the rate limit is more art than science, but a few guidelines hold up.

For login endpoints: 10 requests per minute per IP is the conservative starting point. Real users log in at most a few times a day from any one IP. Credential-stuffing attacks need hundreds per minute to be efficient; 10 is well below that. Tune up only if the warn rate is too high on legitimate traffic (many users behind a corporate NAT, for instance).

For signup endpoints: 5 requests per minute per IP. Signup is even less frequent for legitimate users than login; account enumeration is best stopped tight.

For password reset: 3 requests per hour per IP. A reset is a once-in-a-while operation. Attackers spam reset to exhaust the victim's inbox; the tight limit is the defence.

For token refresh: matched to the session TTL. A session that refreshes every hour should have a rate limit of a few refreshes per hour per session id; an attacker who steals a session cannot extract value through rapid refresh.

For data endpoints: matched to the application's expected use pattern. An API for human-driven dashboards sees a few requests per minute per session; an API for programmatic clients sees hundreds per second per workload. The pattern is deployment-specific.

The default to start with is to measure first. The metrics from AuthnMetrics::rate_limit_rejected (covered below) tell you the real reject rate; the calibration is then to set the limit just above the legitimate-traffic envelope.

What happens at the limit

A request that hits the rate limit gets:

A 429 status code. The standard HTTP response for "Too Many Requests."

A Retry-After header. The value is the number of seconds the client should wait before retrying. The header is read by browsers and well-behaved clients; attackers ignore it.

A short JSON body explaining the limit. The body is generic ("rate limit exceeded") rather than specific (no "you have 0 of 10 requests remaining"); the latter leaks the limit configuration, which lets an attacker calibrate their attack to just under the limit.

The application's metrics record the rejection. The AuthnMetrics::rate_limit_rejected method is the metric; applications wire it to their Prometheus or OpenTelemetry counter.

Distinguishing attack from misconfiguration

A high rate of 429s is operationally interesting. The cause is either an attack (real attacker getting throttled) or a misconfiguration (legitimate traffic hitting a limit that was set too low).

The signals that distinguish them:

A rate of 429s heavily concentrated on a small set of source IPs, with the IPs not matching legitimate user patterns (datacenter IPs, VPN exit nodes, residential ASNs from countries the application does not typically serve) suggests attack.

A rate of 429s spread across many IPs, matching legitimate user patterns (residential ASNs from served countries, mixed mobile and home connections), suggests misconfiguration.

The audit events the rate limiter produces (a RateLimitRejected event per drop) carry the source IP, the endpoint, and the timestamp; SIEM queries against these distinguish the patterns quickly.

Per-tenant rate limits

For multi-tenant deployments, the rate limit configuration can be per-tenant. A tenant with a higher SLA gets a higher rate limit; a tenant with a lower SLA gets a tighter one. The mechanism is the same RateLimitLayer, with a Custom key extractor that composes the standard key (typically PeerIp) with the tenant id, and with separate RateLimitConfigs per tenant tier.

The pattern is operationally complex (one configuration per tenant tier), so most deployments use a single shared limit and calibrate to the deployment-wide envelope. The per-tenant shape is for deployments where the SLA differences are explicit and the operational overhead is justified.

Metrics

The layer emits two metrics through the AuthnMetrics trait:

rate_limit_rejected is incremented on each 429. The metric is the primary signal for tuning and for attack detection.

rate_limit_evaluated (optional, off by default) is incremented on every request the layer sees, regardless of outcome. The ratio of rejected to evaluated is the reject rate; below 0.1% typically means the limit is set well, above 1% suggests either attack or misconfiguration.

The AuthnMetrics implementation is the application's; it typically routes to Prometheus, OpenTelemetry, or whatever metrics system the deployment uses. The examples/sqlite/ reference application shows a simple AtomicU64-based implementation suitable for adapting to a real metrics system.

Composing with the lockout policy

The rate limiter and the lockout policy are different defences that compose. The rate limiter catches volume; the lockout policy catches credential pattern. Both fire on attacks, in different shapes.

The pattern that emerges: the rate limiter is the first line of defence against credential stuffing. It drops the attack to a trickle before any individual user's lockout policy can fire. The lockout policy then catches the few attempts that get through, marking the targeted user accounts as locked.

A deployment that has rate limiting but no lockout policy is vulnerable to slow attacks that stay below the rate limit. A deployment that has lockout but no rate limiting is vulnerable to high-volume attacks that distribute across many users. Both together cover both attack shapes.

What this enables

The rate limiter is the operational layer that sits between "the request was sent" and "the authentication logic runs." A deployment without it is vulnerable to a class of attack that the authentication logic alone cannot prevent; a deployment with it has the broader defence against volume-based attacks that complements the credential-pattern defence of lockout.

Further reading

Multi-tenancy covers the lockout policy that pairs with the rate limit. Audit events catalogues the RateLimitRejected event the layer emits. Cookies, fingerprinting, hijack detection covers the trusted-proxy configuration that determines how PeerIp reads the source IP. Operations runbook covers the metrics dashboards and the SIEM rules that turn the rate-limit signal into alerts.

Security posture

This chapter is the production-readiness chapter. It covers the crypto choices axess makes by default, the production integration requirements an adopter has to meet before launch, the compliance touch-points (GDPR, SOC 2, PCI-DSS, HIPAA) the deployment will face, and the disclosure protocol for handling the inevitable vulnerability report.

The chapter has two halves. The first half is axess-specific and covers the crypto backends, the FIPS-routing notes, and the PII classification. The second half is the canonical SECURITY.md from the repo root, included verbatim so the production checklist lives in one place rather than two.

Crypto backends

Axess uses three crypto backends, chosen per operation:

RustCrypto is the default for most cryptographic primitives. The implementations are pure Rust, with no system-library dependency, and the project's audit history is good. Axess uses RustCrypto for AES-256-GCM (the session envelope), HMAC-SHA256 (cookie signing, fingerprint binding), Argon2id (password hashing), TOTP and HOTP (the RFC 6238 and RFC 4226 implementations), and SHA-256 (refresh token hashing).

aws-lc-rs is an alternative for deployments that need FIPS 140-3 validated crypto. The backend wraps the FIPS-validated aws-lc library; selecting it through a Cargo feature redirects the relevant primitives to the validated implementations. The trade-off is binary size (the FIPS module adds a few megabytes) and platform support (aws-lc does not build on every target).

ring is a third option, used historically for TLS-adjacent primitives. The project is mature but the maintenance cadence has slowed; axess uses ring in a few legacy spots and is migrating away. New code uses RustCrypto by default and aws-lc-rs when FIPS is required.

The selection is a Cargo feature, configured per crate:

[dependencies]
axess = { version = "0.2", features = ["crypto-aws-lc"] }

The default is crypto-rust (which is the same as not specifying a backend); crypto-aws-lc is the FIPS variant. The crates that depend on a specific backend gate their implementations on the feature; the build refuses if the application requests incompatible backends (a deployment cannot simultaneously enable RustCrypto and aws-lc-rs for the same operation).

FIPS targeting

A FIPS 140-3 validated deployment requires three things to be true.

The first is that every cryptographic operation runs through a validated module. Axess's crypto-aws-lc feature routes the relevant operations through aws-lc-rs. The choice satisfies the "validated module" requirement.

The second is that the deployment's compile and link chain does not introduce non-validated crypto. Cargo's dependency graph is the source of truth here; running cargo tree and inspecting for non-aws-lc crypto crates (rustls, ring, the older RustCrypto crates) shows what the deployment actually pulls in. Anything that introduces non-validated crypto needs to be replaced or compiled out.

The third is that the validation certificate covers the platform the deployment runs on. NIST publishes FIPS validation certificates per platform-binary combination; a certificate for Linux x86-64 does not cover macOS ARM. The deployment's compliance evidence must include the certificate matching the production platform.

The deployment's compliance team owns the end-to-end FIPS validation; axess provides the crypto-backend lever. The chapters that depend on specific crypto choices (session envelope, refresh-token hashing, HMAC fingerprint) all use the configured backend automatically.

PII classification

The application records PII across several stores. The classification matters for GDPR (the data subject's rights), for SOC 2 (the control objectives), and for the retention sweep (Device identity's device_retention_days). The classification:

Primary PII includes the user's identifier (email, username, or similar), their hashed password, their TOTP secret, their FIDO2 credentials, their IP address as seen during authentication, and their device fingerprint. This data lives in the identity store and the device store; the retention is the application's choice within whatever regulatory bounds apply.

Secondary PII includes the audit-event log (which references the primary PII through user_id, tenant_id, device_id, and client_ip). The audit retention covered in Audit pipeline applies here; for GDPR the typical pattern is to retain audit data longer than the primary PII but to scrub or hash the IP addresses after the operational hot window.

Pseudonymous data includes the session id, the refresh token hash, and the device id itself (a UUID that does not name the user directly). These can be retained longer than the primary PII without GDPR implications; they only become PII when joined to the primary data, and the join requires access to the identity store.

The GDPR right-to-erasure verb (IdentityAdmin::erase_user) cascades through every store: the user's primary PII is removed from the identity store, the user's device records are removed from the device store, the user's sessions are removed from the session store, and the user's refresh tokens are removed from the refresh-token store. The audit-event entries that reference the user are not removed (the audit trail is load-bearing for compliance); the user's identifier in the events is hashed to a pseudonymous token, which makes the events non-PII without losing the ability to correlate them.

Compliance touch-points

The deployment will face one or more of these regulatory frames. Axess does not provide compliance on its own; it provides the controls each framework requires. The touch-points:

GDPR (EU data protection): the right-to-erasure verb (above), the audit trail's retention configuration, the IP-address scrubbing in the cold-tier archive, the per-tenant device_retention_days. The deployment owns the data subject notices, the privacy policy, and the legal basis for processing; axess provides the technical mechanisms.

SOC 2 (operational controls): the audit catalogue (every authentication and authorisation decision produces an event), the lockout policy (defends against credential stuffing), the session and refresh-token security (covered in earlier chapters), the operational metrics (covered in Operations runbook). The deployment owns the policy and procedure documentation; axess provides the operational evidence.

PCI-DSS (payment card data, if applicable): the strong authentication for administrative access, the audit retention of at least one year, the cryptographic protection of session data at rest. The deployment owns the cardholder data environment; axess covers the authentication boundary into it.

HIPAA (US healthcare data, if applicable): the strong authentication for protected health information access, the audit retention of at least six years, the encryption of session data at rest and in transit. The deployment owns the HIPAA-covered systems; axess covers the authentication boundary.

The chapters that cover the relevant mechanisms are the place to look up specific controls: Session lifecycle and crypto envelope for the at-rest encryption, Audit pipeline for the retention, Refresh tokens and session continuity for the refresh-token hygiene, Multi-tenancy for the lockout policy. The compliance documentation maps the framework's requirements to the relevant chapters.

Disclosure protocol

The vulnerability disclosure protocol lives in the canonical SECURITY.md at the repo root. The summary:

Vulnerability reports go through the private channel described in SECURITY.md (typically a security email or GitHub Security Advisories). Do not file vulnerabilities on the public issue tracker.

The maintainers acknowledge reports within a few business days and triage to a severity level. Critical and high-severity issues get a private fix in a security branch, a coordinated disclosure window, and a CVE if the issue warrants one. Lower severity issues fix in the normal development cycle.

Adopters are expected to keep their axess dependency current. Vulnerability fixes ship in the next patch release; the changelog notes which fixes are security-relevant. Deployments behind on patches accept the risk of the unfixed vulnerabilities.

Canonical SECURITY.md

The rest of this chapter is the canonical SECURITY.md from the repo root, included so the production checklist is in one place.

Security Policy

Reporting a Vulnerability

If you discover a security issue in Axess, please report it through GitHub's private vulnerability reporting (the Report a vulnerability button under the repository's Security tab) or by emailing security@gnomes.ch. Do not open a public issue.

Response targets (best-effort while the project is pre-1.0):

  • Acknowledgement: within 48 hours of report
  • Triage and severity assessment: within 7 calendar days
  • Critical / High fix: patch release within 7 calendar days of confirmation
  • Medium fix: patch in the next scheduled release (typically within 30 days)
  • Advisory: published via GitHub Security Advisory once a fix is available

Only the latest 0.x minor receives security patches. If you are on an older version, upgrade to receive fixes.

Using Axess Securely

Axess is a library for authentication and authorization. Its security depends on correct integration and configuration in your application.

Production integration checklist

Transport and cookies

  • Terminate TLS before Axess sees requests. All session cookies default to Secure; HttpOnly; SameSite=Lax.
  • Set an HSTS header (Strict-Transport-Security: max-age=63072000; includeSubDomains) at the reverse-proxy or application layer so browsers never downgrade to HTTP.
  • Use a cryptographically random 32-byte signing key loaded from a secrets manager. Never hard-code or re-use the all-zero example key.

CSRF

  • Mount CsrfLayer on state-changing routes. The shipped middleware implements signed double-submit cookie protection; CsrfConfig::new(signing_key) is the entry point.
  • SameSite=Lax (the cookie default) mitigates the most common vectors, but is not sufficient on older browsers or cross-site GET-triggered mutations; keep CsrfLayer engaged.
  • For API-only endpoints, validate Origin / Referer headers or use a custom request header as a CSRF defence in addition.

Session binding and hijacking

  • Enable session binding (e.g. UserAgentBinding) to detect cookie theft from a different browser/client.
  • Understand the trade-off: session binding raises the bar for opportunistic theft but does not protect against an attacker who copies the User-Agent string along with the cookie.
  • Consider combining with IP-subnet or TLS channel binding for higher-security environments.

Session registry and forced logout

  • If using a session registry for forced logout, guard all authenticated routes with registry validity checks; not just require_authn!; so suspended or force-logged-out users cannot continue using stale sessions.
  • Call suspend_user (which automatically invalidates registry entries) rather than updating store status manually.

Rate limiting

  • Apply per-IP rate limiting on login, factor verification, and OAuth callback endpoints using the built-in RateLimitLayer. Axess enforces per-user lockout, but distributed brute-force across many usernames requires IP-level throttling.

Recommended configuration for authentication endpoints:

#![allow(unused)]
fn main() {
use axess::{RateLimitLayer, RateLimitConfig, KeyExtractor};
use std::time::Duration;

// Tight limit for login / factor verification (5 attempts per 60 s per IP).
let auth_rate_limit = RateLimitLayer::new(
    RateLimitConfig::builder()
        .max_requests(5)
        .window(Duration::from_secs(60))
        .key(KeyExtractor::ForwardedIp)
        .build(),
);

// Separate, tighter limit for OTP verification (3 attempts per 60 s).
let otp_rate_limit = RateLimitLayer::new(
    RateLimitConfig::builder()
        .max_requests(3)
        .window(Duration::from_secs(60))
        .key(KeyExtractor::ForwardedIp)
        .build(),
);

let app = Router::new()
    .route("/login", post(login_handler))
    .route("/verify-totp", post(totp_handler))
    .layer(auth_rate_limit)
    // Or apply per-route:
    .route("/verify-email-otp", post(otp_handler))
    .route_layer(otp_rate_limit);
}
  • Rate-limit OTP verification endpoints separately; 8-digit email OTPs have 10^8 possibilities but a tighter window reduces feasibility further.

Trusted proxy and IP extraction

  • If you rely on X-Real-IP or X-Forwarded-For for audit trails or rate limiting, ensure your reverse proxy strips these headers from untrusted client requests before forwarding. Axess trusts the first entry in X-Forwarded-For.

Session store selection

  • In-memory stores (MemorySessionStore, MemoryRefreshTokenStore) are for testing only. They use non-constant-time lookups and do not persist across restarts.
  • SQL stores (SqliteSessionStore, PostgresSessionStore, MysqlSessionStore) support optional AES-256-GCM encryption at rest via SqliteSessionStore::new(pool, SessionCrypto::new(key)); opt out only via the explicit ::plaintext(pool) constructor (dev/test only).
  • Valkey store supports AES-256-GCM encryption via ValkeySessionStore::new(client, key). Plaintext available via ::plaintext(client) for dev/test.
  • All encryption-capable stores support key rotation via SessionCrypto::with_previous_key(old_key); sessions encrypted with the previous key are transparently re-encrypted on the next access.

Content Security Policy

  • Set a Content-Security-Policy header on all HTML responses to mitigate XSS impact. At minimum: default-src 'self'; script-src 'self'; style-src 'self'.
  • Avoid unsafe-inline and unsafe-eval in CSP directives.

OAuth / OIDC

  • Register only HTTPS issuer URLs. Axess rejects http:// issuer URLs in discovery (localhost / 127.0.0.1 / [::1] exemption for dev).
  • Request the minimum scopes needed; avoid offline_access unless refresh tokens are required.
  • Validate that the OAuth redirect URI matches exactly; do not use wildcard patterns.

Social login (plain OAuth 2.0)

  • Prefer OIDC whenever the provider supports it. Reach for SocialProvider only for IdPs that explicitly don't (GitHub user login, Twitter/X, Discord, Reddit, Spotify, …).
  • Understand the weaker security model: identity comes from a userinfo HTTPS GET, not from a signed assertion. A compromised IdP can impersonate any of its users to your service; you accept that blast radius when you adopt the provider.
  • Keep PKCE on (the default). A handful of providers reject the extra parameter; SocialProvider::without_pkce is the opt-out and should be used sparingly.
  • Verify csrf_state echo on the callback before calling exchange_code; SocialProvider::mint_csrf_state produces a fresh value routed through the same injectable RNG as PKCE.

Workload identity

  • Pin the trust domain at resolver construction. Every shipped resolver (JwtSvidResolver, MtlsResolver, WorkloadResolver) accepts an expected TrustDomain and rejects tokens / certs whose synthesised WorkloadId lives under a different one; defense in depth against a confused-deputy where the JWKS or CA happens to be shared across trust domains.
  • For the generic WorkloadResolver, keep adopter-supplied claim mappers strict about which subject paths the application admits. The recipes in examples/workload-identity/ are templates, not policy.
  • When fetching SVIDs from a local SPIRE agent, use the spire-workload crate today; see docs/workload-identity/jwt-svid.md for the fetch-side recipe.

Dependencies

  • Regularly update Axess and its dependencies (cargo update).
  • Run cargo audit in CI to catch known vulnerabilities in the dependency tree.

Trusted proxy configuration (detailed)

Axess extracts client IP addresses from the X-Real-IP and X-Forwarded-For headers for audit logging and rate limiting. These headers are only trustworthy if your reverse proxy strips or overwrites them before forwarding.

If you don't run behind a trusted reverse proxy, these headers are user-controlled and any IP-based security decision (rate limiting, geo-blocking, audit trails) can be spoofed.

Configure your reverse proxy to:

  1. Strip incoming X-Forwarded-For and X-Real-IP from client requests.
  2. Set X-Real-IP to the immediate client address (TCP peer).
  3. Append X-Forwarded-For with the client address (for multi-hop chains).

Example for nginx:

proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

Axess reads X-Real-IP first; if absent, it takes the first entry from X-Forwarded-For. It does not walk the forwarded chain or maintain a trusted-proxy allowlist; that is the reverse proxy's responsibility.

Feature inventory

The shipped security surface, grouped by area. Caveats live in the Notes column.

Authentication factors

FeatureNotes
PasswordArgon2id hashing with workspace-pinned parameters, per-user lockout, password-reuse history, plaintext zeroized after hash.
TOTP (RFC 6238)Constant-time comparison, last-step replay guard, SHA-1 / SHA-256 / SHA-512, 6–8 digit codes.
HOTP (RFC 4226)Counter advancement, zeroized secrets, same algorithm options as TOTP.
Email OTP8-digit default, Argon2-hashed codes at rest, TTL-bound, single-use.
FIDO2 / WebAuthnRegistration + authentication + clone detection + discoverable / passwordless. Per-ceremony UV / attestation policy waits on webauthn-rs 0.6 stable; see ROADMAP.md.
LDAP bindVerifier over ldap3 with TLS via rustls. Bind only; schema mapping is the application's responsibility.
mTLS factorX.509 certificate verification against a configured trust anchor; SAN URI extraction for SPIFFE / regular identity binding.
JWT bearerGeneric JWT verifier with JWKS rotation, iss / aud / exp / nbf / alg allowlist, clock-injected for DST.
Multi-factor chainsOrdered factor pipeline (FactorStep::AnyOf for choice steps), session state machine enforces sequencing at compile time.

Sessions

FeatureNotes
Session cookiesHMAC-SHA256 signed, Secure; HttpOnly; SameSite=Lax by default. Configurable via SessionLayer::with_secure / with_same_site.
Session bindingHMAC-keyed fingerprint (not a plain hash); UserAgentBinding + extension points for IP / TLS channel binding.
Session registry + forced logoutSessionRegistry::invalidate_user is error-observable and fail-closed; cooperates with suspend_user for combined identity + session revocation.
ID cyclingAutomatic on Guest→Authenticated transition (fixation defense) and on logout; explicit AuthSession::regenerate for app-defined privilege boundaries (MFA enrollment, password change, role grant). See docs/sessions/lifecycle.md.
Refresh tokensRotation with family revocation on reuse; integration with device-binding cascade.
In-memory storeTesting only; no persistence, no encryption, non-constant-time lookups.
SQLite session storeOptional AES-256-GCM via SqliteSessionStore::new(pool, SessionCrypto::new(key)). Key rotation via with_previous_key.
Postgres session storeSame encryption model. Recommended for multi-instance deployments. Validated against CockroachDB via the cockroach_compat CI job.
MySQL / MariaDB session storeSame encryption model. Tested against MySQL 8.x and MariaDB 10.5+.
Valkey session storeAES-256-GCM, key rotation, TTL-managed eviction.
Cross-backend Store<K, V> traitAll five backends implement it for adopters that want backend-agnostic dispatch via Arc<dyn Store<…>>.

Device identity

FeatureNotes
Three-stage trust ladderUnknownSeenTrusted (plus terminal Revoked); retention sweep demotes idle devices and purges revoked rows past the grace window.
Per-tenant fingerprint pepperStops cross-tenant fingerprint correlation; TenantPepperResolver is adopter-provided.
Cascade revocationRefresh-token family compromise revokes every device that carried that family's binding.
CachedDeviceStore decoratorLRU + clock-driven TTL eviction; revocation propagates through set_trust_level.
Five DeviceStore backendsMemory, SQLite, Postgres, MySQL / MariaDB, Valkey; surface-equivalent across SQL dialects + Valkey hash storage, optional AES-256-GCM envelope on the bindings blob (SQL backends).
Adopter-supplied store recipeDocumented contracts (tenant scoping, atomic save, hot-path sighting, required sweep) in docs/identity/device.md for adopters with non-shipped backends (DynamoDB, MongoDB, …).
PII tokenisationMemoryDevicePiiStore reference impl + adopter trait for the GDPR-scoped fields (label, last-seen IP).

Workload identity

FeatureNotes
Principal::{Human, Workload} unified abstractionSame ToCedarEntity bridge for both shapes; Cedar policies authorise across without branching.
JwtSvidResolverSPIFFE JWT-SVID spec adherence; mandatory spiffe:// URI in sub, trust-domain extracted and pinned.
MtlsResolverSPIFFE X.509-SVID over mTLS via leaf-cert SAN URI extraction.
WorkloadResolver<C, F, R>Generic JWT-bearer workload resolver covering GitHub Actions OIDC, Kubernetes service accounts, GitLab CI, Okta, Azure AD, Auth0, LocalIdP, and any other JWT-issuer via an adopter-supplied claim parser + mapping closure. Ready-made recipes for GitHub Actions + k8s SA ship in examples/workload-identity/.
Cloud STS exchangeaws-sts, gcp-wif, azure-fic adapters for exchanging federated workload identity for short-lived cloud credentials.
Outbound identityoutbound-oauth (axess as OAuth client with client_credentials / private_key_jwt) and outbound-mtls (axess presenting an mTLS identity to downstream services).

OAuth / OIDC ceremonies

FeatureNotes
Authorization Code + PKCEDiscovery, token exchange, nonce validation; HTTPS-enforced (localhost / 127.0.0.1 / [::1] exemption for dev).
Client CredentialsReal HTTP token exchange via OAuthProviderConfig.
Device Code (RFC 8628)Real HTTP; device endpoint configured via with_device_authorization_endpoint. Nonce-bindable per RFC.
Token refreshProvider-delegated refresh with audit logging.
FAPI 2.0 Baseline ProfilePushed Authorization Requests (PAR, RFC 9126), DPoP (RFC 9449), JARM, RP-initiated logout. Strict nbf enforcement on ID tokens with clock-injected validation.
Back-Channel LogoutJWT signature verified via cached JWKS; sid-based session invalidation.
Front-Channel LogoutGET handler with sid query parameter; shared SidMap with back-channel.
Plain-OAuth-2.0 social loginSocialProvider (gated on social, off by default) for IdPs that don't support OIDC (GitHub user login, Twitter/X, Discord, Reddit, Spotify, …). Weaker security model than OIDC; identity comes from a TLS-trusted userinfo endpoint, not from a signed assertion. Parallel types (SocialClaims vs IdTokenClaims) keep the distinction visible at the call site. PKCE on by default.
LocalIdpFixtureIn-process IdP minting workload JWTs against an in-memory RSA-2048 keypair + matching JWKS endpoint. RS256 + ES256, RFC 8414 discovery, multi-key rotation.

On-behalf-of (OBO)

FeatureNotes
delegated-storedRFC 6749 §4.1 Authorization Code + PKCE with persisted refresh token for long-lived offline access.
delegated-exchangeRFC 8693 Token Exchange for short-lived per-request exchange.
delegated-stored-encryptedEncryptedDelegatedCredentialStore<S, K> decorator wraps any delegated-credential backend with AES-256-GCM at rest.

Authorization

FeatureNotes
Cedar Policy engineRBAC + ABAC + ReBAC. AuthzStore orchestrates evaluation; ToCedarEntity bridges principals, resources, and contexts.
Layered policy bundleBase + overlay; adopters drop additional .cedar and .schema.cedar.json files into an overlay/ directory.
Procedural macrosrequire_authn!, require_partial_authn!, require_authz! guard handler functions at compile time.
Entity cachingEntityCache (in-process, default), MokaEntityCache, ValkeyEntityCache (cross-node). Asymmetric defaults: cache authz, not authn.

Middleware

FeatureNotes
CSRFSigned double-submit cookie; required for state-changing form posts.
Rate limitingToken-bucket via RateLimitLayer. KeyExtractor::{PeerIp, ForwardedIp} for direct vs trusted-proxy deployments.
Request IDX-Request-Id extraction + generation.
Trace IDW3C Trace Context (traceparent) propagation.
WebSocketRevocation-aware wrapper that closes connections on session invalidation.

Audit and observability

FeatureNotes
AuthEvent regulatory audit trailSix device-identity event variants + the full authn event surface.
AuthnMetrics17-method trait (counters + timers) with no-op defaults.
AuditArchiver + AuditRetentionPolicyHot / cold tiering with three-stage retention (90d / 7d / never defaults). FilesystemAuditArchiver reference impl behind audit-archive-fs.
AuthnAnalyticsSink + RichAuthnEventDenormalised analytics path parallel to the regulatory AuthEvent. serde + rkyv derives for Apache Iggy / ClickHouse / DuckDB / Snowflake.
TracingCaptureTest subscriber for asserting on emitted tracing events.

PII classification

Axess processes personal data as part of authentication. This section documents what the library logs, stores, and never touches; useful for GDPR Data Protection Impact Assessments and SOC2 evidence packages.

What axess logs (via tracing and AuthEvent)

FieldWherePurposePII?
user_idStructured log spans, AuthEventCorrelate events to accountsPseudonymous; opaque ID, not directly identifying
tenant_idStructured log spans, AuthEventMulti-tenant correlationNo
session_idAuthEvent, tracing spansSession correlationNo (random UUID)
IP addressAuditContext (extracted from headers)Geo/fraud detection, complianceYes; personal data under GDPR
User-AgentAuditContext, session bindingClient identification, hijack detectionIndirect; device fingerprint
event_typeAuthEventAudit trail (login, factor verified, logout)No
factor_kindAuthEventWhich factor was attemptedNo
success/failureAuthEventSecurity monitoringNo
request_idAuditContextRequest tracingNo

What axess stores in session data

FieldStoragePII?
user_id / tenant_idSession store (Memory / SQLite / Postgres / MySQL / Valkey)Pseudonymous
auth_stateSession storeNo
fingerprintSession store (HMAC hash)No (one-way hash)
customSession store (application-defined)Depends on application

What axess NEVER logs or stores

  • Plaintext passwords (only Argon2id hashes are stored; input is zeroized after hashing)
  • TOTP/HOTP secrets in logs (stored encrypted in FactorConfig, zeroized on drop)
  • Session cookie values
  • OAuth tokens (access, refresh, ID tokens); these stay in memory only during the exchange
  • PKCE verifiers, CSRF state tokens (cleared from session after use)

Recommendations

  • Encrypt at rest: pass a SessionCrypto::new(key) to the SQL session-store constructors (SqliteSessionStore::new(pool, crypto), same for Postgres / MySQL) or use ValkeySessionStore::new(client, key) so session data (which contains user_id) is AES-256-GCM protected. The explicit ::plaintext(pool) constructor opts out and is dev/test only.
  • Log retention: configure your log aggregator to retain auth events per your compliance requirements (MiFID II: 5 years; GDPR: minimize).
  • Right to erasure: deleting a user's sessions (SessionRegistry::invalidate_user) and database records satisfies GDPR erasure for axess-managed data. The custom session field is the application's responsibility.

Compliance framework mapping

GDPR

RequirementHow Axess addresses it
Lawful basis for processingApplication's responsibility. Axess processes only what the app sends.
Data minimizationSessions store only user_id, tenant_id, auth_state, and fingerprint (HMAC hash).
Right to erasureSessionRegistry::invalidate_user() + database record deletion.
Data protection by designAES-256-GCM encryption at rest, zeroization of secrets in memory.
Breach notificationApplication responsibility. Axess provides audit trail via AuthEvent.
DPA (Data Processing Agreement)Not applicable; Axess is a library, not a service.

SOC2

Trust service criteriaHow Axess addresses it
CC6.1; Logical access securityMFA, session binding, Cedar policy authorization
CC6.3; Access revocationSessionRegistry::invalidate_user(), session TTL
CC7.2; MonitoringAuthnMetrics trait (17 hooks), AuthEvent audit trail, tracing
CC8.1; Change managementApplication responsibility (CI/CD, version pinning)

PCI-DSS

RequirementHow Axess addresses it
8.3; MFA for admin accessMulti-factor chain support (password + TOTP/FIDO2)
8.6; Session managementSigned cookies, TTL, session binding, forced logout
3.4; Encryption of cardholder dataAES-256-GCM session encryption (session store, not card data)
10.2; Audit trailsAuthEvent records login attempts, factor verifications, logouts

HIPAA

SafeguardHow Axess addresses it
Access control (§164.312(a))MFA, Cedar RBAC/ABAC, session state machine
Audit controls (§164.312(b))AuthEvent audit trail with timestamps
Integrity controls (§164.312(c))HMAC-signed session cookies, AES-GCM encryption
Transmission security (§164.312(e))Application must terminate TLS; Axess sets Secure cookie flag

These mappings are informational. Compliance certification requires assessment of the complete application stack, not just the authentication library.

Supported Versions

We recommend using the latest release of Axess and actively maintained branches.

Disclaimer

Axess is provided as a library. While we strive for secure defaults, the overall security of your application depends on your usage and integration.

Further reading

Operations runbook covers the production-launch checklist (key rotation, multi-instance considerations, graceful shutdown). Audit events and Audit pipeline cover the audit mechanisms the compliance frames depend on. Migration guide covers cross-version upgrade paths, including security-relevant breaking changes.

Operations runbook

This chapter is the operator-facing runbook. It covers the pre-launch checklist, the routine rotations the deployment needs to schedule, the multi-instance considerations that catch deployments off-guard, the graceful-shutdown sequence, the health-check and metrics surfaces, and the emergency procedures for the categories of incident that recur.

The chapter has two halves. The first half is operational guidance specific to axess. The second half is the canonical OPERATIONS.md from the repo root, included so the deployment's runbook checklist is in one place.

Pre-launch checklist

The list below is the minimum an axess-instrumented deployment should clear before serving real traffic. Each item is covered in detail in another chapter; the list here is the inventory.

The session signing key is loaded from the deployment's secrets manager. The key is 32 bytes of cryptographic randomness, stable across process restarts. The development placeholder ([0; 32] from Getting started) is replaced.

The session envelope key is loaded the same way. The two keys are independent; one is for HMAC signing the cookie, the other is for AES-256-GCM encrypting the session payload at rest. Session lifecycle and crypto envelope covers the distinction.

The fingerprint pepper is loaded for the fingerprint binding. Each tenant has its own pepper, stored alongside the tenant record; Multi-tenancy and Cookies, fingerprinting, hijack detection cover the mechanism.

The session cookie has Secure=true set. TLS terminates at the edge; the application sees only HTTPS traffic; the cookie is only sent on HTTPS.

The trusted-proxy list is configured. The application reads the forwarded header (X-Forwarded-For or Forwarded) only when the immediate peer is in the trusted list. Without this, the fingerprint and the rate-limit keys can be spoofed.

The rate limit is configured on the login, signup, password-reset, and any other authentication-adjacent endpoints. The defaults from Rate limiting are starting points; calibrate to the deployment's legitimate-traffic envelope.

The lockout policy is configured (or the global default is accepted). The three levers (per-user, per-tenant, per-IP) all have explicit thresholds suited to the deployment's risk posture. Multi-tenancy §"Three-lever lockout" covers the configuration.

The audit pipeline is wired. The regulatory sink is the IdentityAuthnLog the lockout policy already uses; the analytics sink (if configured) is the deployment's SIEM connector. The retention loop is configured with the deployment's required retention period. Audit pipeline covers the full pipeline configuration.

The health check is wired. /healthz (or whatever the deployment chooses) queries the session store, the identity store, and the device store; the response is a JSON document that aggregates the per-component states. Operations runbook in the canonical SECURITY/OPERATIONS section covers the deployment expectations.

The metrics are exported. The AuthnMetrics trait is implemented; the metric values flow into Prometheus or OpenTelemetry; the dashboards cover the auth-attempt rate, the failure rate, the rate-limit rejection rate, and the lockout trigger rate. Operations runbook below covers the production-dashboard expectations.

The Cedar policy set is loaded and validated against the schema. The startup path refuses if the validation fails; a production launch with a misconfigured policy set never gets to serve traffic. Cedar policy fundamentals covers the validation flow.

The cleanup tasks are scheduled. The session cleanup, the device retention sweep, the audit retention loop, the OAuth JWKS cache refresh: all of these run on intervals; the scheduler is the application's responsibility. Backends §"SQLite" and similar sections cover the per-backend cleanup patterns.

Key rotation

The deployment has three keys to rotate on a schedule: the session signing key, the session envelope key, and the per-tenant fingerprint pepper. The mechanism is the same shape for all three: provide the new key alongside the old one for a transition window, let in-flight sessions and devices roll over, then remove the old key.

Session signing key

The signing key is what HMAC-protects the session cookie. Rotating it without invalidating sessions requires keeping the old key available for verification during the transition.

let session_layer = SessionLayer::new(store, new_signing_key)
    .with_previous_key(old_signing_key)
    .with_ttl(session_ttl);

with_previous_key accepts the old key. Cookies signed with the old key continue to validate; new cookies sign with the new key. After enough time for all old cookies to expire (one session TTL plus a safety margin), the previous key can be removed.

The rotation sequence:

  1. Deploy the application with new_signing_key = old_key and previous_key = old_key. Nothing has changed; this is the baseline.
  2. Generate a fresh 32-byte signing key. Store it in the secrets manager alongside the existing one.
  3. Deploy the application with new_signing_key = fresh_key and previous_key = old_key. New cookies sign with the fresh key; existing cookies continue to validate against the old.
  4. Wait one session TTL. By the end of this window, every existing session has either expired or been refreshed (which re-signs the cookie with the fresh key).
  5. Deploy the application with previous_key = None (or absent). The old key is now unused.
  6. Remove the old key from the secrets manager.

Session envelope key

The envelope key is what AES-256-GCM protects the session payload at rest. Rotating it without invalidating sessions is similar to the signing-key rotation, with the additional consideration that sessions stored before the rotation continue to be readable but new writes use the new key.

let crypto = SessionCrypto::new(new_envelope_key)
    .with_previous_key(old_envelope_key);
let store = SessionStore::new(pool, crypto);

The rotation sequence is the same as the signing key. The transition window covers one session TTL; after that, every stored session has been rewritten with the new key.

For deployments with long session TTLs (a week or a month), rotating the envelope key per the deployment's compliance cycle (quarterly, semiannually) requires the transition window to be at least the TTL. Alternative: a background scan that proactively rewrites stored sessions with the new key, finishing the rotation faster than the TTL would.

Per-tenant fingerprint pepper

The fingerprint pepper rotates per-tenant rather than globally. The mechanism is on the tenant record:

service.rotate_fingerprint_pepper(
    &tenant_id,
    new_pepper,
).await?;

The rotation invalidates every device record under the tenant. Existing sessions remain valid (they do not depend on the device record), but the next request from each user re-registers their device from scratch (transitioning the device to Unknown and walking the assurance ladder again). Users see no break; the device store sees a churn.

The pepper rotates on tenant suspension and on demand. The default cadence is annual; tighter cadences are appropriate for high-sensitivity deployments.

Multi-instance considerations

A deployment that runs multiple application instances behind a load balancer has a handful of considerations the single-instance deployment does not.

Shared session store. The session backend must be cluster-safe: Postgres, MySQL, or Valkey. SQLite is single-writer and works only for single-instance deployments. Backends covers the choices.

Shared signing and envelope keys. Every instance must use the same keys; otherwise an instance that issued a cookie cannot have the cookie validated by a different instance that receives the next request. The secrets manager is the source of truth; each instance pulls the keys at startup.

Shared rate-limit state. If the rate limiter is keyed by PeerIp and the buckets live in memory per instance, an attacker hitting all instances in parallel evades the limit. The fix is BucketStore::Valkey { client }, which moves the state to a shared Valkey instance; every application instance sees the same buckets.

Session affinity (sticky sessions). Optional, not required. The session is stored server-side; any instance can serve any session. Some deployments prefer sticky sessions to improve local cache hit rates; the trade-off is reduced resilience to instance failure.

Load-balancer-level fingerprint handling. The load balancer must forward the real client IP through X-Forwarded-For (or the load balancer's specific header). The application's trusted-proxy list must include the load balancer's IP range. Without this, every request looks like it came from the load balancer, and the fingerprint and rate-limit keys are useless.

Graceful shutdown

A graceful shutdown drains in-flight requests before stopping the process. The pattern in axess:

The process receives a SIGTERM (from Kubernetes, systemd, or whatever orchestrator). The application's shutdown handler sets a flag that tells the HTTP server to stop accepting new connections.

In-flight requests continue. The HTTP server is in draining mode; new connections get refused (which the load balancer treats as the signal to route elsewhere), existing connections complete their request.

The shutdown handler waits for the in-flight requests to complete, with a timeout (typically 30 seconds; long enough for real requests, short enough that a stuck request does not block shutdown forever).

The audit pipeline drains. The shutdown handler triggers the pipeline to flush its buffer to all sinks. The wait is bounded (typically 10 seconds); buffered events that do not flush in time are written to a local recovery log for the next process start to pick up.

The session store closes. The connection pool drains; in-flight queries complete; the pool releases its connections.

The process exits.

The pattern is what Axum's with_graceful_shutdown enables; the application wires the shutdown signal through the standard shutdown handler. No axess-specific code is needed beyond the audit-pipeline drain.

Health checks and metrics

A production deployment exposes /healthz and /metrics endpoints. The health check confirms the application's backends are reachable; the metrics expose the operational counters.

The health check pattern:

let health = Arc::new(
    CompositeHealthCheck::new()
        .add("session_store", session_store.clone())
        .add("identity_store", identity_store.clone())
        .add("device_store", device_store.clone())
);

async fn healthz(State(state): State<AppState>) -> impl IntoResponse {
    let status = state.health.check_all().await;
    let code = if status.is_healthy() {
        StatusCode::OK
    } else {
        StatusCode::SERVICE_UNAVAILABLE
    };
    let body = serde_json::json!({
        "status": if status.is_healthy() { "healthy" } else { "unhealthy" },
        "components": status.components,
    });
    (code, axum::Json(body))
}

Each backend that implements HealthCheck provides its own probe (typically a bounded SELECT 1 for SQL backends or a PING for Valkey). The composite aggregates the results; the endpoint returns 200 on all-healthy or 503 on any-unhealthy.

The metrics pattern:

async fn metrics_endpoint(State(state): State<AppState>) -> impl IntoResponse {
    let m = &state.metrics;
    axum::Json(serde_json::json!({
        "auth_attempts": m.auth_attempts.load(Ordering::Relaxed),
        "auth_successes": m.auth_successes.load(Ordering::Relaxed),
        "auth_failures": m.auth_failures.load(Ordering::Relaxed),
        "rate_limit_rejections": m.rate_limit_rejections.load(Ordering::Relaxed),
    }))
}

The metrics implementation (covered in AuthnMetrics trait) exposes the counters; the endpoint serialises them in whatever format the deployment's metrics system expects (Prometheus text format, JSON, OpenMetrics).

The dashboards the operational team uses combine these counters with the audit-event volumes from the SIEM. Audit events §"SIEM query patterns" covers the SIEM-side queries.

Common failures and remedies

The categories of failure that recur in production deployments, and the standard responses.

Spike in auth_failures: typically a credential-stuffing attack or a credential leak elsewhere. The rate limiter should be absorbing the bulk; the lockout policy catches the rest. Investigate the source IPs in the failure events; if the spike is concentrated on a small set of IPs, block them at the WAF; if it is spread broadly, the leak is the larger concern.

Spike in rate_limit_rejections: either an attack (real attacker getting throttled) or a misconfiguration (legitimate traffic hitting a limit too tight). Rate limiting §"Distinguishing attack from misconfiguration" covers the signals.

Health check failing on session store: the session backend is unreachable. Investigate the database. Until the backend is back, the application cannot serve authenticated traffic; the load balancer treats the 503 as a signal to route around the instance.

Session cookie validation failing for known-good sessions: the signing key has changed without the previous-key transition. Add the previous key to the configuration; sessions will start validating again as soon as the deployment picks up the change.

Spike in DeviceFingerprintMismatch events: typically the fingerprint tolerance is too tight. Calibrate against the warn rate; widen the IP-prefix tolerance or the user-agent matching. Cookies, fingerprinting, hijack detection covers the tolerance configuration.

Audit pipeline buffer filling: the analytics sink is slow or down. Inspect the sink's metrics; if it is the SIEM under maintenance, the buffer fills until the policy fires (DropOldest, Block, or ShutdownAuthn). Plan for the maintenance window through the deployment's standard notification process.

Canonical OPERATIONS.md

The rest of this chapter is the canonical OPERATIONS.md from the repo root.

Axess; Operations Guide

Deployment, key management, and operational procedures for production environments.

Key rotation (zero-downtime)

Session signing keys and encryption keys can be rotated without invalidating active sessions.

Signing key rotation

The signing key authenticates session cookies via HMAC-SHA256. Rotation requires a code change (new key), but SessionLayer does not support a previous signing key; rotating the signing key invalidates all active sessions.

Procedure:

  1. Generate a new 32-byte signing key in your secrets manager.
  2. Deploy the new key. All active sessions become invalid (users must re-authenticate).
  3. Schedule signing key rotation during low-traffic windows.

Encryption key rotation

SessionCrypto supports transparent key rotation via with_previous_key():

#![allow(unused)]
fn main() {
let crypto = SessionCrypto::new(new_key)
    .with_previous_key(old_key);
}

Procedure:

  1. Generate a new 32-byte encryption key in your secrets manager.
  2. Deploy with both keys: new as current, old as previous.
  3. Sessions encrypted with the old key are transparently re-encrypted with the new key on next access.
  4. After all sessions have been accessed (or after the session TTL expires), remove the previous key from the deployment.
  5. Monitor the "session decrypted with previous (rotated) key" log message to track migration progress.

Multi-instance deployment

Shared state requirements

ComponentSharing requirement
Signing keyMust be identical across all instances
Encryption keyMust be identical across all instances
Session storeValkey, PostgreSQL, or MySQL (shared). SQLite is single-instance only.
Session registryValkey-backed (ValkeySessionRegistry). In-memory is single-instance only.
OIDC sid_mapIn-memory per instance. Back-channel logout works when the IdP sends to the instance that handled the login. Use sticky sessions or a shared store for full coverage.
Rate limit bucketsIn-memory per instance. For distributed rate limiting, use an external solution (e.g. Valkey-based sliding window at the reverse proxy).

Health checks

Implement a /healthz endpoint using the CompositeHealthCheck trait:

#![allow(unused)]
fn main() {
use axess::{CompositeHealthCheck, HealthCheck, HealthStatus};

async fn healthz(State(health): State<CompositeHealthCheck>) -> impl IntoResponse {
    match health.check().await {
        HealthStatus::Healthy => StatusCode::OK,
        HealthStatus::Degraded(_) => StatusCode::OK, // still serving
        HealthStatus::Unhealthy(_) => StatusCode::SERVICE_UNAVAILABLE,
    }
}
}

All session store implementations (SqliteSessionStore, PostgresSessionStore, MysqlSessionStore, ValkeySessionStore) implement HealthCheck.

Session store migration

To migrate from one session store to another (e.g. SQLite to Valkey):

  1. Dual-write phase: deploy a wrapper that writes to both stores, reads from the new store first with fallback to the old store.
  2. Cutover: once the old store's TTL has expired (default 24h), switch reads to the new store only.
  3. Cleanup: remove the old store configuration.

There is no built-in migration tool. Sessions are short-lived (default 24h TTL), so a simpler approach is:

  1. Deploy the new store.
  2. Accept that active sessions on the old store will expire naturally.
  3. New sessions are created on the new store.

Session cleanup

SQLite, PostgreSQL, and MySQL stores accumulate expired sessions. Use the built-in helper:

#![allow(unused)]
fn main() {
let store = SqliteSessionStore::new(pool, crypto);
store.init_schema().await?;
let _cleanup = store.spawn_cleanup_task(Duration::from_secs(3600));
}

PostgresSessionStore::spawn_cleanup_task and MysqlSessionStore::spawn_cleanup_task work the same way. The returned JoinHandle aborts the loop when dropped; store it for the lifetime of the application (or pass it through to graceful shutdown, see below).

Valkey manages expiration natively via TTL; no cleanup needed.

Graceful shutdown

Axess spawns long-lived background tasks for everything that needs to run on a wall-clock cadence: session cleanup, JWKS refresh, back-channel-logout sid_map aging. None of these survive SIGTERM unless the application drains them; tokio::spawn tasks are unconditionally aborted when the runtime stops.

The standard pattern is Axum's with_graceful_shutdown plus explicit abort/await of every JoinHandle axess returns:

use axum::serve;
use std::sync::Arc;
use tokio::signal;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // ── Build stores and spawn axess background tasks ─────────────
    let session_store = SqliteSessionStore::new(pool.clone(), crypto);
    session_store.init_schema().await?;

    let cleanup_handle = session_store.spawn_cleanup_task(
        std::time::Duration::from_secs(3600),
    );

    let jwks_handle = oauth_provider.spawn_jwks_refresh(
        std::time::Duration::from_secs(3600),
    );

    // ── Shared shutdown signal ────────────────────────────────────
    let shutdown = async {
        let ctrl_c = async { signal::ctrl_c().await.ok(); };
        let term = async {
            #[cfg(unix)]
            {
                use signal::unix::{SignalKind, signal};
                if let Ok(mut s) = signal(SignalKind::terminate()) {
                    s.recv().await;
                }
            }
        };
        tokio::select! { _ = ctrl_c => {}, _ = term => {} }
    };

    // ── Serve until SIGTERM/SIGINT ────────────────────────────────
    let listener = tokio::net::TcpListener::bind("0.0.0.0:8080").await?;
    serve(listener, app)
        .with_graceful_shutdown(shutdown)
        .await?;

    // ── Drain background tasks ────────────────────────────────────
    // Aborting is safe; both loops persist via the database, so a
    // killed cleanup tick at most leaves expired rows for the next
    // scheduled run, and a killed JWKS tick leaves the cached JWKS
    // intact until the next process serves a request.
    cleanup_handle.abort();
    jwks_handle.abort();
    let _ = cleanup_handle.await;
    let _ = jwks_handle.await;

    Ok(())
}

What survives shutdown vs what is lost

StateSurvives?Notes
Persisted sessions (SQL / Valkey)YesStored in DB; new process re-reads.
MemorySessionStore contentsNoIn-process only; everyone is logged out.
MemorySessionRegistry contentsNoSame; fresh registry on restart.
Refresh tokens (SQL / Valkey)YesHash + family in DB; rotation continues seamlessly.
JWKS cacheNo (re-fetched)First post-restart OAuth callback warms it.
sid_map (back-channel logout)NoOIDC sid → local session mapping is in-process. Sessions remain valid; only the sid-keyed lookup is lost, so a back-channel logout that arrives before re-login will silently no-op. Acceptable; the session still expires on its TTL.
In-flight HTTP request being servedYes (via with_graceful_shutdown)Axum waits for active connections to close before returning from serve.
In-flight cleanup_expired queryAbortedThe next scheduled cleanup picks up the slack.
In-flight refresh_jwks HTTP callAbortedThe next request triggers a fresh fetch on demand.

Why drain the handles after serve returns

with_graceful_shutdown only drains in-flight HTTP requests. The tokio::spawn'd cleanup / JWKS refresh tasks are independent of the HTTP server and continue running until the runtime is dropped. Without an explicit abort().await they hold a reference to the store clone and the runtime keeps them alive; at minimum delaying shutdown to the next tick, at worst (with tokio::main(flavor = "current_thread")) deadlocking because the abort signal can't be processed while the runtime is also waiting for the task to yield.

Monitoring and alerting

The thresholds below are starting points for a single-region deployment serving thousands to low-millions of users. Tune to your traffic shape; a free-tier app with no MFA will see very different baselines than a banking dashboard with mandatory FIDO2. The general rule: alert on ratios and rates, not absolute counts, so an alert that fires at 1k DAU still fires at 100k DAU without re-tuning.

Critical (page on-call)

SignalThresholdWhy it matters
auth_failure / (auth_success + auth_failure)> 50% for 5 minEither a brute-force campaign is in progress or the IdP is down. Either way, real users are locked out.
account_locked rate> 10 / minute for 5 minSustained password-spray; tens of accounts being locked per minute is well above any realistic legitimate spike.
session_binding_mismatch rate> 1 / minute per tenant for 5 minEither a stolen session cookie is being replayed across user agents, or a buggy client is rotating UAs mid-session. Investigate immediately.
Health check returns Unhealthyfor 2 consecutive checksSession store / database is unreachable; users cannot log in.
JWKS RwLock was poisoned logany occurrenceA panic happened while holding the JWKS lock; OAuth verification may be silently degraded.

Warning (alert in chat / ticket queue)

SignalThresholdWhy it matters
factor_failure / factor_attempt (per factor kind)> 30% for 15 minTargeted factor probe (e.g. TOTP guessing) or a regression in the factor verification code.
rate_limit_rejected / (rate_limit_allowed + rate_limit_rejected)> 5% for 10 minEither the rate limit is mis-tuned for legitimate traffic or an attacker is sustained-firing requests.
sid_map capacity reached; evicted oldest mapping log> 1 / minuteOAuth login throughput exceeds the 10 K-entry sid_map cap; back-channel logout precision degrades (some sid lookups will miss). Increase MAX_SID_MAP_ENTRIES or shorten the TTL.
session decrypted with previous (rotated) key logpersists > 7 days after rotationLong-lived sessions are still on the old key. The next rotation will invalidate them; communicate the cutover.
account_locked rate> 1 / minute for 5 minBackground brute force or aggressive credential stuffing. Below paging threshold but worth watching.
session custom data exceeds size limit logany occurrenceApplication is writing too much to the session; investigate before users hit it in production.

Info (dashboard only, no alert)

auth_attempt, auth_success, factor_attempt, factor_success, session_created, session_invalidated, rate_limit_allowed; useful for trend dashboards, capacity planning, and as denominators for the ratio-based alerts above. Avoid alerting on absolute counts; they swing wildly with traffic.

Computing rates from counters

AuthnMetrics exposes counters; alerts live in your monitoring system (Prometheus / Datadog / Grafana / CloudWatch). The standard pattern in Prometheus terms:

# Auth failure rate over 5 minutes
rate(axess_auth_failure_total[5m])
  / (rate(axess_auth_success_total[5m]) + rate(axess_auth_failure_total[5m]))
> 0.5

Implement the AuthnMetrics trait against your metrics client and emit _total-suffixed counters for the rate queries above to compose cleanly.

Key log messages

MessageSeverityAction
"session decrypted with previous (rotated) key"InfoKey rotation in progress; monitor until gone
"JWKS RwLock was poisoned"WarnInvestigate what panicked while holding the lock
"sid_map capacity reached"WarnMany OAuth logins; consider increasing capacity
"session custom data exceeds size limit"WarnApplication is writing too much to session
"login rejected by tenant IP policy"WarnLegitimate user from blocked IP, or attack

Emergency procedures

Force-logout all users

#![allow(unused)]
fn main() {
// Via session registry (if configured):
registry.invalidate_user(&user_id).await;

// Nuclear option; clear the session store:
store.cleanup_expired().await; // only clears expired
// For immediate full clear: truncate the sessions table or flush Valkey.
}

Encryption key compromise

  1. Generate a new encryption key immediately.
  2. Deploy with new key only (no previous key); this invalidates all active sessions.
  3. Rotate the signing key as well (the attacker may have decrypted session data containing the HMAC tag).
  4. Review audit logs for suspicious session activity during the compromise window.

Further reading

Security posture covers the production-readiness posture and the compliance touch-points. Audit pipeline covers the audit retention and the buffer-overflow policies. Migration guide covers cross-version upgrades and the security-relevant breaking changes. Backends covers the per-backend operational notes (CockroachDB caveats, MySQL timezone handling, Valkey eviction policies).

Migration guide

This chapter is the cross-version migration reference. Each axess release that ships a breaking change documents the change here, with the symptom (what the compiler or the runtime will tell you), the rationale (why the change happened), and the fix (what to update in adopter code). The pattern is ordered by version, with the most recent breaks first.

The chapter is sorted by what you will see, not by what we changed. A breaking change manifests as either a compile error (the type system rejected something it accepted before), a runtime error (a deserialization fails, a config rejects), or a behaviour change (the same code does something subtly different). The sections below group by symptom; finding your case is faster than reading the full changelog.

Upcoming: 0.1.x to 0.2.0

The first crates.io publish is the 0.2.0 release. The accumulated changes since the previous stable line are catalogued exhaustively in CHANGELOG.md; this chapter covers the breaking ones an adopter has to act on.

Compile errors you will see

use axess::PolicyStore becomes use axess::AuthzStore. The authorisation entry point was renamed for consistency with the Authz* prefix convention. The new name better describes what the type is (an immutable store of policies plus schema, not just a policy collection).

use axess::AxessSession becomes use axess::AuthSession. The session extractor was renamed; the new prefix is the shared Auth* prefix from the naming conventions (Architecture at a glance).

use axess::backends::SqliteStore becomes use axess::backends::sqlite::SessionStore. The backend module layout was reorganised so the same trait name (SessionStore) appears under each backend's namespace; the previous flat SqliteStore symbol no longer exists.

AuthnService::new(backend) becomes AuthnService::new(identity_store, factor_store). The service now takes the two stores separately so adopters can wire different implementations (for instance, a read-replica identity store and a write-only factor store). When the two stores are the same type (the common case), pass it twice.

SessionLayer::with_secret becomes SessionLayer::with_signing_key. The previous name was ambiguous; the new name names what the bytes are used for (HMAC signing the cookie).

AuthState::Logged becomes AuthState::Authenticated. The state was renamed for clarity; nothing else changed about the variant.

Configuration changes

The axess_factors_default_password_hasher config function is gone. Argon2id is now the default; deployments that need a different hasher (PBKDF2, legacy bcrypt) implement a custom factor and register it. Factors and methods covers the extension pattern.

The AuditPipeConfig shape changed. The sinks: Vec<Box<dyn Sink>> field was replaced with explicit regulatory_sink: Arc<...> and analytics_sink: Option<Arc<...>> fields, reflecting the dual-stream architecture from Audit pipeline. The change makes the wire-stable vs. enriched stream distinction explicit in the config.

The RateLimitConfig no longer accepts a key_fn field directly; use KeyExtractor::Custom(Arc<dyn KeyExtractorFn>) to provide a custom extractor, or use one of the built-in variants (PeerIp, SessionId, UserId, TenantId, WorkloadId, Composite). The change is to make the common cases discoverable without losing the escape hatch.

Behaviour changes

The Authenticating state now carries a Vec<FactorKind> for remaining rather than the previous Option<FactorKind>. The change is what enables multi-factor methods longer than two factors. Code that pattern-matched on Some(kind) needs to adapt to remaining.first() or to iterate over the list.

The lockout policy now defaults to per-IP in addition to per-user and per-tenant. The previous default only locked the user; the new default also throttles the source IP. Deployments that explicitly want only per-user lockout configure LockoutPolicy::per_user_only().

The session cookie's SameSite attribute now defaults to Lax rather than Strict. The change is to match modern browser defaults and to admit cross-site link-to-app navigations as legitimate. Deployments that need Strict configure it explicitly.

The fingerprint binding now defaults to FingerprintPolicy::Warn rather than FingerprintPolicy::Reauth. The new default is quieter during initial rollout. Production deployments that want stricter posture lift to Reauth or Revoke after calibrating the warn rate (Cookies, fingerprinting, hijack detection covers the calibration).

Schema migrations

The users table gained a tenant_status field for the tenant-suspension support. The migration is a single ALTER TABLE that adds the column with a default value. The examples/sqlite/migrations/ shows the SQL.

The devices table gained a fingerprint_hash field and lost the previous fingerprint_raw field. The migration is destructive: the fingerprint_raw field carried PII that the new design hashes before storage (Device identity covers the rationale). Adopters who want to preserve the audit trail of past fingerprints write the migration accordingly; adopters who do not, just drop the column.

The authn_attempts table gained an event_kind enum field that distinguishes between attempt outcomes, rather than relying on a separate outcome string. The migration is non-destructive; the outcome field stays for backward compatibility and is populated from event_kind automatically.

The session-data schema version bumped from 1 to 2. The new version adds a device_id field on Authenticated (for the device-binding work covered in Device identity). The schema-migration code (Schema migration) handles existing sessions transparently; no manual data migration is needed.

Workspace structure changes

The axess-delegated crate folded back into axess-core. The adopter import paths stay the same (axess::delegated::* continues to work through facade re-export); the Cargo.toml no longer needs an explicit axess-delegated dependency, just the delegated feature on axess. The workspace dropped from 11 to 10 library crates.

For deployments running on the 0.1.x line:

The first step is to read this chapter end-to-end. Make a checklist of every change that applies to your code.

The second step is a parallel-deploy approach. Stand up a 0.2.0 build alongside the production 0.1.x; route a small fraction of traffic to it; observe behaviour. The session cookies between the two versions are not compatible (the schema-migration mechanism handles cookie reads but not writes across major versions), so the parallel deploy needs to be on isolated session storage.

The third step is the cutover. Once the 0.2.0 build has been green for at least the session TTL on the production-like sample, route 100% of traffic to it. The 0.1.x build can be decommissioned after a roll-back window has passed without incident.

The roll-back path: if 0.2.0 surfaces problems, route traffic back to 0.1.x; the sessions that started under 0.2.0 will be invalid against 0.1.x and will land as Guest, prompting re-login. The user-visible impact is one re-login; the behavioural impact is bounded.

Future migrations

The pattern from 0.1.x to 0.2.0 is the pattern future migrations will follow. Each migration documents itself here, sorted by release. The pattern:

Symptom: what the compiler or the runtime will tell you.

Rationale: why the change happened. Most changes happen because the previous shape was wrong in a specific way (a footgun, a performance bug, a security gap, an inconsistency with the rest of the library). The rationale gives the explanation; the next section gives the action.

Action: what to update in adopter code. The action is the shortest possible change that satisfies the new shape; longer restructurings are flagged as optional improvements.

A typical migration entry runs five to ten lines for a small change, a few paragraphs for a larger one. The chapter grows additively; older migrations are not removed.

What does not migrate

Some adopter changes do not produce a migration entry. The patterns:

Behaviour that was bug-fixed. A previous version's incorrect behaviour might have been load-bearing for an adopter who built around it; the fix is still the right thing to do, and the adopter has to adapt. The fix appears in the changelog as a bug fix; if the bug-fix is large enough to warrant a migration entry, it lands here, but not all of them do.

Internal refactors that do not change the public API. The internal split between axess-core modules is free to reorganise without producing a migration entry, as long as the public re-exports stay stable.

Configuration defaults that change but are configurable. A default that flipped is a behaviour change, captured above. A default that is configurable in both directions and the configuration is the source of truth does not produce a migration entry; the adopter's existing configuration continues to apply.

Further reading

Schema migration covers the per-session schema migration mechanism that handles session-data shape changes. The CHANGELOG.md covers the exhaustive list of changes per release; this chapter is the curated migration subset. Security posture covers the security-relevant breaking changes specifically, with the disclosure protocol for security fixes.

Contributing

This chapter is the contributor reference. It covers what we expect of pull requests, the testing requirements (including the non-negotiable DST discipline), the AX-NNN tracking convention, and the naming and visibility conventions that show up at code review.

The chapter has two halves. The first half is contributor-facing guidance specific to working on axess. The second half is the canonical CONTRIBUTING.md from the repo root, included so the workflow checklist is in one place.

Before you open a PR

Three things to do before you open a PR.

The first is to read or skim Architecture at a glance. The verifier-versus-orchestrator boundary, the three state slices, the DST discipline, and the naming conventions are the four architectural decisions that the review process holds new code against. A PR that violates one of them is harder to land; a PR written with them in mind sails through.

The second is to find or create an AX-NNN tracking entry. The ROADMAP is the source of truth for "what is being worked on" and "what is committed." A PR that lands a feature should reference an AX-NNN. A PR that lands a bug fix can do without (though one is often associated even with fixes). The number lives in the PR description and in the commit messages; the format is AX-NNN (no #, no space).

The third is to discuss substantial changes before writing them. The review cycle is faster when the maintainers have agreed to the shape ahead of time. A drive-by PR that rewrites a module is usually rejected even when the rewrite is well-thought-out; the cost of integration is higher than the value of the rewrite. A discussion (an issue, a draft PR description, a comment in an existing thread) before the work starts is the shape that lands.

Testing requirements

Every change passes its tests under both the production and the mock implementations of Clock, SecureRng, and the backend traits. The DST discipline is the testing non-negotiable; it is not aspirational.

A test that fails on the production implementation but passes on the mock is detecting a real bug in the production code (or in the test). A test that fails on the mock but passes on production is detecting either a real timing-dependent bug or an over-strict test; either way it is worth investigating before landing.

The pattern in the test code is to parameterise:

#[tokio::test]
async fn login_succeeds_with_correct_password() {
    let suite = TestSuite::default();  // sets up the mocks
    let outcome = suite.service
        .verify_factor(
            &suite.session(),
            FactorCredential::Password("Gnomes2+".into()),
        )
        .await
        .unwrap();
    assert!(matches!(outcome, FactorOutcome::Authenticated));
}

TestSuite::default() wires MockClock, MockRng, MockBackend, MockRegistry, the in-memory session store, and the in-memory device store. The test runs entirely in process, deterministically, against a known initial state.

For tests that need a real database (integration tests that verify SQL adapters), the pattern is to feature-gate them and run them in CI under a service container:

#[tokio::test]
#[ignore = "requires Postgres"]
async fn postgres_session_round_trip() {
    let pool = sqlx::PgPool::connect(env_var("TEST_POSTGRES_URL")?).await?;
    // ... full integration test
}

The #[ignore] attribute keeps the test out of the default cargo test run; the CI runs them explicitly with cargo test --features integration -- --ignored. The pattern keeps the inner loop fast (default cargo test is in-process) while still exercising the integration tests in CI.

What good PR descriptions look like

The PR description is what reviewers read first. The goal is to explain what the PR does, why, and what to look for. The shape:

A one-sentence summary at the top. "Add the BearerToken factor for inbound API authentication." Not "Misc fixes." The summary is what shows up in the PR list and in the commit history.

A "Why" paragraph. What problem does the change solve. The problem might be a documented bug, a missing capability, an operational signal that needs response. The reviewer's first question after "what" is always "why now"; answer it in the description rather than the comments.

A "How" section. The shape of the change. Which modules touched, which traits added or modified, which tests added. The reviewer's first question after "why" is "where to look"; the section is the map.

A "Testing" section. What tests cover the change. The default expectation is unit tests against the mocks; integration tests where the change crosses an integration boundary; manual testing notes for changes that are hard to automate (typically migrations or operational tooling).

A "Migration" section if the change is breaking. What downstream code has to update. The section is what feeds the Migration guide chapter; the maintainers add the entry there as part of the merge, but the PR author drafts the wording.

A reference to the AX-NNN tracking number. If the work is substantial, the AX entry has the larger context; the PR description summarises the slice this PR delivers.

Naming and visibility

The naming conventions from Architecture at a glance are enforced at review. The shapes:

A type that is shared across authentication and authorisation uses the Auth* prefix. A type used only for authentication uses Authn*. A type used only for authorisation uses Authz*. A type that does not fit any of the three either picks one (typically the broader one) or argues in the PR description why the convention does not apply.

A type's suffix carries its role. *Store, *Registry, *Provider, *Resolver, *Config, *Error, *Outcome, *Decision. A new type that does not fit any of these picks the closest match or argues in the PR description; the conventions are tight, but they are not exhaustive, and the rare exception is acceptable when documented.

A method's verb carries its complexity. get_* is O(1) by primary key. find_* may scan. load_* and save_* are serialisation pairs. begin_* and complete_* are ceremony starts and finishes. verify_* is a credential check. A method that does not fit any of these picks the closest match.

Visibility defaults to pub(crate). A type is promoted to pub only when an external consumer needs it; the default is to not export, and the burden is on the PR to justify the promotion. The convention catches the common case where an internal helper accidentally becomes public surface that has to be maintained forever.

The no-#[non_exhaustive] policy

Axess does not use #[non_exhaustive] on its public enums and structs. The attribute trades exhaustiveness checking (the downstream compiler does not catch missing match arms) for backward compatibility (the upstream can add variants without breaking downstream). For axess, the trade is the wrong way around: missing match arms in the downstream are bugs we want to catch, and the backward-compatibility cost of adding variants is manageable through deprecation cycles and the migration guide.

A PR that adds #[non_exhaustive] to a public type is rejected unless the reasoning in the PR description argues a specific case. The default is to bump the semver major version when a variant is added, document the change in the migration guide, and let the downstream's compiler catch the missing arm.

The DST non-negotiable

The DST discipline is reproduced from Architecture at a glance as a contributor reminder:

Every code path that reads wall time goes through the Clock trait. Every code path that sources entropy goes through the SecureRng trait. Every backend trait has a mock implementation that the tests use. A PR that introduces a chrono::Utc::now() call, a getrandom() call, or a direct database read outside the trait surface is rejected.

The exceptions are extremely narrow: the axess-cache crate's moka-cache feature uses wall-clock-driven eviction (opt-in, documented as DST-breaking), and the production SystemClock and SystemRng implementations delegate to the OS (these are the only places where the OS calls happen). New code introduces neither another exception nor a workaround that hides the same problem.

The discipline is what lets the test suite be reproducible. A contributor who finds the discipline frustrating is usually about to introduce a bug; the friction is the point.

Canonical CONTRIBUTING.md

The rest of this chapter is the canonical CONTRIBUTING.md from the repo root.

Contributing to Axess

Thanks for your interest! Axess accepts bug reports, feature requests, documentation improvements, and code contributions.

Before opening a PR for non-trivial work, please file an issue first; this lets us flag overlap with in-flight work in ROADMAP.md and confirm the change fits the library's direction (see docs/intro/architecture.md) before you invest time.

Before you submit

  1. Fork the repository and create a topic branch from main.
  2. Tests; add or update tests for every behaviour change. The library uses deterministic simulation testing (DST); inject MockClock / MockRng rather than calling SystemTime::now() or rand::rng() directly.
  3. Run the full check locally:
    cargo fmt --all
    cargo clippy --workspace --all-features --lib --tests -- -D warnings
    cargo test --workspace --all-features
    
  4. Update CHANGELOG.md; add an entry under the [unreleased] section describing the change. Behaviour-changing entries belong under ### Changed (breaking) if they alter a public API.
  5. Open a PR with a description that covers the why; link the issue, summarise the design choice, and call out any deliberate trade-offs.

Coding conventions

  • Idiomatic Rust, async/await for IO, thiserror for error types, tracing for logs.
  • Prefer traits + generics on hot paths; vtable dispatch (Box<dyn …>) only where it earns its keep.
  • Public APIs need rustdoc; including at least one usage example for newly-introduced traits or builders.
  • All time + randomness goes through the Clock / SecureRng traits. This is non-negotiable; it's what makes the test suite deterministic.

See .github/copilot-instructions.md for the full house style.

Workspace layout

CrateRole
axessPublic facade: middleware builder, re-exports, feature gates
axess-coreCore types, session orchestrator, Cedar authz integration, on-behalf-of credential storage + token exchange
axess-cacheGeneric clock-aware TTL cache
axess-clockClock / MockClock traits for DST
axess-eventsrkyv-serialisable audit event types
axess-factorsAuthentication factor implementations
axess-identityNewtype ID macros + impls
axess-macrosProcedural macros for route guards
axess-rngSecureRng / MockRng traits
axess-stringsShort hot-path string primitive
examples/*Reference example applications

Repository conventions

A few rules that aren't obvious from reading the code but affect every PR. Most exist because the cost of not following them showed up somewhere.

Module layout

axess uses the modern Rust convention: foo.rs + a sibling foo/ directory holding submodules. No mod.rs files in new code. Every directory module declares its submodules in the foo.rs file next to (not inside) the directory.

Test-sideways-pull

When #[cfg(test)] tests crowd a production file enough to make scrolling expensive, pull them into a sibling tests.rs:

axess-core/src/path/file.rs      ; production code +
                                    #[cfg(test)] mod tests;
axess-core/src/path/file/tests.rs; the actual tests, gated by
                                    #![cfg(test)]

Applied so far across several files where the tests-to-production ratio exceeded ~40%.

pub(crate) for state-machine internals

AuthSession carries identity / session-state accessors as pub. State mutation methods (set_authenticated, begin_authenticating, advance_factor, record_attempt_at) are state-machine transitions that the factor pipeline drives; they are pub(crate) so handler code cannot corrupt the state machine. Adopters drive flow through AuthnService; the session is read-only-ish from outside axess-core.

Per-app workflow mutations (set_identifying, set_pending_workflow, clear, regenerate) remain pub; apps build their own two-step identify / workflow-step / logout flows on top.

No #[deprecated] pre-v0.1.0

Breaking changes happen freely in the unreleased [0.2.0] window; adopters get one coordinated migration window, not a long #[deprecated] trail. CHANGELOG documents each break under ### Changed (breaking).

MSRV bumps are breaking changes

The workspace pins rust-version = "1.87" in [workspace.package]. A bump to a higher MSRV requires a minor-version bump on every published crate (0.x → 0.x+1 for 0.x; 1.x → 1.x+1 once stable). The reasoning: adopters pin Rust toolchains in CI; jumping the floor without warning silently breaks their builds.

Procedure for an MSRV bump:

  1. Justify in the PR description (which compiler feature, why it earns the bump).
  2. Update rust-version in [workspace.package] AND the MSRV job's toolchain pin in .github/workflows/ci.yml.
  3. Add an entry under ### Changed (breaking) in CHANGELOG.md naming the new floor.
  4. Bump the workspace version (in [workspace.package]) accordingly.

No #[non_exhaustive] on first-party enums

#[non_exhaustive] trades one breakage class (adding variants) for another (every downstream match needs a wildcard arm forever, even when the caller wants compile-time exhaustiveness on a closed set). Project policy is to bump the version and let downstream match failures be loud. CI enforces this; the ban_non_exhaustive workflow job rejects any PR that introduces the attribute.

No ticket-meta date stamps pre-v0.1.0

Source-code comments do not carry // AX-NNN (YYYY-MM-DD): markers. The CHANGELOG is the authoritative timeline; in-source stamps add noise without information a future reader can use. ROADMAP + CHANGELOG retain their AX-NNN references unchanged.

Closed AX-NNN references get stripped

Once an AX-NNN case closes, every reference in source / doc-strings / test names is stripped, preserving the rationale comment but dropping the case number. Open + deferred cases stay referenced.

Promoting a module out of axess-core

axess-core has accumulated significant surface. When proposing a new crate carve-out, check:

  1. No reverse dep from axess-core onto the carved module. If the module's types appear in AuthnService method signatures or in any axess-core trait surface, the carve isn't yet feasible; invert the dependency first.
  2. Module has its own external dep blast. Carving delegated/ into axess-delegated won because it pulls aes-gcm only when adopters opt in. A carve that pulls no extra deps is just churn.
  3. Module is consumable in isolation. A consumer who wants only the carved module should not transitively recompile axess-core's protocol surface.
  4. Re-export via the facade preserves the import path. Adopters write axess::middleware::ratelimit::*, not axess_middleware::ratelimit::*. The facade decides the shape.

Security

Do not open public issues for security vulnerabilities. Report them privately per SECURITY.md.

Licensing

By contributing, you agree your contribution will be dual-licensed under MIT and Apache-2.0, matching the project licence.

Community

Be respectful and constructive. See CODE_OF_CONDUCT.md.

Maintainer time is volunteer-funded; review turnaround is best-effort.

Further reading

Architecture at a glance covers the architectural decisions that review enforces. Publishing runbook covers the maintainer-only release process. The CHANGELOG.md catalogues what each release has shipped, which is useful context for understanding what the next PR is meant to do.

Publishing runbook

This chapter is the maintainer-only reference. It covers the publish-to-crates.io procedure: the pre-flight checklist, the dependency topological order, the dry-run, the actual publish, and the rollback procedure if something goes wrong.

The audience is the maintainer cutting a release. An adopter does not need this chapter; the chapter is here so the maintainer has a written reference and so the procedure can be followed by a different maintainer if needed.

Pre-flight

The pre-flight checklist runs before the first dry-run. Each item is a binary pass-or-fail; one failure blocks the release.

The CI is green on the release branch. The full test matrix (default features, all-features, per-backend isolation, FIPS backend) all pass. A red CI does not publish.

The version bumps are consistent across the workspace. Every member of the workspace gets the same version bump (this is the versioning policy: the workspace ships as one unit). The Cargo.toml in each crate carries the new version; the version.workspace = true shape inherits from the root.

The Cargo.lock is up to date. Run cargo update --workspace, review the changes, commit if needed.

The migration guide is complete. The Migration guide chapter in the book carries an entry for every breaking change in the release. The entry covers the symptom, the rationale, and the fix.

The CHANGELOG.md has a current entry for the release. The entry covers the new features, the breaking changes (referencing the migration guide), and the bug fixes. The entry is the short-form version of the migration guide; both exist because they serve different audiences (the CHANGELOG is the per-release overview, the migration guide is the per-change reference).

The docs.rs configuration is present and correct on every crate that publishes to crates.io. The shape:

[package.metadata.docs.rs]
all-features = true
rustdoc-args = ["--cfg", "docsrs"]

Verify by running cargo doc --all-features locally; the build must succeed without warnings. A docs build that fails on docs.rs after publish is a maintenance problem that surfaces once the release is out.

The description, license, repository, keywords, and categories fields are populated on every published crate. The crates.io page renders these; a missing field is a missing detail in the listing.

The publish = false flag is removed from every crate that should publish. This is the deliberate gate that keeps accidental publishes from happening; flipping the flag is what makes the release possible.

The version branch (a fresh release/0.2.0 branch from main) exists. The branch is the source of truth for the release; any fixes during the publish window land on the branch and merge back to main after.

Topological dependency order

The workspace's library crates publish in dependency order: a crate must be published before any crate that depends on it. The order:

  1. axess-strings (no axess deps)
  2. axess-clock (no axess deps)
  3. axess-rng (no axess deps)
  4. axess-identity (no axess deps)
  5. axess-cache (depends on axess-clock)
  6. axess-events (depends on axess-identity)
  7. axess-factors (depends on axess-identity, axess-clock, axess-rng)
  8. axess-macros (no axess deps; procedural macros stand alone)
  9. axess-core (depends on everything in the previous tier)
  10. axess (the facade, depends on axess-core and axess-factors and axess-macros)

The order is generated by cargo publish --dry-run against each crate in turn, but the maintainer should know it manually so a publish that fails partway through can be resumed at the right position.

The dry-run

Before the actual publish, run a dry-run for each crate in topological order. The shape:

cargo publish --dry-run -p axess-strings
cargo publish --dry-run -p axess-clock
cargo publish --dry-run -p axess-rng
cargo publish --dry-run -p axess-identity
cargo publish --dry-run -p axess-cache
cargo publish --dry-run -p axess-events
cargo publish --dry-run -p axess-factors
cargo publish --dry-run -p axess-macros
cargo publish --dry-run -p axess-core
cargo publish --dry-run -p axess

The dry-run does everything the publish does except the upload. It builds the package, verifies the manifest, runs the publish-time checks, and prints the path of the .crate file it would have uploaded. A failure here is an opportunity to fix without having to yank a half-published release.

A failure on a downstream crate (say axess-core) typically means an upstream crate (say axess-factors) needs an update that the dry-run does not yet reflect. The fix is to update the upstream first; the downstream dry-run picks up the change.

The publish

After the dry-runs pass, run the actual publish in the same topological order. The shape:

cargo publish -p axess-strings
# wait ~30s for crates.io to index, then:
cargo publish -p axess-clock
# wait, then:
cargo publish -p axess-rng
# ... and so on

The wait between publishes is necessary because each subsequent publish needs the previous one to be available on crates.io's index. Without the wait, the downstream publish fails with "crate not found"; with the wait, the index has propagated by the time the next publish queries it.

The wait time is short (30 seconds is generous; sometimes 15 seconds works). For automation, the wait can be scripted with a retry loop that polls the crates.io API until the expected version is listed.

After the final publish (cargo publish -p axess), wait a few minutes and verify on crates.io that all the crates are listed at the new version.

The smoke test

After the publish, run a smoke test against a fresh dependency on the published version. The shape:

mkdir /tmp/axess-smoke
cd /tmp/axess-smoke
cargo new --name axess-smoke .
echo 'axess = "0.2"' >> Cargo.toml
cargo build

The build pulls the published crates from crates.io (not from the workspace) and verifies they assemble. A failure here indicates an issue that the dry-run did not catch (typically a crates.io-specific issue like a missing file in the package manifest); the rollback procedure is the response.

For a more thorough smoke test, copy examples/sqlite/ to a fresh directory, point its Cargo.toml at the published versions (replacing path = "../../axess" with version = "0.2"), and verify it builds and runs.

The smoke test is the last gate before announcing the release. A successful smoke test means the publish is real.

The announcement

After the smoke test passes:

Tag the release in git (git tag v0.2.0 && git push --tags). The tag is the canonical reference point.

Update the status banner near the top of README.md from "pre-release" to "0.2.0 released."

Open a 0.3.0-pending section in CHANGELOG.md. Future PRs land entries under that section until the next release cuts.

Post to the project's announcement channels (the GitHub releases page is the canonical one; the project's Discord, Slack, mailing list, or other channels mirror as appropriate).

Update the docs.rs links anywhere they hardcode a version. The canonical version is now the released one, not the development branch.

Rollback

If the publish goes wrong (a critical bug surfaces, a crate is broken on crates.io, the release was premature), the rollback procedure:

cargo yank --version 0.2.0 axess (and every other crate) withdraws the version from crates.io. Yanked versions remain available to existing consumers (so Cargo.lock references continue to work), but new resolves do not pick them up.

A yanked version cannot be unyanked, and a new publish with the same version number is not possible. The next release picks a new version (0.2.1); the fix lands there.

In practice the rollback is needed less often than the dry-run suggests; the topological-publish discipline catches most issues before the upload. A yank is the genuine emergency response, typically for a security issue that warrants withdrawing a specific version.

Post-release maintenance

After the release lands, the maintenance window is the period where the maintainer watches for issues. The shape:

The first 24 hours: monitor crates.io for download numbers (a quick check that the publish reached an audience); monitor the GitHub issue tracker for new bug reports; monitor the dashboards of any deployments that follow main closely.

The first week: triage the issues that come in; assess whether any warrant a patch release (0.2.1). The criteria for a patch release: a critical bug, a security issue, a regression from 0.1.x that the migration guide did not catch.

The first month: roll up the lessons learned. The patches shipped, the issues that surfaced, the documentation gaps the release exposed. The roll-up feeds the next release's planning.

Further reading

Migration guide covers the cross-version compatibility surface that the release-management decisions depend on. Contributing covers the development workflow that produces the changes a release ships. The CHANGELOG.md covers the exhaustive list of changes per release.