Welcome
Axess is a library for authenticating users (and non-human callers) in Axum applications. Most of what it offers is not unusual: a session layer, factor verification, cookies, policy evaluation. What makes it interesting is what those pieces refuse to do, and what the refusals add up to.
The first refusal is the most consequential. A session in axess is never
"half logged in". The authentication state is a typed enum with five
variants, and a partially-completed login is one of those variants
(Authenticating) rather than an Authenticated session with a missing
flag. Handlers that receive a session cannot mistake a one-factor login
for a finished one because the types do not permit it. Code reviewers
reading a handler do not need to model which fields are populated when;
the variant they match on tells them what is in scope and what is not.
The second refusal concerns time and entropy. Production authentication
code is full of both: session identifiers come from the operating
system's random source, one-time-password windows are measured against
the wall clock, account lockout clears on a schedule. None of that goes
through the operating system directly in axess. Every wall-clock read,
every random byte, is sourced from a Clock or SecureRng trait whose
production implementation delegates to the system and whose test
implementation reproduces a controlled sequence. The same login flow,
including all its side effects on the session registry and the audit
log, runs in a unit test without infrastructure and without flakes. The
discipline is called deterministic simulation testing, and it is the
reason a race condition between token issuance and use can become a
failing test rather than a postmortem.
The third refusal is to ship scattered if user.role == "admin" checks
as the authorisation story. Cedar Policy, the policy language axess uses
for authorisation, is declarative, schema-validated, deny-by-default,
and one language for what most codebases split between role checks,
ownership checks, and contextual rules. Policies live in policy files,
not in handlers. The Rust code asks the question and receives an
AuthzDecision. Reviewing the policy set is then a single artifact
review, not a hunt across handlers.
Most of the rest of axess follows from these three decisions, in
combination with one structural choice: the library is split across
ten small crates so adopters who do not need (say) a given
federation adapter do not compile its dependencies. The split is
also the verifier-versus-orchestrator boundary in code. Per-credential
algorithms live on one side (axess-factors), the state machine,
composition, and federation machinery live on the other (axess-core).
The split is the most important line in the workspace, and it is
described in detail in the next chapter.
What axess is not
It helps to know the boundaries. Axess is not a SaaS, has no hosted
control plane, and does not own your user database. It is a library
your application depends on, and your application keeps owning its data.
It is not an Identity Provider in its primary use. In OAuth/OIDC
terms axess is the Relying Party (the application that delegates
identity to an external IdP and runs a session on the resulting
tokens), not the OpenID Provider (the IdP itself, with login UI,
consent screens, and user database). For the OP role, point axess at
Keycloak, Ory Hydra, Okta, Azure AD, or whatever SSO your
organisation already runs. The local-idp feature does mint workload
JWTs in-process, but that is on-host service-to-service issuance, not
a user-facing OP; Local IdP covers the surface. It is not an HTTP
server. Axum is the HTTP server; axess plugs into it as a Tower layer
plus a set of extractors, and your code is what owns the lifecycle. It
is not a general-purpose session library. The session machinery is in
service of the authentication state machine, not the other way around;
if all you need is HTTP sessions without authentication or
authorisation, smaller libraries do that better.
The workspace, in one table
The crate split is structural, and the table below is a fair approximation of which one you reach for in any given situation. The chapter Architecture at a glance expands on the dependency direction and the rules that keep leaf crates from depending on the orchestrator.
| Crate | Role |
|---|---|
axess | Facade. Re-exports the public API. Application code depends on this. |
axess-core | Session state machine, AuthnService, AuthzStore, federation adapters (OAuth, OIDC, LDAP, mTLS, FIDO2, JWT, K8s SA, GitHub OIDC), device identity, OBO/delegated access, middleware, storage backends. The orchestrator. |
axess-factors | Per-credential verifier primitives: Argon2id, TOTP, HOTP. Composable on their own. |
axess-identity | Typed IDs (UserId, TenantId, WorkloadId) and the Principal { Human, Workload } enum. |
axess-events | Audit event payloads and async sinks. |
axess-cache | TTL+LRU cache with single-flight. Used by the Cedar entity cache and the OIDC JWKS cache. |
axess-clock | Clock trait, SystemClock, MockClock. The DST time foundation. |
axess-rng | SecureRng trait, SystemRng, MockRng. The DST entropy foundation. |
axess-strings | Shared string newtypes (Arc<str> interning). |
axess-macros | require_authn!, require_partial_authn!, require_authz! procedural macros. |
When to reach for axess
Axess fits when at least two of the following are true. Multi-factor authentication that varies per user or per tenant is the most common driver, because composing factors and threading their result through a typed state machine is the value over a single-factor session library. Policy-driven authorisation in one language, across roles, relationships, and contextual conditions, is the second. Multi-tenancy is the third, since axess scopes factors, methods, and policies at three tiers (Global, Tenant, User) by default. Device identity, workload identity, or delegated access are the fourth, fifth, and sixth, in the order most adopters need them. Regulated industries land here for the audit pipeline and FAPI 2.0 conformance work, because the trail axess emits is already shaped as evidence.
It does not fit when a single-factor session is all you need. It does not fit when you want a hosted IdP. It does not fit when your protocol is not HTTP, because the state machine is shaped against Axum extractors and middleware. Each of these has better answers elsewhere.
Where to read next
If you are evaluating axess, the next chapter is the one to read. Architecture at a glance covers the verifier-versus-orchestrator line, the dependency direction, the three independent state slices that make up a request, and the DST mechanics that ride underneath. Twenty minutes there will save an hour in every chapter after.
If you are starting an integration, jump to Getting started. It walks
through a minimal Axum application end-to-end and points at
examples/sqlite/ for the production-shaped version with a real
database, encrypted sessions at rest, two-factor login, rate limiting,
health checks, and metrics.
If you have inherited an existing axess integration, the left-hand navigation is grouped by concern. Parts II and V (Authentication and Sessions) carry most of the day-to-day surface; the rest reads as reference and can wait until you need it.
If you are responsible for the production deployment, read Security posture and Operations runbook before launch. The defaults in code are conservative for development; production has specific knobs that must be set explicitly, and both chapters name them.
Status
Axess is at v0.2.0, pre-publication. The API is stabilising, with
(the crates.io publish) named as the next milestone. The breaking
changes accumulated against the previous stable line are catalogued in
Migration guide. Until the first crates.io release, minor versions
may break source compatibility; the goal post-publish is to maintain
the SemVer discipline that Rust libraries are held to elsewhere.
Vulnerability reports go through the private channel described in
SECURITY.md.
Please do not file security issues on the public GitHub tracker.
Architecture at a glance
This chapter describes the shape of the axess workspace: which crate owns what, how the pieces compose, what stays put under which kind of change, and where adopters plug in. The goal is to make the rest of the book pre-cached. Once you have the four architectural decisions below in mind (the verifier-versus-orchestrator line, the three state slices, the DST foundation, and the naming conventions), every later chapter slots into place without further explanation.
If you are evaluating axess, read this chapter end-to-end. If you are already mid-integration, you can skim and come back when something feels surprising.
Workspace shape
Axess is ten library crates plus a set of example applications. The split is not cosmetic. It enforces a structural invariant (leaf crates do not depend on the orchestrator), it gates compile cost for features adopters do not use, and it makes the verifier-versus-orchestrator line explicit in the dependency graph.
flowchart TD facade["axess<br/><i>facade</i>"] core["axess-core<br/><i>orchestrator</i>"] factors["axess-factors<br/><i>verifiers</i>"] macros["axess-macros<br/><i>guard macros</i>"] identity["axess-identity<br/><i>typed IDs</i>"] events["axess-events<br/><i>audit payloads</i>"] cache["axess-cache<br/><i>TTL cache</i>"] clock["axess-clock<br/><i>Clock trait</i>"] rng["axess-rng<br/><i>SecureRng trait</i>"] strings["axess-strings<br/><i>Arc<str></i>"] facade --> core facade --> factors facade --> macros core --> factors core --> identity core --> events core --> cache core --> clock core --> rng core --> strings factors --> identity factors --> clock factors --> rng cache --> clock events --> identity
The axess crate is a thin facade that re-exports the curated public
API from axess-core and axess-factors. Application code depends on
this crate and only this crate. The internal split is free to
reorganise without breaking adopters, provided the types surfaced at
the facade level stay compatible.
axess-core is the orchestrator. It owns the session state machine,
AuthnService, AuthzStore, the Axum middleware stack (CSRF, rate
limit, request id, trace id), session storage backends, the
device-identity ladder, the workload identity resolvers, and the audit
dispatch. If a type drives a transition or owns persistent state, it
lives here.
axess-factors holds the per-credential verifiers. The list is long
because the credential surface authentication actually has is long:
Argon2id, TOTP, HOTP, email OTP, FIDO2, LDAP bind, mTLS, OAuth and OIDC
(with discovery, JWKS cache, and logout-token claim validation), JWT
validation, federation adapters for Kubernetes service accounts and
GitHub Actions and generic OAuth resource servers, a bearer-token
extractor, an outbound OAuth client, and the PKCE helpers. The crate is
composable on its own and is the obvious extension point when you need
a custom factor: implement the verifier trait, register it with the
service, the rest stays the same.
Everything else in the workspace is a leaf. Each leaf crate owns one
concept (typed IDs, TTL cache, the Clock trait), and depends only on
other leaves on its own row of the dependency graph. The structural
invariant under review is straightforward: no leaf crate may depend on
axess-core. Flipping any of these to depend on the orchestrator would
create a cycle through the facade and is rejected at review.
The verifier-versus-orchestrator line
The most important line in the workspace runs between axess-factors
and axess-core. Per-credential algorithms and their data shapes live
on the verifier side. The sum types and the composition machinery that
combine them live on the orchestrator side.
This is concrete. The Fido2Config struct, the Fido2Verifier trait,
and the WebAuthn ceremony itself live in axess-factors. The
FactorKind::Fido2 variant, the FactorConfig::Fido2(Fido2Config)
wrapping, and the FactorStep::factor(FactorKind::Fido2) composition
helper live in axess-core. The same pattern applies to LDAP, to OAuth,
to every factor: the algorithm and its config are verifier-side, the
enum variant and the composition are orchestrator-side.
The reason for the line is the kind of change each side absorbs. The
verifier is the thing you might want to swap (an alternative WebAuthn
library, a custom OTP scheme, an LDAP binding that reads from a sidecar
rather than directly). The orchestrator is the thing you do not swap
(the state machine, the audit dispatch, the storage interface) but do
want to extend (add a factor, add a workflow, add a backend). Keeping
the two in separate crates makes the swap and the extension into
independent operations. A change in axess-factors does not invalidate
orchestrator code; a change in axess-core does not touch the verifier
crates.
The line also shows up in the dependency direction. axess-core
depends on axess-factors, never the reverse.
The one exception that proves the rule
axess-core hosts one piece of code that does not fit the
RP-side-orchestrator framing: the in-process IdP under
crate::local_idp (feature local-idp, off by default). LocalIdp
mints workload-identity JWTs on-host, which is OP-side issuance, not
verifier composition. It lives in axess-core deliberately, not by
oversight. The choice is between two costs: carve LocalIdp into a
sibling crate that mirrors the verifier/issuer split at workspace
shape, or accept one feature-gated OP-side module inside the
orchestrator crate. The carve-out has been considered (see the
ROADMAP) and rejected on the same reasoning that retired the earlier
axess-delegated crate: the structural benefit is real but small,
the maintenance overhead of an additional workspace member is real,
and no adopter is asking for LocalIdp as a separate dependency.
Adopters who do not enable local-idp pay nothing for it; adopters
who do enable it find it through axess::local_idp::* regardless of
which crate hosts the implementation.
The internal layout reflects the boundary even when the crate
boundary does not. Primitives shared between the production
[LocalIdp] and the test [LocalIdpFixture] live in
axess-core/src/local_idp/primitives.rs, outside the testing/
tree, so production code does not have to import from a test module.
The fixture itself stays under crate::testing::local_idp and
imports the primitives, which is the dependency direction the prior
arrangement got backwards.
The three state slices
Most authentication libraries conflate three independent state machines into one bag of fields and call the result a "session". Axess keeps them separate. This is not a stylistic choice; the slices answer different questions, change on different cadences, and are owned by different concerns.
flowchart LR
subgraph auth["Authentication state"]
direction TB
s1["Guest"] --> s2["Identifying"]
s2 --> s3["Authenticating"]
s3 --> s4["Authenticated"]
s3 --> s5["PendingWorkflow"]
s5 --> s4
end
subgraph authz["Authorisation state"]
direction TB
a1["AuthzStore<br/><i>policies + schema<br/>(loaded once)</i>"]
a2["AuthzSession<br/><i>per-request facade</i>"]
a3["AuthzEntityProvider<br/><i>app-supplied graph</i>"]
a1 --> a2
a3 --> a2
end
subgraph principal["Principal state"]
direction TB
p1["Principal::Human"]
p2["Principal::Workload"]
end
Authentication state is AuthState, the session state machine
covered in Part II. It transitions through factor verification, lives
inside SessionData behind a cookie, and is what AuthnService::verify_factor
mutates. It answers the question "is this caller authenticated, and to
what tier?" It changes on factor verification, which is rare in
absolute terms.
Authorisation state is AuthzStore, holding the Cedar policy set and
its schema, loaded once at startup. A per-request AuthzSession then
evaluates those policies against an entity graph that the application
supplies through an AuthzEntityProvider. It does not live in the
session; it is rebuilt fresh per request. It answers the question
"is this principal allowed to perform this action against this
resource?" It changes when policies are redeployed, which is even
rarer.
Principal state is Principal { Human | Workload }. A human
principal carries a UserId and TenantId; a workload principal
carries a WorkloadId. The principal is extracted from the
authentication state for humans and from a workload-identity resolver
(bearer JWT, mTLS, K8s service account, and so on) for non-humans. It
changes on every single request.
The slices are independent because they answer different questions and change on different cadences. Treating them as one bag conflates the questions and the cadences. Keeping them apart lets each evolve without disturbing the others.
Deterministic simulation testing
Every place in axess that reads wall time or sources entropy on the hot path goes through an injected trait. This is the discipline that lets the test suite be reproducible and that lets subtle timing or ordering bugs become failing tests rather than rare incidents.
Two traits carry the foundation. The first is Clock:
pub trait Clock: Send + Sync {
fn now(&self) -> chrono::DateTime<chrono::Utc>;
}
pub struct SystemClock; // delegates to chrono::Utc::now()
pub struct MockClock { /* ... */ } // advances under test control
The second is SecureRng:
pub trait SecureRng: Send + Sync {
fn fill_bytes(&self, dest: &mut [u8]);
}
pub struct SystemRng; // delegates to getrandom
pub struct MockRng { /* ... */ } // seeded; reproduces byte sequence
A small detail matters here. SecureRng::fill_bytes takes &self, not
&mut self. The mock implementation guards its internal counter with a
Mutex so that Arc<dyn SecureRng> is dyn-compatible and concurrent
use is serialised without forcing every call site to plumb a mutable
borrow through. The trade is a single locked critical section per
random fill, which is irrelevant on the authentication hot path.
The wiring matches. AuthnService<I, F> holds Arc<dyn SecureRng> and
Arc<dyn Clock> as construction-time fields. The service is generic
over the identity store (I) and factor store (F) but type-erased
over clock and RNG, so swapping in MockRng or MockClock does not
change the service's type signature. Tests do this with
.with_rng(MockRng::new(seed)) and .with_clock(MockClock::default());
production wires SystemRng and SystemClock.
The same discipline extends to backends. The pattern is uniform: the
production implementation talks to a real database or external service,
and a Mock* implementation does the same thing in memory under test
control.
| Trait | Production implementation | Test mock |
|---|---|---|
AuthnBackend | real database | MockBackend |
SessionRegistry | Valkey or memory | MemorySessionRegistry |
OAuthProvider | HTTP plus JWKS cache | MockOAuthProvider |
Fido2Provider | WebAuthn ceremony | MockFido2Provider |
LdapProvider | LDAP directory | MockLdapProvider |
DeviceStore | SQL or Valkey | MemoryDeviceStore |
DeviceResolver | header or IP | RedactedResolver, NoopDeviceResolver |
A complete login including session-registry interactions, factor
verification, refresh-token rotation, and audit emission can be
exercised in a #[tokio::test] with no database, no Valkey, no
network. The same test that detects a regression on a development
laptop detects it in CI without further configuration.
One carve-out is worth naming. The axess-cache crate has an opt-in
moka-cache feature that runs Moka's wall-clock-driven background
eviction. That feature breaks DST and is documented as breaking it.
The default ClockTtlCache takes a Clock trait and is DST-clean. If
your test suite runs against the default configuration, you are inside
the determinism envelope.
Storage backends
Identity persistence is adopter-owned. Axess does not prescribe a user or tenant or factor schema, because every application already has one and the schemas do not agree on much. What axess does prescribe is the trait surface you implement, split into three tiers so that adopters can narrow what they have to write.
The narrowest tier is IdentityLookup, with ten read verbs. It is
enough to support a read-replica path or a test fixture. The middle
tier, IdentityAuthnLog: IdentityLookup, adds four per-attempt audit
writes; it is required for production because lockout decisions depend
on the audit log. The widest tier, IdentityAdmin: IdentityAuthnLog,
adds nine verbs covering privileged provisioning, suspension, and GDPR
erasure, and is required for any control-plane surface.
The umbrella alias IdentityStore: IdentityAdmin preserves the
all-three-tiers shape for production backends. NoopAuthnLog is an
adapter that wraps an IdentityLookup and satisfies the
IdentityAuthnLog signature with a no-op, suitable for fixtures and
read-replica contexts. Production must implement IdentityAuthnLog
directly, however; the noop disables lockout, which is a security
posture you do not want by accident.
Session, refresh-token, and device storage have first-party backends for the obvious targets:
| Trait | Memory | SQLite | Postgres | MySQL | Valkey |
|---|---|---|---|---|---|
SessionStore | always-on | sqlite | postgres | mysql | valkey |
SessionRegistry | always-on | (adopter) | (adopter) | (adopter) | valkey |
RefreshTokenStore | always-on | adopter | adopter | adopter | adopter |
DeviceStore | device | device, sqlite | device, postgres | (adopter) | device, valkey |
DelegatedCredentialStore | always-on | adopter | adopter | adopter | adopter |
The word "adopter" means axess defines the trait and provides a memory
implementation; the SQL or Valkey-backed implementation is yours. The
chapter Identity store implementation walks through the pattern, and
examples/sqlite/ ships a complete one.
Session backends are also re-exported through the facade under the
axess::backends::{sqlite, postgres, mysql, valkey, memory} namespace.
Application code writes use axess::backends::sqlite::{SessionStore, DeviceStore}
rather than stitching together flat SqliteSessionStore,
SqlDeviceStoreError, and similar symbols. The grouping is a facade
detail; backend module paths inside axess-core are internal.
The generic Store<K, V> surface
All five session backends also implement the generic
axess_core::store::Store<SessionId, SessionData> trait. Adopters who
want a backend-agnostic key/value-with-TTL surface (test doubles,
generic operations endpoints, multi-backend deployments) can hold an
Arc<dyn Store<…>> or a generic S: Store<…> and dispatch uniformly.
SessionStore stays the primary surface for session-domain operations
(cycle, find_sessions_for_user) because those carry primitives the
generic Store deliberately omits.
A fully codec-parameterised SqlStore<K, V, C: Codec<V>> was evaluated
and rejected. The dialect-specific SQL bodies are too thin to justify
the sqlx::Database bound noise: only ON CONFLICT versus
ON DUPLICATE KEY UPDATE plus three placeholder styles differ. The
slice that does dedupe cleanly lives in
session/storage/sql_helpers.rs.
Naming conventions
A reviewer reading axess code can predict a type's responsibility from its prefix and suffix. The conventions are tight on purpose; they let you scan a module index without reading any function bodies.
Type prefixes
| Prefix | Scope | Examples |
|---|---|---|
Auth* | Shared across authentication and authorisation | AuthSession, AuthState, AuthEvent, AuthMethod, AuthPrincipal |
Authn* | Authentication only | AuthnService, AuthnError, AuthnScope, AuthnBackend |
Authz* | Authorisation only | AuthzStore, AuthzSession, AuthzDecision, AuthzError |
Auth* is shared infrastructure. Authn* is what you reach for when
handling a login attempt. Authz* is what you reach for when deciding
whether a request may proceed. If you see a function that takes
AuthSession and returns AuthzDecision, you know without opening it
that it is bridging authentication state into authorisation evaluation.
Type suffixes
| Suffix | Meaning |
|---|---|
*Outcome | Multi-variant result from an authentication operation (LoginOutcome, FactorOutcome, SignupOutcome) |
*Decision | Binary allow/deny verdict (AuthzDecision) |
*Config | Configuration or parameters (SessionConfig, TotpConfig, RateLimitConfig) |
*Store | Persistence trait or implementation (SessionStore, IdentityStore, DeviceStore) |
*Registry | Session validity tracking (SessionRegistry, MemorySessionRegistry) |
*Provider | External integration trait (OAuthProvider, Fido2Provider, LdapProvider) |
*Resolver | Extract typed value from a request (DeviceResolver, PrincipalResolver) |
*Error | Error type (AuthnError, OAuthError, CryptoError) |
*Builder | Builder pattern (SessionConfigBuilder, AuthEventBuilder) |
The conventions are not retroactive style guides. They are how the public surface is built today. New types adopt them; PR review catches violations.
Method verb conventions
| Verb | Semantics | Examples |
|---|---|---|
get_* | Lookup by primary key, deterministic, O(1) | get_user(id) |
find_* | Search by business criteria, may scan | find_user(identifier, tenant) |
load_* / save_* | Deserialise / serialise persisted state | load_factor(scope, kind) |
begin_* / complete_* | Multi-step ceremony start / finish | begin_login(), complete_oauth_login() |
verify_* | Check a credential or assertion | verify_factor() |
If you read find_user_by_email, you know it may be O(n) and may miss.
If you read get_user, you know the id was already validated and the
call should succeed unless the user was deleted.
Visibility
Internal types for cross-module use within axess-core
(SessionHandle, SessionInner, LoadOutcome, FinalizeOutcome)
are pub(crate). The public API surface is defined by the re-exports
in axess-core's lib.rs and the facade in axess's lib.rs. The
default for new types is pub(crate); promotion to pub requires
concrete demand.
Security invariants
Three invariants run through every part of the workspace. They are not advice; they are enforced by lints, by review, and in some cases by the type system.
The first is #![forbid(unsafe_code)], declared at the root of every
crate. There is no unsafe code in axess. There never will be unsafe
code in axess. If a future change needs it, the change goes elsewhere.
The second is constant-time comparison for any byte-level secret
check. HMAC cookie verification, TOTP code verification, OAuth CSRF
state, refresh-token device binding, session fingerprint: all of these
compare bytes through subtle::ConstantTimeEq. The alternative, ==
on bytes, leaks timing information and is rejected at review.
The third is secret zeroization on drop. Password hashes are wrapped
in ZeroizedString. TOTP and HOTP shared secrets use Zeroizing. The
session signing key zeroes its bytes in its Drop impl. The
discipline is not perfect (an attacker with sufficient memory access
can still win), but the surface is reduced.
The full production posture, including integration requirements and compliance touch-points, is in Security posture.
What lives where, in one paragraph
If you read nothing else from this chapter: state machines, storage,
middleware, federation adapters, device identity, and OBO/delegated
access live in axess-core. Factor algorithm primitives (Argon2id,
TOTP, HOTP) live in axess-factors. Typed IDs and the principal
enum live in axess-identity. Anything that delegates to time or
randomness goes through axess-clock or axess-rng. Adopters
depend on the axess facade; the internal split is free to
reorganise behind that boundary.
Everything else is detail. The rest of the book is detail.
Further reading
- The session state machine covers the five-state machine in full,
including
PendingWorkflow. - Factors and methods covers verifier composition, method authoring, and the scope hierarchy.
- Cedar policy fundamentals covers policy loading, the evaluator, and the entity provider contract.
- Session lifecycle and crypto envelope covers the cookie shape, the AES-256-GCM envelope, and fingerprint binding.
- Contributing covers the AX-NNN policy, the DST non-negotiable, and the naming conventions tied back to this chapter.
Getting started
This chapter is the on-ramp. It assumes you can read Rust and have seen Axum, but it does not assume you know axess. The goal by the end is a small running Axum application that logs a user in with a password, holds the session in a signed cookie, and rejects requests to a protected route until the login is complete.
We will skip the database for as long as possible. Replacing the
in-memory backend with a real SQLite backend is a one-trait swap,
covered at the end and walked through in detail in Identity store
implementation and the working
examples/sqlite/
reference application.
If you already have an Axum application and want the punch list: add
the dependencies in Dependencies, drop in the SessionLayer and
AuthnService from The minimum viable wiring, and wire the login
handler from Adding password login. The rest of the chapter is
rationale and a tour of the production-shaped example.
Prerequisites
You need Rust 1.87 or later on the stable channel (the workspace MSRV),
Axum 0.8.x, and a Tokio runtime in your binary (#[tokio::main] is
fine). Axess does not depend on system libraries, message brokers, or
external IdPs by default. The defaults are deliberately zero-infra: the
in-memory session store, the in-memory backend, and the password, TOTP,
HOTP, and email-OTP factors all work out of the box for development and
tests.
Dependencies
The shortest functional Cargo.toml looks like this.
[dependencies]
axess = "0.2" # facade -- depend on this, never on the internal crates
axum = "0.8"
tokio = { version = "1", features = ["macros", "rt-multi-thread"] }
tower = "0.5" # transitively from axum, but listed for clarity
The defaults of the axess facade enable authz and device.
Everything else is opt-in via features. For this chapter we will also
turn on memory, the in-memory session store used for development and
tests.
axess = { version = "0.2", features = ["memory"] }
The complete feature reference lives in the
crate-level docs on docs.rs and is surveyed in
the project's
README.
Per-feature chapters in this book (Backends, OAuth, and so on)
state their required feature at the top.
The minimum viable wiring
A minimum axess setup has four moving pieces, used in the same order
they are wired. The backend looks up users and verifies their factors.
The session store persists session data across requests. A signing key
HMAC-signs the session cookie so it cannot be tampered with. The
AuthnService is what handlers reach for to drive the state machine.
On top of those four pieces sits one Tower layer, SessionLayer,
which reads the cookie at the start of every request, hydrates the
session, and writes it back on response.
Here is the whole thing in one file. We will walk through each line right after.
use axess::{ AuthnService, InMemoryBackend, InMemorySessionStore, SessionLayer, AuthSession, }; use axum::{Router, routing::get, response::IntoResponse, http::StatusCode}; use std::{sync::Arc, time::Duration}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { // 1. Backend -- one type implements both IdentityStore and FactorStore. let backend = InMemoryBackend::new() .with_user_password("alice", "default", "Gnomes2+"); // 2. Session store + 3. signing key. let session_store = InMemorySessionStore::new(); let signing_key: [u8; 32] = [0; 32]; // PLACEHOLDER, see "Signing keys" below. // 4. AuthnService -- type-erased over clock and RNG; production wires // SystemClock + SystemRng. let service = Arc::new(AuthnService::new(backend.clone(), backend)); // 5. SessionLayer threads the session through each request. let session_layer = SessionLayer::new(session_store, signing_key) .with_ttl(Duration::from_secs(86_400)) .with_secure(false); // dev only -- see "Cookie security" below. let app = Router::new() .route("/", get(public_page)) .route("/dashboard", get(protected_page)) .with_state(service) .layer(session_layer); let listener = tokio::net::TcpListener::bind("127.0.0.1:3000").await?; axum::serve(listener, app).await?; Ok(()) } async fn public_page() -> &'static str { "everyone can see this" } async fn protected_page(session: AuthSession) -> impl IntoResponse { if session.is_authenticated() { (StatusCode::OK, "welcome").into_response() } else { (StatusCode::UNAUTHORIZED, "log in first").into_response() } }
This compiles and runs. Visiting http://127.0.0.1:3000/ returns
"everyone can see this". Visiting /dashboard returns 401, because no
session is authenticated yet. Adding the login flow is the next
section.
What each line is doing
InMemoryBackend::new() constructs a backend that holds users, factor
configurations, and authentication-attempt logs in memory. The
convenience method with_user_password seeds one user (alice, in
tenant default, with the Argon2id-hashed password Gnomes2+).
Production replaces this with a real backend that implements
IdentityStore and FactorStore against your database. The trait
surface is identical.
InMemorySessionStore::new() is the trivial session backend. Session
data lives in a HashMap behind an RwLock, and disappears on
process exit. The first replacement is
axess::backends::sqlite::SessionStore (with the sqlite feature),
covered in Backends.
AuthnService::new(backend.clone(), backend) takes two arguments
because the identity store and the factor store can be different
types. In the in-memory case they are the same object, hence the
clone. In production they typically remain the same struct (a single
backend implementing both traits), again with a clone.
SessionLayer::new(store, key) constructs the Tower layer. The
chained .with_ttl(86_400) sets a one-day session lifetime, and
.with_secure(false) permits HTTP cookies for local development. See
Cookie security below for the production setting.
AuthSession is an Axum extractor. Receiving it as a handler argument
hydrates the session for the current request, and is_authenticated()
returns true only when the state is AuthState::Authenticated. There
are also is_guest(), is_authenticating(), and a typed .state()
accessor if you want to match on the enum directly.
Adding password login
The convenience seeded by with_user_password configures a
single-factor method called password. A login is two HTTP requests.
The first is POST /login with a JSON body carrying the username and
password. Axess transitions the session from Guest to
Authenticating, verifies the password, and on success transitions to
Authenticated. Every request after that carries the cookie that
identifies the session, and AuthSession reads Authenticated.
use axess::{AuthnService, AuthSession, LoginOutcome};
use axum::{extract::State, response::IntoResponse, http::StatusCode, Json};
use serde::Deserialize;
use std::sync::Arc;
#[derive(Deserialize)]
struct LoginForm {
username: String,
password: String,
}
async fn login(
session: AuthSession,
State(service): State<Arc<AuthnService<InMemoryBackend, InMemoryBackend>>>,
Json(form): Json<LoginForm>,
) -> impl IntoResponse {
// 1. Begin the login. Transitions Guest -> Authenticating.
match service.begin_login(&session, &form.username, "default").await {
Ok(_) => {}
Err(e) => return (StatusCode::UNAUTHORIZED, format!("{e}")).into_response(),
}
// 2. Verify the password factor.
use axess::FactorCredential;
match service
.verify_factor(
&session,
FactorCredential::Password(form.password.clone()),
)
.await
{
Ok(LoginOutcome::Authenticated { .. }) => {
(StatusCode::OK, "logged in").into_response()
}
Ok(LoginOutcome::AwaitingFactor { remaining }) => {
// Unreachable for a password-only method, but the branch matters
// when chaining factors (password + TOTP, etc).
(StatusCode::OK, format!("need more factors: {remaining:?}")).into_response()
}
Err(e) => (StatusCode::UNAUTHORIZED, format!("{e}")).into_response(),
}
}
The call to begin_login is what transitions the session from
Guest to Authenticating. The transition records the user id, the
tenant, and the list of factors still required (just Password for a
single-factor method). Then verify_factor consumes one factor and
returns a LoginOutcome. The successful terminal case is
LoginOutcome::Authenticated, meaning every required factor has
passed. The intermediate case is LoginOutcome::AwaitingFactor,
meaning the factor verified but more are required; the state stays
Authenticating and remaining lists what is still needed.
The branching is the whole point of the explicit state machine. There
is no version of "logged in" that means "we believe one factor, you
can let them in". is_authenticated() returns true only when every
required factor has passed.
Wiring the login route
The minimum-viable router picks up the new handler:
let app = Router::new()
.route("/", get(public_page))
.route("/login", axum::routing::post(login))
.route("/dashboard", get(protected_page))
.with_state(service)
.layer(session_layer);
A login flow now works end-to-end. Start the server, curl once to
log in, hold the cookie, curl again to reach /dashboard.
$ curl -c jar -X POST http://127.0.0.1:3000/login \
-H 'content-type: application/json' \
-d '{"username":"alice","password":"Gnomes2+"}'
logged in
$ curl -b jar http://127.0.0.1:3000/dashboard
welcome
What just happened
A full request walks the following path. The numbers correspond to the wiring steps from The minimum viable wiring.
The browser sends the request with a Cookie: header carrying the
session id. SessionLayer (5) extracts the cookie, verifies its HMAC
signature against the signing key, looks up the session in the
InMemorySessionStore (2), and rebuilds the AuthState. Axum
invokes the handler with the hydrated AuthSession extractor. The
handler reads or mutates the session through AuthnService (4), and
mutations flag the session dirty. On response, SessionLayer
re-serialises the session if it is dirty, re-signs the cookie, and
sets it on the response.
The state machine, the backend, the session store, and the layer are independent moving parts. Swapping the in-memory backend for a SQLite-backed one does not touch the state machine or the session store. Swapping the session store for Postgres does not touch the state machine or the backend.
Signing keys
The example uses [0; 32] as the signing key. That is fine for a
five-minute demonstration. It is not fine for anything else.
In production the signing key is a 32-byte random value loaded from a secrets manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault, sealed Kubernetes secrets, or your platform's equivalent). The key must be stable across process restarts; the HMAC of an existing session cookie is computed with this key, and if the key changes underneath, every existing session becomes invalid on the next request.
Rotating the signing key is supported via
SessionLayer::with_previous_key, which keeps the old key available
for a transitional period so that sessions signed with the previous
key continue to validate while new sessions sign with the new one.
The Operations runbook walks through the rotation sequence in
detail.
Cookie security
Setting .with_secure(false) in the example permits the cookie to be
sent over HTTP, which is necessary for localhost development. In
production, you terminate TLS at the edge and call .with_secure(true).
The cookie will then only be sent over HTTPS. The other defaults are
already production-shaped: HttpOnly is on, SameSite=Lax is set,
and the cookie path is the application root.
The Cookies, fingerprinting, hijack detection chapter covers the
rest of the surface: the HMAC fingerprint binding that detects when a
session cookie is replayed from a different user agent, the
trusted-proxy configuration that controls how X-Forwarded-For is
interpreted, and the SameSite=Strict trade-off.
Going further
This chapter is deliberately the minimum. The real
examples/sqlite/
extends the same shape with everything you will actually want in
production: a real SQLite backend (OurBackend implements
IdentityStore and FactorStore over a sqlx::SqlitePool), a
SQLite-backed session store with AES-256-GCM encryption at rest, a
password + TOTP two-factor login for a second user, self-service
signup and TOTP enrollment, a password-reset flow with email-OTP,
rate limiting on the auth routes, a health check on the session
store, atomic auth-attempt counters exposed at /metrics, and a
background interval task that purges expired sessions. Read the
example, run it, compare its app.rs to the snippet in this chapter.
The shape is the same; there are simply more pieces wired in.
After that, the order in which you read the rest of the book depends on your goal.
| Goal | Next chapter |
|---|---|
| Add a second factor (TOTP, FIDO2, OAuth) | Factors and methods |
Replace InMemoryBackend with your database | Identity store implementation |
| Switch the session store to Postgres, MySQL, or Valkey | Backends: SQLite, Postgres, MySQL, Valkey |
| Add authorisation policies | Cedar policy fundamentals |
| Run multiple tenants | Multi-tenancy |
| Federated login (Google, Okta, Azure AD) | OAuth 2.0 and OIDC |
| Workload identity for non-human callers | Workload identity overview |
| Production deployment | Operations runbook |
Common stumbling points
A handful of failures bite first-time integrators. They are worth naming up front so the chapter that solves them is easy to find.
If your handler cannot see AuthSession, the extractor needs the
layer to populate request extensions. Add use axess::AuthSession;
and check that SessionLayer is in .layer(...) on the router.
If begin_login returns UserNotFound, the tenant probably does
not match. The example seeds alice in tenant default; passing a
different tenant returns UserNotFound deliberately, not
"user exists in a different tenant". Axess never leaks tenant
membership across tenant boundaries.
If sessions disappear on process restart, that is correct for
InMemorySessionStore. Use SqliteSessionStore,
PostgresSessionStore, or ValkeySessionStore (with their
respective features) for persistence. See Backends.
If you need to attach application data to a session, SessionData has
a custom field for that. The size cap is 64 KiB to keep oversize
cookies from becoming a DoS surface. See Session lifecycle and
crypto envelope §"Custom session data".
If the user logs out, AuthSession::clear() or
service.logout(&session).await resets the state to Guest, rotates
the session id (defeating fixation), and clears the cookie on
response.
Each of these has a dedicated chapter or section later in the book. The goal here was to get you running, not to be complete. You are running. The rest is detail.
The session state machine
AuthState is the most important type in axess. Everything that matters
about an authenticated session, both for the type system and for a
reviewer reading a handler, is captured by which of its five variants
you are looking at. The whole library is built around the
representational claim that authentication is not a boolean, not a
flag column, and not a row in a sessions table that the handler reads
and then trusts. It is an enum, transitions on the enum are methods on
the enum, and a partial login is a distinct variant rather than a
"finished" session with one field missing.
This chapter walks through the variants, the transition method, the
outcome enum that dispatches transitions, the PendingWorkflow escape
hatch for signup and password reset, and the orchestration-versus-pure
split that keeps the state machine independently testable.
The five variants
The enum lives at
axess-core/src/session/data.rs
in the workspace. Each variant carries exactly the data its phase needs.
There is no field on Authenticated for "current factor being verified"
because at that point no factor is in progress, and there is no field
on Guest for "tenant" because no user has been identified yet. The
absence is the point.
pub enum AuthState {
Guest,
Identifying {
user_id: UserId,
tenant_id: TenantId,
},
Authenticating {
user_id: UserId,
tenant_id: TenantId,
method_name: Arc<str>,
remaining: Vec<FactorKind>,
completed: Vec<FactorKind>,
attempt_count: u32,
last_attempt: Option<DateTime<Utc>>,
},
Authenticated {
user_id: UserId,
tenant_id: TenantId,
authn_time: DateTime<Utc>,
factors_completed: Vec<FactorKind>,
},
PendingWorkflow {
user_id: UserId,
tenant_id: TenantId,
workflow: WorkflowState,
},
}
Guest is the default. A request with no cookie, or a cookie whose
session has been logged out or expired, arrives at the handler with an
AuthSession whose state is Guest. There is no user identity in
scope.
Identifying is the brief intermediate state for flows that prompt for
a username before asking for any credential. Most applications skip it
and go straight from Guest to Authenticating. It exists for the
two-page login pattern where step one collects the identifier and step
two collects the password, possibly with the identifier carried over a
hidden form field or a short-lived intermediate token. The variant
records who is being identified but says nothing about credentials.
Authenticating is where most of the action happens. The session knows
who it is trying to authenticate, which method is in progress (because
a tenant might have multiple methods, and the choice is locked in
before any factor runs), what factors are still required, what factors
have already been verified this attempt, how many credential attempts
have been made, and when the last attempt landed. The last two fields
exist because lockout decisions depend on them. A method that allows
three attempts before locking the user out for fifteen minutes needs
exactly this information, and putting it in the variant rather than in
a side table keeps the decision local and reviewable.
Authenticated is the terminal success state. It carries the user id,
the tenant id, the moment of successful authentication (for the audit
trail), and the list of factors that were used. The factor list is
load-bearing. A tenant policy that requires Fido2 for certain routes
can check factors_completed.contains(&FactorKind::Fido2) directly,
without consulting an external store.
PendingWorkflow is the variant most adopters do not initially expect
and end up reaching for once they ship a real signup flow. It models
the state where a user has authenticated enough to identify themselves
but is in the middle of a multi-step ceremony (signup, password reset,
email verification, or a custom workflow) and should not be treated as
fully logged in until the ceremony completes. The variant wraps a
WorkflowState that records which workflow is in progress, which step
the user is on, and when the workflow started.
pub struct WorkflowState {
pub kind: WorkflowKind,
pub current_step: u32,
pub total_steps: u32,
pub initiated_at: DateTime<Utc>,
}
pub enum WorkflowKind {
Signup,
PasswordReset,
EmailVerification,
Custom(Arc<str>),
}
Custom(Arc<str>) is the extension point. If your application has a
KYC flow, a hardware-key registration flow, or a multi-step recovery
ceremony, you name it as a custom workflow and the session machinery
treats it like the built-in kinds. The string is interned through
Arc<str> because workflow names recur and the cost of repeated
allocation adds up across a busy login surface.
The transition method
Factor verification is the only mutation that the state machine
exposes. The transition is AuthState::advance_factor, which takes a
FactorKind and a timestamp, and returns an AdvanceOutcome that
tells the caller what just happened.
impl AuthState {
pub(crate) fn advance_factor(
&mut self,
kind: &FactorKind,
authn_time: DateTime<Utc>,
) -> AdvanceOutcome { ... }
}
pub enum AdvanceOutcome {
NotApplicable,
StillAuthenticating,
Completed,
}
The visibility on the method is pub(crate), which is the choice that
keeps the orchestration honest. The pure state mutation is reachable
only from within axess-core. Application code never calls it
directly. Instead, application code calls AuthnService::verify_factor,
which is the orchestrator method that locks the session, performs the
factor's cryptographic verification through axess-factors, calls
advance_factor on the typed state, and dispatches on the returned
outcome.
The three outcomes are exhaustive. NotApplicable means the call was
made against a state that does not accept factor verification (you
cannot verify a factor against a Guest session, for instance).
StillAuthenticating means the factor verified and more factors are
required to complete the method. Completed means the final required
factor for this method just passed, and the session should transition
to Authenticated. The orchestration layer translates Completed
into a typed Authenticated variant with the right authn_time and
factors_completed, applies session id rotation to defeat fixation,
and writes the session back to the store.
The orchestration split
The split between AuthState (the pure data and pure transition
methods) and AuthSession (the Axum extractor with its RwLock,
dirty flag, and side-effect dispatch) is a deliberate choice with two
payoffs.
The first payoff is testability. Unit tests on the state machine do not
need tokio, do not need RwLock, do not need a fake session store,
and do not need an extractor harness. They construct an AuthState
directly, call advance_factor (or one of the other pub(crate)
transition methods), and assert on the resulting variant. A regression
in the transition logic surfaces as a one-line test against the enum,
not as an integration test against a contrived HTTP request.
The second payoff is auditability. Every orchestration side effect (id
rotation, fingerprint binding, dirty-flag handling, store write-back)
lives in one file (the SessionService::call() method, walked through
in Session lifecycle and crypto envelope) rather than scattered
across transition methods. A code review of the orchestration is
self-contained; a code review of the state machine is self-contained;
neither has to mentally reconstruct the other.
The pattern is worth naming because it shows up again in the runtime. Pure state machines compose cleanly with async orchestrators that hold the locks and dispatch side effects, and the two halves get reviewed and tested independently.
When Authenticated is and is not the right shape
The natural temptation when integrating axess for the first time is to
treat Authenticated as the "done" state and Guest as the "not done"
state, and to ignore the intermediate variants. Resist it. The
intermediate variants are how axess represents real-world flows that
do not fit a binary, and reaching into them lets your application
behave correctly without inventing parallel state on the side.
A signup flow that captures a username and password, mints a session,
and then asks the user to verify their email before granting any
access should sit in PendingWorkflow { kind: EmailVerification, ... },
not in Authenticated. A handler that protects the dashboard checks
is_authenticated(), which returns true only for the Authenticated
variant, and the user sees the email-verification page until the
ceremony completes. The variant change at completion time then
transitions to Authenticated, the same handler now lets the user in,
and the application does not need to model a "needs to verify email"
column on the users table.
A password-reset flow follows the same pattern with
WorkflowKind::PasswordReset. The user proves identity (with an
email-OTP, say), the session enters PendingWorkflow, the
password-reset page becomes accessible, the user submits a new
password, and the session transitions back to Guest (forcing them to
log in fresh with the new password). The reset page is unreachable
from Guest and unreachable from Authenticated, which is correct in
both directions: a not-logged-in user should not see it, and a fully
logged-in user does not need it.
The pattern generalises to any post-identification ceremony. The
typical question to ask is "should the user be considered fully logged
in during this step?" If the answer is no, PendingWorkflow is the
right variant. If the answer is yes, and you simply want the user to
do something next, then Authenticated plus a flag on the user record
fits better.
Logging out and identifier rotation
AuthnService::logout (and AuthSession::clear, which calls into it)
transitions any state to Guest. The transition is more than a state
change. The session identifier is rotated, the cookie is cleared on
the response, the session row is deleted from the session store, and
an audit event is emitted. The combination defeats session fixation.
Even if an attacker knew the session id before logout, the id changes
on the next login.
The orchestration layer also rotates the session id at the transition
to Authenticated, for the same reason. A user who logs in receives a
new session id, distinct from any id observed while they were a guest.
The cookie is reissued; the old id is unreachable on subsequent
requests. The rotation is invisible to application code and lives in
the orchestration; the state machine just sees the variant change.
Custom session data
Real applications need to attach data to a session that axess does not
model: a preference, a feature-flag selection, a partial form draft.
SessionData has a custom field for this, and the size cap is
sixty-four kilobytes. The cap exists because the session is
round-tripped through a cookie (or its server-side analogue), and a
session that grows without bound becomes a DoS surface. Sixty-four
kilobytes is enough for almost any sensible use; anything larger
probably belongs in the database keyed by user id rather than in the
session.
Adding a custom field is purely additive. The SessionData struct
exposes custom: HashMap<String, serde_json::Value> (the
implementation may evolve, but the field-with-cap shape is stable),
and you write through accessor methods on the session handle. The
state-machine variants do not change. The schema-migration story
covered in Schema migration handles upgrade paths without breaking
existing sessions.
What this enables
The state machine is the foundation that lets the rest of the book be
shorter. Factor composition (Factors and methods) works because
Authenticating::remaining is a list, not a single field. Step-up
authentication works because the orchestrator can transition from
Authenticated to Authenticating with a non-empty remaining list
when a sensitive route demands a stronger factor. Cedar authorisation
works because Authenticated carries factors_completed, which the
entity provider can serialise into a Cedar attribute the policy can
match on. Audit events work because every transition produces a
distinct AuthEvent variant with the right fields populated.
None of these features required a different enum; they all read out of the state machine that was already there. The enum carries the authentication question, and the rest of the library asks it.
Further reading
The chapters that build directly on this one are Factors and methods
(which factors fit into the variants, and how methods compose), Scope
hierarchy (how begin_login picks the right method given Global,
Tenant, and User overrides), and Refresh tokens and session continuity
(how the session continues across long-lived sessions, key rotation,
and detection of token theft). Session lifecycle and crypto envelope
in Part V covers the cookie, the encryption envelope, and the
orchestration's dirty-flag handling.
Factors and methods
A factor is a single credential check: a password, a TOTP code, a WebAuthn assertion, an LDAP bind, an OAuth token exchange. A method is a sequence of factors that together count as a successful login. Composing factors into methods, and scoping methods to apply per-user or per-tenant rather than globally, is the day-to-day surface adopters work with. This chapter explains the vocabulary, the types that carry it, and the pattern for adding a factor that axess does not ship.
Vocabulary
The four words that recur are factor, step, method, and scope. They sound interchangeable in casual writing, and they are not in the code.
A factor is one credential verifier, identified by a FactorKind
variant: Password, Totp, Hotp, EmailOtp, Fido2, LdapBind,
or Federated(FederatedProvider). Each factor has a config struct
(PasswordConfig, TotpConfig, and so on) that the relevant adopter
seeds at provisioning time and the service reads at verification time.
A step is one node in a method. A step is either a Required(kind)
demand for a specific factor, or an AnyOf(vec![kind1, kind2, ...])
disjunction that lets the user choose among several factors at that
position. The step is the unit of authoring; a method is a sequence
of steps.
A method is an ordered sequence of steps with a stable name. Examples
in the wild: "password-only" (one step, Required(Password)),
"password-then-TOTP" (two steps, Required(Password) then
Required(Totp)), "password-then-second-factor"
(two steps, Required(Password) then AnyOf(vec![Totp, Fido2, EmailOtp])). The name matters because the session records which
method is in progress, and the audit trail names the method when
recording success or failure.
A scope is the tier at which a method is configured. There are three tiers (Global, Tenant, and User), covered in detail in Scope hierarchy. The short version: a global default applies everywhere; a tenant can override it; a user can override the tenant. Resolution is the simple inversion of authority: user override beats tenant override beats global default.
The factor list
The current FactorKind enum and its companion config sum-type live in
axess-core/src/authn/factor.rs.
pub enum FactorKind {
Password,
Totp,
Hotp,
EmailOtp,
Fido2,
LdapBind,
Federated(FederatedProvider),
}
pub enum FederatedProvider {
Github,
Google,
Microsoft,
Custom(String),
}
pub enum FactorConfig {
Password(PasswordConfig),
Totp(TotpConfig),
Hotp(HotpConfig),
EmailOtp(EmailOtpConfig),
Fido2(Fido2Config),
LdapBind(LdapBindFactorConfig),
// Federated configs live with their provider's verifier crate.
}
FactorKind is the discriminator the state machine carries.
FactorConfig is the data the verifier needs. They mirror each other
because the verifier-versus-orchestrator split (see Architecture at
a glance) puts the algorithm and its config in axess-factors and
puts the discriminator and the composition machinery in axess-core.
A new factor lands as a new FactorKind variant, a new FactorConfig
variant, and a new verifier crate (or module) under axess-factors.
The federated case is intentionally a parameterised variant rather than a flat list. Each federated provider has its own configuration shape (Google's audience claim differs from GitHub's; Microsoft adds tenant directory parameters), and the wire formats are different enough that flattening them into one enum would require a discriminator inside the config. Parameterising the kind itself makes the config sum-type smaller and the type system honest about the variation.
Custom(String) is the extension point for IdPs the upstream library
does not name explicitly. Adopters who federate against Okta, Auth0,
Azure AD as a generic OIDC provider, or an in-house IdP plug in with
the OAuth-RS resolver and a custom string identifier; the workload
identity chapter (Workload identity overview) describes the same
pattern from the inbound-resolver side.
How factors compose
The composition primitives are FactorStep and Method. A FactorStep
is one node in a method. A Method is a vector of steps plus a name.
pub enum FactorStep {
Required(FactorKind),
AnyOf(Vec<FactorKind>),
}
pub struct Method {
pub name: Arc<str>,
pub steps: Vec<FactorStep>,
}
The two-step Required(Password) then AnyOf(vec![Totp, Fido2])
method handles a common shape: the user must enter their password,
then must complete one of two second factors, and the choice of second
factor is theirs (perhaps because they have not registered a passkey
yet, or perhaps because their phone is at home and they only have
their hardware key with them). The state machine's
Authenticating::remaining field carries the residue of steps yet to
complete: after the password step, remaining looks like
[AnyOf(vec![Totp, Fido2])] and the application's login page renders
the choice between them.
Required(kind) is shorthand for a one-element AnyOf(vec![kind]),
but the distinction matters for audit clarity. A successful login that
went password + totp reads cleanly when the audit log records
"completed Required(Totp)"; the same login through an AnyOf step
records "completed AnyOf::Totp" and a reviewer asks why the choice
was offered at all. Use Required when there is no choice.
The orchestrator does not support arbitrary expression trees of factors (you cannot say "two of these three" with a single step). The omission is on purpose. Real authentication methods are short sequences with at most one decision point per step, and admitting arbitrary expressions would invite policies that pass formal review but defeat operational understanding.
The verify_factor path
Application code drives factor verification through
AuthnService::verify_factor. The signature is
pub async fn verify_factor(
&self,
credential: &FactorCredential,
session: &AuthSession,
) -> Result<FactorOutcome, AuthnError<I::Error>>;
with FactorCredential the runtime credential value:
pub enum FactorCredential {
Password(ZeroizedString),
OtpCode(Arc<str>),
Fido2Assertion(serde_json::Value),
}
and FactorOutcome the result of the call:
pub enum FactorOutcome {
Authenticated,
FactorRequired(FactorKind),
InvalidCredential,
Locked { until: Option<DateTime<Utc>> },
}
The handler in your application takes the credential off the request
(form body, JSON, header, whatever), wraps it in the right
FactorCredential variant, and calls verify_factor. Three things
then happen inside the service.
First, the service acquires the session's write lock and reads its
current state. If the state is not Authenticating, the call returns
an AuthnError. If the state is Authenticating, the service inspects
remaining to determine which factor is expected next. A mismatch
between the credential the client supplied and the factor the method
expects returns FactorOutcome::InvalidCredential without engaging
the verifier, which keeps the cryptographic cost of failed attempts
predictable.
Second, the service dispatches to the appropriate verifier in
axess-factors. The password case calls Argon2id. The TOTP case calls
the RFC 6238 verifier with the user's stored secret and the current
window. The FIDO2 case calls the WebAuthn ceremony, which is itself
stateful and threads through the session's challenge field. Federated
cases dispatch to their respective OAuth or OIDC handlers.
Third, the service translates the verifier's result into a
FactorOutcome and an AdvanceOutcome from the state machine. A
successful verification calls AuthState::advance_factor, which
returns Completed if no factors remain (the orchestrator promotes
the session to Authenticated) or StillAuthenticating if more
factors are required (the orchestrator leaves the state in
Authenticating and returns FactorOutcome::FactorRequired(kind)
with the next expected kind). A failed verification increments
attempt_count, updates last_attempt, and returns
FactorOutcome::InvalidCredential or Locked depending on the
attempt policy.
The Locked outcome is the lockout decision in band. The
until: Option<DateTime<Utc>> field carries the unlock time when one
is scheduled (a five-minute exponential backoff after three attempts,
for instance) or None when the lockout requires administrative
intervention. The application surfaces this to the user with the right
copy; the audit log records the lockout regardless.
Begin and complete
verify_factor is the verb that drives a method forward, but a login
also has a start and an end. The start is AuthnService::begin_login,
which transitions a Guest session into Authenticating. The end is
the orchestrator's promotion of Authenticating to Authenticated
when the last factor completes (or to PendingWorkflow when a workflow
is in progress).
begin_login does three things that are worth naming explicitly.
First, it looks up the user in the configured identity store, in the
tenant that the caller named, and returns UserNotFound if no user
matches. Second, it loads the method that applies to that user under
the scope hierarchy (covered in Scope hierarchy); the result is the
specific sequence of steps the user will walk. Third, it transitions
the session into Authenticating with the loaded method's
remaining set to the method's full step list.
complete_signup is the corresponding verb for the PendingWorkflow
case. After a signup ceremony completes (email verified, KYC checks
passed, terms accepted), the orchestrator transitions the session
from PendingWorkflow { kind: Signup, ... } to Authenticated. The
factor list on the resulting Authenticated variant is the list that
was used during the signup, which is what the audit trail wants and
what subsequent policy evaluation reads.
Adding a custom factor
The pattern for adding a factor that axess does not ship is the same pattern that produced the factors that axess does ship. There are four moving parts.
The first part is the verifier itself. It lives in axess-factors
(or in a separate crate that depends on axess-factors) and exposes
a function or trait that takes the stored config plus the runtime
credential and returns a verifier-side result. For a hash-based
factor this is straightforward (compute the hash, constant-time
compare); for a ceremony-based factor (FIDO2, OAuth) the verifier
threads through the session-side challenge and the response.
The second part is the FactorKind variant. Adding a variant is a
breaking change to the public surface, which is what you want: any
match on FactorKind in adopter code now flags a missing arm, and the
adopter chooses to handle the new factor or to reject it with an
explicit pattern. There is no "add a variant silently" mechanism in
axess, and that omission is intentional.
The third part is the FactorConfig variant and the storage adapter
that loads it. The factor config goes into the configured factor
store; the load path resolves the scope (Global, Tenant, User) and
returns the right config for the user being authenticated. Adopters
implement the factor store, so the storage decision is theirs.
The fourth part is the credential type. A new factor that requires a
new shape of input adds a variant to FactorCredential. A factor
that maps to one of the existing variants (a password-like factor
reuses Password, a code-based factor reuses OtpCode) avoids the
addition.
The work is small. The factors that axess ships today each take fewer than a thousand lines of Rust including tests. The reason the work stays small is that the orchestration and the state machine do not change; the verifier is doing one job, behind a fixed contract.
Step-up authentication
Step-up is the pattern where an already-Authenticated session is
asked to re-prove identity (or to prove with a stronger factor) before
performing a sensitive action. Axess models this by transitioning the
state from Authenticated back to Authenticating with a non-empty
remaining list. The orchestrator method that drives this is
AuthnService::require_step_up, which takes the session and the
factor or factors the caller demands.
The state-machine view is uniform. The session is Authenticating
again; the factor list contains the stepped-up factors; the session
remembers (in completed) which factors it already cleared.
verify_factor works the same way it did during the original login,
and on the final Completed outcome the session transitions back to
Authenticated with a fresh authn_time and an updated
factors_completed.
The application controls when step-up is required. The Cedar policy
engine can express "this action requires Fido2 in
factors_completed" (see Cedar policy fundamentals), or the
handler can demand it directly. The state machine does not impose a
policy; it provides the shape that lets the policy be enforced.
What this enables
A method composed of a Required(Password) followed by an
AnyOf(vec![Totp, Fido2, EmailOtp]) covers an enormous share of real
deployments without any further structure. A per-tenant override for
a specific tenant that requires Required(Fido2) instead of the
disjunction covers the rare case where one tenant must be stricter.
A per-user override that adds Required(EmailOtp) for a flagged user
covers the regulatory case where one user is on a watch list.
None of these require new code beyond an entry in the method store. The state machine, the verifier dispatch, and the audit pipeline all read the method out of the configured scope and execute it. The next chapter, Scope hierarchy, covers the configuration tier in detail.
Further reading
Scope hierarchy covers Global, Tenant, and User configuration tiers
and how begin_login resolves them at runtime. Cedar policy
fundamentals covers how the policy engine reads
factors_completed and authorises against it. Part III, Factor
cookbooks, has a chapter per real-world factor (Password and TOTP,
FIDO2 and WebAuthn passkeys, OAuth 2.0 and OIDC, and so on) that
walks through the integration details one factor at a time.
Scope hierarchy
Methods and factor configurations live at three tiers: Global, Tenant, and User. The mechanism is simple, the consequences are not. Done well, the three-tier hierarchy makes multi-tenant SaaS deployment feel like one configuration with two override surfaces. Done badly, it becomes a maze where nobody can answer "what method is this user actually using?" without running a query. This chapter walks through the mechanism and the patterns that keep it operationally clear.
The three tiers
AuthnScope lives in
axess-core/src/authn/types.rs.
It is a three-variant enum, ordered from broadest to narrowest:
pub enum AuthnScope {
Global,
Tenant(TenantId),
User { tenant_id: TenantId, user_id: UserId },
}
Global is the workspace-wide default. A method or factor configured
at global scope applies to every user in every tenant unless something
overrides it.
Tenant(TenantId) is a per-tenant override. A method configured at
tenant scope applies to every user in that tenant, overriding the
global default for that tenant.
User { tenant_id, user_id } is a per-user override. A method
configured at user scope applies to that one user, overriding both the
tenant and global defaults for that user.
The ordering is the ordering of authority. Narrower beats broader.
How resolution works
At begin_login time the service needs to know which method this user
should authenticate against. The resolution walks the scope chain from
narrowest to broadest, returning the first match it finds.
The chain helper AuthnScope::lookup_chain produces the ordered
sequence of scopes to query. For a user with tenant_id = T and
user_id = U, the chain is
[User { T, U }, Tenant(T), Global]. The factor store walks this list
and returns the first configured method.
async fn load_factor_with_fallback(
user_scope: &AuthnScope,
tenant_id: &TenantId,
kind: &FactorKind,
) -> Result<Option<FactorConfig>, FactorStoreError> {
for scope in user_scope.lookup_chain() {
if let Some(config) = factor_store.load_factor(&scope, kind).await? {
return Ok(Some(config));
}
}
Ok(None)
}
The same chain is used for each factor in the method. A method that chains password and TOTP looks up the password config first (which might be a user-scoped override) and then the TOTP config (which might be a tenant default). Each factor's configuration is resolved independently, which is the right shape for the common case where the user has chosen their own TOTP device but the tenant has standardised the password policy.
The storage convention matches the tier model. The factor store schema
typically has tenant_id and user_id columns that are nullable,
with the following semantics:
tenant_id | user_id | Scope |
|---|---|---|
NULL | NULL | Global |
| set | NULL | Tenant(tenant_id) |
| set | set | User { tenant_id, user_id } |
ScopeColumns is the in-code representation of this pair; it lives
next to AuthnScope and is what the SQL adapters use when building
queries.
What gets scoped
The hierarchy applies to three kinds of object: factor configurations, methods, and lockout policies. Each plays the same game, with the same chain-walking resolution.
Factor configurations are the per-factor stored data: the password hash for a user, the TOTP secret for a user, the FIDO2 credential public keys for a user, the LDAP bind parameters for a tenant. Most factor configurations are user-scoped because they belong to a specific user (a password hash is intrinsically per-user). A few are tenant-scoped because they belong to a tenant configuration (LDAP bind parameters, OIDC discovery URLs). A very few are global (the system default Argon2id parameters, the system default TOTP drift window).
Methods are the ordered sequences of factor steps. A tenant typically configures a single default method (password-plus-TOTP, say), and a small minority of tenants override it (a regulated tenant requires FIDO2 instead of TOTP). Individual users very rarely have a custom method; when they do, it is because policy demands a stronger factor for a flagged user.
Lockout policies are the rate and threshold for locking out a user after repeated failed attempts. Defaults are global. Tenants with stricter risk postures override at tenant scope. Per-user lockout policies exist but are rare; they usually mean "this user is on a watch list and gets locked out faster than the rest".
The pattern across all three is identical. Configure a sensible global default. Let tenants override when they have a real reason. Reach for the user-scoped override only when policy demands per-individual differentiation. The more configuration you do at the narrowest scope, the more state you have to reason about during incidents.
Migration patterns
The scope hierarchy is the right tool for rolling out factor changes in a controlled way. The pattern is to introduce the change at the narrowest scope, verify it on a small population, and broaden the scope as confidence accumulates.
A worked example. A SaaS deployment wants to require FIDO2 for all users, replacing the existing password-plus-TOTP method. The cautious roll-out has three phases.
Phase one is User-scoped pilot. The operations team configures the
new method (Required(Password) then Required(Fido2)) at user scope
for a small set of internal users. These users go through the new
flow first, surface any UX problems, and validate that the FIDO2
ceremony works end-to-end against the application's relying-party
configuration.
Phase two is Tenant-scoped pilot. The team configures the new method at tenant scope for a single early-adopter tenant. Their users transition next, and the pilot widens to a population that includes real customer traffic. The user-scoped overrides from phase one are removed (they no longer differ from the tenant default).
Phase three is Global rollout. With confidence from both pilot phases, the team configures the new method at global scope. The tenant-scoped override for the early-adopter tenant is removed at the same time, since it no longer differs from the global default. The roll-out is complete; the method store has one row (the global default) instead of many.
The pattern works in reverse for emergency revocation. If the new method has a bug that surfaces after global rollout, the team can override at tenant scope or user scope for the affected population without redeploying the application or reverting the global config. The narrower scope wins; the affected users walk the old method while the bug is fixed.
How Cedar policy interacts
The scope hierarchy answers "what method does this user authenticate with?" Cedar answers "what is this user allowed to do once authenticated?" The two surfaces are distinct, and confusing them leads to authorization-as-authentication mistakes.
A common pattern is to use Cedar to require a method outcome rather
than to choose one. A policy might require that
factors_completed.contains("Fido2") for an action against a sensitive
resource. The method itself remains the resolved one from the scope
hierarchy. If the method does not include FIDO2, the user reaches the
sensitive route and gets a deny; the application then offers step-up
to add FIDO2 (covered in Factors and methods §"Step-up
authentication"), the user completes it, and the policy now passes.
The split between choice (scope hierarchy) and demand (Cedar policy) is what makes this work. The hierarchy decides what factors are available; the policy decides which of them are required for which actions. A user can have a stronger method than the policy minimum and satisfy the policy without effort; a user with a weaker method gets prompted for step-up.
Anti-patterns
The hierarchy invites a few mistakes that are worth naming explicitly.
The first is overusing user-scoped configuration. Every user-scoped row in the factor store is a piece of state that an operator has to maintain. If a tenant decides to change its method, the tenant-scoped row updates; the user-scoped overrides do not. After a few months of incremental changes, the user-scoped rows are out of sync with the intended policy, and nobody remembers why each row exists. The fix is to use user scope only when policy genuinely requires per-individual differentiation, and to document the reason in a separate field next to the row.
The second is using the hierarchy as a feature flag. The temptation is to roll out a new factor by user-scoping it to internal users, then forget about the user-scoped rows after the global rollout. The hierarchy is a good migration tool but a bad permanent home for temporary state. After a rollout completes, remove the narrower-scope overrides that no longer differ from the broader-scope default. The audit trail still records the historical use; the live configuration is clean.
The third is conflating method scope with tenant identity. The hierarchy says nothing about which tenants exist; it says only how to resolve a configuration for a given (tenant, user) pair. Tenant provisioning, tenant suspension, and tenant deletion are covered in Multi-tenancy.
What this enables
The hierarchy is the reason an axess deployment scales from "one
company with one method" to "a SaaS with hundreds of tenants, each
with its own posture, and a few high-risk users on stricter policies"
without restructuring the application. The same code path
(begin_login, verify_factor, Authenticated) handles the
single-tenant case and the hundred-tenant case. The only difference is
which scope holds the configuration.
The pattern is not unique to axess. Cedar policies, audit retention policies, and rate-limit thresholds all follow the same three-tier pattern. The vocabulary is consistent across the library so a reviewer who has internalised the resolution rule does not have to re-learn it for each subsystem.
Further reading
Multi-tenancy covers tenant provisioning, the TenantId lifecycle,
cross-tenant refusal, and the three-lever lockout. Cedar policy
fundamentals covers how authorisation policy reads the resolved
method's factors_completed field. Identity store implementation
walks through the storage adapter that resolves the scope chain
against a relational schema.
Refresh tokens and session continuity
A session cookie keeps a user logged in until it expires or is cleared. A refresh token is the mechanism that extends that lifetime past the cookie's short window, without exposing a long-lived bearer credential to the client. The shape of the mechanism matters more than most adopters initially realise, because the choice between "long cookie" and "short cookie plus refresh token" is the choice between "stolen cookie is valid for a day" and "stolen cookie is valid for an hour and then detectable as theft when the legitimate user next refreshes".
This chapter covers the refresh token shape in axess: hash-only
storage, token families for reuse detection, device binding and
cascade revocation, and the configuration surface adopters tune. The
relevant code lives in
axess-core/src/session/refresh.rs.
Why refresh tokens at all
A naive long-lived session is one cookie that lives for a month. If the cookie is stolen, the attacker has a month of access. The legitimate user has no way to know the cookie was stolen unless they notice the attacker's actions in their account.
A short-lived session with a refresh token is two credentials. The session cookie lives for an hour and grants access. The refresh token lives for a month and grants only the right to mint a new session cookie. The refresh exchange happens server-side, typically when the session cookie expires; the client sends the refresh token, the server checks it, and the server issues a fresh session cookie (and optionally a fresh refresh token).
The cost is one extra round-trip per hour. The benefit is twofold. First, a stolen session cookie expires within the hour. Second, and more importantly, a stolen refresh token gets caught the next time either the attacker or the legitimate user attempts to refresh, because the system detects that a token has been used twice and revokes the entire token family.
The stored shape
RefreshToken is the row that lives in the refresh token store:
pub struct RefreshToken {
pub id: RefreshTokenId,
pub user_id: UserId,
pub tenant_id: TenantId,
pub token_hash: String,
pub issued_at: DateTime<Utc>,
pub expires_at: DateTime<Utc>,
pub revoked: bool,
pub device_info: Option<String>,
pub family_id: Option<TokenFamilyId>,
pub device_id: Option<DeviceId>,
}
Three fields are worth dwelling on.
token_hash is the SHA-256 hash of the token string, not the string
itself. The plaintext token is generated when the token is issued
(through SecureRng for DST), returned to the client once, and never
stored. The hash is what lives in the database. A database breach that
leaks every row of the refresh token store does not leak any usable
token, because the hash is one-way. The verification path hashes the
client-supplied plaintext and compares it constant-time against the
stored hash.
The hashing uses an optional pepper, configured through
RefreshTokenConfig::hash_pepper. When set, the hash is
HMAC-SHA256(pepper, plaintext); when unset, the hash is plain
SHA-256(plaintext). The pepper is a deployment-level secret stored
outside the database (in the secrets manager that holds the session
signing key, typically) and adds defence in depth: an attacker who
breaches the database alone cannot mount an offline brute-force attack
against the hashes.
family_id is the link to the token's lineage. Every refresh token
issued in a single authentication chain shares a TokenFamilyId. The
first token issued at login starts a family; each subsequent token
issued by rotation extends the same family. When the system detects
that a token from a family has been used after rotation (which is
what theft looks like), it revokes the entire family.
device_id is the link to the device identity ladder. When a refresh
token is bound to a device, revoking the token can cascade to revoke
the device, and revoking the device cascades to revoke every token
bound to it. The cascade is bidirectional and is the mechanism that
makes "log out everywhere on this device" work in practice. Device
identity covers the device ladder in detail.
How families catch theft
The interesting part of the design is the family. The mechanism is worth walking through with a concrete sequence.
Alice logs in. The server issues refresh token A, in family F. A is
delivered to her browser; the hash of A is stored in the database
with family_id = F.
An hour later, Alice's session cookie expires. Her browser sends A back to refresh. The server hashes the plaintext, finds the row, verifies it is not revoked, marks A as revoked (rotation), and issues a new refresh token B in the same family F. B is delivered to the browser.
Meanwhile, an attacker has stolen the cookie and copied A. The attacker now sends A to refresh. The server hashes the plaintext, finds the row, and sees that A is already marked revoked.
The clean refresh-after-rotation invariant says that a revoked token should never be presented again. If it is, either Alice's browser is broken (unlikely), or the network retried (rare and recoverable), or the token has been stolen and the attacker is racing the legitimate user. The conservative response is to assume the worst: revoke the entire family F. Token B (which Alice's browser holds and has not yet used) is now revoked. The next time Alice's browser refreshes, it fails. The user has to log in again, but during the brief window between detection and re-login the attacker has no access either.
The detection-and-revoke pattern is implemented in the
refresh_session function: when a revoked token is presented, the
function calls revoke_family(user_id, family_id) and emits an
audit event noting the suspected compromise. The application can
also wire an on_token_compromise callback to receive the event
synchronously and take application-specific action (logging Alice
out of related sessions, alerting her by email, escalating to
fraud review).
The pattern catches a class of attacks that long-lived sessions cannot detect at all. Even a sophisticated attacker who avoids generating alerts cannot avoid the family revoke, because the legitimate user's next refresh inevitably triggers it. The trade-off is one re-login per detected compromise; given the alternative is silent access, the trade-off is worth it.
Device-binding cascade
When the device feature is enabled, refresh tokens are bound to the
device that received them. A refresh token issued from a browser on
Alice's laptop carries device_id = Some(laptop). A refresh token
issued from her phone carries device_id = Some(phone). Family
revocation cascades to the device store, marking the relevant device
as Revoked; device revocation cascades back to the token store,
revoking every token bound to the device.
The cascade is the mechanism behind "log out everywhere on this device" and "this device was lost, revoke all access from it". The operator marks the device revoked in the device store; the cascade revokes every refresh token bound to it; the next refresh from that device fails. The user is logged out of every session that ran through the device, including any session that was idle but still holding a refresh token.
The opposite direction matters too. When a family-revoke triggers
from a token-reuse detection, the cascade marks the relevant device
as compromised. The device's three-stage trust ladder
(Unknown to Seen to Trusted, covered in Device identity) is
short-circuited to the terminal Revoked state. Subsequent logins
from the same device fingerprint surface as a fresh Unknown device,
which the user re-establishes trust on with whatever step-up the
application requires.
The collect_family_device_targets helper gathers
(TenantId, DeviceId) pairs from a family for the cascade. The
helper exists because the device store and the refresh-token store
are independent persistence layers, and the cascade is the place
where they coordinate. The application's on_token_compromise
callback receives the list and decides which cascade to apply (some
applications mark devices Revoked directly; others write an
intermediate audit event and let an operator confirm).
Configuration
RefreshTokenConfig is the operator's tuning surface:
pub struct RefreshTokenConfig {
pub ttl: Duration,
pub max_per_user: usize,
pub rotation: bool,
pub hash_pepper: Option<Vec<u8>>,
}
The defaults are conservative for most applications: a thirty-day TTL, ten concurrent tokens per user, rotation enabled, and no pepper. Each field is worth a few words of guidance.
ttl is how long a refresh token is valid before it expires without
being used. Thirty days is enough that most users do not feel the
expiry in normal use, and short enough that an abandoned device's
tokens become unusable in a bounded time. Applications with stricter
posture set this lower; applications with weak step-up at re-login
set this higher.
max_per_user is the cap on how many refresh tokens a user can have
active at once. The cap exists to prevent a runaway "log in from
every device the user owns" pattern from filling the token store.
Issuing a new token past the cap evicts the oldest one. Ten is
generous for most users (a phone, a laptop, a tablet, plus a few
spares); applications with operators who routinely log in from
ephemeral machines push this higher.
rotation controls whether a refresh issues a new token (true) or
extends the existing one (false). Rotation enabled is the default and
is what makes family-based theft detection work. Rotation disabled
is faster (one less write per refresh) but defeats the family
detection mechanism, because a token never moves to revoked under
normal use. The recommendation is to leave rotation on; the
performance cost is negligible.
hash_pepper is the optional shared secret used to HMAC the token
hash. Adding a pepper is a defence-in-depth measure that helps when
the database is breached but the secrets manager is not. The pepper
must be stable across the deployment (otherwise existing tokens
become unverifiable); rotation is supported through the same pattern
as the session signing key, covered in Operations runbook.
Atomicity contracts
The RefreshTokenStore trait documents that production backends
must implement three methods atomically. The atomicity is what makes
the family-based theft detection sound; a non-atomic implementation
opens a TOCTOU window where an attacker could race the legitimate
user past the detection.
rotate_token must atomically mark the current token revoked and
issue a new token in the same family. Two requests racing each other
must result in one rotation and one detected reuse, not two
rotations.
issue_with_eviction must atomically issue a new token and evict the
oldest if the user is at the max_per_user cap. A non-atomic
implementation can leave a user with eleven active tokens
momentarily, which is harmless, or evict the wrong token under
contention, which can log a legitimate session out for no reason.
revoke_family must atomically revoke every token in a family.
Partial revocation defeats the detection mechanism: an attacker
holding a token from a half-revoked family can still refresh.
The first-party SQL adapters use transactions to satisfy these contracts. Custom adapters need to do the same; the contract is documented on the trait so reviewers can check it explicitly.
What this enables
Refresh tokens and session cookies are the two ends of a continuum between "convenience" and "security". A session cookie alone is the convenience end. A refresh token with family-based theft detection and device-binding cascade is what lets axess sit much closer to the security end without compromising user experience: sessions feel permanent because they refresh transparently, and theft gets caught the next time anyone attempts a refresh.
The mechanism is the same one that lets axess support "log out of everything" at the user-account level and "this device was lost" at the device level, because the cascade between tokens, families, and devices is the same in both directions. A session that has lived its whole life behind axess can be revoked through any of the three handles, and the others follow.
Further reading
Device identity covers the three-stage device assurance ladder, the per-tenant fingerprint pepper, and the retention sweep. Session lifecycle and crypto envelope covers the session cookie itself, the AES-256-GCM envelope, and the orchestration that issues and reads cookies. Operations runbook covers key rotation for the session signing key, the refresh-token pepper, and the device fingerprint pepper.
Password and TOTP
The four factors axess-factors ships by default (password, totp,
hotp, email_otp) are the ones most adopters reach for first. They
require no external IdP, no specialised hardware, no extra
infrastructure. This chapter walks through password (Argon2id) and
TOTP (RFC 6238), the two most common combination in practice, with
references to HOTP and email OTP at the end. The pattern these
factors illustrate generalises to every other factor in the library.
The feature flags password, totp, hotp, email_otp are all on
by default in axess-factors. No Cargo.toml change is needed to
use them.
Password (Argon2id)
The password factor verifies a user-supplied secret against a stored Argon2id hash. The choice of Argon2id rather than bcrypt or PBKDF2 is the standard one for new systems built today; the parameter tuning is the operational lever you reach for first.
The configuration struct is PasswordConfig. It carries the Argon2id
parameters (memory cost, time cost, parallelism), the minimum and
maximum password length, and the optional pepper. Defaults are
calibrated for a server class that can spare about fifty milliseconds
of CPU per verification, which is what current guidance considers an
appropriate cost ceiling for an interactive login.
pub struct PasswordConfig {
pub argon: Argon2idParams, // memory, time, parallelism
pub min_length: usize, // default 8
pub max_length: usize, // default 128
pub pepper: Option<Vec<u8>>, // optional, see below
}
The pepper is an optional secret stored outside the database (in the secrets manager that holds the session signing key). When set, the hash is HMAC-SHA256(pepper, password) before Argon2id processes it. The defence-in-depth benefit is the same one refresh-token peppers provide: a database breach alone does not enable an offline brute-force attack against the password hashes, because the pepper is not in the database.
The maximum password length matters for DoS protection. Argon2id is deliberately expensive; an attacker who can submit a megabyte of password text per request can wedge the server with a handful of concurrent attempts. The cap is one hundred and twenty-eight characters by default, which is generous for legitimate users (no password manager generates more than that) and bounded enough that the worst case per request stays under a hundred milliseconds.
The minimum is eight characters, which is below the modern recommendation but matches what most users encounter elsewhere. A deployment serious about password quality lifts this to twelve or fourteen, alongside a length-and-character-class meter on the signup form. Axess does not enforce password complexity rules beyond the length range; complexity meters live in the registration UI, where they can produce real feedback.
TOTP (RFC 6238)
The TOTP factor verifies a six-digit code derived from a shared secret and the current time window. The shared secret is twenty bytes of cryptographic randomness, generated at enrolment time and stored alongside the user's other factor configurations.
The configuration struct is TotpConfig:
pub struct TotpConfig {
pub secret: ZeroizedString, // base32-encoded 20-byte secret
pub digits: u32, // default 6
pub period: Duration, // default 30 s
pub algorithm: HmacAlgorithm, // default Sha1 (RFC 6238)
pub drift_window: u32, // default 1
}
secret is zeroized in memory on drop. The string is base32-encoded
because that is what TOTP apps expect when scanning a QR code or
pasting a manual key; the bytes underneath are twenty cryptographically
random bytes from SecureRng. Adopters serialise the secret to and
from their factor store however the store's encryption envelope
prefers.
digits is six in line with every TOTP authenticator in production
use. RFC 6238 admits up to eight, but no widely deployed TOTP app
generates eight-digit codes, so the field exists for symmetry rather
than for variability.
period is the time window each code is valid for. Thirty seconds is
the RFC default and what every authenticator app expects. Increasing
the period (to sixty seconds, say) reduces the chance that a user
typing slowly enters a code that has just expired, at the cost of
doubling the window an intercepted code remains valid. The
recommendation is to keep this at thirty unless you have a specific
reason to change it.
algorithm is the HMAC primitive used to derive the code. SHA-1 is
the RFC 6238 default and remains universally compatible. SHA-256 is
the harder-to-collide choice; some authenticator apps do not yet
support it. Stay on SHA-1 unless you have control over the
authenticator app the users will use.
drift_window is the count of adjacent time windows the verifier
accepts. A drift window of one means the verifier accepts codes from
the current window plus one window on either side, covering a
ninety-second total acceptance range against a thirty-second period.
The drift accommodates a few seconds of clock skew between server and
authenticator. Lifting it to two or three reduces user friction at
the cost of slightly increasing the brute-force attack surface; the
default of one is the right trade for most deployments.
Composing password and TOTP
A method that combines password and TOTP is two FactorSteps:
use axess::{FactorKind, FactorStep, Method};
let password_plus_totp = Method {
name: "password-then-totp".into(),
steps: vec![
FactorStep::Required(FactorKind::Password),
FactorStep::Required(FactorKind::Totp),
],
};
The method is stored at whatever scope the deployment wants (Global
default, Tenant override, User override; see Scope hierarchy). At
begin_login time the resolver loads the method, the session
transitions to Authenticating with remaining = [Password, Totp],
and the login flow walks the two factors in order.
The application's login page renders the password prompt while the
session is in Authenticating with remaining[0] == Password, and
the TOTP prompt while in Authenticating with remaining[0] == Totp.
A successful TOTP verification calls advance_factor, which returns
Completed, and the orchestrator transitions the session to
Authenticated. The user is logged in.
A common variant offers TOTP plus another second factor as a choice:
let password_plus_2fa_choice = Method {
name: "password-then-2fa-choice".into(),
steps: vec![
FactorStep::Required(FactorKind::Password),
FactorStep::AnyOf(vec![
FactorKind::Totp,
FactorKind::Fido2,
FactorKind::EmailOtp,
]),
],
};
The login page after the password step shows three options. The user
picks one; the application calls verify_factor with the appropriate
credential; on success, the session is authenticated.
TOTP enrolment
Enrolment is a separate ceremony from login. The user is already authenticated (often immediately after signup), and the application walks them through registering a TOTP device. The shape is uniform across deployments.
The server generates a new TOTP secret through SecureRng. It
serialises the secret as a base32 string and as an
otpauth://totp/<issuer>:<account>?secret=<base32>&issuer=<issuer>
URI suitable for embedding in a QR code. The UI displays the QR code
(scanned by the user's TOTP app) and offers a copy of the base32
secret for users whose apps prefer manual entry.
The user enters a six-digit code from their app, the server verifies it against the same TOTP algorithm that login uses, and on success the server persists the secret to the factor store under the user's scope. The user is now enrolled. Their next login that demands TOTP will succeed.
Two operational details matter at enrolment.
The first is that the verification at enrolment must succeed before
the secret is persisted. A user who scans the QR code but mistypes
the verification code (or scans into the wrong app) should not be
left with a stored secret that they cannot reproduce. The standard
pattern is: generate the secret in memory, display the QR code, hold
the secret in a short-lived enrolment record (in the session
custom field, for example), verify the user's code, persist on
success, discard on failure.
The second is recovery codes. A user who loses access to their TOTP device cannot log in with a method that requires TOTP. The deployment must offer a recovery path: either a recovery code printed at enrolment time (a long random string the user stores in a password manager), an email-OTP fallback factor, or an administrative reset flow with identity verification. Axess does not opinionate which path to take; the choice depends on the deployment's risk profile. The common pattern is to generate a recovery code at enrolment, treat it as a one-shot factor stored under the user's scope, and offer it as an alternative second factor.
HOTP and email OTP, briefly
The HOTP factor is the counter-based variant of TOTP. Instead of
deriving the code from the current time window, the verifier derives
it from a monotonically-increasing counter that advances on every
successful verification. HOTP is the right choice for hardware tokens
that have no clock (some YubiKey configurations, for instance). The
configuration mirrors TotpConfig with a counter field instead of a
period.
The email OTP factor verifies a six-digit code delivered to the user out of band, typically by email. The configuration carries the code length, the validity window (default fifteen minutes), and the count of allowed attempts before the code is revoked. The delivery is the application's responsibility; axess provides the verification side, the application provides the email send. The chapter Audit events covers the events emitted at email-OTP issuance and verification.
Threat model
A password-plus-TOTP login is robust against three common attacks and weak against one.
It is robust against a password leak (an attacker with the password alone cannot complete login without the TOTP code), against a TOTP secret leak (an attacker with the TOTP secret alone cannot complete login without the password), and against credential stuffing (an attacker reusing leaked credentials from another service is unlikely to also have the user's TOTP secret).
It is weak against a real-time phishing attack: a fake login page that prompts the user for their password, forwards it to the real server, prompts the user for their TOTP code, forwards that to the real server, and steals the resulting session. FIDO2 (covered in FIDO2 and WebAuthn passkeys) is the standard defence against this class of attack, because the WebAuthn ceremony binds the authentication to the origin and cannot be replayed against a different origin.
For applications where real-time phishing is a credible threat (financial services, healthcare, anything that handles regulated data), the recommendation is to offer FIDO2 as the second factor and treat TOTP as a fallback for users who do not yet have a passkey. The combination is what most regulators are asking for today.
Troubleshooting
A few failures recur often enough to be worth naming.
If TOTP verification fails consistently, the most likely cause is
clock skew between the server and the authenticator app. The
drift_window config accommodates a few seconds; larger drift
points to a misconfigured NTP setup on either side. Logging the
generated and accepted windows at debug level surfaces the offset
quickly.
If TOTP verification fails for some users but not others, the likely cause is that the affected users scanned the QR code into an app that defaults to SHA-256 (some less common authenticators do), while the server defaults to SHA-1. The fix is to either align the server to SHA-256 (and re-enrol users), or to ensure the QR code URI explicitly specifies SHA-1.
If password verification is slow under load, the Argon2id parameters are probably set higher than the server class can support at the offered concurrency. The fix is to either lower the memory cost or to add CPU. Lower the memory cost first; below sixty-four megabytes you are out of the modern recommendation, and sixty-four megabytes is what current guidance suggests as a minimum.
If password verification is fast but logins occasionally take multiple seconds, the bottleneck is somewhere else (the factor store, the session store, an outbound network call in the login handler). Inspect the trace.
Further reading
Factors and methods covers the composition machinery this chapter exercises. FIDO2 and WebAuthn passkeys covers the WebAuthn second factor that supplants TOTP for the highest-assurance deployments. Identity store implementation covers how the password hash and TOTP secret are persisted alongside the user. Audit events covers the events emitted at every step of the password and TOTP flow.
FIDO2 and WebAuthn passkeys
FIDO2 is the answer to real-time phishing. Every other second-factor
mechanism in this book (TOTP, HOTP, email OTP, SMS) is vulnerable to
an attacker who proxies the user's input through a fake login page to
the real server in real time. WebAuthn, the browser-side standard
that FIDO2 implements, binds each authentication to the origin where
the credential was registered, and a credential registered against
accounts.example.com cannot be exercised against
accounts-example-com.attacker.example. The defence is structural,
not behavioural: the browser refuses to use the credential at the
wrong origin, regardless of what the user clicks.
This chapter walks through the integration: the two-ceremony model, relying-party configuration, storage, the resident-key choice, and the rollout patterns for shipping passkeys alongside an existing password-and-TOTP flow.
The feature flag is fido2 (off by default), enabled with
features = ["fido2"] on the axess facade.
The two ceremonies
WebAuthn has two ceremonies, and an integration touches both. The first is registration: the user has authenticated to your application by some other means (signup with email verification, an already-logged-in session, an OAuth callback) and is registering a new authenticator. The second is authentication: the user already has a registered credential, is logging in, and the WebAuthn ceremony proves possession.
Registration is the more involved of the two because it is where the
relying-party configuration matters. The server starts the ceremony
by calling Fido2Provider::begin_registration, which returns a
CreationChallengeResponse. The handler serialises that to JSON and
returns it to the browser, which calls
navigator.credentials.create() with the JSON deserialised. The
browser produces a RegisterPublicKeyCredential, which the page
posts back to the server. The server calls
Fido2Provider::finish_registration, which verifies the response
against the challenge stored in the session, and on success returns
the credential public key and metadata. The application persists this
to the factor store under the user's scope, indexed by the credential
id.
Authentication mirrors registration. The server calls
Fido2Provider::begin_authentication, which returns a
RequestChallengeResponse listing the credential ids the user has
registered. The browser calls navigator.credentials.get() with the
serialised challenge. The browser produces a
PublicKeyCredential, the page posts it back, and the server calls
Fido2Provider::finish_authentication, which verifies the signature
against the stored public key.
The challenge in both ceremonies is the part the session machinery
threads through. The server generates the challenge from SecureRng
at begin, stores it in the session (in a typed field on
SessionData::custom or a dedicated extension), and reads it back at
finish. The challenge is one-shot; it is consumed at finish,
regardless of success, to prevent replay. The whole ceremony lives
inside the typed state machine: a begin without a subsequent
finish leaves the session in a state where the next call expects
the finish, and any other call returns an error.
Relying-party configuration
The relying party is the server that owns the credentials. WebAuthn identifies the relying party by an origin (the host plus scheme plus port) and by a relying-party id (an effective domain suffix of the origin). The two pieces of configuration that matter at registration time are:
pub struct Fido2Config {
pub rp_id: String, // e.g. "example.com"
pub rp_name: String, // human-readable, "Example Inc."
pub rp_origin: Url, // e.g. "https://accounts.example.com"
pub user_verification: UserVerificationPolicy,
pub attestation: AttestationConveyancePreference,
pub resident_key: ResidentKeyRequirement,
}
rp_id is the relying-party id, a string equal to or a suffix of the
host part of the origin. Setting it to the apex domain
(example.com rather than accounts.example.com) lets credentials
registered on the accounts subdomain be used across other subdomains
of the same apex (app.example.com, admin.example.com), which is
usually what you want. Setting it to the full hostname scopes
credentials to that hostname alone, which is appropriate when other
subdomains belong to other applications you do not trust.
rp_origin is the full origin where the registration happens. The
browser cross-checks this against the page's actual origin and
refuses the registration if they do not match. Wildcards are not
allowed; multi-region deployments register credentials under each
region's specific origin.
user_verification controls whether the authenticator must verify
the user's presence (a fingerprint, a PIN, a face scan) at
authentication time, in addition to proving possession of the
authenticator. Required is the right setting for high-assurance
deployments; Preferred is the right setting for usability when
some authenticators do not support verification.
attestation controls how much detail the authenticator reports
about itself at registration. None is the right default unless you
have a specific reason to track which authenticator models your
users register (some regulatory frameworks require this for
hardware-key deployments). Direct records the attestation
statement; the trade-off is privacy (the authenticator vendor is
identifiable from the attestation).
resident_key controls whether the authenticator stores the
credential identifier on-device (a resident key, or "passkey"), or
whether the credential id is server-side and the authenticator only
stores the key material. Required is what makes a passkey: the
user does not have to type a username, because the authenticator
holds the credential id and surfaces it directly to the browser.
Preferred allows either, with the device's preference deciding.
Discouraged is the legacy mode where the server provides the
credential id list.
The passkey-or-not choice is the most consequential for usability. Passkeys are what users mean today when they say "biometric login": the user clicks a button, taps their fingerprint, and they are in. Non-resident credentials require the user to enter a username first, which is the older WebAuthn flow and what most existing TOTP-style second factors look like. New deployments should default to passkeys; older deployments adopting WebAuthn alongside existing flows often start with non-resident and migrate.
Storage
Each registered credential is one row in the factor store. The row carries:
- The credential id (a byte string, base64-encoded for storage).
- The public key (the bytes WebAuthn returns at registration).
- The signature counter (used to detect credential cloning).
- The attestation statement (if
attestationwas set aboveNone). - The user verification flag from registration.
- The authenticator transports list (USB, NFC, internal, hybrid).
The signature counter is the part that catches credential cloning. WebAuthn authenticators increment the counter on each successful signature. The server stores the counter at registration and updates it at each authentication. A counter that fails to increase between authentications indicates the credential has been cloned (the clone's counter started from the same value as the original and diverged on first use). The defence is conservative: revoke the credential and require re-registration.
The credential is scoped per user (each user has zero or more registered credentials), which is the natural per-user scope from the chapter Scope hierarchy. A user with multiple authenticators (a phone passkey plus a hardware key, say) has multiple credentials under their scope; the authentication ceremony enumerates them and the user's authenticator (or the browser, in the passkey case) picks one.
Adding passkeys to an existing flow
A common rollout is to keep an existing password-and-TOTP method and to offer passkey enrolment as an opt-in. The method shape:
let passkey_or_password = Method {
name: "passkey-or-password".into(),
steps: vec![
FactorStep::AnyOf(vec![
FactorKind::Fido2,
FactorKind::Password,
]),
// When Password is chosen, demand a second factor.
// Implementing this conditional path takes a richer state
// machine; the common simplification is two methods.
],
};
The conditional in the second step (require TOTP only if the user
took the password branch) is the part axess does not natively
support, because FactorStep does not nest. Two parallel methods
handle the case more cleanly: one method
(passkey-only, single-step FIDO2) for users with a registered
passkey, and another method (password-then-totp, two-step
password-and-TOTP) for users without one. The scope hierarchy
chooses the right method per user. When a user enrols a passkey,
the application updates their user-scoped method to
passkey-only; if the passkey is later revoked, the application
reverts to password-then-totp.
The pattern keeps both flows in production simultaneously, lets each user transition independently, and avoids the conditional in the state machine. The audit log records which method ran for which user, so the rollout is observable.
Threat model
A passkey login is robust against the classes of attack that password-and-TOTP is weak against. Real-time phishing is defeated because the credential is origin-bound at registration. Credential stuffing is defeated because the credential is unique to the relying party. Server-side breach is defeated because what is stored is a public key, not a secret.
The remaining attack surface is:
The first is a compromised endpoint. An attacker with full control
of the user's device can ask the authenticator to perform any
authentication the device permits. The defence here is
user-verification: the authenticator must prove the user is
present (biometric or PIN). For a deployment where this matters,
user_verification: Required is non-negotiable.
The second is account recovery. A user who loses their passkey needs to recover access; the recovery path becomes the weakest link in the chain. The recommendation is to enrol at least two passkeys (a primary on the phone, a backup on a hardware key, say), and to offer a step-up administrative recovery flow with strong identity verification rather than a password-reset email. The recovery flow gets attacked because the primary login is robust; make sure the recovery is at least as strong.
The third is sync-fabric credentials. A passkey synced through Apple iCloud Keychain, Google Password Manager, or 1Password is available on every device the user has signed into that sync fabric. This is what makes passkeys usable; it also means a breach of the sync fabric compromises the credential. The implication is operational, not architectural: deployments that must defend against sync-fabric compromise pair the passkey with a device-bound credential (a hardware key, an attestation-bound device passkey), and require the device-bound credential for the highest-sensitivity actions through Cedar policy.
Troubleshooting
A few failures recur during initial integration.
If the browser rejects the registration with "The relying party ID
is not a registrable suffix of the page origin", the rp_id does
not match the page origin. Setting rp_id to example.com while
the registration page is on accounts.example.com works; setting
it to attacker.com does not. Check the host part of the actual
URL the browser is on.
If authentication succeeds on one device and fails on another, the
likely cause is that the credential is a passkey on one device but
not synced to the other. Some authenticators register
non-syncable credentials by default; check the
resident_key: Required setting and the device's documentation.
If the signature counter check fails for legitimate users, the authenticator may not implement the counter (some legacy hardware keys do not). The fix is to log the counter mismatch and let the authentication proceed, sacrificing clone detection for usability on those specific authenticators. The decision is policy, and the application surfaces it explicitly.
Further reading
Factors and methods covers the composition machinery this chapter
exercises. Device identity covers the device-bound trust ladder
that complements passkeys for high-assurance deployments. Cedar
policy fundamentals covers how a policy demands FIDO2 for specific
actions (the factors_completed.contains("Fido2") check).
OAuth 2.0 and OIDC
Federated login through an Identity Provider you do not control is the most common reason adopters reach for OAuth. The user has a Google account, an Okta account, a corporate Azure AD account, and the application accepts a login from any of them rather than asking the user to invent and remember another password. The mechanism is OAuth 2.0 for the authorisation flow and OpenID Connect for the identity assertion layered on top. This chapter walks through what axess wires up automatically, what the integration code has to do, and the failure modes that have specific defences.
The feature flag is oauth (off by default), enabled with
features = ["oauth"] on the axess facade. The feature transitively
enables oidc (the discovery and JWKS-cache machinery) and jwt (the
ID token validator).
Axess supports generic OIDC-based external login and SSO, including standard providers such as Google and Microsoft Entra ID when configured with the appropriate issuer metadata and client credentials. SAML / Shibboleth federation is not currently supported out of the box.
The shape of the flow
A federated login involves the user, the application (the OAuth client, in OAuth language, which is axess), and the Identity Provider (the OAuth server, which is the third-party IdP). The flow is the authorisation code grant with PKCE, which is what every modern OIDC deployment uses.
sequenceDiagram
actor User
participant App as Application (axess client)
participant IdP as Identity Provider
User->>App: GET /auth/login/google
App->>App: build auth URL with PKCE + state + nonce
App->>User: 302 to IdP authorize endpoint
User->>IdP: GET /authorize?...
IdP->>User: login + consent
User->>App: GET /auth/callback?code=...&state=...
App->>IdP: POST /token (code + pkce_verifier)
IdP->>App: { id_token, access_token, refresh_token }
App->>App: validate ID token (issuer, audience, nonce, signature)
App->>App: optionally fetch /userinfo
App->>App: transition session to Authenticated
App->>User: 302 to /dashboard
The flow has six pieces axess does for you and three pieces the
integration code is responsible for. The six axess-owned pieces are:
generating the PKCE verifier and challenge, generating and binding
the CSRF state, generating and binding the OIDC nonce, the discovery
of the IdP's endpoints and signing keys, the token exchange itself,
and the ID token validation including signature, audience, nonce, and
the azp check when the audience is multi-valued. The three pieces
the integration owns are: the redirect to the IdP authorize URL, the
callback handler that picks up the code, and the application-specific
mapping from the validated claims to the user record in the local
identity store.
The provider
OAuthProvider is the trait that represents an IdP. The trait is
asynchronous because every method may need to fetch JWKS, perform
discovery, or hit the token endpoint. Adopters do not implement this
trait themselves under normal circumstances; the
OAuthProviderConfig constructor in axess-factors produces a
provider from a discovery URL plus client credentials, and the
returned provider implements the trait.
let provider = OAuthProviderConfig::discover(
"https://accounts.google.com/.well-known/openid-configuration",
client_id,
client_secret,
"https://your-app.example.com/auth/callback/google".parse()?,
)
.await?;
discover fetches the IdP's discovery document, validates it
contains the endpoints axess needs (authorization, token, JWKS,
userinfo, sometimes end-session), constructs a Discovery value, and
sets up the JWKS cache against the IdP's signing-key endpoint. The
cache is single-flight (concurrent JWKS misses dedupe to one request)
and debounced (the cache refuses to refresh more often than once
every few seconds, defeating a denial-of-service that triggers
constant JWKS fetches).
The configuration record carries the client id and secret (both
provisioned at the IdP), the redirect URI (where the IdP sends the
user after authentication), the ceremony timeout (how long the
intermediate state on the session may live before the flow has to
restart), and the list of scopes to request (openid and profile
at minimum; email if the application needs the user's email
address; offline_access if the application needs a refresh token
to continue acting as the user after the initial session expires).
Begin the login
The handler that starts the federated login transitions the session into a state that holds the PKCE verifier, the CSRF state, and the nonce, and returns a redirect to the IdP's authorize URL with those values bound in.
use axess::{AuthnService, AuthSession, OAuthLoginOptions};
use axum::response::{IntoResponse, Redirect};
async fn begin_oauth_login(
session: AuthSession,
State(service): State<Arc<AuthnService<...>>>,
Path(provider_name): Path<String>,
) -> impl IntoResponse {
match service
.begin_oauth_login(&session, &provider_name, OAuthLoginOptions::default())
.await
{
Ok(auth_url) => Redirect::to(auth_url.as_str()).into_response(),
Err(e) => (StatusCode::BAD_REQUEST, format!("{e}")).into_response(),
}
}
begin_oauth_login does three things internally. First, it generates
the PKCE verifier through SecureRng and derives the S256 challenge
that travels in the authorize URL. Second, it generates the CSRF
state and the OIDC nonce, also through SecureRng, and stores all
three values (verifier, state, nonce) in the session's intermediate
state. Third, it composes the authorize URL with the client id, the
redirect URI, the requested scopes, the PKCE challenge, the state,
and the nonce, and returns it.
The redirect URI passed at this step must exactly match the one registered with the IdP at provisioning time. A mismatch is the single most common reason a federated login fails out of the box.
Handle the callback
The IdP, on successful user authentication and consent, redirects the
user to the registered redirect URI with a code and a state query
parameter. The application's callback handler picks these up,
verifies the state matches what was stored on the session (defeating
CSRF), and calls into axess to perform the token exchange.
async fn finish_oauth_login(
session: AuthSession,
State(service): State<Arc<AuthnService<...>>>,
Path(provider_name): Path<String>,
Query(callback): Query<CallbackQuery>,
) -> impl IntoResponse {
match service
.finish_oauth_login(&session, &provider_name, &callback.code, &callback.state)
.await
{
Ok(_authenticated) => Redirect::to("/dashboard").into_response(),
Err(e) => (StatusCode::UNAUTHORIZED, format!("{e}")).into_response(),
}
}
#[derive(serde::Deserialize)]
struct CallbackQuery {
code: String,
state: String,
}
finish_oauth_login does seven things internally. First, it reads
the PKCE verifier, the CSRF state, and the nonce from the session's
intermediate state. Second, it cross-checks the supplied state
against the stored state, returning OAuthError::CsrfMismatch if
they disagree. Third, it constructs the POST to the IdP's token
endpoint, including the code, the PKCE verifier, the client id, and
the client secret. Fourth, it parses the response and extracts the
ID token, access token, and (optional) refresh token. Fifth, it
validates the ID token: signature against the cached JWKS, issuer
match, audience match, nonce match, expiry, azp check when the
audience is multi-valued. Sixth, it optionally fetches the userinfo
endpoint with the access token to supplement the ID token claims.
Seventh, it transitions the session to Authenticated (or to
PendingWorkflow if the federated flow is part of a multi-step
ceremony like signup).
If any of the seven steps fails, the function returns an
OAuthError variant naming what failed. The session does not
transition; the intermediate state is cleared (to prevent replay);
the callback handler can render an error.
ID token validation
The ID token validation is where most of the security of an OIDC integration lives. Axess performs the full set of checks RFC 6749 and OpenID Connect Core 1.0 require; the integration code does not have to write them. The checks are:
The first is signature verification against the IdP's JWKS. The
cache holds the current signing keys; if the ID token's kid header
does not match a cached key, the cache refreshes (subject to the
single-flight and debounce protections). A signature that fails
against the refreshed keys produces OAuthError::SignatureInvalid.
The second is the issuer check. The ID token's iss claim must
exactly match the discovery document's issuer field. A mismatch
indicates either a misconfigured IdP, a discovery-document substitution
attack, or an attempt to replay an ID token from a different issuer;
all three produce OAuthError::IssuerMismatch.
The third is the audience check. The ID token's aud claim must
contain the client's registered client id. If aud is a single
value, the check is straightforward. If aud is an array (which
happens when the IdP issues tokens valid for multiple clients), the
check ensures the client id is in the array, and additionally
enforces the azp (authorized party) check: the azp claim must
exist and equal the client id, regardless of the array's contents.
The azp check defeats a class of attacks where an ID token issued
for one client is replayed against a different client whose id is
also in the audience array.
The fourth is the nonce check. The ID token's nonce claim must
exactly match the nonce that was generated at begin_oauth_login
time and stored in the session. The nonce defeats ID token replay:
an attacker who captures an ID token cannot reuse it against the
same client because the session-bound nonce will not match on a
later login.
The fifth is the expiry check. The ID token's exp claim must be
in the future at the moment of validation, with a small clock-skew
allowance. The clock comes from the injected Clock trait, so DST
tests can exercise expiry handling deterministically.
The sixth is iat (issued-at) bounds. The token must have been
issued within the last few minutes; tokens older than that indicate
replay. The bound is configurable but defaults to five minutes,
which matches what RFC 7519 implementations typically use.
Back-channel logout
When the IdP supports OIDC back-channel logout, the IdP sends a POST
to a registered logout endpoint at the application with a
logout_token. The application validates the token and, on success,
revokes the user's session.
The validation is similar to ID token validation but slightly
different: the audience and issuer checks apply, the azp check
applies when audience is multi-valued, and an additional check on
the events claim verifies the token is a back-channel logout token
(the URI http://schemas.openid.net/event/backchannel-logout must be
present). Axess implements this through OAuthProvider::verify_logout_jwt,
which returns the claims on success.
The size cap on the logout token is eight kilobytes, the iat bound
is five minutes, and the clock-skew tolerance is sixty seconds. The
caps protect against denial-of-service through oversize tokens; the
bounds defeat replay of a captured logout token after a meaningful
delay.
RP-Initiated Logout
The opposite direction is RP-Initiated Logout: the application
initiates a logout that propagates to the IdP, so the user is logged
out of the IdP session as well as the application session. Axess
constructs the end-session URL through
OAuthProvider::build_end_session_url, which takes the ID token
hint (the user's last issued ID token, signed by the IdP), an
optional post_logout_redirect_uri (where to send the user after
logout), and an optional state value.
The post_logout_redirect_uri must be on an allowlist that the
application configures. The allowlist exists to defeat open-redirect
attacks: an attacker who can manipulate the redirect URI could send
the user to an arbitrary external site after logout, which is the
shape of a phishing setup. The allowlist is a small explicit list of
allowed URIs; anything else is rejected at build_end_session_url
time.
Multiple providers
A common shape is to offer login with several IdPs side by side
(Google, GitHub, Microsoft). Each provider is its own
OAuthProvider instance constructed at startup; the application
registers them under a provider_name key. The login URL carries
the provider name (GET /auth/login/google); the callback URL also
carries the name (GET /auth/callback/google). Axess dispatches to
the right provider per request.
A per-tenant variation is also common: each tenant's users federate against the tenant's own IdP (an Okta workspace, an Azure AD directory). The provider name in this case is the tenant slug; the provider is constructed at tenant provisioning time (or lazily, on first use) and cached. The scope hierarchy chapter covers the pattern for storing per-tenant configurations.
Threat model
OAuth and OIDC together are robust against a handful of attacks when the implementation does the validations above correctly.
Against CSRF on the callback: the state parameter binds the callback to the session that started the login. An attacker who tricks a user into hitting the callback URL with a stolen code cannot complete the login because the state will not match.
Against ID token replay: the nonce binds the ID token to the session's login attempt. An ID token captured by an attacker cannot be replayed against a different session.
Against ID token forgery: signature validation against the JWKS catches an attacker who synthesises an ID token without the IdP's signing key.
Against audience confusion (an ID token issued for one client used
against another): the audience check plus the azp check on
multi-element audiences catch this.
Against authorization code interception: PKCE binds the code to the verifier the application generated. An attacker who intercepts the code cannot exchange it without the verifier.
Against open-redirect phishing on logout: the
allowed_post_logout_redirect_uris allowlist catches an attacker
who tries to manipulate the redirect URI.
The attacks OAuth and OIDC do not defend against are the ones FIDO2 defends against (real-time phishing of the IdP login page itself) and the ones that depend on the IdP's own security posture (a compromised IdP issues compromised tokens, and no client-side check catches that). The defence for the latter is operational: monitor which IdPs the application accepts, audit periodically, and rotate the registered client secret if the IdP suffers a breach.
Troubleshooting
A few failure modes recur during initial integration.
If the callback returns an error about state mismatch, the most likely cause is that the user took longer than the ceremony timeout to complete the IdP login. The intermediate state on the session has expired and the state value is no longer recoverable. Increasing the ceremony timeout (a generous fifteen minutes is reasonable) is the fix.
If the token exchange returns an invalid-client error, the client id
or secret in OAuthProviderConfig does not match what the IdP has
registered. The most common variant is using a public-client id at
the IdP while configuring axess with a confidential-client expectation
(or vice versa). Check the IdP's client registration page.
If the ID token validation returns an audience mismatch on an IdP
that supports multiple clients, the aud claim is probably an
array and the azp claim is missing. Some IdPs do not emit azp
when they should; configuring the IdP to issue azp is the fix.
Axess deliberately refuses to bypass the azp check because doing
so would open the audience-confusion attack.
If the userinfo endpoint returns a 401 after a successful token
exchange, the access token's scopes do not include the ones the
userinfo endpoint requires. The fix is to add the required scopes
(typically profile and email) to the scopes configuration.
Further reading
FAPI 2.0 covers the financial-grade extensions (PAR, DPoP, JARM)
that layer on top of the OAuth provider for regulated deployments.
Workload identity overview covers the inbound resolver side of
the same machinery, where the application is the OAuth server
accepting tokens issued by federated workload-identity systems.
Local IdP covers the in-process IdP, both production LocalIdp
for workload-identity issuance and the LocalIdpFixture that mints
test tokens against a controllable JWKS for integration tests.
FAPI 2.0
FAPI is the OpenID Foundation's Financial-grade API profile, a set of additional requirements on top of OAuth 2.0 and OIDC that address the threat model of regulated financial APIs. The headline differences from baseline OAuth are mandatory Pushed Authorization Requests (PAR), mandatory sender-constrained tokens through DPoP or mTLS, optional JWT Authorization Response Mode (JARM), and stricter ID token lifetime bounds. This chapter walks through what FAPI adds, how axess exposes it, and when to reach for it.
The feature flag is fapi (off by default), which implies oauth.
The base OAuth chapter (OAuth 2.0 and OIDC) covers everything that
remains true under FAPI; this chapter covers only what changes.
Axess is the Relying Party, not the OP
A FAPI deployment has two parties. The OpenID Provider (OP, also called the IdP) owns user identity, runs the login UI, and issues tokens; in open-banking this is typically the bank's own SSO or a hosted Keycloak / Ory Hydra / Curity instance. The Relying Party (RP) is the application that delegates identity to the OP, accepts the resulting tokens, and runs a session on top. Axess fills the RP role. PAR, DPoP, JARM, and RP-Initiated Logout are all RP-side protocols that exist to talk to an external OP; without an OP to talk to, none of them make sense.
This is a deliberate architectural choice. Building a FAPI-conformant
OP is a multi-year project (Keycloak, Hydra, Curity, and the
commercial vendors are the established options) and is largely
disjoint from the RP-side machinery axess provides (sessions, MFA
verifiers, Cedar authorization). The verifier-vs-orchestrator split
in the workspace (covered in Architecture at a glance) is the
internal expression of the same boundary; axess does not become the
OP, and adopters are expected to point at one. The
examples/fapi/
crate ships a pre-configured Keycloak realm in a podman container as
a quick way to get an OP locally for the demo, but in production the
issuer URL would point at whatever OP your organisation already runs.
The local-idp feature is the one place axess does issuance, but
that is on-host workload-identity issuance (service-to-service flows
where a sidecar mints JWTs for its own workloads), not a
user-facing OP. Local IdP covers that surface.
What FAPI changes
The four headline mechanisms address four specific gaps in baseline OAuth.
Pushed Authorization Requests (PAR, RFC 9126) move the authorization
parameters off the redirect URL. Instead of the application
constructing a query-string-laden authorize URL and redirecting the
user to it, the application makes a direct POST to the IdP's PAR
endpoint containing the parameters, receives an opaque request_uri
in return, and constructs a much shorter authorize URL containing
only the client id and the request URI. The defence is twofold: the
authorization parameters never appear in browser history or referer
headers, and the parameters cannot be tampered with in transit
because the user only carries a reference to them.
DPoP (Demonstration of Proof of Possession, RFC 9449) binds the access token to a key pair the client controls. Each request the client makes to a protected resource carries a JWT signed with the client's DPoP key, and the token validator at the resource server checks that the access token was issued for a thumbprint of that key. The defence is against bearer-token theft: an attacker who captures the access token (from logs, a misconfigured proxy, a debugging surface) cannot use it without also having the DPoP private key, which never leaves the client.
JARM (JWT Authorization Response Mode) is the optional FAPI 2.0 recommendation that the IdP return the authorization response as a signed JWT instead of as query parameters. The defence is integrity: the response cannot be tampered with after the IdP issues it. JARM is optional in FAPI 2.0; some implementations use it, others do not.
Stricter ID token bounds: FAPI 2.0 requires the ID token's nbf
(not-before) claim to be enforced and the lifetime to be no longer
than a short window (axess defaults to five minutes, and refuses
ID tokens with nbf in the future or exp more than five minutes
out). The defence is against replay through stale tokens.
When to reach for FAPI
The honest answer is: when a regulator requires it. FAPI 2.0 was designed for the open-banking ecosystem and similar regulated financial APIs, and adopting it imposes operational complexity (every client needs DPoP key management, every authorize call goes through PAR, every IdP must support the PAR endpoint) that is substantial relative to the security benefit for non-regulated deployments. A consumer-facing SaaS that takes credit card payments through Stripe does not need FAPI; an open-banking application that acts as an account-information service provider does.
The decision is binary: either you need FAPI because someone is asking you for compliance evidence, or you do not. If you do, the mechanisms below are non-negotiable, and axess implements them. If you do not, the baseline OAuth chapter covers what you need.
Configuration
FAPI is enabled per-provider by attaching a FapiConfig to an
OAuthProviderConfig:
use axess::factors::oauth::{FapiConfig, SenderConstraint, OAuthProviderConfig};
let fapi_config = FapiConfig {
sender_constraint: SenderConstraint::DPoP,
require_jarm: false,
max_id_token_lifetime_secs: 300,
};
let provider = OAuthProviderConfig::discover(
"https://idp.example.com/.well-known/openid-configuration",
client_id,
client_secret,
redirect_uri,
)
.await?
.with_fapi(fapi_config);
sender_constraint chooses between DPoP and mTLS for the
sender-constrained-tokens requirement. DPoP is the right choice for
applications that already manage HTTPS in software; mTLS is the
right choice for applications that already manage X.509 certificates
for service-to-service authentication. The two cannot be combined
on a single provider, but different providers in the same
application can use different constraints.
require_jarm toggles JARM enforcement. When true, the authorization
response from the IdP must arrive as a signed JWT; the
configuration's oidc.discovery.jwks_uri is used to verify the
signature. When false, the IdP may return the response as query
parameters as in baseline OAuth.
max_id_token_lifetime_secs is the upper bound on ID token validity.
The FAPI default is three hundred seconds (five minutes), which is
short enough that a captured token expires before most replay attacks
can succeed and long enough that clock skew does not cause spurious
rejections.
The PAR flow
With FAPI enabled, the application starts a federated login through the PAR-enhanced auth URL rather than the query-parameter auth URL:
let auth_url = service
.begin_oauth_login(&session, "fapi-provider", OAuthLoginOptions::default())
.await?;
// auth_url looks like:
// https://idp.example.com/authorize?client_id=...&request_uri=urn:ietf:params:oauth:request_uri:...
Internally, begin_oauth_login detects the FAPI configuration and
takes the PAR branch. The branch performs a POST to the IdP's PAR
endpoint with the full set of authorization parameters (client id,
redirect URI, scopes, PKCE challenge, CSRF state, nonce), receives
the request_uri and its expires_in, and constructs the
shorter authorize URL the user is redirected to.
The PAR exchange happens server-to-server and is authenticated. The
authentication is whatever the IdP requires (client secret POST,
client secret basic, mTLS, or signed JWT assertion); axess passes
through the credential that OAuthProviderConfig was constructed
with.
The callback flow on the application side is unchanged. The IdP
redirects the user back to the application's callback URL with a
code; the application calls finish_oauth_login with the code and
state; axess performs the token exchange and ID token validation.
DPoP key management
DPoP binds each access token to a public key the client controls. The application generates a key pair at session start (or at application start, for some deployments), uses the private key to sign a DPoP proof JWT on each request to a protected resource, and the resource server verifies the proof and matches the JWK thumbprint against the access token's binding.
Axess exposes the proof-generation primitive through
OAuthProvider::generate_dpop_proof:
let proof: DpopProof = provider.generate_dpop_proof(
"GET", // HTTP method
"https://resource.example.com/data", // target URL
Some(&access_token), // bind to this access token
&dpop_key, // the application's key
)?;
let response = http_client
.get("https://resource.example.com/data")
.header("Authorization", format!("DPoP {}", access_token))
.header("DPoP", &proof.proof_jwt)
.send()
.await?;
The proof JWT contains the HTTP method, the target URL, a nonce, a
timestamp, and the thumbprint of the binding key. The resource
server checks all of these against the access token's cnf
(confirmation) claim, which carries the thumbprint at token
issuance.
Key lifecycle is the operational concern. A DPoP key pair generated per session is the safest choice (a compromised session is bounded to one key); a key pair generated per application instance is the easiest choice (one key to manage). The trade-off is between blast radius and operational complexity. Most deployments choose per-session keys for high-sensitivity flows and per-instance keys for routine flows.
Token revocation
FAPI 2.0 expects that compromised tokens can be revoked through the
IdP's revocation endpoint (RFC 7009). The application calls
revocation when the user logs out, when a session is administratively
ended, or when token theft is detected. Axess exposes revocation
through OAuthProvider::revoke_token:
provider.revoke_token(&access_token, Some(TokenTypeHint::AccessToken)).await?;
provider.revoke_token(&refresh_token, Some(TokenTypeHint::RefreshToken)).await?;
The revocation endpoint, when present in the discovery document, is called with the token to revoke and an optional type hint. The IdP responds with a 200 regardless of whether the token was actually revoked (intentionally, to defeat token-existence enumeration).
Revoking the refresh token is the more important call. The access token typically has a short lifetime (matching the FAPI ID token bound) and expires on its own; the refresh token has a longer life and an unrevoked one allows continued access through new access tokens. A logout that revokes only the access token leaves the refresh token active, which is rarely what the application wants.
Testing FAPI flows
There are three useful test modes, picked by what you want to exercise.
For Rust unit and integration tests, the FAPI feature pairs with the
local-idp feature. The LocalIdpFixture in axess-core::testing::local_idp
mints FAPI-grade tokens with the right nbf/exp bounds and exposes
a shared JwkSet handle that a JwtVerifier borrows for signature
verification. The fixture is an in-process value, not an HTTP service:
PAR and discovery endpoints are not part of its surface. For FAPI
flows that need a real PAR exchange, use Keycloak or another OP (see
the end-to-end walkthrough below). The pattern for unit tests is to
write against an OAuthProvider trait object, parameterise it over
fixture and live, and run both in CI. Local IdP covers the fixture
in detail.
For an end-to-end browser walkthrough, the examples/fapi/
crate ships with a pre-configured Keycloak realm under
examples/fapi/keycloak/. One podman compose up -d brings up
Keycloak with PAR required, PKCE S256 required, DPoP-bound tokens
enabled, the axess-fapi-client client registered, and a seeded
user (alice/alice) ready to log in. The example's
OAuthProviderConfig::discover(...) call points at the local
Keycloak issuer through env vars, and the same code talks to a real
production IdP when those env vars point elsewhere. Docker users can
substitute docker compose for podman compose; podman is the
documented path.
For compliance certification, the OpenID Foundation runs a free
hosted conformance suite at https://www.certification.openid.net/.
It acts as a scripted OP that drives an RP through the full FAPI 2.0
test matrix including adversarial cases (missing PAR, bad DPoP,
replay, wrong audience). Point it at the example's /auth/callback
to produce a certifiable artifact; use Keycloak for everyday
development.
Threat model
FAPI 2.0 closes the attacks baseline OAuth leaves open in regulated contexts.
Against authorization-parameter tampering: PAR moves the parameters off the URL, so they cannot be modified by an intermediary.
Against bearer-token theft: DPoP (or mTLS) binds tokens to keys the attacker does not have, so a captured token is unusable.
Against ID token replay through stale tokens: the strict lifetime bound shrinks the replay window to minutes.
The attacks FAPI does not close are the same ones baseline OAuth does not close: a compromised IdP issues compromised tokens regardless of the profile, and a compromised client device gives the attacker access to the DPoP private key alongside everything else.
Troubleshooting
If the PAR exchange fails with invalid_client, the application's
PAR endpoint authentication does not match what the IdP expects.
Some IdPs require mTLS authentication on PAR even when the rest of
the flow uses client secrets; check the IdP's PAR documentation.
If DPoP verification fails at the resource server, the most common cause is a clock-skew issue between the client and the resource server. The DPoP proof's timestamp is checked within a small window (a few seconds typically); larger skew triggers spurious failures. Synchronise both sides against the same NTP source.
If JARM verification fails, the signing key the IdP uses for JARM may differ from the key used for ID token signing. Some IdPs publish separate JWKS for the two; the discovery document should indicate this, but configurations occasionally miss it. Inspect the discovery document.
Further reading
OAuth 2.0 and OIDC covers the base OAuth machinery this chapter extends. Workload identity overview covers the resolver side of OAuth, where axess is the resource server rather than the client. Local IdP covers the test fixture for FAPI-grade integration testing.
LDAP bind
LDAP bind is the right factor for enterprise deployments where the authoritative user store is Active Directory, OpenLDAP, or a similar directory server. The application does not own user passwords; the directory does. The verification mechanism is a simple bind against the directory with the user's distinguished name and password; if the bind succeeds, the user has authenticated.
The feature flag is ldap (off by default), enabled with
features = ["ldap"] on the axess facade.
When LDAP fits
LDAP fits when three conditions hold. The first is that the authoritative user identities live in an LDAP directory the application can reach. The second is that the directory administrators have agreed to allow simple binds from the application's deployment network. The third is that the directory speaks LDAP, not some other protocol that wraps LDAP semantics (SAML, OIDC) which would route through the OAuth factor instead.
When those conditions hold, LDAP gives the application authentication-as-a-service from the directory without the application ever storing a user password. New employees added to the directory can log into the application immediately; departed employees removed from the directory lose access immediately. The directory is the source of truth.
When those conditions do not hold (a SaaS deployment where users come from many organisations, a directory the application cannot reach over a stable network, an authoritative store that is not LDAP), the right answer is OAuth or OIDC against an IdP that the organisation does support.
Configuration
LdapProviderConfig carries the connection details:
pub struct LdapProviderConfig {
pub url: String, // ldaps://ad.example.com:636
pub bind_dn_template: String, // "uid={user},ou=people,dc=example,dc=com"
pub starttls: bool, // upgrade ldap:// to TLS via STARTTLS
pub connection_timeout: Duration, // typical 5-10 seconds
pub group_search: Option<LdapGroupSearch>,
}
url is the directory's URL. The ldaps:// scheme means TLS is
established at the transport layer (port 636 by default); the
ldap:// scheme means cleartext, possibly upgraded to TLS via
STARTTLS. Cleartext without STARTTLS is acceptable only on a private
network where the directory traffic does not leave a trusted segment;
production deployments use one of the encrypted forms.
bind_dn_template is the pattern axess uses to construct a user's
distinguished name from their login identifier. The string {user}
in the template is replaced with the identifier the user typed. The
example above turns the username alice into the DN
uid=alice,ou=people,dc=example,dc=com, which is then used in the
bind request.
starttls triggers a STARTTLS upgrade after the initial cleartext
connection establishes. The mechanism is widely supported and is the
right choice when the directory accepts both cleartext and TLS on
the same port (usually 389). When the directory exposes a separate
TLS port (usually 636), use ldaps:// instead and leave this false.
connection_timeout bounds how long a bind attempt may take. Five
to ten seconds is typical. Longer timeouts admit slow failure modes
into the login path; shorter timeouts produce spurious failures
when the directory is briefly slow. Tune to match the directory's
observed latency.
group_search is optional. When set, after a successful bind axess
performs an additional search to enumerate the user's group
memberships. The result is returned alongside the bind outcome and
can be used by the application to populate the user's authorisation
attributes.
pub struct LdapGroupSearch {
pub base_dn: String, // "ou=groups,dc=example,dc=com"
pub filter_template: String, // "(member={dn})"
pub group_attr: String, // "cn" -- attribute identifying the group
}
filter_template interpolates {dn} (the bound user's DN) or
{user} (the original identifier) into an LDAP filter. The example
filter (member={dn}) matches groups that list the user's DN in
their member attribute, which is the OpenLDAP convention. Active
Directory typically uses memberOf on the user record itself
instead, in which case the group search is unnecessary because the
groups are already attributes of the user.
The verification flow
The verification flow is straightforward. The user submits a username
and password to the application. The application calls
AuthnService::verify_factor with the LDAP bind credential; axess
expands the bind DN template with the username, opens a TLS
connection to the directory, performs a simple bind with the
constructed DN and the user's password, optionally searches for
groups, and unbinds.
A successful bind transitions the session as any factor would: the
state machine calls advance_factor, which returns Completed if
LDAP was the last required factor or StillAuthenticating if more
factors are required. A failed bind returns
FactorOutcome::InvalidCredential, and the user sees the standard
failed-login message.
The connection model is per-attempt. Each bind opens a fresh TLS connection, performs the bind, and closes. There is no connection pooling. The trade-off is operational simplicity (no pool to size, no idle-connection management) against per-attempt latency (a TLS handshake on each login). For most deployments the latency is acceptable; busy directories with thousands of binds per second benefit from a connection pool at the network layer (HAProxy, nginx) rather than inside the application.
Mixing LDAP with other factors
LDAP can be the only factor in a method (the directory's bind is the entire authentication), or it can be one factor in a chain.
A common shape in enterprise deployments is LDAP followed by TOTP. The user enters their LDAP credentials, the directory verifies them, and then axess prompts for the user's TOTP code. The TOTP secret is stored in axess's own factor store (not in LDAP), under the user's scope. The combination gives directory-managed passwords with an application-managed second factor; the directory does not need to know about TOTP and the application does not need to know about the password.
A variation is LDAP followed by AnyOf(vec![Totp, Fido2]),
allowing the user to register a passkey alongside or instead of
TOTP. The flow is otherwise unchanged.
Threat model
LDAP bind is robust against the same attacks any second-factor method is robust against: credential reuse from other services, local password lists, offline brute-forcing of a stolen hash (the hash never leaves the directory).
It is weak against attacks the directory itself is weak against. A directory that allows anonymous binds is vulnerable to attribute enumeration. A directory whose bind path is misconfigured to accept empty passwords for any DN is catastrophically vulnerable. The defence is operational: configure the directory correctly, audit periodically, and treat the LDAP factor's security as a function of the directory's security posture.
The application also has to be careful about what it logs. The bind password should never appear in application logs at any level, including trace. Axess does not log it; adopters' own login handlers need to make the same guarantee. The standard pattern is to mark the password field as zeroized and to route it directly into the verifier without touching it again.
Troubleshooting
If binds fail consistently with "invalid credentials" for known-good
passwords, the bind DN template is most likely wrong. Active
Directory typically expects userPrincipalName (the user's email
address) or sAMAccountName (a short login name) in the bind, not
a constructed DN. The template might need to be {user}@example.com
rather than uid={user},ou=people,dc=example,dc=com.
If the connection succeeds but the bind times out, the directory is under load or the connection is being inspected by a middlebox that buffers slowly. The connection timeout fires; the user sees a generic failure. Inspect the network path.
If the group search returns nothing, the filter template might be wrong or the bound user might not have permission to read group membership. OpenLDAP often requires explicit ACLs for the bound user to enumerate groups they are members of; Active Directory usually grants this by default. Run the same search through a known-good LDAP client to verify.
If TLS fails with a certificate-validation error, the directory's
certificate is probably signed by a private CA that the application's
trust store does not include. Add the CA to the rustls trust store
via the standard SSL_CERT_FILE or SSL_CERT_DIR environment
variables.
Further reading
Factors and methods covers the composition machinery this chapter exercises. Identity store implementation covers how user records referenced by LDAP get provisioned in the application's identity store (typically just-in-time on first successful LDAP login). Multi-tenancy covers the case where different tenants federate to different directories.
mTLS-based authentication
Mutual TLS authenticates the client to the server at the transport layer, before the application sees the request. The client presents an X.509 certificate during the TLS handshake, the server validates the certificate against a trust anchor, and the resulting connection carries a known identity. For service-to-service traffic between parties that own both sides of the connection, mTLS is the strongest practical authentication: there is no credential to phish, no token to leak, no replay window after the handshake.
This chapter covers using mTLS as a factor for human or human-adjacent flows (a kiosk machine, an internal admin host). The other use of mTLS in axess, where the certificate identifies a workload rather than a human, is covered in Workload identity overview and specifically in Inbound: mTLS-SVID. The mechanism is the same; the interpretation of the certificate differs.
The feature flag is mtls (off by default), enabled with
features = ["mtls"] on the axess facade.
Where the certificate comes from
The most important detail about an mTLS integration is that axess does not handle the TLS handshake. Axum sits behind a TLS terminator (rustls in process, or nginx, HAProxy, AWS NLB, or Cloudflare in front), and the certificate validation happens at the terminator. Axess receives the validated certificate as part of the request, extracts an identity from it, and proceeds.
The extraction is a Tower middleware the adopter wires in. The middleware reads the certificate from wherever the terminator put it:
- For rustls in process, the certificate is in
axum_server::tls_rustls::RustlsConnectInfoor an equivalent connector callback. - For nginx, the certificate is passed through as the
X-SSL-Client-Certheader (the exact header is the deployment's choice). - For HAProxy, the convention is
X-Client-Certor similar. - For AWS NLB with TLS passthrough, rustls handles the validation;
for AWS ALB with mTLS, the certificate is in
X-Amzn-Mtls-Clientcert.
The middleware reads the certificate, validates that it came from a
trusted source (the certificate must be present, the header must
have arrived only from the trusted terminator, the deployment must
not allow clients to inject the header directly), wraps the
certificate chain in a PeerCertChain, and inserts it into the
Axum request extensions:
use axess::factors::mtls::PeerCertChain;
async fn mtls_middleware<B>(
mut req: Request<B>,
next: Next<B>,
) -> Response {
if let Some(chain) = extract_cert_from_terminator(&req) {
req.extensions_mut().insert(PeerCertChain::from(chain));
}
next.run(req).await
}
The trusted-terminator check is the critical line. If the deployment accepts the certificate header from anywhere, an attacker who can reach the application directly (bypassing the terminator) can spoof any identity by setting the header themselves. The defence is to either configure the application to listen only on a socket the terminator owns, or to gate the extraction on a token the terminator injects alongside the certificate.
The trust anchor
The certificate validation that the TLS terminator performs uses a trust anchor: a set of CA certificates the terminator considers authoritative. A client certificate is accepted only if it chains back to one of those CAs.
For service-to-service mTLS within an organisation, the trust anchor is typically the organisation's own internal CA. The CA issues certificates to known clients, the terminator trusts the CA, and the validation works on the closed set of certificates the organisation has signed.
For broader deployments (a partner integration where the partner runs their own CA), the trust anchor is the partner's CA or a short list of CAs, and the validation accepts clients signed by any of them.
For consumer-facing deployments where clients might use any certificate, mTLS is the wrong factor. Use OAuth or another flow where the client does not need to provision a certificate.
From certificate to user
After the middleware inserts the PeerCertChain into the
extensions, the application's login handler reads it back and maps
the certificate to a user identity. The mapping depends on the
deployment's conventions.
The simplest mapping is from the certificate's Subject Common Name (CN) to a username. The CA issues certificates with CNs that match the deployment's usernames, the login handler reads the CN, and the application looks up the user under that CN.
use axess::factors::mtls::PeerCertChain;
use axum::Extension;
async fn mtls_login(
session: AuthSession,
State(service): State<Arc<AuthnService<...>>>,
Extension(chain): Extension<PeerCertChain>,
) -> impl IntoResponse {
let leaf = chain.leaf().expect("validated chain has at least one cert");
let cn = extract_common_name(leaf).expect("validated cert has a CN");
match service.begin_login(&session, &cn, "default-tenant").await {
Ok(_) => {}
Err(e) => return (StatusCode::UNAUTHORIZED, format!("{e}")).into_response(),
}
use axess::FactorCredential;
match service
.verify_factor(&session, FactorCredential::Mtls { chain })
.await
{
Ok(_) => Redirect::to("/dashboard").into_response(),
Err(e) => (StatusCode::UNAUTHORIZED, format!("{e}")).into_response(),
}
}
A more structured mapping uses a SPIFFE URI in the certificate's
Subject Alternative Name (SAN). The CA issues certificates with
SAN URIs of the form spiffe://<trust_domain>/<path>, and the
application's login handler parses the URI to extract the trust
domain, the path, and any embedded identifiers. This shape is what
the workload identity chapter covers, and it remains the right
shape even for human-adjacent flows because it is more structured
than a CN.
The verification on the axess side is straightforward. The
FactorCredential::Mtls { chain } variant carries the cert chain
through verify_factor. The verifier checks that the chain is
present (a sanity check, since the middleware put it there), that
the leaf certificate has not expired, and that the certificate
matches the user's stored mTLS configuration (which CA it should
be signed by, which CN or SAN it should have). Verification
success advances the state machine; failure returns
InvalidCredential.
Composing mTLS with other factors
mTLS as a sole factor is appropriate for service-to-service traffic where the certificate's possession is itself the authentication event. For human flows, mTLS pairs with another factor in a method.
A common shape for a high-assurance admin interface is mTLS followed by FIDO2. The user's machine presents a certificate issued by the organisation's CA (so only employees whose machines have been provisioned can even reach the login page); the user then authenticates with a passkey (so a stolen machine is not enough, the user themselves must be present). The combination is strong against both remote attackers (they have no certificate) and local attackers (they have no passkey).
A variation uses mTLS as a tenant-scoping factor and OAuth as the user-identification factor. The certificate identifies which tenant the request is for (a partner integration's certificate maps to the partner's tenant); the OAuth flow identifies which user within that tenant. The method composes the two.
Threat model
mTLS is robust against the standard authentication attacks: credential reuse, credential stuffing, password phishing, replay. The certificate is hard to steal without compromising the device that holds the private key, and a compromised private key is no easier to use than a compromised password (both require some attacker action and both can be revoked).
It is weak against three specific attacks.
The first is private-key theft from a compromised device. An attacker with full filesystem access to a client can copy the private key, install it on their own machine, and use the certificate. The defence is to store the private key on hardware the operating system protects (a TPM, a hardware security module, a smartcard) rather than in a file. Hardware-backed keys cannot be exported and survive even a full filesystem compromise.
The second is CA compromise. An attacker who can issue certificates from a CA the application trusts can authenticate as anyone. The defence is operational: keep the issuing CA offline, use short-lived certificates so revocation is automatic, and monitor the CA's audit log. For service-to-service mTLS, a SPIFFE control plane handles this with rotating, short-lived certificates backed by an attested root.
The third is missing revocation. When a certificate is revoked (employee leaves, machine is lost), the application needs to know. The TLS terminator checks revocation through OCSP or CRL or a short-lived-certificate strategy; an unchecked revocation lets the old certificate continue to work. The defence is to wire revocation checking at the terminator and to monitor the revocation lifecycle.
Troubleshooting
If the middleware never sees a certificate, the most likely cause is that the TLS terminator is not requiring client certificates. Some terminators require explicit configuration to request the client certificate at handshake time; others accept the handshake without a certificate and silently let the request through. Check the terminator's configuration.
If certificates are present but the CN extraction returns nothing,
the certificate may use a SAN URI instead of a CN. Inspect the
certificate (openssl x509 -in cert.pem -text) to see what fields
are present. Updating the extraction to read the SAN URI is the
fix; the structured-mapping pattern above is the right shape.
If the trust-anchor configuration accepts a certificate the application does not expect, the terminator's trust store may include a CA the deployment did not intend to trust. Check the terminator's CA-bundle configuration and remove anything that should not be there. Use a dedicated trust store for client certificates rather than reusing the server's general CA bundle.
Further reading
Workload identity overview covers the workload-side use of mTLS, where the certificate identifies a service rather than a human. Inbound: mTLS-SVID covers the SPIFFE X.509-SVID profile that is the standard shape for service-to-service mTLS today. Security posture covers the production crypto requirements that apply to mTLS deployments, including FIPS-routing notes for regulated contexts.
Cedar policy fundamentals
Most application authorisation is the if user.role == "admin" style:
a check scattered across handlers, expressed in code, written by
whoever happened to be in the file at the time, with no shared
schema and no way to review the policy as a whole. The pattern works
for small applications and fails for everything else, because the
authorisation logic is the part of the application that needs the
most review and is also the part most likely to drift.
Cedar is a policy language designed for this exact problem. It is declarative, deny-by-default, statically checkable against a schema, and built to express RBAC, ReBAC, and ABAC in one set of rules. Axess loads a Cedar policy set at startup, validates it against a schema, and exposes per-request evaluation through a small typed interface. This chapter covers the lifecycle: loading, validation, the per-request evaluator, the contract with the application's data layer, and the error modes.
The feature flag is authz (on by default in the axess facade).
The lifecycle
Cedar in axess has three lifecycle phases: load, evaluate, redeploy. Each phase has a specific failure mode, and the design is built so the failures land at the right place.
The load phase happens once at application startup. The application
constructs a PolicyStore from one or more policy files, validates
the parsed policies against a schema, and produces an AuthzStore
that holds the result. A load failure (a malformed policy, a type
mismatch against the schema, an action that references an
undefined entity) is a startup failure: the application refuses to
start. The defence is structural: there is no path to production
with a broken policy file because the application refuses to come
up.
The evaluate phase happens once per authorisation check. The
application constructs an AuthzSession from the AuthzStore, a
Principal (typically extracted from the session or from a
workload-identity resolver), an AuthzEntityProvider that supplies
the application's entity graph for this request, and a context
(MFA status, IP address, the application's custom attributes). The
session offers two verbs: require (allow or deny, returning an
error on deny) and decide (a typed AuthzDecision). The
evaluation is cheap, predictable, and deterministic.
The redeploy phase happens when policies change. The application
loads a new PolicyStore from the new policy files, swaps it in
behind the AuthzStore's Arc, and from the next request onward
new evaluations use the new policies. A hot reload of policies is
supported; the trade-off is that decisions in flight at swap time
see the old policies and decisions started after see the new
policies. There is no decision-caching layer in axess for this
reason: a cached decision from before a redeploy would survive into
the new policy regime and produce wrong answers. The chapter
Entity providers and request context expands on what does and
does not get cached.
Loading policies
The minimal load is a directory of .cedar files plus a
schema.cedarschema file:
use axess::authz::{AuthzStore, PolicyStore};
let policy_store = PolicyStore::load_directory("./policies")?;
let schema = std::fs::read_to_string("./policies/schema.cedarschema")?;
policy_store.validate_against(&schema)?;
let authz_store = AuthzStore::new(policy_store);
The load is recursive: every .cedar file under the directory is
parsed and added to the policy set. Cedar policies have no import
or namespace mechanism beyond the entity-type namespace; the
collection of all files is the policy set, evaluated as one.
validate_against is the call that catches malformed policies
before they reach production. The validator checks that every
entity type the policies reference is defined in the schema, that
every attribute access is on an attribute the schema declares, and
that the types align (a policy that asks principal.age > "old"
gets caught because the schema declares age as a number and the
literal is a string).
The schema is its own discipline. Writing a schema that accurately
describes the application's entities is the hardest part of a
Cedar integration. The schema names the principal types (User,
Workload, Role, Group), the action types (read, write,
administer), the resource types (the application's domain
objects), and the parent relationships (a User is in Groups,
which are in Roles, which permit Actions). The Cedar
documentation covers schema authoring in detail; the chapter here
focuses on what axess does with a schema once it has one.
The per-request evaluator
The AuthzSession is constructed per request and lives only as
long as the request:
let session: AuthzSession = authz_store.session()
.with_principal(principal)
.with_entity_provider(&app.entity_provider)
.with_context(StandardRequestContext::from_request(&request))
.build();
match session.decide(
Action::View,
ResourceUid::new("Document", "doc-123"),
).await {
Ok(AuthzDecision::Allow) => proceed(),
Ok(AuthzDecision::Deny) => render_forbidden(),
Err(e) => render_error(e),
}
The with_principal call binds the caller. The principal carries
the user id, the tenant id, the factors completed, and the
authentication time. Cedar policies can match on any of these.
The with_entity_provider call binds the application's data
layer. The entity provider is the application-specific code that
loads the relevant entities (the user record, their group
memberships, the resource being accessed, its parents) for the
evaluation. The provider returns a Cedar entity graph; the session
holds it for the duration of the evaluation. The next chapter,
Entity providers and request context, covers the provider
contract in detail.
The with_context call binds the contextual attributes. The
StandardRequestContext covers the common cases: MFA status, IP
address, the time of the request. Applications can extend it with
custom keys (a custom-headers map, a tenant-feature-flag set, a
geographical location).
The decide verb evaluates the policies and returns
AuthzDecision::Allow or AuthzDecision::Deny. The verb is async
because the entity provider may need to fetch entity data from a
database. The require verb is a thin wrapper that returns an
error on Deny, suitable for handlers that want to short-circuit
on a denied request.
What policies look like
A Cedar policy is a permit or forbid statement against a
principal, action, and resource, with optional when conditions.
The simplest possible policy:
permit (
principal,
action == Action::"read",
resource
);
This is the "everyone can read everything" policy. It permits any principal to perform the read action against any resource. It is useful for nothing in production but illustrates the shape.
A real RBAC policy:
permit (
principal in Role::"finance-viewer",
action == Action::"read",
resource in TenantData::"acme"
) when {
principal.tenant_id == "acme"
};
This permits any principal in the finance-viewer role to read any
resource in the acme tenant's data, but only when the principal
is also in the acme tenant. The in operator is set membership
against the entity graph: the policy is asking the entity provider
"is this principal in this role?", which the provider answers from
the application's data.
A ReBAC policy:
permit (
principal,
action == Action::"edit",
resource
) when {
resource.owner == principal
};
This permits a principal to edit a resource only when the resource's
owner attribute equals the principal. Ownership is the ReBAC
relationship; the schema declares Document has an owner
attribute of type User, and the entity provider populates it from
the document's row.
An ABAC policy:
permit (
principal,
action == Action::"write",
resource in TenantData::"acme"
) when {
principal.tenant_id == "acme"
&& context.mfa == true
&& context.ip like "10.*"
};
This permits writes to the acme tenant's data when the principal
is in the tenant, has completed MFA, and is connecting from an
internal IP range. Context attributes come from the
StandardRequestContext (or custom extensions); the schema
declares them so the validator can type-check the policy.
The three styles compose freely in one rule. A real production policy is typically a mix: roles establish broad permissions, relationships restrict to ownership, attributes restrict to high-assurance contexts. Cedar's deny-by-default behaviour means the rules accumulate as positive grants; no rule denies, and the absence of a permitting rule is itself a deny.
Errors
The AuthzError enum has variants for the cases that go wrong:
pub enum AuthzError {
PolicySetInvalid(String), // load-time, should never reach prod
SchemaValidationFailed(String), // load-time
EntityNotFound { uid: String }, // evaluator could not load an entity
ContextMissing(String), // policy needed a context key not provided
EvaluationFailed(String), // Cedar internal error (rare)
Cancelled, // request cancelled during evaluation
}
The load-time variants should never reach production because the
PolicyStore::validate_against call catches them at startup.
The runtime variants are recoverable but specific. EntityNotFound
means the entity provider returned no entity for a UID a policy
referenced; the deployment may have a stale Cedar reference or a
race between policy and data. ContextMissing means a policy
referenced a context key the request did not provide; the schema
should have caught this at load time but did not (a context key
the schema declared as optional, used in a policy as if required).
EvaluationFailed is the catch-all for Cedar's own errors, which
are rare in well-formed policy sets.
Every variant produces a deny. There is no path where an evaluation error produces an allow. The defence is structural and is one of the reasons Cedar was chosen.
When to use require versus decide
The two verbs differ in their failure handling. require returns
an error on Deny (so the handler short-circuits with an error
without needing an explicit match); decide returns the typed
decision (so the handler can branch).
The recommendation is to use require in handlers (the most
common case: deny gives a 403, allow proceeds), and decide in
code that needs to express a non-binary outcome (a UI that hides
buttons rather than displaying them and denying on click, an
admin panel that shows what the current user could do).
// require version: handler short-circuits on deny
async fn delete_document(
session: AuthzSession,
Path(doc_id): Path<String>,
) -> Result<Json<()>, AppError> {
session
.require(Action::Delete, ResourceUid::new("Document", &doc_id))
.await?;
// ... proceed with delete
}
// decide version: branch on the decision
async fn dashboard(
session: AuthzSession,
) -> impl IntoResponse {
let can_create_doc = matches!(
session.decide(Action::Create, ResourceUid::new("Document", "*")).await,
Ok(AuthzDecision::Allow)
);
render_dashboard(can_create_doc)
}
The wildcard resource UID in the second example is a Cedar convention for "is the principal allowed to perform this action at all?"; it relies on the policy set being written with that question in mind.
What policies cannot do
Cedar is the right tool for asking "is this allowed?". It is not the right tool for everything that pattern-matches like authorisation but is actually something else.
It is not for rate limiting. Rate limits are stateful (they depend
on the rate of past requests, not the content of the current
request), expensive to express in declarative terms, and not what
Cedar is built for. Use the RateLimitLayer middleware (covered in
Rate limiting).
It is not for input validation. A request with an invalid body fails at deserialisation, not at authorisation. Cedar policies that try to enforce body-shape constraints duplicate validation logic and run after the body has already been parsed.
It is not for state transitions. A workflow that allows a
transition from Pending to Approved but not from Pending to
Closed is a state machine, not a policy. Implement the state
machine in code (or in a axess-style typed state machine for the
workflow); use Cedar to gate access to the transition operations.
It is not for caching decisions across requests. Policies and entity graphs are mutable; cached decisions are stale by construction. Axess deliberately caches entity graphs (which are much more stable) and not decisions.
The next chapter, Entity providers and request context, covers the entity-graph caching mechanism and the contract between Cedar and the application's data layer.
Further reading
Entity providers and request context covers the
AuthzEntityProvider trait, the StandardRequestContext extension
points, and the caching posture. RBAC, ReBAC, and ABAC patterns
walks through worked examples of each style and how they compose
in one policy set. The principal model covers the principal types
the evaluator binds to.
Entity providers and request context
A Cedar policy evaluates against three inputs: a principal, an
action, a resource, plus an entity graph that gives the policies
the data they need to reason about (which roles the principal is in,
which group owns the resource, what the principal's MFA status is).
The policy set is loaded once at startup. The principal and action
come from the request. The entity graph and the request context
come from the application, per request, through two interfaces this
chapter covers: the AuthzEntityProvider trait and the
StandardRequestContext extension surface.
Doing both of these well determines whether the Cedar integration holds up under load. A naive entity provider that loads an entire user's group membership on every request will be the slowest part of the request lifecycle. A request context that omits an attribute a policy expects produces denies that are hard to debug. The shapes below avoid both failure modes.
The entity provider contract
AuthzEntityProvider is the trait the application implements. The
job is to take a request's principal and resource UIDs, and return
a Cedar entity graph rich enough that the evaluator can answer the
policy questions:
#[async_trait]
pub trait AuthzEntityProvider: Send + Sync {
async fn entities(
&self,
principal: &Principal,
resources: &[ResourceUid],
) -> Result<EntitySet, AuthzProviderError>;
}
The provider receives the principal (so it can load the
principal's groups, roles, and any attributes the policies need)
and the list of resource UIDs the request is touching (so it can
load the resources, their parents, and their attributes). It
returns an EntitySet, which is Cedar's typed entity graph: each
entity has a UID, a set of attributes, and a list of parent
entities.
The contract is "return enough to answer the policies, no more."
An entity set that omits an entity a policy references produces an
EntityNotFound error at evaluation time. An entity set that
includes hundreds of entities the policy never touches wastes the
database time. The right shape is the minimum set the policies
need for this request.
What "enough" means
The policies that the evaluator runs against the entity set typically need a few categories of data.
The principal's parents. Every role the principal is in, every
group they belong to. A policy that says
principal in Role::"finance-viewer" needs the principal's
parents list to include Role::"finance-viewer" if the principal
is in that role. The provider populates this from the application's
role-and-group store.
The principal's attributes. The user's tenant id, MFA status,
factors completed, custom attributes the policies use. Many of
these are already on the Principal value; the provider attaches
them as Cedar attributes on the principal entity.
The resource's parents. The tenant that owns it, the project it
belongs to, any logical grouping the policies might match against.
A policy that says resource in TenantData::"acme" needs the
resource's parents list to include TenantData::"acme" if the
resource belongs to that tenant.
The resource's attributes. The owner, the visibility setting, the classification level, anything the policies need. The provider populates these from the resource's row.
The principal's relationships to the resource. A ReBAC policy that
matches resource.owner == principal needs the resource's owner
attribute to equal the principal's UID. If the resource is shared
with the principal through a separate sharing record, the provider
either expresses it as an attribute on the resource (a shared_with
list) or as a parent (the principal is in a "viewers" group
attached to the resource).
The application's data model is the source of truth for all of this; the provider's job is to shape the data into Cedar's vocabulary.
A worked provider
A typical provider for a document-management application looks like this:
struct AppEntityProvider {
db: PgPool,
}
#[async_trait]
impl AuthzEntityProvider for AppEntityProvider {
async fn entities(
&self,
principal: &Principal,
resources: &[ResourceUid],
) -> Result<EntitySet, AuthzProviderError> {
let mut set = EntitySet::new();
// Principal: load roles and groups, attach as parents.
let user_id = principal.user_id().ok_or(AuthzProviderError::NotHuman)?;
let memberships = sqlx::query_as::<_, (String,)>(
"SELECT role_uid FROM user_roles WHERE user_id = $1"
)
.bind(user_id.to_string())
.fetch_all(&self.db)
.await?;
let principal_uid = ResourceUid::new("User", &user_id.to_string());
set.insert(Entity {
uid: principal_uid.clone(),
attrs: principal_attrs(principal),
parents: memberships
.into_iter()
.map(|(uid,)| ResourceUid::parse(&uid).unwrap())
.collect(),
});
// Resources: load each resource's row + tenant parent.
for resource in resources {
if resource.entity_type() == "Document" {
let row: DocumentRow = sqlx::query_as("SELECT * FROM documents WHERE id = $1")
.bind(resource.id())
.fetch_one(&self.db)
.await?;
set.insert(Entity {
uid: resource.clone(),
attrs: document_attrs(&row),
parents: vec![ResourceUid::new("TenantData", &row.tenant_id)],
});
}
}
Ok(set)
}
}
The shape is uniform: one principal entity (with parents from the role-and-group store), one or more resource entities (each with parents from the tenant model and attributes from the resource's row). The provider uses Postgres in this example; the choice is the application's. The key shape is that the loads are batched per request (one query for memberships, one or two for the resources), not per policy or per entity.
Caching entities, not decisions
The single most important performance choice in a Cedar integration is what to cache. Axess takes the conservative line: entity graphs are cached aggressively, decisions are never cached.
Decisions cannot be cached because they are functions of the entity graph, the policy set, and the context. Any of the three can change between the cache write and the cache read: the entity graph because the database has updated (a role granted, a relationship added), the policy set because a redeploy has happened, the context because the request is different. A cached decision that survives any of these changes produces a wrong answer. The defence is to not cache decisions at all.
Entity graphs can be cached because they are functions of the database state at a known point in time. The cache key is the principal UID plus the resource UIDs; the cache value is the entity set; the cache TTL is a function of how stale the application is willing to tolerate.
Axess provides an AuthzSessionCache decorator that wraps an
AuthzSession. The decorator caches the entity graph for a
configurable TTL (default sixty seconds for low-sensitivity
deployments, one second or less for high-sensitivity deployments,
or even off for the highest-sensitivity ones). The cache is keyed
by (tenant_id, principal_uid, resource_uids).
The TTL is the lever. Sixty seconds is fine for a deployment where
a role change can take a minute to propagate (most internal admin
panels). Anything tighter requires the cache to be invalidated on
role changes, which means the application's role-mutation code
calls into the cache to flush the affected entries. The
CacheInvalidator trait on EntityCache is the surface for this;
applications that need stricter consistency wire the invalidations
explicitly.
The chapter Session lifecycle and crypto envelope covers the
generic axess-cache machinery the entity cache uses. Operations
runbook covers the operational signals for the cache (hit rate,
eviction rate, invalidation rate).
The standard request context
The context is the third input to a policy evaluation. It carries the per-request attributes that are not on the principal or the resource: the MFA status, the IP address, the time of the request, the custom keys the application wants to expose to policies.
StandardRequestContext is the built-in implementation:
pub struct StandardRequestContext {
pub mfa: bool,
pub ip: Option<IpAddr>,
pub now: DateTime<Utc>,
pub custom: BTreeMap<String, serde_json::Value>,
}
impl StandardRequestContext {
pub fn from_request(req: &Request) -> Self { /* ... */ }
pub fn with_custom(mut self, k: impl Into<String>, v: serde_json::Value) -> Self {
self.custom.insert(k.into(), v);
self
}
}
The from_request constructor pulls what it can from the request:
the IP from the trusted-proxy chain, the MFA status from the
session's factors_completed, the time from the clock. The
with_custom builder adds application-specific keys.
Policies can match on any of these:
permit (
principal,
action == Action::"write",
resource
) when {
context.mfa == true
&& context.ip like "10.*"
&& context.custom.region == "eu"
};
The schema declares the context shape:
type Context = {
mfa: Bool,
ip: String,
custom: {
region?: String,
...
}
};
Required fields are checked at policy load time; optional fields
are checked at evaluation time. A policy that uses a required
field the request omits produces a startup error (good, caught
early). A policy that uses an optional field the request omits
produces a deny at runtime with ContextMissing (acceptable, deny
is the conservative answer).
When to extend the context
The custom keys exist to bridge application state that does not fit on the principal or the resource. Common cases:
The first is a tenant feature flag. A policy that gates a beta
feature on "this tenant has opted in" reads context.custom.beta,
which the application sets from the tenant's feature-flag state.
The second is the request's geographical context. A policy that
restricts certain actions to certain regions reads
context.custom.region, which the application populates from the
load balancer's geo-IP information or from an explicit header.
The third is a stepped-up factor that is not in factors_completed
because it was completed for a different reason. A policy that
wants to know "did the user complete a fresh password challenge in
the last five minutes" reads
context.custom.password_challenge_at, which the application
populates from a sidecar store of recent challenges.
The pattern across all three: the application owns the data, the context is the carrier, the policy sees a typed attribute it can match on.
Failure modes and visibility
The two failure modes worth knowing are EntityNotFound and
ContextMissing, both of which surface as Deny from the
evaluator. The right response is the same in both cases: log the
failure with enough detail to diagnose, surface a generic deny to
the user, and keep the audit trail.
EntityNotFound typically means the entity provider should have
loaded an entity but did not. The fix is in the provider: load the
missing entity, or update the policy to not reference it.
ContextMissing typically means a policy was written against a
context key the application does not provide. The fix is in the
schema: declare the key as optional and update the policy to handle
its absence, or update the application to provide it.
Axess emits an AuthzEvent for every evaluation, regardless of
outcome. The chapter Audit events covers the event surface; the
relevant variants here are AuthzEvent::EntityNotFound and
AuthzEvent::ContextMissing, both of which name the missing key
and the policy that referenced it. A spike in either suggests a
mismatch between the policy set and the rest of the deployment;
operational dashboards should alert on it.
What this enables
The provider-and-context contract is what makes Cedar usable against an arbitrary application data model. The schema names the shape; the policies match on the shape; the provider populates the shape from whatever the application's storage actually looks like. The three layers are independent, which means a database migration that changes how roles are stored does not break the policies (the provider updates; the rest stays), and a policy change does not touch the database (the policy file updates; the rest stays).
The chapter RBAC, ReBAC, and ABAC patterns covers worked examples that show the three styles composed in real policies.
Further reading
Cedar policy fundamentals covers the policy lifecycle and the
evaluator surface this chapter feeds. RBAC, ReBAC, and ABAC
patterns covers the policy authoring style with concrete examples
for each pattern. Identity store implementation covers how the
provider's principal-loading queries fit into the application's
identity-store implementation. Audit events covers the
AuthzEvent variants the evaluator emits.
RBAC, ReBAC, and ABAC patterns
The three letter-soup acronyms RBAC, ReBAC, and ABAC name the three
standard styles of authorisation. Cedar is one of the few policy
languages that admits all three in the same set of rules. This
chapter walks through each style with worked examples, then shows
how to compose them in a single policy set without the rules
fighting each other. The examples are concrete enough that you
should be able to paste them into a .cedar file and have them
type-check against a corresponding schema.
RBAC: roles as groups
Role-based access control assigns users to roles and assigns permissions to roles. The model has been the workhorse of enterprise authorisation since the 1990s and remains the right starting point for most applications.
The schema declares roles and the action permissions they hold:
entity User {
tenant_id: String,
};
entity Role;
entity Document {
tenant_id: String,
owner: User,
};
action read appliesTo {
principal: [User],
resource: [Document],
};
action edit appliesTo {
principal: [User],
resource: [Document],
};
The policy grants the role-action mappings:
permit (
principal in Role::"viewer",
action == Action::"read",
resource
);
permit (
principal in Role::"editor",
action in [Action::"read", Action::"edit"],
resource
);
The entity provider, on each request, attaches the user's role
memberships as parent entities. A user in Role::"viewer" has
that role in their parents list; a user in Role::"editor" has
that role in their parents list and inherits read permission
through the second policy's action set.
The shape works for most applications until two situations arise. The first is when permissions need to depend on the relationship between the principal and the resource (a user can edit their own documents but not others'), which is the ReBAC case below. The second is when permissions need to depend on the request context (MFA must be present for sensitive actions), which is the ABAC case below.
ReBAC: relationships as paths
Relationship-based access control assigns permissions based on the relationship between the principal and the resource, not on a role label. The classic example is ownership: a user can edit a document they own.
The schema does not change much; the relationship is already on the entity:
entity Document {
tenant_id: String,
owner: User,
shared_with: Set<User>,
};
The policy expresses the relationship:
permit (
principal,
action == Action::"edit",
resource
) when {
resource.owner == principal
};
permit (
principal,
action == Action::"read",
resource
) when {
resource.owner == principal
|| principal in resource.shared_with
};
The first rule grants edit to the owner. The second rule grants
read to the owner or to anyone in the resource's shared_with
set. The set membership principal in resource.shared_with is the
ReBAC primitive: the principal is in some set on the resource, and
the policy matches on that.
More elaborate relationships involve multi-hop paths. Consider a "team" model where a user belongs to a team, the team owns projects, and the projects contain documents. The schema:
entity Team;
entity Project {
owner_team: Team,
};
entity Document {
project: Project,
};
entity User in [Team];
The policy that says "anyone in the team that owns the project that contains this document can read the document":
permit (
principal,
action == Action::"read",
resource
) when {
principal in resource.project.owner_team
};
The in operator follows the entity graph: resource.project
yields a Project entity, .owner_team yields a Team entity,
and principal in Team checks the principal's parents list. The
entity provider populates the graph: the document with its project
parent, the project with its owner_team attribute, the user with
their team memberships. Cedar walks the graph at evaluation time.
The pattern generalises to any depth, though policies that walk more than two or three hops start to feel hard to review. When the depth gets uncomfortable, extract the relationship into an intermediate entity (a "can_view" set on the document that the application's data layer computes ahead of time) and let the policy match on the simpler shape.
ABAC: attributes as conditions
Attribute-based access control adds context to the decision. The attributes might be on the principal (MFA status, last authentication time), on the resource (sensitivity level), or on the request (IP address, time of day). A policy applies only when the attributes match.
The schema declares the attribute shapes:
entity User {
tenant_id: String,
mfa_completed: Bool,
last_authn_at: Long, // unix seconds
};
entity Document {
tenant_id: String,
classification: String, // "public" | "internal" | "secret"
};
type Context = {
ip: String,
now: Long,
};
The policy combines attribute conditions:
permit (
principal,
action == Action::"read",
resource
) when {
principal.tenant_id == resource.tenant_id
&& (
resource.classification == "public"
|| (
resource.classification == "internal"
&& principal.mfa_completed
)
|| (
resource.classification == "secret"
&& principal.mfa_completed
&& context.now - principal.last_authn_at < 900 // last 15 min
)
)
};
The rule grants read access in three tiers: public documents to anyone in the tenant, internal documents to anyone in the tenant with MFA completed, secret documents to anyone in the tenant with MFA completed in the last fifteen minutes. The attributes drive the gradations; the policy expresses them in one statement.
ABAC is the right tool for time-sensitive, location-sensitive, and context-sensitive policies. It is the wrong tool for static permissions (use RBAC) or for relationship checks (use ReBAC). When in doubt, write the policy and read it back: if the rule says "users in X role can perform Y," it is RBAC; if it says "users with relationship Z to this resource can perform Y," it is ReBAC; if it says "users can perform Y when condition W," it is ABAC.
Composing the three styles
A real production policy set mixes the three. A user who has the
editor role (RBAC) can edit any document, but a user who owns a
document (ReBAC) can edit it regardless of role, and a user trying
to edit a secret document must have MFA completed (ABAC).
// RBAC layer: editors get full access.
permit (
principal in Role::"editor",
action in [Action::"read", Action::"edit", Action::"delete"],
resource
);
// ReBAC layer: owners get full access to their own.
permit (
principal,
action in [Action::"read", Action::"edit", Action::"delete"],
resource
) when {
resource.owner == principal
};
// ReBAC layer: shared-with users get read access.
permit (
principal,
action == Action::"read",
resource
) when {
principal in resource.shared_with
};
// ABAC layer: secret documents require fresh MFA, forbid otherwise.
forbid (
principal,
action,
resource
) when {
resource.classification == "secret"
&& (
!principal.mfa_completed
|| context.now - principal.last_authn_at > 900
)
};
The forbid rule overrides any permit that would otherwise
match. The structure works because Cedar evaluates all rules: if
any permit matches and no forbid matches, the decision is
Allow; if any forbid matches, the decision is Deny
regardless of what permits also match.
The pattern is to express the broad grants through permit rules
in increasing specificity (role, relationship, context), then to
express the absolute constraints through forbid rules. The
forbid rules are typically about high-sensitivity resources or
about high-risk principal states; they are the small set of cases
where a positive grant is not enough.
Tenant isolation as a structural rule
Multi-tenant applications need a structural rule that no policy
should ever leak data across tenants. The right shape is a single
top-level forbid:
forbid (
principal,
action,
resource
) when {
principal.tenant_id != resource.tenant_id
};
The rule applies to every action on every resource. Any later
permit that would have allowed a cross-tenant access is
overridden. The rule is the structural defence against the worst
class of authorisation bug a multi-tenant application can have: an
operator from tenant A accessing tenant B's data because of a
mistake in another policy.
The rule is also the right place to validate that the principal
has a tenant id at all. A workload principal might be in a global
trust domain (no tenant), in which case the comparison fails the
type system and the rule denies. The policy authoring style is to
treat tenant id as a required attribute on every multi-tenant
entity, and to let this forbid catch any drift.
Step-up as a policy concern
Step-up authentication is the pattern where a user is asked to re-prove identity (or to prove with a stronger factor) before performing a sensitive action. The mechanism is in the state machine (see Factors and methods §"Step-up authentication"); the policy expresses when step-up is required.
The shape:
forbid (
principal,
action == Action::"delete-account",
resource
) when {
!("Fido2" in principal.factors_completed)
};
The rule denies the account-deletion action unless FIDO2 is in the
user's completed factors. The user reaches the action with a
password-and-TOTP session, gets denied, and the application offers
step-up: the user completes the FIDO2 ceremony, the session's
factors_completed now includes Fido2, the next request to the
delete-account action passes the policy.
The pattern composes with the other styles. A permit rule says
who can delete an account (RBAC: the user themselves, ReBAC: the
admin who owns the user). The forbid rule adds the contextual
requirement (ABAC: FIDO2 in factors_completed). The three rules
together produce a policy that says "the user themselves can
delete their own account, but only after completing FIDO2 in this
session."
Anti-patterns
The two patterns most likely to mislead are worth naming.
The first is duplicating ReBAC as RBAC. The temptation is to
materialise the ownership relationship as a per-resource role
("owner of document 123"), then write an RBAC policy that grants
edit to the role. The shape works but produces an explosion of
roles (one per resource), is hard to invalidate when ownership
changes, and obscures the relationship that the policy is actually
expressing. The right shape is to express ownership as an
attribute (resource.owner == principal) and write the ReBAC
policy directly.
The second is encoding state machines in policies. A workflow that
allows transitions only from certain states is a state machine,
not a policy. Writing it as a Cedar rule (permit ... when { resource.state == "draft" && action == "submit" }) admits the
rule but makes the policy set the source of truth for what the
state machine allows. The right shape is to put the state machine
in code (or in a typed state machine in the application), and to
use Cedar only for "who can invoke this transition" rather than
"which transition is valid right now."
Schema discipline
The most consequential decision in any Cedar integration is the schema. The schema names every entity type, every attribute on every entity, every action that applies to every principal-resource pair, every required and optional context key. Getting the schema right is most of the work; getting the policies right is what follows naturally from a good schema.
Three rules help:
The first is to name entities by their domain meaning, not by the
table they live in. User is the right name; usersRow is the
wrong name. The policies that read like English are the ones that
let reviewers do their job.
The second is to declare attributes as required only when every
production deployment guarantees the attribute is present. An
attribute declared as required forces the entity provider to
return it on every load, which often forces the application to
add an INSERT default. Optional attributes are the right default;
require only when the policy logically depends on it.
The third is to update the schema whenever a policy expression needs an attribute that is not yet declared. The validator catches the inconsistency at load time; the alternative is a runtime deny that is hard to debug. The schema is not optional; treat it as part of the policy set.
Further reading
Cedar policy fundamentals covers the policy lifecycle and the
evaluator surface. Entity providers and request context covers
the data-loading contract the policies in this chapter depend on.
Audit events covers the AuthzEvent variants the evaluator
emits, including the policy id that produced each decision. The
Cedar documentation covers the
language in full detail and is the authoritative reference for
syntax and semantics.
Session lifecycle and crypto envelope
A session in axess is a server-side record that holds the authentication state, the bound principal, and any application data the session carries. The cookie that travels between the browser and the server identifies the session, but the cookie itself does not contain the session data. This separation is what lets the session shape evolve across deployments without invalidating existing cookies, and what lets the data be encrypted at rest with keys the client never sees.
This chapter walks through the lifecycle of a single session from its creation through its expiry, the cookie shape and signing, the AES-256-GCM envelope that encrypts the data at rest, the fingerprint binding that catches cookie replay, and the dirty-flag and write-back machinery that makes the lifecycle invisible to application code.
The cookie
The session cookie is small. By default it carries an opaque
session id (the SessionId newtype, sixteen bytes of cryptographic
randomness from SecureRng, base64-encoded for transport) plus an
HMAC signature computed from the id and the deployment's signing
key. The whole cookie is well under two hundred bytes.
session=<base64(session_id)>.<base64(hmac_sha256(signing_key, session_id))>
The signature defeats forgery. An attacker who guesses or brute-forces a session id cannot use it without also producing the HMAC, which requires the signing key. The signing key is the operational secret covered in the Getting started chapter: a 32-byte value loaded from a secrets manager, stable across process restarts, rotated on a schedule.
The cookie attributes are conservative by default: HttpOnly
(client-side JavaScript cannot read it), SameSite=Lax (it is sent
on top-level cross-site navigations but not on cross-site
sub-requests), Path=/ (it applies to the whole application), and
Secure when configured (it is only sent over HTTPS). The default
session lifetime is a function of SessionLayer::with_ttl; the
default in the constructor is twenty-four hours.
The cookie is opaque. The session id maps to a row in the session store, and the row carries the actual data. A user who copies the cookie has the session id and the signature, both of which the server already has; nothing on the cookie carries the user's identity, the factors completed, or any other session state.
The session store
The session store is the persistence layer for the data the cookie identifies. Each row in the store carries:
- The session id (the primary key).
- The serialised
SessionData(covered below). - The created-at and updated-at timestamps.
- The expiry timestamp.
- The optional fingerprint binding (covered below).
SessionData is the application's view of the session:
pub struct SessionData {
pub auth_state: AuthState, // see Part II
pub principal_hint: Option<PrincipalHint>, // cache of recent extractor outputs
pub custom: HashMap<String, serde_json::Value>, // application data
pub schema_version: u32, // see Schema migration
}
The auth_state carries the state-machine variant (Guest,
Authenticating, Authenticated, PendingWorkflow,
Identifying). The principal_hint is an optional cache of the
principal extracted during this session's authentication, kept on
the session so the PrincipalResolver does not have to recompute
it on every request. The custom map carries application-defined
data with a sixty-four kilobyte cap. The schema_version is the
field that lets the data shape evolve.
The serialisation format is MessagePack: faster than JSON, more
compact, and stable across versions of serde. Backends that
support binary blobs persist the bytes directly; backends that
require text (some configurations of MySQL, for instance) encode
the bytes as base64 first. The format is the same across all
backends; switching backends does not require re-serialisation.
The AES-256-GCM envelope
The serialised session bytes are encrypted before storage. The envelope is AES-256-GCM, a standard authenticated-encryption scheme that produces a ciphertext, a tag, and a nonce. The encryption key is a 32-byte value loaded from a secrets manager at process start.
The shape of one envelope:
nonce (12 bytes) | ciphertext (variable) | tag (16 bytes)
The nonce is generated fresh per write through SecureRng. AES-GCM
is sensitive to nonce reuse (a reused nonce against the same key
catastrophically compromises confidentiality and authenticity); the
twelve-byte random nonce gives a collision probability of about
one in 2^48 per encryption, which is comfortably safe for any
realistic session volume.
The additional authenticated data (AAD) carries the session id. The binding means that an encrypted blob from one session cannot be swapped into another session's row even if an attacker can write to the database. The session id is plaintext in the cookie, so this adds no confidentiality, but it adds integrity: the database is not the source of truth for "which session is this blob from."
Key rotation is the operational lever. SessionCrypto::new(key)
constructs an envelope with one current key. .with_previous_key(old_key)
keeps the old key available for reads, so sessions encrypted with
the old key continue to decrypt while new writes use the new key.
After a transition window long enough for every existing session
to be rewritten (which happens naturally over the next session
write, or can be forced through a background scan), the previous
key can be removed.
The chapter Operations runbook covers the rotation sequence and the staged rollout for both the signing key and the envelope key.
The fingerprint binding
A session id alone is not enough to defend against cookie theft. An attacker who captures a session cookie can replay it from a different browser, IP, and operating system, and the session machinery on the server cannot tell the difference without additional signal.
The fingerprint binding is the additional signal. At session creation (typically at first login), the server computes a fingerprint from the user agent header, the IP address (read through the trusted-proxy configuration), and any other coarse features the deployment chooses to include. The fingerprint is HMAC-signed and stored alongside the session id. On every subsequent request, the server recomputes the fingerprint from the incoming request and compares it (constant-time) against the stored value.
The match has three outcomes:
- Match exactly: the session is allowed to proceed.
- Match within a tolerance: the session is allowed to proceed, but the divergence is logged.
- Mismatch beyond tolerance: the session is treated as compromised and one of three responses fires (warn, re-authenticate, full logout), depending on the configured policy.
The tolerance accommodates legitimate change: a user's IP can change when they switch from wifi to cellular; their user agent can update overnight when the browser auto-updates. Strict matching on either signal produces too many false positives. The default is coarse: the IP must remain within the same /24 (for IPv4) or /64 (for IPv6), and the user agent must share its major version.
The chapter Cookies, fingerprinting, hijack detection covers the configuration knobs and the trade-offs in detail.
The Tower layer
The SessionLayer is the Tower middleware that threads the
session through every request. The layer's call method is the
sequencing centre of the session lifecycle.
The pseudocode of one request:
async fn call(&self, req: Request) -> Response {
// 1. Extract the cookie (or skip if absent → Guest).
let cookie = extract_session_cookie(&req);
// 2. Verify the HMAC, decode the session id.
let session_id = verify_cookie(&cookie, self.signing_key)
.map_err(|_| (); // fall through to a guest session
// 3. Load the row from the session store.
let row = self.store.load(&session_id).await;
// 4. Decrypt the envelope, deserialise the data.
let data = decrypt_and_deserialize(&row, &self.crypto)?;
// 5. Verify the fingerprint binding.
enforce_fingerprint(&data, &req, self.fingerprint_policy)?;
// 6. Wrap into a SessionHandle, insert into request extensions.
let handle = SessionHandle::new(session_id, data);
req.extensions_mut().insert(handle.clone());
// 7. Run the handler.
let response = self.inner.call(req).await?;
// 8. If the handle is dirty, write back.
if handle.is_dirty() {
let new_data = handle.into_data();
let new_envelope = encrypt(&new_data, &self.crypto, &new_session_id);
self.store.save(&session_id, &new_envelope).await?;
// Reissue the cookie (with a fresh id if rotation was triggered).
response.headers_mut().append("Set-Cookie", construct_cookie(...));
}
response
}
Three of the eight steps are worth dwelling on.
Step 5 (the fingerprint check) is the gate that catches replay. A mismatched fingerprint causes the handler not to run at all; the session-layer returns a 401 (or the configured response). The choice of response depends on the policy: warn-only deployments log and continue; strict deployments deny.
Step 7 is where the handler actually runs. The handler receives a
SessionHandle via AuthSession (the extractor), reads or
mutates it, and the mutations are tracked via the dirty flag.
Step 8 is the write-back. The session is saved only when it is dirty, which means a read-only request (the dashboard, a metric endpoint, an idle-page poll) does not write to the session store. The store sees writes proportional to the rate of state changes, not the rate of requests, which is the difference between a manageable database load and a saturated one.
The dirty flag
The dirty flag is the optimisation that makes the session store
viable at the read rates a real application produces. The flag is
on SessionHandle and is set by any method that mutates the
session: set_authenticated, clear, set_custom, and so on.
The flag is checked at step 8 in the lifecycle above. A clean handle is dropped silently; a dirty handle triggers the serialisation, encryption, store-write, and cookie-reissue path.
The trade-off is that a read of mutable state through an immutable
borrow does not mark dirty, but the application's pattern for that
case is to use the typed accessors (is_authenticated,
current_user_id, custom_get) that do not need a mutable
borrow. Mutating accessors (clear, set_custom, the orchestrated
begin_login and verify_factor paths) all set the flag.
The cookie is reissued only when the session id rotates, not on
every write. Identifier rotation happens at two automatic moments
(Guest → authenticated to defeat fixation; logout so the new
Guest session doesn't share an id with the old) plus explicit
re-issuance through AuthSession::regenerate. The routine
read-write-read cycle does not rotate.
regenerate exists for the cases the library can't infer on its
own: any handler that crosses a privilege boundary should call
it before responding. The canonical list (drawn from OWASP ASVS
V3, the OWASP Session Management Cheat Sheet, and NIST SP 800-63B
on AAL transitions):
| Boundary | Rotate session id? | Also revoke sibling sessions? |
|---|---|---|
| Primary login | automatic | optional |
| Logout | automatic (id invalidated) | depends |
| MFA factor added (TOTP, WebAuthn, recovery codes, …) | yes | optional |
| MFA factor removed or disabled (AAL drops) | yes | recommended |
| Password / primary credential change | yes | strongly recommended |
| Step-up to a higher assurance level | yes | ; |
| Account recovery flow completion | yes | yes |
| Impersonation start / stop | yes | ; |
| Role grant / revoke, scope change | yes | depends on direction |
| Tenant switch in a multi-tenant deployment | yes | ; |
| Profile edit, theme change, factor config tuning | no | ; |
Rotating does two things at once: it defeats fixation (any
pre-existing id, including one an attacker planted before the
boundary, becomes useless), and it caps the blast radius of a
captured pre-elevation cookie (a cookie stolen at AAL1 cannot ride
the new AAL2 binding). Sibling-session revocation
(SessionRegistry::revoke_user_sessions) is a strictly stronger
statement that matters most on credential changes, where any other
device holding a stale password-derived session must be cut off.
A library hook on FactorStore::save_factor would catch some of
the rows above and miss the rest (un-enrolment, password change,
role grants), and would misfire on factor-config tuning that is
not a privilege change. The boundary decision is necessarily
app-level. Call regenerate at the handler that knows.
When the session expires
The session has two expiry mechanisms. The first is the cookie's
own Max-Age attribute, which the browser enforces: after the
configured TTL, the browser stops sending the cookie. The second
is the session store's expiry timestamp, which the server
enforces: after the timestamp passes, the store returns the row
as expired (or the cleanup sweep removes it altogether).
Both are needed. The cookie expiry handles the browser-side case (the user closes the browser, the cookie is forgotten); the server-side expiry handles the case where the cookie outlives the session's intended lifetime (an attacker captures a cookie and replays it after the user's session would have expired).
The expiry is sliding by default: every dirty write updates the expiry timestamp, so an actively-used session keeps refreshing. The maximum lifetime is the configured TTL from the most recent write. A session that goes idle for the TTL expires; a session that gets a single dirty write per TTL window never expires (through ordinary use).
Some deployments want a hard cap: a session expires absolutely at
a fixed time after creation, regardless of activity. The
SessionLayer::with_absolute_ttl option enables this; the
absolute expiry is stored at session creation and is not refreshed.
The two TTLs (sliding and absolute) compose: the session expires
at the earlier of the two.
Session cleanup
Expired sessions need to be removed from the store. The cleanup is the application's responsibility (axess does not run a background task on its own), but the patterns are uniform across backends.
The SQL backends expose a cleanup_expired method that deletes
rows whose expiry timestamp has passed. The
examples/sqlite/
reference application runs this on a tokio::interval once per
hour; the interval is tunable.
The Valkey backend uses Valkey's native TTL: each row is written with an expiry, and Valkey removes it automatically. There is no cleanup task to write because the database does the work.
For deployments with millions of sessions, the cleanup pattern matters operationally. A daily delete-by-range is fine for tens of thousands; for millions, the delete needs to be incremental (a limit clause, looping through batches) to avoid long-running transactions that lock the table.
What this enables
The lifecycle as designed makes session handling invisible to
application code. The handler reads AuthSession, mutates it (or
does not), and the framework handles the cookie, the
serialisation, the encryption, the fingerprint check, the
write-back, and the expiry. The application's surface area for
session bugs is small: most session-related issues are policy
choices (rotate too aggressively, lockout too strict, fingerprint
tolerance too tight), not bugs in the lifecycle itself.
The chapter Backends covers the storage backends in detail; the
chapter Cookies, fingerprinting, hijack detection covers the
fingerprint binding in detail; the chapter Schema migration
covers the SessionData::schema_version field and what happens
when the data shape changes between deployments.
Further reading
Backends: SQLite, Postgres, MySQL, Valkey covers the four
first-party session stores and their feature-flag and dialect
notes. Cookies, fingerprinting, hijack detection covers the
configuration knobs for the fingerprint and the trusted-proxy
configuration that determines how IP is read. Schema migration
covers the SessionData::schema_version field. Operations
runbook covers signing-key and envelope-key rotation.
Backends: SQLite, Postgres, MySQL, Valkey
Axess ships four first-party session storage backends. The choice between them is the operational decision the deployment makes when it picks a database, not a technical decision the application code needs to revisit. This chapter covers the capability matrix, the configuration shape per backend, and the operational notes that have caught real deployments by surprise.
The feature flags are sqlite, postgres, mysql, and valkey,
all off by default. Enable the one your deployment uses.
What the backends actually do
A session storage backend implements the SessionStore trait. The
trait is small and on purpose: it offers a key-value-with-TTL
surface plus a handful of session-specific verbs the typical
application needs.
#[async_trait]
pub trait SessionStore: Send + Sync {
async fn load(&self, id: &SessionId) -> Result<Option<SessionRow>, StoreError>;
async fn save(&self, row: &SessionRow) -> Result<(), StoreError>;
async fn delete(&self, id: &SessionId) -> Result<(), StoreError>;
async fn cycle(&self, old: &SessionId, new: &SessionId) -> Result<(), StoreError>;
async fn cleanup_expired(&self) -> Result<usize, StoreError>;
async fn find_sessions_for_user(
&self,
user_id: &UserId,
tenant_id: &TenantId,
) -> Result<Vec<SessionId>, StoreError>;
}
The verbs map to operations the lifecycle in the previous chapter
exercises. load retrieves a session by id. save writes a
dirty session. delete removes a session on logout. cycle
atomically rotates the session id (used at the Guest to
Authenticated transition, and at sensitive step-up points).
cleanup_expired removes rows whose expiry has passed.
find_sessions_for_user is the verb behind
"log this user out of all sessions" admin operations.
The implementations differ in how they store the rows and how they implement the verbs, but the surface is the same.
Capability matrix
| Capability | Memory | SQLite | Postgres | MySQL | Valkey |
|---|---|---|---|---|---|
| Required feature | always-on | sqlite | postgres | mysql | valkey |
| Encryption at rest | none | optional (AES-GCM) | optional (AES-GCM) | optional (AES-GCM) | optional (AES-GCM) |
| Cluster-safe | no | with care | yes | yes | yes |
| Native TTL | n/a | manual sweep | manual sweep | manual sweep | yes |
| Session registry support | yes | adopter | adopter | adopter | yes |
| Schema migration | n/a | sqlx-migrate | sqlx-migrate | sqlx-migrate | none needed |
The encryption-at-rest column is the AES-256-GCM envelope from the previous chapter. The application configures it with a 32-byte key; the backend wraps the envelope around the serialised session data before writing. The envelope is optional because some deployments accept the unencrypted at-rest store (when the database is itself encrypted, when the threat model does not require it), and decrypting on every read costs a few microseconds per session. The recommendation for production is to enable encryption unless the deployment has a specific reason not to.
The cluster-safe column says whether multiple application instances can share the same backend without coordination issues. SQLite is single-writer; a deployment with one application instance behind a load balancer is fine, but multiple instances need to share the SQLite file over a filesystem the database supports (which is operating-system-dependent and risky). Postgres, MySQL, and Valkey are cluster-safe out of the box.
The native TTL column says whether the database has a native mechanism for removing expired rows. SQLite, Postgres, and MySQL do not; the application runs a periodic cleanup task. Valkey expires keys automatically as they age past their TTL, which means the cleanup task is unnecessary.
SQLite
The SQLite backend is right for development, for tests, for single-instance production deployments, and for embedded-style applications where the database lives on the same machine as the application.
Configuration:
use axess::backends::sqlite::SessionStore;
use axess::session::SessionCrypto;
let pool = sqlx::SqlitePoolOptions::new()
.max_connections(5)
.connect("sqlite:axess.db")
.await?;
let crypto = SessionCrypto::new(envelope_key); // optional encryption
let store = SessionStore::new(pool.clone(), crypto);
store.init_schema().await?;
init_schema creates the sessions table and the indexes the
backend needs. It is idempotent; calling it on a database that
already has the table is a no-op.
The cleanup pattern is a background task that runs
store.cleanup_expired on an interval, typically once per hour.
The
examples/sqlite/
reference application demonstrates this in main.rs.
The operational notes:
-
SQLite locks on writes. The
max_connectionssetting on the pool determines how many concurrent writes the database admits, and WAL mode (configured in the connection string) is what enables concurrent reads alongside writes. Use WAL mode for any deployment that has more than one request at a time. -
The schema migration story is
sqlx::migrate!: the migrations directory under the application is the source of truth, and the pool runs them at startup. Axess does not include its own migrations;init_schemais enough. -
Backups: a SQLite session store can be backed up with the standard
sqlite3 .backupcommand, which works against a live database. The session data is encrypted at rest if the envelope is configured, so a backup carries the same security posture as the live data.
Postgres
Postgres is the right backend for most production deployments. It is cluster-safe, has good concurrency, supports JSONB if a deployment wants to index into the session's custom map, and is the most-tested backend in axess after SQLite.
Configuration:
use axess::backends::postgres::SessionStore;
let pool = sqlx::PgPoolOptions::new()
.max_connections(20)
.connect("postgres://app@db:5432/axess")
.await?;
let store = SessionStore::new(pool.clone(), SessionCrypto::new(envelope_key));
store.init_schema().await?;
The pool sizing depends on the application's request rate; twenty
is a reasonable starting point for a single application instance,
multiplied by the number of instances and tuned against the
database's max_connections setting.
The operational notes:
-
The
init_schemacall creates thesessionstable with an index on the expiry timestamp (forcleanup_expired) and on the user id plus tenant id (forfind_sessions_for_user). The indexes are essential at any meaningful scale; do not remove them. -
CockroachDB is wire-compatible with Postgres and works against this backend with one caveat: Cockroach's lock semantics differ in edge cases (a
SELECT ... FOR UPDATEpattern that works on Postgres can produce different behaviour on Cockroach). The axess CI runs the Postgres integration suite against Cockroach to catch divergence; the failures that have surfaced are noted in this chapter when they affect adopter code. -
Postgres extensions: pgcrypto can be used as an alternative to the AES-GCM envelope, but the axess envelope is faster (the encryption happens in the application before the network write, not on the database side) and uses the same key as other axess encryption. Stick with the envelope unless a specific deployment reason argues for pgcrypto.
MySQL
The MySQL backend is right for deployments where MySQL is the already-deployed database. The capability surface is the same as Postgres, with a handful of dialect differences that affect the implementation but not the application.
Configuration:
use axess::backends::mysql::SessionStore;
let pool = sqlx::MySqlPoolOptions::new()
.max_connections(20)
.connect("mysql://app@db:3306/axess")
.await?;
let store = SessionStore::new(pool.clone(), SessionCrypto::new(envelope_key));
store.init_schema().await?;
The operational notes:
-
The dialect differences from Postgres are mostly invisible:
ON CONFLICT DO UPDATEbecomesON DUPLICATE KEY UPDATE, the placeholder syntax shifts from$1to?, datetime precision defaults to seconds rather than microseconds. Axess handles all three internally; the application code is identical. -
MariaDB 10.x and later versions are compatible with the same schema and the same SQL. The CI runs against both MySQL 8.x and MariaDB 10.x.
-
Timezone handling differs. MySQL stores
DATETIMEvalues as naive timestamps in the server's timezone; the backend serialises expiries as UTC and reads them back as UTC, sidestepping the implicit-conversion trap. -
Connection options: pool sizing is the same as Postgres. MySQL has a default
wait_timeoutof eight hours, after which idle connections are closed; the sqlx pool handles reconnection automatically, but be aware of the setting if connection-state matters to your application.
Valkey
The Valkey backend is right for deployments where a Redis-style key-value store is already present in the architecture, or for deployments where the session-store load is high enough that the overhead of a relational database is undesirable. Valkey's TTL mechanic makes session expiry automatic: the cleanup task is not needed.
Configuration:
use axess::backends::valkey::SessionStore;
let client = redis::Client::open("redis://valkey:6379")?;
let store = SessionStore::new(client, SessionCrypto::new(envelope_key));
The Valkey backend does not need a schema initialisation; the keys are written directly with TTLs.
The operational notes:
-
Cluster mode: the Valkey client supports cluster mode through the
clusterfeature of the underlying redis crate. The keys axess writes are prefixed (axess:session:,axess:registry:, ...) so cluster sharding by key works without conflict. -
Persistence: Valkey can be configured for in-memory only, for RDB snapshots, or for AOF (append-only file) durability. The axess session store is fine on any of the three; the choice trades latency against durability. For sessions specifically, in-memory is acceptable if the deployment tolerates losing all active sessions on a Valkey restart; AOF is the standard choice when sessions matter.
-
The session registry: Valkey is the only first-party backend with a session registry (the verb behind "list all sessions for a user", which the SQL backends require an adopter to wire up). The registry uses a sorted set keyed by user id, with the session ids as members and the expiry timestamp as the score. Operations on the registry are O(log n) in the number of sessions per user.
-
Eviction policy: Valkey under memory pressure can evict keys. If the eviction policy is
allkeys-lru, the session store can lose sessions before their TTL fires. The recommendation is to configure Valkey withvolatile-lru(only TTL'd keys are candidates for eviction), and to monitor the eviction rate. If evictions happen at all, the Valkey instance is undersized; scale up before the user-visible behaviour becomes painful.
Choosing between them
The decision tree is short.
If the deployment already has a database, use the matching backend. Postgres for Postgres, MySQL for MySQL, Valkey for Redis or Valkey.
If the deployment is starting fresh and the application is single-instance, SQLite is the simplest choice and works fine for small-to-medium scale.
If the deployment is multi-instance and starting fresh, Postgres is the conservative default. The operational tooling for Postgres is mature, the backup story is well understood, and the schema flexibility leaves room for future extensions.
If the deployment expects very high session throughput (tens of thousands of writes per second, or pathologically high read rates), Valkey is the choice. The latency is the lowest of the four, the TTL mechanic removes the cleanup task, and the cluster scaling is proven.
The choice is not irreversible. The session storage backend is behind a trait, the data shape is uniform across backends, and migrating between backends is a matter of reading from the old store and writing to the new one (during a deploy window where both are active, or with a one-time migration script that runs against a paused application). No data shape changes; the migration is purely operational.
Cross-backend Store<K, V> access
All four backends also implement the generic
axess_core::store::Store<SessionId, SessionData> trait. This
matters for adopters who want backend-agnostic access (test
doubles, ops endpoints that work against any deployment, code that
needs to switch backends at runtime).
use axess::store::Store;
use std::sync::Arc;
async fn dump_session(
store: Arc<dyn Store<SessionId, SessionData>>,
id: &SessionId,
) -> Option<SessionData> {
store.get(id).await.ok().flatten()
}
The generic trait omits session-domain operations (cycle,
find_sessions_for_user). Code that needs those operations uses
the concrete SessionStore trait directly; code that only needs
key-value-with-TTL semantics uses Store.
The duplication is deliberate. The generic trait is the common denominator across backends; the specific trait carries the session vocabulary. Mixing them lets each callsite use the narrowest surface it needs.
Further reading
Session lifecycle and crypto envelope covers the full lifecycle that exercises these backends. Cookies, fingerprinting, hijack detection covers the cookie attributes and the fingerprint binding the backends store. Schema migration covers what happens when the session data shape changes between deployments. Operations runbook covers signing-key and envelope-key rotation across all four backends.
Cookies, fingerprinting, hijack detection
The session cookie is the credential a browser presents on every request. If an attacker captures it, they can act as the user until the session expires or is revoked. The defences are layered: cookie attributes constrain how the browser handles the cookie, HMAC signing detects tampering, fingerprint binding catches replay from a different browser, and trusted-proxy configuration controls how the application reads the request's IP. This chapter covers each layer.
Cookie attributes
The session cookie carries five attributes the deployment cares
about. Most have defaults that are right for production; one
(Secure) needs to be set explicitly.
Path=/ makes the cookie apply to the whole application. The
alternative (a narrower path) is occasionally useful for embedded
deployments where the application lives under a sub-path of a
larger site; for most deployments, the root path is right.
HttpOnly prevents client-side JavaScript from reading the
cookie. The attribute defeats one class of cross-site scripting
attack: an attacker who injects JavaScript into the page cannot
read the session cookie through document.cookie and exfiltrate
it. The attribute is on by default and there is rarely a reason to
turn it off.
SameSite controls when the browser sends the cookie on
cross-origin requests. There are three values:
Strictmeans the cookie is sent only on same-site requests. A link from an external site to your application produces a guest-state request even if the user is logged in; the user must navigate from within your site for the session to be recognised.Lax(the default) means the cookie is sent on top-level cross-site navigations (a link click) but not on cross-site sub-requests (an embedded image, an XHR). The combination defeats most CSRF attacks while preserving the user experience of "click an external link, arrive logged in."Nonemeans the cookie is sent on every cross-site request. This is the right setting when the application is embedded in iframes on third-party sites; it is the wrong setting otherwise.
The recommendation is Lax for most deployments. Switch to
Strict for the highest-sensitivity actions; the cost is the
user-experience friction of cross-site link arrivals not being
logged in.
Secure requires HTTPS. The cookie is sent only on TLS-protected
connections; a misconfigured load balancer that accepts cleartext
HTTP does not see the cookie. The attribute is non-negotiable for
production but breaks localhost development against http://,
which is why SessionLayer::with_secure(false) exists as a
development concession.
The Max-Age (the cookie's lifetime in seconds) matches the
session TTL from SessionLayer::with_ttl. The browser stops
sending the cookie after the lifetime expires; the server-side
session has its own expiry that the lifecycle layer also enforces.
HMAC signing
The cookie carries an HMAC signature computed from the session id and the deployment's signing key. The format is:
<base64(session_id)>.<base64(hmac_sha256(signing_key, session_id))>
The signature defeats forgery and tampering. An attacker who
guesses a session id (or who tries to mutate an existing cookie)
cannot produce a valid signature without the signing key. The
server rejects any cookie whose signature does not validate; the
session is not loaded and the request proceeds as Guest.
The HMAC verification is constant-time. The constant-time comparison defeats a timing attack where an attacker could distinguish "valid signature for invalid id" from "invalid signature for valid id" by measuring response latency.
The signing key rotation is the operational lever for replacing the
signing key without invalidating active sessions. The pattern is
covered in Operations runbook. The short version is:
SessionLayer::with_previous_key accepts the old key, sessions
signed with the old key continue to validate, sessions signed
with the new key (which is now what the layer uses for new
signings) are the new default. After enough time for all old
cookies to expire, the previous key is removed.
The fingerprint binding
The fingerprint is the additional signal that catches session replay from a different browser. The mechanism takes a few coarse features of the request (the user agent, the IP address, sometimes the accept-language), HMACs them together with a deployment-level pepper, and stores the result alongside the session.
let fingerprint = hmac_sha256(
fingerprint_pepper,
format!("{}|{}|{}",
user_agent,
client_ip,
accept_language,
),
);
The choice of features is deliberate. They are coarse enough that the legitimate user's browser produces the same fingerprint across ordinary requests (the user agent does not change between requests, the IP is within the same prefix, the accept-language is stable), and specific enough that an attacker replaying the cookie from a different machine produces a different fingerprint.
The tolerance is the operational lever. Strict matching produces too many false positives (a user switching from wifi to cellular sees their IP change, a browser auto-update changes the user agent string). Coarse matching produces too few signals to detect replay. The default tolerance:
- IP: same /24 for IPv4, same /64 for IPv6.
- User agent: same major version of the same browser.
- Accept-language: same primary language.
A request that matches within the tolerance passes. A request that diverges beyond it produces an event the policy decides what to do with.
The policy has three options:
-
FingerprintPolicy::Warnlogs the mismatch and lets the request proceed. This is the right setting during initial rollout when the tolerance is being calibrated; the logs show how often legitimate users trigger mismatches, and the tolerance can be adjusted. -
FingerprintPolicy::Reauthreturns 401 and clears the session. The user has to log in again. This is the right setting for high-sensitivity actions; the user accepts the friction of re-authentication in exchange for the assurance that a captured cookie does not get away with the session. -
FingerprintPolicy::Revokedeletes the session entirely. The user is logged out, and their other sessions remain. This is the right setting when fingerprint mismatch is a strong signal of compromise; the deployment treats it as the user being hijacked and ends the session immediately.
The default is Warn. Lift to Reauth once the warn rate is
below your tolerance.
The pepper is a deployment-level secret stored alongside the session signing key. It defeats fingerprint synthesis: an attacker who knows the features (the user's IP, their user agent) cannot construct the fingerprint without the pepper, so they cannot adjust their replay to match.
Trusted-proxy configuration
The fingerprint depends on the request's IP being accurate. In
many deployments the application sits behind one or more proxies
(a load balancer, a CDN, a WAF), and the request's source IP is
the proxy's IP, not the user's. The user's IP is in a forwarded
header like X-Forwarded-For.
Reading the forwarded header is necessary but dangerous. A
deployment that trusts the header without checking the source can
be spoofed: a request directly to the application with a forged
X-Forwarded-For header will be treated as if it came through
the proxy.
The defence is the trusted-proxy configuration. The application configures which source IPs are trusted to set the header; the session layer reads the header only when the immediate request came from one of those IPs.
let layer = SessionLayer::new(store, signing_key)
.with_trusted_proxies(vec![
"10.0.0.0/8".parse().unwrap(), // internal load balancer
"172.16.0.0/12".parse().unwrap(), // VPN range
])
.with_forwarded_header(ForwardedHeader::XForwardedFor);
The configuration accepts a list of CIDR ranges that the
deployment trusts. Requests from inside any range have their
X-Forwarded-For read; requests from outside any range use the
immediate connection's IP.
The forwarded-header choice is the application's. The standard is
X-Forwarded-For (a comma-separated list of IPs, the first being
the original client), but some deployments use Forwarded (RFC
7239) or a proxy-specific header. Axess supports all three;
configure the one the deployment uses.
For multi-hop proxy chains, the rule is the same. The request
came through lb → cdn → application; the application's immediate
peer is the CDN, the CDN's X-Forwarded-For lists lb and the
original client. If both the CDN and the LB are in trusted ranges,
the application takes the leftmost IP from the header (the
original client). If only the immediate peer is trusted, the
application takes the rightmost IP from the header (the next hop
back).
The configuration is deployment-specific. Get it wrong in either direction (trust too much, get spoofed; trust too little, see only proxy IPs) and the fingerprint binding becomes either fragile or useless.
Defending against XSS
The cookie's HttpOnly attribute defeats one class of XSS attack
(reading the cookie). It does not defeat all of them.
An attacker with JavaScript execution in the page can:
-
Submit requests on the user's behalf (the browser sends the cookie automatically). The defence is CSRF protection: most axess deployments use
tower-http's CSRF middleware, which requires a CSRF token on state-changing requests, and the token is not readable from JavaScript. -
Manipulate the page the user sees to phish credentials or to trick the user into actions. The defence is Content Security Policy (CSP) headers, which constrain what JavaScript the page can load and execute. CSP is an application-side concern, not a session-layer concern, but it composes with the session layer's defences.
The session layer's role is to constrain the cookie. The application's role is to constrain what JavaScript can do in the page. Both layers are needed; the session layer alone does not defend against XSS.
CSRF defences
The session cookie is sent on cross-origin top-level navigations
because SameSite=Lax allows it. An attacker can craft a link
that, when clicked from an external site, triggers a state change
in the user's session (the classic CSRF attack).
The SameSite=Lax default narrows the attack: it works only on
top-level GETs and on the Form element, not on XHR or fetch
calls. The defences against the remaining surface:
- Use POST (or PUT, DELETE, PATCH) for state-changing requests. GET requests should be safe.
- Add a CSRF token to state-changing forms. The token is set in
the session and read from a hidden form field; the server
checks that they match. Axess does not include a CSRF middleware
out of the box; the convention is to use
tower-http's middleware or to write a small one. - For applications that need cross-origin embedded use,
SameSite=Noneplus a strict CSRF token check is the combination.SameSite=NonerequiresSecure, so the combination is only deployable on HTTPS.
What goes wrong, and how to tell
Three failure modes recur during initial deployment.
The first is a cookie that the browser refuses to send. The
symptom is sessions that disappear between requests; the cause is
almost always either Secure=true on an http:// connection
(the browser refuses to send), SameSite=Strict on a cross-site
navigation that should have been recognised, or a Path that
does not match the request URL. Inspect the cookie's attributes
in the browser's dev tools.
The second is a fingerprint that diverges for the legitimate user.
The symptom is a Warn log every few sessions or a Reauth that
fires on every wifi-to-cellular switch. The cause is usually the
tolerance being too strict; widen the IP prefix or relax the
user-agent match. The right tolerance is the smallest one that
does not produce noise on legitimate traffic.
The third is the trusted-proxy configuration getting the wrong IP. The symptom is a fingerprint that matches when it should not (an attacker successfully replaying a cookie), or that diverges when it should match (a legitimate user being asked to re-authenticate). The cause is either an unintentionally trusted source (a debug endpoint left open, a VPN allowed to spoof the header) or an unintentionally untrusted proxy (the deployment forgot to add a new proxy's IP to the trusted list).
The pattern across all three: turn on the diagnostic logs, let the deployment run for a week, look at the warning rate, calibrate.
Further reading
Session lifecycle and crypto envelope covers the cookie shape and the orchestration that issues it. Backends covers the storage backends that persist the fingerprint alongside the session. Security posture covers the production crypto requirements that apply to the session layer, including the signing-key length and the FIPS-routing notes. Operations runbook covers signing-key, envelope-key, and fingerprint-pepper rotation.
Schema migration
The SessionData struct can change between axess versions. New
fields get added, old fields get renamed or removed, the auth state
machine gains a new variant. Existing sessions in the store carry
the old shape; new code reads them and needs to produce the new
shape. The mechanism that bridges the two is the schema migration
on read.
This is a short chapter because the mechanism is small. The mechanism is small because the design pushes the version field into the data itself rather than into the store.
The version field
SessionData::schema_version is a u32 field set at construction
and serialised with the rest of the data. At read time the
deserialiser inspects the version, dispatches to the appropriate
migration function for that version, and produces a current-shape
SessionData.
pub struct SessionData {
pub schema_version: u32,
pub auth_state: AuthState,
pub principal_hint: Option<PrincipalHint>,
pub custom: HashMap<String, serde_json::Value>,
}
impl SessionData {
const CURRENT_VERSION: u32 = 2;
fn migrate(self) -> Self {
match self.schema_version {
0 => migrate_from_v0(self),
1 => migrate_from_v1(self),
_ => self, // current, no migration needed
}
}
}
The migration functions are pure transformations. They take the
old shape (which serde has parsed against an older SessionData
definition, possibly with the version-bumped fields defaulted)
and produce the new shape. Each migration handles one version
step; chained migrations are run in sequence to bridge multiple
version gaps.
The version is bumped every time the shape changes in a way that
older code would not handle correctly. Adding an optional field
with a Default impl typically does not bump the version (older
code reads None, which is fine). Removing or renaming a field
does. Changing the meaning of a field does.
What migrations cannot do
A migration is a pure function on the serialised bytes. It cannot talk to a database, cannot consult the user store, cannot make network calls. The version of the data is determined entirely by what is in the cookie's session record at the moment of read.
The implication: if a new shape needs information that the old
shape did not carry, the migration cannot synthesise it. The
options are to default the field (set it to None, or to a known
placeholder), to discard the session (the migration returns an
error, the layer treats the session as invalid and starts a fresh
one), or to defer the population (the field is set later in the
request lifecycle from the application's stores).
The first option is the standard pattern. New fields get sensible defaults, the session continues to work with the new shape, and the application populates the real value on the next dirty write.
When the session is invalidated
Sometimes the shape change is breaking in a way that no migration can bridge. The session's data refers to a user who has been deleted, the auth state references a tenant that no longer exists, the factor list contains a kind that the new version has removed. The migration's right response is to error, and the layer's right response is to treat the session as invalid.
The mechanism is the SessionData::deserialize path returning
Err. The session layer catches the error, deletes the session
row (or marks it expired), and treats the request as a fresh
Guest. The user's cookie is still valid; the next request sets
a new session, the user logs in again.
The pattern is the right one because the alternative (the layer falling through to a degraded state, leaving the session in an inconsistent shape) lets bugs persist for the lifetime of the session. Invalidating eagerly converts the bug into a one-time user-facing event (re-login) that is fixable in one round-trip, rather than a long-tail bug that surfaces sporadically.
Adding a custom field
Adopters who add their own fields to SessionData::custom follow
the same pattern at the application layer. The custom map is
JSON-shaped; each application-owned key is independently
versioned by the application.
The common pattern is to wrap the custom value in a small struct with its own version field:
#[derive(Serialize, Deserialize)]
struct MyAppSessionData {
schema_version: u32,
preferences: UserPreferences,
feature_flags: Vec<String>,
draft_form_state: Option<DraftForm>,
}
fn read_app_data(session: &SessionData) -> MyAppSessionData {
session
.custom
.get("my_app")
.and_then(|v| serde_json::from_value::<MyAppSessionData>(v.clone()).ok())
.map(|d| d.migrate_if_needed())
.unwrap_or_default()
}
The application's schema_version is independent of axess's. The
two evolve on different cadences and the application's version
field captures the application's own changes.
When to reach for a different mechanism
The schema migration is the right tool for evolutions of the
session data shape. It is the wrong tool for migrations between
storage backends (use the cross-backend Store<K, V> trait or a
one-off copy script) or for changes to the encryption envelope
(the key-rotation mechanism, covered in Operations runbook).
It is also the wrong tool for application-level data migrations
that touch the database. A migration that says "every user gains
a new field on their user record" runs against the user store
(via sqlx::migrate! or the application's migration tool), not
against the session store. The session machinery does not interact
with the user table.
The mechanism's scope is narrow on purpose. Each piece of state has its own evolution mechanism, and conflating them produces migrations that have to consider too many cases at once.
Further reading
Session lifecycle and crypto envelope covers the lifecycle that
the migration runs as part of. Backends covers the storage
backends and their own (database-level) migration mechanisms.
Migration guide in Part VIII covers the cross-axess-version
migrations that bump the SessionData::schema_version constant.
The principal model
A Principal in axess is the answer to "who is making this request?"
The unusual choice, and the one this chapter explains, is that the
same type answers the question for human users and for service-to-service
workloads. A signed-in employee opening a page and a CI job calling
an API are both principals, with different variants but the same trait
surface, the same authorisation contract, and the same place in the
audit trail.
This chapter covers the type, where each variant comes from, how the unified shape lets a Cedar policy treat humans and workloads with one set of rules, and why the alternative (two parallel authentication stacks) was rejected.
The type
Principal lives in axess-identity:
pub enum Principal {
Human(HumanPrincipal),
Workload(WorkloadPrincipal),
}
pub struct HumanPrincipal {
pub user_id: UserId,
pub tenant_id: TenantId,
pub session_id: Option<SessionId>,
pub attributes: BTreeMap<String, serde_json::Value>,
}
pub struct WorkloadPrincipal {
pub workload_id: WorkloadId,
pub trust_domain: TrustDomain,
pub issuer: Issuer,
pub tenant_id: TenantId,
pub tenant_slug: String,
pub service_name: String,
pub attributes: BTreeMap<String, serde_json::Value>,
}
The two variants are intentionally not symmetric. They carry the data
each principal kind actually has. A human has a user_id and is
optionally inside a session (some flows act on behalf of a user without
a live HTTP session, which is why the field is Option). A workload
has a workload_id (a SPIFFE-format URI), a trust domain, and an
issuer that says how the principal was authenticated (which OIDC
provider, which JWKS, which SPIFFE control plane).
Both variants carry a tenant_id (because every request happens in
the context of a tenant, whether the caller is human or not) and an
open attributes map (because policies need to ask questions that the
fixed fields cannot answer). The attribute map is JSON-valued so that
custom attributes (a hardware-key serial, a CI build hash, a regulator
classification) can be carried without changing the type.
Where each variant comes from
The two variants are constructed by two different resolvers. The split is what keeps the human and workload sides from contaminating each other.
A HumanPrincipal is constructed by a SessionResolver from an
AuthSession. The resolver reads the session's AuthState, returns
None if the state is not Authenticated, and otherwise reads
user_id, tenant_id, and the session id off the variant. The
attributes map is populated from the resolved user's stored profile
data (which fields depend on the application's identity store).
Construction is synchronous and cheap because everything the resolver
needs is already on the session.
A WorkloadPrincipal is constructed by a PrincipalResolver from
an inbound credential (a bearer JWT, an mTLS client certificate, a
projected Kubernetes service-account token, a GitHub Actions OIDC
token). The resolver does the verification work (signature, audience,
expiry, sometimes a token-exchange against a control plane) and on
success returns a WorkloadPrincipal with the validated identity. The
work is async because verifying tokens typically involves a JWKS
fetch or an STS round-trip. The chapter Workload identity overview
covers the resolver landscape end-to-end.
The two resolvers are independent. An application that has no
workloads (a customer-facing SaaS, say) never wires a
PrincipalResolver and never sees a Workload variant. An
application that has only workloads (an internal data-pipeline API,
say) never wires a SessionResolver and never sees a Human variant.
An application that mixes both wires both resolvers and a small piece
of glue that decides which to consult given the incoming request shape.
Why one type
The natural alternative is two types and two stacks: a User for
humans, a Service for workloads, a different middleware for each,
a different authorisation contract for each, two parallel audit trails.
That shape is what most libraries ship, and it is what axess
deliberately rejects.
The argument for one type is straightforward when you start to write
the authorisation policy. A request to a billing endpoint might be
made by a finance staff member during office hours, or by a scheduled
job running the monthly invoicing batch. The policy that decides
whether the request is allowed is the same in both cases: this caller,
in this tenant, has the right to read this resource. With one
Principal type, the policy is one rule. With two types, the policy
either duplicates the rule (and the duplicates drift) or branches on
the caller kind (and the branches obscure the intent).
The same applies to the audit trail. A regulatory audit log that records "principal X performed action Y against resource Z at time T" works uniformly across human and workload callers when the principal type is unified. The downstream SIEM rules ("alert on any principal making more than N requests per minute to the high-sensitivity endpoint") fire on both human attacks and runaway workloads, without separate detection logic.
The unification has a cost. The Principal enum must accommodate
both variants, which makes its memory footprint larger than either
variant alone, and pattern-matching code has to handle both arms even
when the application only uses one. The cost is paid mostly in code
that loads the principal (one match per request), and not in policy
evaluation or audit emission (which see the trait surface). On
balance, the unification pays for itself by simplifying the policy
layer.
SPIFFE shape for workloads
The WorkloadPrincipal is shaped after SPIFFE because SPIFFE is the
right shape for workload identity even when the underlying credential
is not literally a SVID.
A SPIFFE identity is a URI of the form
spiffe://<trust_domain>/<path>. The trust domain is the federation's
namespace (prod.example.com, say), and the path identifies a
specific workload within that domain (/svc/billing/tenant-acme).
The combination uniquely names the workload, the trust domain
parameterises the verification (each domain has its own signing keys),
and the path is structured enough for policies to match on patterns
("any workload under /svc/billing/*") without inventing parallel
identity stacks.
Axess's workload identity layer uses this shape even when the inbound
credential is a Kubernetes service-account token (which is an OIDC
token, not a SVID) or a GitHub Actions OIDC token (which is also not
a SVID). The relevant resolver constructs a SPIFFE-format WorkloadId
from the inbound claims; downstream code sees a uniform identity.
Workload identity overview covers the construction rules for each
resolver.
The trust domain and issuer fields on WorkloadPrincipal are the part
that policies can use to discriminate between identity sources. A
policy that says "only workloads issued by our production control
plane may write to the production database" reads the issuer and
matches against a fixed list. A policy that says "any workload in the
finance trust domain may read the audit log" reads the trust domain.
The Cedar bridge
Cedar policies take principals as entities. Axess implements
ToCedarEntity for both HumanPrincipal and WorkloadPrincipal,
producing entities with the canonical shape Cedar expects.
A HumanPrincipal becomes a Cedar entity with UID
User::"<user_id>", attributes including tenant_id,
factors_completed, and authn_time, and parent entities for the
tenant and any groups the user belongs to (which the application
provides through AuthzEntityProvider, covered in Entity providers
and request context).
A WorkloadPrincipal becomes a Cedar entity with UID
Workload::"<spiffe-uri>", attributes including trust_domain,
issuer, and tenant_id, and parent entities for the trust domain
and the tenant. Policies that want to match all workloads in a trust
domain write principal in TrustDomain::"prod.example.com";
policies that want to match a specific workload pattern write
principal.workload_id like "spiffe://prod.example.com/svc/billing/*".
The bridge is what makes one type into one policy. A Cedar policy that says
permit (
principal,
action == Action::"read",
resource in TenantData::"acme"
) when {
principal.tenant_id == "acme"
};
allows both a human user in tenant acme and a workload bound to tenant acme. The principal type does not appear in the rule because it does not need to. If the policy later needs to discriminate (say, to require MFA for humans but not for workloads), the rule that expresses the discrimination is local and readable.
When the type is empty
Some flows operate without a principal: a health check, a metrics
endpoint, the login page itself. Axess models this by representing
the request as Option<Principal>. The resolver returns None, the
authorisation layer either short-circuits (for unauthenticated
endpoints) or evaluates against principal == Principal::None (for
endpoints that take a deny-by-default position toward unauthenticated
callers).
The pattern matters for one specific reason. A misconfigured resolver
that returns a stub principal for unauthenticated requests, instead
of None, silently widens the authorisation surface. The Cedar policy
evaluates against the stub and may allow actions that should require
authentication. Treating "no principal" as the absence of a value,
rather than as a kind of value, makes the policy author's life harder
in the short term and easier in the long term: a policy that does not
explicitly admit None denies it by default.
What this enables
The unified principal type is what makes the rest of the workload
identity story (Part VII) and the Cedar authorisation story (Part IV)
short. A handler reads Principal, the authorisation layer evaluates
policies against it, and the audit pipeline emits events keyed by
it. None of these layers need to know whether the caller is a human
or a workload, because the type carries both possibilities and the
policy author resolves the discrimination where it actually matters.
Further reading
Workload identity overview covers the resolvers that produce
WorkloadPrincipal values: SPIFFE JWT-SVID, SPIFFE mTLS, Kubernetes
ServiceAccount tokens, GitHub Actions OIDC, generic OAuth-RS, and
cloud STS exchange. Cedar policy fundamentals covers the
AuthzSession::require and AuthzSession::decide calls that take a
Principal and return an AuthzDecision. Audit events covers the
log emitted for each authentication and authorisation decision,
including the principal serialisation.
Device identity
A device in axess is a typed aggregate, not a string in a column. A
user has zero or more devices; each device has a stable identifier,
a fingerprint that the session layer can match against, an
assurance level on a three-stage ladder, and a relationship to the
refresh tokens issued against it. The combination is the
machinery behind "this device was lost, revoke its access" and
"this is a new device, require step-up before we trust it." The
mechanism is opt-in but on by default in the axess facade
because most adopters benefit from it without specifically asking.
The feature flag is device (on by default).
The three-stage ladder
A device occupies one of four states. The first three form an assurance ladder; the fourth is terminal.
Unknown is the default for a new device. The session layer has
seen this fingerprint for the first time, the user has not yet
confirmed it, and no commitment has been made about trust. An
unknown device can still authenticate (the user enters their
password and second factor as usual), but step-up policies may
require additional friction (a second confirmation email, a
recovery code) before high-sensitivity actions become available.
Seen is the second state. The device has authenticated
successfully at least once; the user has implicitly accepted it
by continuing through the login. A seen device retains the
fingerprint binding from the session layer but does not yet carry
explicit trust. It is the right state for a device that the user
might log in from again but has not explicitly registered.
Trusted is the third state and the steady state for primary
devices. The user (or the application's administrative flow)
explicitly trusted this device. The device's fingerprint binding
applies; the device is the bound carrier for refresh tokens; the
device can perform high-sensitivity actions without additional
step-up.
Revoked is the terminal state. The device was lost, the user
removed it, the security team forced a revocation, or the system
detected compromise. Tokens bound to the device are revoked,
sessions bound to it are deleted, and further authentication
attempts from the fingerprint are blocked until the user
explicitly re-establishes the device.
The transitions move strictly forward through the ladder.
Unknown becomes Seen on first successful login. Seen becomes
Trusted on explicit user action or after an
application-configurable trust period. Any state becomes Revoked
on revocation. There is no path back from Revoked; a device
that was revoked and is later re-encountered registers as a new
Unknown device.
The device record
The Device struct carries the per-device state:
pub struct Device {
pub device_id: DeviceId,
pub user_id: UserId,
pub tenant_id: TenantId,
pub trust_level: DeviceTrustLevel, // Unknown | Seen | Trusted | Revoked
pub fingerprint_hash: String, // HMAC against the per-tenant pepper
pub display_name: Option<String>, // user-set ("My laptop")
pub first_seen_at: DateTime<Utc>,
pub last_seen_at: DateTime<Utc>,
pub trusted_at: Option<DateTime<Utc>>,
pub revoked_at: Option<DateTime<Utc>>,
}
The device_id is a stable identifier minted at first sight. It
is what refresh tokens bind to (see Refresh tokens and session
continuity), what Cedar policies can reference, and what the
admin UI lists when the user inspects their registered devices.
The fingerprint_hash is the HMAC of the device's fingerprint
features against a per-tenant pepper. The hash, not the raw
fingerprint, lives in the database; the raw features are computed
per request and matched constant-time. Storing the hash defends
against database breach: an attacker who reads every row of the
device store does not learn the underlying fingerprint features
of any user.
The display_name is for the user. When the device transitions
from Seen to Trusted the application typically asks the user
to name it ("My laptop", "iPhone 15 Pro"); the name appears in the
user's device-management UI. It is not used for authentication.
The per-tenant pepper
The fingerprint pepper is the secret the HMAC uses. Two design choices matter.
The pepper is per-tenant, not global. Each tenant has its own pepper, stored alongside the tenant record. The choice means that a fingerprint hash from tenant A cannot be matched against tenant B's hashes; a breach that leaks one tenant's pepper compromises only that tenant's fingerprint hashes.
The pepper is rotated when the tenant is suspended or when the deployment chooses to invalidate all device records. Rotation invalidates every device record under the tenant (their fingerprint hashes no longer match the new pepper); existing sessions remain valid (they do not depend on the device record), but new logins re-register devices from scratch.
The chapter Operations runbook covers the rotation sequence and the staged rollout.
How devices interact with refresh tokens
The cascade between devices and refresh tokens is bidirectional and is what makes "revoke this device" actually mean "revoke every session this device can refresh."
In one direction: when a device is revoked, every refresh token
that carries device_id = revoked_device is invalidated. The next
attempt to use any of those tokens fails. The application's
session layer detects this on the next refresh and treats the
session as expired.
In the other direction: when a refresh token family is invalidated
through reuse detection (the family-revoke mechanism covered in
Refresh tokens), the cascade marks the bound devices as
compromised. The compromise is the shortcut from Trusted (or
Seen) to Revoked without an intermediate state.
The cascade is what makes the system robust against both operator-initiated revocation ("the device was lost") and attack-driven revocation ("a token was stolen"). The two cases converge on the same revocation primitive; both directions of cascade fire from the same code path.
Step-up policies
The trust level becomes interesting at the Cedar policy layer. A
policy that wants to require a Trusted device for sensitive
actions reads principal.device.trust_level == "Trusted":
forbid (
principal,
action == Action::"transfer-funds",
resource
) when {
principal.device.trust_level != "Trusted"
};
The rule denies fund transfers from any device that is not
Trusted. A user on a new (Unknown or Seen) device is
prompted to trust the device first, typically by completing an
additional verification step (a second-factor challenge, a
confirmation email, a step-up to FIDO2).
The pattern composes with the other authorisation styles. A policy that requires both FIDO2 and a Trusted device is the two constraints together; a policy that allows any of three different ways to clear the bar is the disjunction in one rule.
Identifying a device
Each request needs to be associated with a device. The mapping
runs through the DeviceResolver trait:
#[async_trait]
pub trait DeviceResolver: Send + Sync {
async fn resolve(
&self,
request: &Request,
user_id: &UserId,
tenant_id: &TenantId,
) -> Result<DeviceMatch, DeviceResolverError>;
}
pub enum DeviceMatch {
Existing(DeviceId),
NewDevice(DeviceId), // freshly minted, written to store
}
The default implementation computes the fingerprint from the
request features (user agent, IP, accept-language) and matches it
against existing devices for the user. A match returns the
existing device id; a miss writes a new device row with
trust_level = Unknown and returns the new id.
The default works for most deployments. Applications with
stronger device-identity signals (a long-lived hardware key, a
mobile app's persistent installation id, a device certificate)
can provide their own DeviceResolver that consults the stronger
signal first and falls back to the fingerprint match.
Caching
The device record is read on most requests (every authenticated request that involves a Cedar evaluation reads the device). A naive lookup against the device store would be the hottest read in the application.
The CachedDeviceStore decorator wraps any DeviceStore with an
LRU+TTL cache. The cache key is (tenant_id, device_id); the
cache value is the Device record. The TTL is short (a few
seconds) so revocations propagate quickly; the LRU bound
constrains memory under fan-out scenarios.
The cache is invalidated explicitly on revocation. The
DeviceStore::revoke call clears the relevant cache entry and
writes the revocation. Subsequent reads see the revoked state
without waiting for the TTL.
The pattern is the same one Entity providers and request context covers for the Cedar entity cache. Cache the data, not the decision; invalidate eagerly on mutation; let TTLs catch the cases the invalidation missed.
PII tokenisation and GDPR
The device record carries personally-identifiable information. The fingerprint features include the IP address (which is PII under GDPR), the user agent (which can carry identifying details about the user's setup), and the timestamps (which together can identify the user's working patterns).
The defence is twofold.
The first is that the device store holds hashes, not the raw features. The fingerprint hash is the HMAC against the per-tenant pepper; an attacker who reads the store sees the hash, not the IP or user agent.
The second is the retention sweep. The DeviceStore::retention_sweep
verb removes device records older than a configured threshold,
along with the refresh tokens that bound to them. The sweep is
the GDPR-shaped lever: data the deployment no longer needs is
removed within a bounded period, and the retention is
documentable.
The retention period is per-tenant. The
Tenant::device_retention_days field carries it; the default is
ninety days. Tenants with stricter requirements set it lower
(say, thirty days for an EU tenant subject to strict GDPR
interpretation); tenants with looser ones set it higher (say,
three hundred and sixty-five days for a US tenant where session
continuity matters more).
The chapter Multi-tenancy covers the per-tenant configuration mechanism. Security posture covers the GDPR and SOC2 touch-points.
Storage backends and writing your own
axess ships five DeviceStore implementations:
| Backend | Feature | Notes |
|---|---|---|
MemoryDeviceStore | memory | DashMap + clock-driven sweep. Dev and tests. |
SqliteDeviceStore | sqlite | SQLx pool, INSERT … ON CONFLICT, schema in init_schema(). |
PostgresDeviceStore | postgres | SQLx pool, same surface as the sqlite backend with the Postgres dialect. |
MysqlDeviceStore | mysql | SQLx pool, MySQL dialect (? binds, ON DUPLICATE KEY UPDATE, VARBINARY(32)). Compatible with MySQL 8.x and MariaDB 10.5+. |
ValkeyDeviceStore | valkey | Hash-per-device + per-tenant fingerprint index. Server-side EXPIRE handles purge. |
All five SQL/Valkey backends share the same trait surface; switching
between them requires only the init_schema call against the new
pool and a different constructor at startup.
Writing an adopter-supplied store
Any storage technology can back devices as long as it can answer the
ten methods on axess_core::device::DeviceStore. The shipped
backends (memory, sqlite, postgres, valkey) are the reference
implementations to read alongside the trait docstring; the recipe
below names the contracts that aren't obvious from method
signatures.
Type and Error. Implement the trait on a Clone + Send + Sync + 'static struct (typically Arc<...> around your connection pool /
client). Pick a single `type Error: std::error::Error + Send + Sync
- 'static
; the existing backends use athiserrorenum that wraps their driver error + a "missing row" variant. Don't conflate driver errors with domain errors (aNotFoundreturned by your driver should not surface asSome(Device)inload; map it toOk(None)`).
Tenant scoping is mandatory. Every method that takes a
TenantId must filter on it in the query. The peppered
FingerprintHash is already keyed per-tenant, but the trait
contract documents the scoping requirement explicitly to prevent
cross-tenant leakage on a backend whose primary index might
otherwise be only by hash. Read the docstring on
find_by_fingerprint for the rationale.
save must be atomic. save is documented as idempotent
upsert. Implementations that do SELECT + INSERT racy-checks
must wrap them in a transaction or use the dialect's native
upsert (ON CONFLICT, ON DUPLICATE KEY UPDATE, MERGE, or
SETNX for KV stores). A non-atomic save produces lost updates
under concurrent device-promotion calls.
record_sighting is hot-path. Every authenticated request
touches this. Implement it as a single UPDATE … SET last_seen_at = ? rather than a load-modify-save round trip. The shipped
backends are a guide. The CachedDeviceStore decorator (see
caching, above) shields the underlying store from read pressure
but the write path runs through every request.
sweep is required, not defaulted. A backend that doesn't
implement sweep cannot age devices through the three-stage ladder,
and the documented retention posture (90d trusted / 30d seen / 7d
revoked grace) silently breaks. The trait deliberately omits a
default impl so backends must answer the question, even if the
answer is Err(_) with a "sweep not yet implemented" sentinel
during initial development.
Sighting timestamps come from a Clock. Methods that need
"now" (record_sighting, set_trust_level, sweep) accept
now: DateTime<Utc> as a parameter. Callers thread clock.now()
through; backends never call Utc::now() themselves. This
preserves DST determinism for adopter integration tests.
Mirror the per-backend test layout. Each shipped backend has
its own test module exercising the trait surface end-to-end (load
round-trip, fingerprint lookup, refresh-family fan-out, retention
sweep); device/storage/sqlite/tests.rs is the most complete
template. Copy that suite, adapt the harness setup to your
backend, and run it to catch the non-obvious contract violations
(tenant-scoping leaks, non-atomic save races, sweep counts off-by-
one).
Reach for CachedDeviceStore over reinventing. If your gap is
"my backend is slow on load", wrap your store in
CachedDeviceStore before optimising the implementation. The
decorator gives you bounded-size LRU + clock-driven TTL eviction
for free, with revocation propagating through set_trust_level.
What this enables
Device identity is the connective tissue between the user, the sessions they hold, the refresh tokens those sessions issue, and the authorisation decisions the application makes about them. A user with a known device gets a smoother experience: the fingerprint binding holds, the refresh tokens roll, the policies default to trust. A user with an unknown device gets friction exactly when it makes sense: a step-up before sensitive actions, a confirmation before high-trust operations. A user with a revoked device gets nothing, immediately.
The mechanism is small (a handful of types, one ladder, one cascade) but its reach is wide (every refresh, every policy evaluation, every audit event). Once you have the device aggregate in mind, the rest of the security model falls into place around it.
Further reading
Refresh tokens and session continuity covers the binding between devices and tokens, including the cascade in both directions. Cedar policy fundamentals covers how policies read the device's trust level. Multi-tenancy covers the per-tenant fingerprint pepper and retention configuration. Security posture covers the GDPR and SOC2 implications of device data.
Multi-tenancy
A tenant in axess is the unit of isolation. Users, factor
configurations, sessions, devices, policies, and audit events all
carry a TenantId, and the library refuses to leak data across
tenants by construction. This chapter covers the model, the
atomic provisioning pattern that ensures every tenant starts in a
sound state, the three-lever lockout, and the operational
patterns for tenant suspension and deletion.
The mechanism is on by default. There is no feature flag to
toggle tenancy; the TenantId field is present on every relevant
record. A single-tenant deployment uses one well-known
TenantId ("default" is the convention) and effectively gets
the multi-tenant machinery for free, ready to expand when a
second tenant is added.
The tenant record
The Tenant struct lives in axess-identity and carries the
configuration that applies to every user under the tenant:
pub struct Tenant {
pub tenant_id: TenantId,
pub status: TenantStatus, // Active | Suspended | Deleted
pub display_name: String,
pub fingerprint_pepper: ZeroizedString, // per-tenant device pepper
pub lockout_policy: LockoutPolicy, // tenant-scoped lockout
pub device_retention_days: u32, // GDPR-shaped retention
pub created_at: DateTime<Utc>,
pub suspended_at: Option<DateTime<Utc>>,
}
The TenantId is a typed UUID (the convention in axess-identity).
The status carries the tenant's lifecycle state, covered below.
The fingerprint_pepper is the per-tenant device pepper from
Device identity. The lockout_policy is the tenant-scoped
override of the global lockout configuration, covered in the
Three-lever lockout section below. The device_retention_days
is the per-tenant GDPR-shaped retention period for device records.
Cross-tenant refusal as a structural rule
Every operation in axess that touches a user, a session, a device, a factor, or an event carries a tenant scope. The library checks the scope before performing the operation, and refuses any operation where the scopes do not align.
The pattern is uniform across the API. A begin_login call
takes a tenant id; the user lookup is scoped to that tenant; a
user with the same username in a different tenant is not
returned. A verify_factor call works against the session's
tenant id; a factor configuration registered in a different
tenant is not consulted. A find_sessions_for_user call takes
both user id and tenant id; sessions in other tenants are not
returned.
The structural defence is what lets a multi-tenant deployment
make the strongest possible authorisation claim: not only does
the application not leak across tenants, the library underneath
cannot. The Cedar policy layer can then add a top-level forbid
rule that catches the rare case of an application bug that tries
to authorise across tenants:
forbid (
principal,
action,
resource
) when {
principal.tenant_id != resource.tenant_id
};
The rule applies to every action on every resource, and the combination of "library refuses cross-tenant lookups" and "policy denies cross-tenant decisions" produces a deployment where a cross-tenant access is structurally impossible.
Atomic provisioning
A tenant comes into existence through AuthnService::create_tenant,
which is the verb behind any "sign up a new organisation" or
"administrator provisions a new tenant" flow. The call is atomic
by design.
let tenant = service.create_tenant(TenantBootstrap {
display_name: "Acme Inc.".into(),
initial_admin: AdminUser {
identifier: "admin@acme.example".into(),
initial_password: Some(initial_password.into()),
},
initial_method: Method {
name: "password-then-totp".into(),
steps: vec![
FactorStep::Required(FactorKind::Password),
FactorStep::Required(FactorKind::Totp),
],
},
fingerprint_pepper: SecureRng::random_bytes(32),
lockout_policy: LockoutPolicy::default(),
device_retention_days: 90,
}).await?;
The atomicity matters because a partially-provisioned tenant is a landmine. A tenant that exists in the tenant table but has no configured method admits any user with the global default method, which may not be what the new tenant wants. A tenant with a method but no factor configurations for the admin user produces an immediate lockout. A tenant with an admin user but no factor secret for them is worse: the user record exists, the admin cannot log in, and there is no path to recovery without an out-of-band intervention.
The bootstrap struct is the contract that says "a tenant exists only after every one of these has succeeded." The implementation runs the create-tenant, create-user, create-factor-config, create-method, set-fingerprint-pepper, set-lockout-policy operations in a single transaction. On any failure the transaction rolls back; nothing is persisted; the call returns an error.
A subtler invariant in the bootstrap: every tenant must have at least one factor and one enabled method, and the admin user must have a factor configuration for every factor the method requires. The bootstrap checks both at construction; a misshapen bootstrap fails before the transaction starts.
The three-lever lockout
Lockout is the mechanism that prevents an attacker from brute-forcing credentials. Axess has three levers, applied at three scopes, that compose.
The first lever is per-user lockout. After a configurable number of failed factor verifications against the same user account, that account is locked for a configurable interval. The default is three failed attempts followed by a fifteen-minute lockout with exponential backoff on repeated failure.
The second lever is per-tenant lockout. After a configurable number of failed factor verifications across any user in the tenant within a short window, the tenant's login surface as a whole is throttled. The default is high enough that legitimate traffic does not trigger it; the lever exists to catch distributed brute-forcing across many accounts in the same tenant.
The third lever is per-IP lockout. After a configurable number of failed verifications from the same source IP within a short window, that IP is throttled or blocked outright. The default is ten attempts per minute, beyond which the requests are rejected without engaging the factor verifier. The lever catches a single attacker source attempting many accounts.
The three levers compose multiplicatively. A successful attack needs to dodge all three: stay below the per-user threshold, stay below the per-tenant threshold, and either spread across many source IPs or stay below the per-IP threshold. The cost of the attack grows as a product of the three.
The lockout configuration is in LockoutPolicy:
pub struct LockoutPolicy {
pub per_user: LockoutScale,
pub per_tenant: LockoutScale,
pub per_ip: LockoutScale,
}
pub struct LockoutScale {
pub failures_before_lockout: u32,
pub window: Duration,
pub backoff: BackoffPolicy, // fixed | exponential
pub max_lockout: Duration,
}
The policy is per-tenant by default (loaded from the tenant
record's lockout_policy field). The global default applies if
the tenant did not override.
Tenant suspension
A suspended tenant is still in the database but cannot
authenticate. The state is reached through AuthnService::suspend_tenant,
which is the operational verb behind "this tenant has not paid"
or "this tenant has been flagged for compliance review."
The transition does five things atomically: it sets the tenant's
status to Suspended, it sets the suspended_at timestamp, it
invalidates every active session under the tenant (deletes the
session rows, the user's next request comes through as Guest),
it revokes every refresh token under the tenant (sets revoked = true on each), and it emits a TenantSuspended audit event.
A suspended tenant's users hit TenantSuspended on every login
attempt instead of proceeding to factor verification. The error
is distinct from UserNotFound because the application typically
wants to render a specific page for it (a "your organisation is
suspended, contact support" message), not the generic invalid-credentials
flow.
Unsuspending is the inverse: unsuspend_tenant flips the status
back to Active, clears suspended_at, and emits a
TenantReactivated event. Sessions are not restored; users have
to log in again, which is the right behaviour because their
device records may have aged or rotated during the suspension.
Tenant deletion
A deleted tenant is the irreversible end of the lifecycle. The
state is reached through AuthnService::delete_tenant, typically
in response to a customer exit or a GDPR erasure request.
The deletion runs as a cascade. All sessions, refresh tokens,
devices, factor configurations, audit events, and the tenant
record itself are removed. The deletion is two-phase: the first
phase marks the tenant as Deleted and stops accepting new
operations on it; the second phase runs the cascade asynchronously
(typically as a background task) and removes the underlying
rows.
The two-phase pattern matters for two reasons. First, the
cascade is potentially expensive on large tenants; running it
synchronously blocks the operator's request. Second, the
two-phase approach gives a recovery window: if the deletion was
accidental, the first phase is reversible by flipping the status
back to Suspended before the cascade runs. After the cascade,
recovery requires a backup restore.
The audit events emitted during the cascade are preserved (in a
separate axess.audit.tenant_deletion log) so the deletion is
defensible against later inquiry. The events name the
operator who initiated, the timestamp, and the counts (how many
users, how many sessions, how many tokens).
Per-tenant configuration storage
The per-tenant fields (fingerprint pepper, lockout policy, device retention, methods) live in dedicated tables keyed by tenant id. The application's tenant store is one of the adopter- implemented surfaces; axess provides traits, the implementation is yours. The pattern is uniform across the surfaces:
#[async_trait]
pub trait TenantStore: Send + Sync {
async fn get(&self, id: &TenantId) -> Result<Tenant, TenantStoreError>;
async fn create(&self, bootstrap: TenantBootstrap) -> Result<Tenant, ...>;
async fn suspend(&self, id: &TenantId, at: DateTime<Utc>) -> Result<(), ...>;
async fn unsuspend(&self, id: &TenantId) -> Result<(), ...>;
async fn delete(&self, id: &TenantId, mode: DeleteMode) -> Result<(), ...>;
async fn update_lockout_policy(&self, id: &TenantId, policy: LockoutPolicy) -> Result<(), ...>;
async fn rotate_fingerprint_pepper(&self, id: &TenantId, new: ZeroizedString) -> Result<(), ...>;
}
The trait surface is the tenant lifecycle in code. An adopter implements it against their own tenant table; axess calls into it on each lifecycle event.
Reserved principals
A handful of principals are reserved across all tenants. The
system() principal is the one axess uses for its own internal
operations (retention sweeps, scheduled rotations, audit pipeline
ingestion). The principal carries no TenantId; its actions are
attributed to the system itself, not to any tenant or user.
The reservation prevents an application from creating a user
named "system" and inadvertently granting that user the
permissions axess reserves for its background work. The
UserId::is_reserved check fires at user-creation time;
attempting to provision a reserved principal returns an error.
The set of reserved principals is small and stable. The chapter Audit events lists them.
What this enables
Multi-tenancy in axess is what lets a SaaS application provision new organisations without restructuring the data model, suspend problematic ones without affecting the rest, and delete departed ones cleanly with an audit trail. The fingerprint pepper rotates per-tenant; the lockout policy varies per-tenant; the device retention complies per-tenant; the policies scope per-tenant. The multi-tenant deployment is the single-tenant deployment with N>1.
Further reading
Scope hierarchy covers the three-tier (Global, Tenant, User)
resolution mechanism that determines which configuration applies
to which user. Device identity covers the per-tenant
fingerprint pepper and the GDPR-shaped retention sweep.
Identity store implementation covers the storage layer for
the tenant record and the user records under it. Cedar policy
fundamentals covers the cross-tenant forbid rule and the
policy-scoping pattern.
Identity store implementation
Most of axess works against traits, and the identity store is the
most consequential of them. The library does not prescribe a user
schema, a tenant schema, or a factor schema; it prescribes a set
of trait methods that the application implements against whatever
schema it already has. This chapter walks through the three-tier
trait split, the verbs each tier carries, the patterns for
implementing them against a SQL backend, and the
read-replica-and-fixtures variant that the NoopAuthnLog adapter
enables.
The three tiers
The identity store is split into three trait tiers, in order of increasing privilege. An adopter that needs only read access implements the narrowest tier; an adopter that needs write access for audit purposes implements the middle tier; an adopter that needs full administrative control implements the widest tier.
// Tier 1: read-only.
#[async_trait]
pub trait IdentityLookup: Send + Sync {
async fn get_user(&self, user_id: &UserId) -> Result<User, StoreError>;
async fn find_user(
&self,
identifier: &str,
tenant_id: &TenantId,
) -> Result<Option<User>, StoreError>;
// ... eight more verbs
}
// Tier 2: read + per-attempt audit writes.
#[async_trait]
pub trait IdentityAuthnLog: IdentityLookup {
async fn record_attempt(
&self,
attempt: AttemptRecord,
) -> Result<(), StoreError>;
async fn record_lockout(
&self,
lockout: LockoutRecord,
) -> Result<(), StoreError>;
async fn clear_lockout(
&self,
user_id: &UserId,
tenant_id: &TenantId,
) -> Result<(), StoreError>;
async fn last_attempts(
&self,
user_id: &UserId,
tenant_id: &TenantId,
limit: usize,
) -> Result<Vec<AttemptRecord>, StoreError>;
}
// Tier 3: read + audit + administrative writes.
#[async_trait]
pub trait IdentityAdmin: IdentityAuthnLog {
async fn create_user(&self, user: NewUser) -> Result<User, StoreError>;
async fn suspend_user(&self, user_id: &UserId, at: DateTime<Utc>) -> Result<(), StoreError>;
async fn erase_user(&self, user_id: &UserId, gdpr_reason: &str) -> Result<(), StoreError>;
// ... six more verbs covering admin lifecycle
}
// The umbrella for production: all three tiers.
pub trait IdentityStore: IdentityAdmin {}
impl<T: IdentityAdmin> IdentityStore for T {}
The hierarchy reads from narrowest to widest. An
IdentityAuthnLog is an IdentityLookup plus the audit writes.
An IdentityAdmin is an IdentityAuthnLog plus the
administrative writes. The umbrella IdentityStore is the
all-three-tiers shape that production backends implement.
Why three tiers
The split is the answer to two adopter situations the library has seen often enough to model explicitly.
The first situation is a read-replica deployment. A high-traffic
application runs the login flow against a read-replica of the
user database for latency reasons. The replica cannot accept
writes; the application needs the read verbs without the write
verbs. The IdentityLookup tier covers this. The application
implements IdentityLookup against the replica and
IdentityAuthnLog (which needs writes) against the primary.
The second situation is a fixture deployment. A test or an
embedded usage of axess does not have a real database; the
application uses an in-memory backend for the read verbs and does
not care about the audit writes. The NoopAuthnLog adapter
wraps an IdentityLookup and provides no-op implementations of
the IdentityAuthnLog write verbs. The fixture has the trait
surface it needs without writing an audit-table mock.
The third situation, less common, is a deployment with a
separation between the application code that handles login and
the administrative code that creates users. The application
implements IdentityAuthnLog; the admin code separately
implements IdentityAdmin. The split prevents the application
code from accidentally calling delete_user or suspend_user
because it never has the trait method in scope.
What the verbs actually do
The verbs split cleanly across the tiers.
IdentityLookup is reads. get_user is a primary-key lookup by
UserId. find_user is a credentials-side lookup by identifier
and tenant: the user typed alice@example.com, the application
needs to know if this is a real user in this tenant. Other read
verbs cover the variants: looking up a user by email when email
is separately indexed, looking up a user by a federated identity
key when the application supports federated login, listing the
users in a tenant for admin tooling.
IdentityAuthnLog is the audit writes the lockout system
depends on. record_attempt is called by verify_factor after
every factor check; the record carries the user id, the tenant
id, the factor kind, the outcome (success, failure, locked), the
timestamp, the IP. record_lockout is called when the lockout
policy fires; the record marks the user as locked until a
specific moment. clear_lockout is called when the lockout
window expires or when an administrator manually clears the
state. last_attempts is the read verb the policy consults to
make the next lockout decision.
IdentityAdmin is the privileged writes. create_user is the
verb behind signup or admin provisioning. suspend_user is the
verb behind administrative suspension (compliance, fraud
investigation). erase_user is the GDPR-shaped verb: the user
has invoked their right to be forgotten, and the verb cascades
through every record that references them. Other admin verbs
cover password reset (administrative, not user-initiated),
identifier changes, and the per-user method override.
Implementing against SQL
The typical implementation against a SQL database is verbose but mechanical. The pattern is to implement each verb as one query (or one transaction), with the right indexes on the user table to keep the reads fast.
A reference implementation against PostgreSQL is in
examples/sqlite/
(the SQLite version of the pattern). The shape:
struct OurBackend {
pool: SqlitePool,
}
#[async_trait]
impl IdentityLookup for OurBackend {
async fn get_user(&self, user_id: &UserId) -> Result<User, StoreError> {
let row = sqlx::query_as::<_, UserRow>(
"SELECT id, tenant_id, identifier, display_name, status, created_at
FROM users
WHERE id = ?1"
)
.bind(user_id.to_string())
.fetch_one(&self.pool)
.await?;
Ok(row.into())
}
async fn find_user(
&self,
identifier: &str,
tenant_id: &TenantId,
) -> Result<Option<User>, StoreError> {
let row = sqlx::query_as::<_, UserRow>(
"SELECT id, tenant_id, identifier, display_name, status, created_at
FROM users
WHERE identifier = ?1 AND tenant_id = ?2"
)
.bind(identifier)
.bind(tenant_id.to_string())
.fetch_optional(&self.pool)
.await?;
Ok(row.map(Into::into))
}
// ... eight more verbs
}
The patterns to note:
The tenant scope is on every query. find_user filters by both
identifier and tenant id; the same identifier in a different
tenant is not returned. The discipline is what enforces cross-tenant
refusal at the storage layer.
The identifier comparison is whatever the deployment chose. The
example treats the identifier as case-sensitive; deployments that
want case-insensitive matching apply LOWER() to both sides (and
index on LOWER(identifier)). The trait does not opinionate; the
implementation decides.
The error type is the implementation's. The trait returns
StoreError; the implementation maps sqlx::Error into it. The
mapping preserves the kind of failure (connection error, query
error, constraint violation) so the upstream callers can act on
specific cases.
Implementing the audit writes
IdentityAuthnLog is the layer that requires care. The verbs
fire on every login attempt; a slow implementation is the
bottleneck of the entire authentication flow.
The pattern is to batch where possible and to keep each write
small. The record_attempt table is append-only and indexed on
(user_id, tenant_id, timestamp) for the last_attempts query.
The lockout state lives in a separate table keyed by user; the
record_lockout and clear_lockout verbs are upserts.
#[async_trait]
impl IdentityAuthnLog for OurBackend {
async fn record_attempt(&self, attempt: AttemptRecord) -> Result<(), StoreError> {
sqlx::query(
"INSERT INTO authn_attempts (user_id, tenant_id, factor_kind, outcome, ip, ts)
VALUES (?1, ?2, ?3, ?4, ?5, ?6)"
)
.bind(attempt.user_id.to_string())
.bind(attempt.tenant_id.to_string())
.bind(attempt.factor_kind.as_str())
.bind(attempt.outcome.as_str())
.bind(attempt.ip.map(|ip| ip.to_string()))
.bind(attempt.ts)
.execute(&self.pool)
.await?;
Ok(())
}
async fn last_attempts(
&self,
user_id: &UserId,
tenant_id: &TenantId,
limit: usize,
) -> Result<Vec<AttemptRecord>, StoreError> {
let rows = sqlx::query_as::<_, AttemptRow>(
"SELECT user_id, tenant_id, factor_kind, outcome, ip, ts
FROM authn_attempts
WHERE user_id = ?1 AND tenant_id = ?2
ORDER BY ts DESC
LIMIT ?3"
)
.bind(user_id.to_string())
.bind(tenant_id.to_string())
.bind(limit as i64)
.fetch_all(&self.pool)
.await?;
Ok(rows.into_iter().map(Into::into).collect())
}
// ... record_lockout, clear_lockout
}
The last_attempts query is the hottest read in the audit layer.
The index (user_id, tenant_id, ts DESC) makes it cheap; without
the index, the query degrades to a table scan and the login flow
slows under load.
The append-only attempts table grows. The retention story for it is in Audit pipeline: typically a hot/cold split where recent attempts (the ones the lockout policy consults) stay in the attempts table and older attempts archive to a cold store.
The NoopAuthnLog adapter
NoopAuthnLog<L> wraps an IdentityLookup and provides no-op
implementations of the IdentityAuthnLog write verbs. The
wrapper exists for two cases.
The first is fixtures. A test uses MockIdentityStore
(implementing IdentityLookup), and verify_factor needs
IdentityAuthnLog. The test wraps the mock in NoopAuthnLog,
satisfies the trait, and runs without recording anything.
The second is read-replica deployments where the audit writes go
through a different code path (an out-of-band log shipper, a
Kafka topic, an external SIEM). The application implements
IdentityLookup against the replica, wraps in NoopAuthnLog,
and routes the audit writes through the side channel.
The trade-off is that the lockout policy will not function
correctly under NoopAuthnLog. The policy consults
last_attempts, which depends on the audit writes the noop
silently discarded. Deployments that use NoopAuthnLog for the
read-replica case must accept that the lockout policy is degraded
unless they implement an alternative.
The chapter warns about this in the docstring of NoopAuthnLog;
the warning is worth repeating: do not use NoopAuthnLog in
production without an alternative lockout source.
What about workload identities
Workloads have their own identity surface, not the same one
humans use. The IdentityStore traits do not cover workloads;
the workload identity resolvers (Workload identity overview)
have their own machinery.
The split is deliberate. Humans live in a user table; workloads live in a workload table (or do not live anywhere durable, when they are short-lived service-to-service callers). The audit events for workloads route differently from human events. The lockout policy does not apply to workloads at all. Trying to unify the two would produce a trait that does too many jobs.
The same is true for the principal model: the Principal enum
has two variants, the read paths for the two variants go through
two different stores. The application implements both stores and
the resolver code routes appropriately.
Schema migration
The identity store is the part of the application most likely to need migrations over time: a new factor adds a column to the factor configurations table, a regulatory change requires a new field on the audit-attempts table, a refactor renames a column.
The migration mechanism is the application's, not axess's.
sqlx::migrate! is the standard pattern; alternative migration
tools (Diesel migrations, Atlas, custom SQL) work the same way.
Axess does not need to know about the migrations; the
implementation just needs to keep satisfying the trait against
the new schema.
The pattern in examples/sqlite/ is the reference. The
migrations/ directory carries the SQL files; the main.rs
runs them at startup; the implementation queries against the
latest schema.
What this enables
The trait split is what lets axess fit into existing applications without forcing a schema rewrite. The library knows nothing about the user table; it knows only that there is a trait it can call to look up users. The application's data model is the source of truth, and the trait surface is the bridge.
The three tiers and the noop adapter give the application enough flexibility to fit the awkward shapes (read replicas, fixtures, split admin) without forcing every adopter to implement the full set of verbs.
Further reading
Multi-tenancy covers the per-tenant configuration that the
identity store reads and writes. Audit events covers the
AuthEvent variants the audit-log verbs emit. Audit pipeline
covers the hot/cold retention story for the attempts table.
Migration guide covers the cross-version migrations that affect
the user table.
Workload identity overview
A workload in axess is a non-human caller: a service in your service
mesh, a Kubernetes pod, a CI/CD runner, a batch job, a serverless
function. Workloads need to authenticate against your application
the same way users do, but the credentials, the lifetimes, and the
operational characteristics are different. This part of the book
covers how axess models workload identity, how it resolves credentials
into a typed Principal::Workload, and how the Cedar policy layer
authorises workloads through the same rules it uses for human users.
The unifying claim is the one The principal model in Part II
already made: humans and workloads are the same type. A Cedar policy
that says resource.tenant_id == principal.tenant_id works for a
logged-in user and for a SPIFFE-identified payment service without
branching. The chapters in this part cover the specific resolvers
that turn each credential kind into a Principal::Workload.
The cookbook chapters are siblings of this overview. Read them in the order that matches your deployment: SPIFFE-based deployments read Inbound: JWT-SVID and Inbound: mTLS-SVID; cloud-platform deployments read Inbound: federation and Cloud STS exchange; applications that call downstream services on a workload's behalf read Outbound: OAuth and Outbound: mTLS.
The resolver model
Every inbound request that carries a workload credential runs
through a PrincipalResolver. The resolver inspects the credential
(a bearer JWT in a header, a client certificate from the TLS
handshake, a projected service-account token), validates it, and
returns a Principal::Workload if the validation succeeds. The same
trait is implemented for every credential kind axess supports, and
applications wire only the resolvers their deployment needs.
┌──────────────────────┐
│ Inbound request │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ PrincipalResolver │
│ (per-feature impls) │
└──────────┬───────────┘
│
┌────────────┴───────────┐
│ │
▼ ▼
Principal::Human Principal::Workload
(session + factors) (with Issuer + WorkloadId)
│
┌─────────────────────────┘
│
▼
ToCedarEntity bridge
│
▼
Cedar evaluation
Two resolvers ship today plus a generic third for everything else:
JwtSvidResolver (SPIFFE JWT-SVID, spec-bound; mandatory
spiffe:// URI in sub); MtlsResolver (SPIFFE X.509-SVID over
mTLS); and WorkloadResolver, the generic JWT-bearer resolver that
covers every non-SPIFFE workload-identity flow (Kubernetes projected
service-account tokens, GitHub Actions OIDC, GitLab CI OIDC, Okta,
Azure AD, Auth0, axess's own LocalIdP, custom internal JWT
formats). The adopter supplies a small claim parser + mapping
closure per issuer they care about; see examples/workload-identity/
for ready-made recipes (GitHub Actions, Kubernetes SA). The human
side has its own SessionResolver covered in Part II. A
MockResolver is available for DST tests. Each resolver implements
the same trait and produces the same Principal shape.
Why one type covers both
A traditional auth library treats human and workload identity as two independent stacks. The session layer handles users; a separate JWT-validation middleware handles services. Neither composes with the other, and policies that need to apply to both ("only callers in the finance tenant may read this resource") end up duplicated: one rule for users in code that knows about sessions, another rule for workloads in code that knows about tokens, and the two drift apart over time as the application evolves.
Unifying on Principal removes the duplication. The Cedar policy
quoted above works for a human and a workload because the policy
matches on a tenant id, which both variants carry. If the policy
later needs to discriminate between the two (a rule that demands
human-completed MFA for an action, but admits any workload), the
discrimination is expressed in one rule:
permit (
principal,
action == Action::"transfer-funds",
resource
) when {
resource.tenant_id == principal.tenant_id
&& (
principal has Workload // workloads bypass MFA requirement
|| (
principal has Human
&& "Fido2" in principal.factors_completed
)
)
};
The discrimination is local, readable, and lives in the policy file rather than scattered across handlers.
SPIFFE and SVIDs
SPIFFE is the industry-standard model for workload identity, and the chapters that follow assume the vocabulary. The two terms worth knowing up front:
A SPIFFE ID is a URI of the form
spiffe://<trust_domain>/<path>. The trust domain is the federation
namespace (prod.example.com, say); the path identifies a specific
workload within that domain (/svc/billing/tenant-acme). The
combination uniquely names the workload across the federation.
An SVID (SPIFFE Verifiable Identity Document) is the credential that carries the SPIFFE ID. SVIDs come in two formats: JWT-SVID (a JWT signed by the trust domain's issuing authority) and X.509-SVID (a leaf certificate with the SPIFFE ID in a Subject Alternative Name URI). Both are covered in their own cookbook chapters.
SPIRE is the reference SPIFFE implementation. It handles workload attestation (verifying that a process running on a host is the workload it claims to be), SVID issuance, key rotation, and trust-domain federation. Axess does not replace SPIRE; SPIRE issues, axess validates. The two are designed to compose.
A future SpireWorkloadApiResolver (tracked in the ROADMAP as
) will talk to a local SPIRE agent socket directly, fetching
fresh SVIDs on demand rather than relying on adopters to mount them
into the filesystem. For now, adopters mount short-lived SVIDs into
pod filesystems and configure axess against them.
Federation
A trust domain is a unit of issuance. A workload in trust domain A is identified by an A-issued SVID, validated against A's signing keys. When a workload in domain A needs to call a service in domain B, federation is the mechanism that lets B accept A's identity.
Three federation patterns appear in axess.
Same-domain is the simple case. The resolver validates the SVID against the local trust-domain bundle (the JWKS for JWT-SVIDs, the CA bundle for X.509-SVIDs). The SVID carries the local trust domain; the resolver knows where to fetch the keys.
Federated is the cross-domain case. The resolver validates the
SVID against a remote trust-domain bundle, then runs the resulting
identity through a TrustDomainFederation policy that maps the
foreign identity (which trust domains are accepted, which path
prefixes within each are admitted, how the identity is rewritten
into the local namespace if at all). The federation policy is
deployment configuration; axess validates, the deployment decides
the rules.
External-issuer is the non-SPIFFE case. The credential is not a
SVID at all (a Kubernetes service-account token, a GitHub Actions
OIDC token, an Azure AD workload token). All of these go through
the single generic WorkloadResolver: the adopter supplies a
claim parser + mapping closure that synthesises a SPIFFE-shape
WorkloadId from whichever claims the issuer's JWT carries. The
synthesis is what lets the rest of the system (Cedar policies,
audit events, the principal type) work uniformly: the external
workload looks like any other workload by the time the policy
evaluator sees it.
Cloud STS exchange
A workload that needs to call AWS, GCP, or Azure APIs can exchange its workload identity for short-lived cloud credentials. The mechanism is implemented by all three cloud providers under similar names (AWS STS AssumeRoleWithWebIdentity, GCP Workload Identity Federation, Azure Federated Identity Credentials), and axess provides adapters that bridge a validated workload identity to each of them.
The chapter Cloud STS exchange covers the configuration and the credential lifecycle. The benefit is that no long-lived cloud keys ever live on the workload's filesystem; the credentials are minted on demand from the workload identity, used briefly, and discarded.
Outbound
Axess is not only an inbound authenticator. When a service authenticates to a downstream service, it uses the same identity shape it would accept inbound. The chapters Outbound: OAuth and Outbound: mTLS cover the two ways this works: the workload presents an mTLS client certificate to the downstream's TLS server, or the workload exchanges its identity for a bearer token through an OAuth flow.
The pattern matters because it lets one identity (the workload's SVID, or its federated equivalent) carry through an entire chain of service calls. The audit trail records the same identity at every hop; revocation at the issuing authority propagates to every call that was about to use the identity.
Feature flags
The resolvers are individually feature-gated so a deployment only pays the compile cost for the credential kinds it actually uses.
| Feature | Resolver | Purpose |
|---|---|---|
jwt-svid | JwtSvidResolver | Inbound SPIFFE JWT-SVID (spec-bound) |
mtls | MtlsResolver | Inbound SPIFFE X.509-SVID via mTLS |
jwt (auto-pulled by jwt-svid etc.) | WorkloadResolver | Generic JWT-bearer workload identity for every non-SPIFFE issuer (GitHub Actions, k8s SA, GitLab CI, Okta, Azure AD, Auth0, LocalIdP, …) via adopter-supplied claim parser + mapping closure. No per-company features; see examples/workload-identity/ |
outbound-mtls | (client side) | Outbound mTLS with workload SVID |
outbound-oauth | (client side) | Outbound OAuth client |
aws-sts, gcp-wif, azure-fic | (cloud STS) | Exchange workload identity for cloud credentials |
workload-id | umbrella | SPIFFE adapters + outbound + mTLS bundle |
What this part does not cover
Three concerns are intentionally outside scope.
The SPIRE Agent and Server implementations are not part of axess. Axess validates SVIDs; SPIRE issues them. The two are designed to be independent so deployments can use any SPIFFE-compliant issuer (SPIRE, an in-house implementation, a managed service like AWS IAM Roles Anywhere) without changing the axess side.
Trust domain bootstrap and root-of-trust ceremonies are out of scope. Operators manage the trust-domain bundle through SPIRE's federation API or an equivalent mechanism. Axess consumes the bundle; it does not establish it.
Service mesh integration is out of scope. Istio, Linkerd, and
Consul handle mesh-level identity at the proxy layer. Axess works
at the application layer above the mesh. When the mesh terminates
mTLS and forwards a verified identity in a header, a custom
PrincipalResolver can pick it up and produce a Principal::Workload
the rest of the system understands.
Further reading
The cookbook chapters in this part each cover one resolver in
detail. Start with the one matching the credential kind your
deployment uses; the others are useful background for the
federation and outbound scenarios. Cedar policy fundamentals
covers how the policy engine handles workload principals. The
principal model in Part II covers the unified Principal type
that all of this resolves to.
Inbound: SPIFFE JWT-SVID
A JWT-SVID is a JWT carrying a SPIFFE identity. It is the
right credential for service-to-service authentication where mTLS
is impractical (the network path crosses a load balancer that does
not preserve client certificates, the calling service speaks a
protocol that does not support TLS client auth, the deployment
favours the simplicity of bearer tokens). The JwtSvidResolver is
the axess resolver that validates these tokens and produces a
Principal::Workload.
The feature flag is jwt-svid (off by default).
The credential shape
A SPIFFE JWT-SVID is an ordinary JWT with two specific claim
requirements. The subject (sub) claim is the SPIFFE ID, formatted
as spiffe://<trust_domain>/<path>. The audience (aud) claim
names the intended recipient: when your application validates the
token, the audience must match a configured value.
{
"iss": "https://spire.prod.example.com",
"sub": "spiffe://prod.example.com/svc/billing",
"aud": ["https://api.example.com"],
"exp": 1735689600,
"iat": 1735686000,
"jti": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}
The signature is over the standard JWT body plus header, using keys published by the trust domain's issuing authority through a JWKS endpoint. The signing algorithm is RS256 or ES256 in production deployments; SPIFFE does not standardise the algorithm, but the keys advertised in the JWKS specify it.
Configuration
JwtSvidResolverConfig carries the validation parameters:
pub struct JwtSvidResolverConfig {
pub trust_domain: TrustDomain,
pub jwks_url: Url,
pub expected_audiences: Vec<String>,
pub clock_skew: Duration,
pub max_token_age: Duration,
}
let resolver = JwtSvidResolver::new(JwtSvidResolverConfig {
trust_domain: "prod.example.com".parse().unwrap(),
jwks_url: "https://spire.prod.example.com/keys".parse().unwrap(),
expected_audiences: vec!["https://api.example.com".into()],
clock_skew: Duration::from_secs(30),
max_token_age: Duration::from_secs(3600),
});
trust_domain is the trust domain the resolver accepts SVIDs
from. A token whose sub SPIFFE ID names a different trust domain
is rejected. The defence is the trust-domain isolation that SPIFFE
is built around.
jwks_url is where the resolver fetches signing keys. The fetch
runs through the axess-cache machinery: a single-flight cache
that dedupes concurrent fetches, with debouncing to prevent
denial-of-service through key-rotation thrash. The cache TTL
defaults to one hour, which matches the typical SPIRE rotation
schedule.
expected_audiences is the allowlist of audience values the
resolver accepts. A token whose aud does not contain at least one
of the expected values is rejected. Most deployments configure a
single audience (the application's URL); deployments that serve
multiple identities behind one resolver list each.
clock_skew is the tolerance applied to the exp and iat
checks. Thirty seconds is generous; production deployments that
synchronise clocks tightly through NTP can lower it.
max_token_age is the upper bound on how far in the past the
token's iat claim can be. The check defeats replay of stale
tokens: even if a token has not expired, a token issued more than
the configured age ago is rejected. The default is one hour, which
is generous; deployments with stricter posture set it lower.
Wiring the resolver
The resolver is wired as a Tower middleware that runs before the
handler. The middleware reads a bearer token from the
Authorization header (or wherever the deployment puts it),
calls into the resolver, and on success inserts the resulting
Principal into the request extensions.
use axess::workload::{JwtSvidResolver, JwtSvidLayer};
let resolver = JwtSvidResolver::new(/* ... */);
let layer = JwtSvidLayer::new(resolver);
let app = Router::new()
.route("/api/data", get(handler))
.layer(layer);
The handler reads the principal through an extractor:
use axess::Principal;
use axum::Extension;
async fn handler(Extension(principal): Extension<Principal>) -> &'static str {
match principal {
Principal::Workload(w) => {
tracing::info!(workload = %w.workload_id, "request from workload");
"ok"
}
Principal::Human(_) => {
// The route is workload-only; reject the human request.
// (Or route differently. Choice is the application's.)
unreachable!("the layer only accepts workload tokens")
}
}
}
The middleware can be composed with other authentication paths. An application that accepts both human sessions and workload tokens wires the session layer and the JWT-SVID layer side by side; the first one to produce a principal wins.
Validation details
The validation runs through six checks in order. The order matters because cheaper checks come first: a malformed token fails parsing without ever fetching JWKS keys; an expired token is rejected without engaging the signature check.
The first check is parsing. The token must be a well-formed JWT
with header, payload, and signature segments. Malformed input
produces JwtSvidError::Malformed without further work.
The second check is the header. The alg field must be one of
the configured allowed algorithms (RS256 or ES256 by default;
deployments that need others configure them explicitly). The
kid field must be present so the resolver can look up the right
key.
The third check is the claims. The sub claim must be a valid
SPIFFE URI under the configured trust domain. The aud claim
must contain at least one of the configured expected audiences.
The exp and iat claims must be present and within the clock
skew and max age bounds. Missing or malformed claims produce
specific error variants so the operational signal is clear.
The fourth check is the signature. The resolver looks up the key
matching the token's kid in the cached JWKS, verifies the
signature, and falls through on success. A signature failure
triggers a JWKS cache refresh (subject to the debouncing) and a
retry against the fresh keys; a failure after refresh is final.
The fifth check is the nbf (not-before) claim when present.
SPIRE typically issues tokens with nbf slightly in the future to
allow for clock skew on the receiver side. The check uses the
same clock-skew tolerance.
The sixth check is the duplicate-jti check, when configured.
SPIFFE recommends a JTI on each token to allow receivers to
detect replay; an axess deployment that wants this protection
configures a JTI store (typically a small Valkey cache with the
configured max_token_age TTL), and the resolver checks for
duplicates before admitting the token.
What the principal looks like
A successful validation produces a Principal::Workload:
Principal::Workload(WorkloadPrincipal {
workload_id: WorkloadId::new("spiffe://prod.example.com/svc/billing"),
trust_domain: TrustDomain::new("prod.example.com"),
issuer: Issuer::JwtSvid {
jwks_url: "https://spire.prod.example.com/keys".parse().unwrap(),
},
tenant_id: derive_tenant_from_path(...),
tenant_slug: derive_slug_from_path(...),
service_name: derive_service_from_path(...),
attributes: {
"exp": 1735689600,
"iat": 1735686000,
"jti": "f47ac10b-...",
},
})
The workload_id is the parsed SPIFFE URI. The trust_domain
mirrors the configured trust domain. The issuer records that
the principal came through the JWT-SVID path with the specific
JWKS URL. The tenant and service derivation depends on the
deployment's SPIFFE path convention (the example above expects
paths like /svc/<service>/<tenant>); the resolver's path-parsing
logic is configurable, and examples/local_idp/ demonstrates the
pattern.
The attributes map carries the rest of the token's claims, so
Cedar policies can match on them if needed (a policy that demands
a specific issuer signature, for instance, reads
principal.attributes.iss).
Threat model
The JWT-SVID flow is robust against the standard attacks when the validation is complete.
Against token forgery: the signature check defeats it. An attacker without the issuing authority's signing key cannot mint a valid SVID.
Against token theft: the audience check defeats most of it. A token stolen from one service cannot be used against another service whose audience does not match.
Against token replay: the iat + max_token_age bound shrinks
the replay window. With the optional JTI cache, replay is detected
explicitly.
Against trust-domain confusion: the trust-domain match defeats cross-domain attacks. A token from a different trust domain is rejected without further consideration.
The remaining attack surface is the issuing authority itself. A compromised SPIRE control plane can mint compromised SVIDs, and no client-side check catches that. The defence is operational: secure the SPIRE control plane, monitor its audit log, rotate keys on a schedule.
Troubleshooting
If the resolver returns KeyNotFound consistently, the JWKS URL
is wrong or the key advertised in the token is not yet published
at the URL. The latter is common during SPIRE rotation; the
caching layer's debounce can hide the rotation briefly. Force a
cache refresh (or wait for the TTL) and retry.
If the resolver returns AudienceMismatch for tokens that should
work, the issuing service is minting tokens with a different
audience than the application expects. Either the issuer's
configuration is wrong, or the application's expected_audiences
list is missing the relevant value. Inspect the token (the
payload is unencoded base64, so it is readable) to see what aud
it carries.
If the resolver returns TrustDomainMismatch, a workload from a
different domain is calling your service. If this is intentional,
configure federation (the next chapter, Inbound: federation,
covers the mechanism). If it is not intentional, the workload is
misconfigured.
Fetching SVIDs from a local SPIRE agent
JwtSvidResolver is the verifying side; it consumes an SVID
presented in an HTTP request and validates it against the trust
domain's JWKS. The issuing side; fetching fresh SVIDs from a
local SPIRE agent socket for outbound calls; is a separate
concern.
For deployments that need to fetch SVIDs at runtime, two adopter-direct options exist on crates.io today:
spire-workload; higher-level wrapper around the SPIRE Workload API gRPC, including JWT-SVID fetch with auto-rotation. Most adopters reach for this first.spire-api; lower-level generated gRPC client when finer control is needed.
axess does not currently wrap either crate; the
SpireWorkloadApiResolver ROADMAP item lands when an adopter
needs an axess-shaped surface (e.g. integration with axess-clock
for rotation timing, axess-rng for ceremony nonces, or the
Principal::Workload shape on the fetch result for symmetry with
the verifier). Until then, the recommended path is:
- Use
spire-workloaddirectly in your application to fetch JWT-SVIDs against a configured audience. - Present the fetched SVID on outbound calls via your HTTP client.
- On the receiving service, validate the SVID with
JwtSvidResolveras documented above. The presenting and verifying sides interoperate without axess wrapping the fetch side.
If your deployment forces the issue (e.g. fetch-side rotation needs to drive axess-clock-pinned tests), open a tracking issue; that's exactly the adopter-demand signal the ROADMAP entry waits for.
Further reading
Workload identity overview covers the SPIFFE model and the
unified Principal type this resolver produces. Inbound:
mTLS-SVID covers the X.509 variant for deployments where mTLS is
practical. Inbound: federation covers the cross-trust-domain
patterns. Cedar policy fundamentals covers how policies match on
the workload's claims through principal.attributes.
Inbound: SPIFFE X.509-SVID via mTLS
A workload authenticates over mTLS by presenting a leaf X.509
certificate that carries its SPIFFE identity in a Subject
Alternative Name URI. The TLS handshake validates the certificate
against the trust-domain CA bundle, the application reads the
SPIFFE URI from the SAN, and the resulting identity becomes a
Principal::Workload. The mechanism is the right choice for
service-to-service traffic where mTLS is already in place (a
service mesh, a load balancer that preserves client certs, a
direct VPC peering).
The feature flag is mtls (off by default).
The credential shape
An X.509-SVID is an ordinary X.509 leaf certificate with one
specific requirement: the Subject Alternative Name extension
contains a URI of the form spiffe://<trust_domain>/<path>. The
certificate is otherwise standard; deployments may put additional
information in the subject DN, the other SAN entries, or X.509
extensions, but the SPIFFE URI is the identity the resolver reads.
The certificate chain is signed by the trust domain's CA. The chain validates the certificate's authenticity; the SAN URI identifies the workload within the trust domain.
Where the certificate comes from
Axess does not handle the TLS handshake. The handshake happens where TLS terminates (rustls in the application process, a sidecar proxy in a service mesh, a load balancer in front of the application). The terminator validates the certificate chain against the configured CA bundle, accepts or rejects the connection, and on acceptance makes the certificate available to the application.
The mechanism for making the certificate available depends on the
terminator. For rustls in process, the certificate is available
through axum_server::tls_rustls::RustlsConnectInfo or an
equivalent connector callback, which the resolver wires through
directly. For a sidecar proxy (Istio, Linkerd, Envoy in a service
mesh), the proxy forwards the certificate as a header (Istio uses
X-Forwarded-Client-Cert, Linkerd uses l5d-client-id), and the
resolver wires through a small adapter that parses the header
into a certificate. For a load balancer in passthrough TLS mode,
rustls handles the validation in-process; for a load balancer in
mTLS-terminating mode (AWS ALB with mTLS, Cloudflare with
client-cert auth, nginx with ssl_verify_client), the load
balancer forwards the certificate in a header whose name and
format depend on the product.
The application's job is to extract the certificate chain from
wherever the terminator put it, wrap it in PeerCertChain, and
insert it into the request extensions before the resolver runs.
use axess::workload::PeerCertChain;
async fn mtls_middleware<B>(
mut req: Request<B>,
next: Next<B>,
) -> Response {
if let Some(chain) = extract_cert_from_terminator(&req) {
req.extensions_mut().insert(PeerCertChain::from(chain));
}
next.run(req).await
}
The critical detail: the extraction must trust only sources the
deployment trusts. A request that arrives directly to the
application with a forged X-Forwarded-Client-Cert header must
not be accepted. Either run the application on a socket the
terminator owns and reject direct connections at the network
layer, or gate the header on a token the terminator injects
alongside the certificate.
The resolver
MtlsResolver is the resolver that reads the chain from the
extensions, extracts the SPIFFE URI, validates against the
configured trust domain, and produces a Principal::Workload.
use axess::workload::{MtlsResolver, MtlsResolverConfig};
let resolver = MtlsResolver::new(MtlsResolverConfig {
trust_domain: "prod.example.com".parse().unwrap(),
tenant_resolver: Box::new(MyTenantResolver::new(/* ... */)),
});
The configuration is small because most of the validation work has already happened. The terminator validated the certificate chain; the resolver only needs to read the SAN URI, parse it as a SPIFFE ID, and check that the trust domain matches the configured one.
tenant_resolver is the adopter-supplied piece that maps the
SPIFFE path to a TenantId. The path typically follows a
convention like /svc/<service>/<tenant_slug>, and the resolver
looks up the tenant id from the slug. The convention is the
deployment's; axess just provides the trait surface.
The validation flow
The resolver's resolve method runs five steps.
The first step is reading the peer certificate chain from request
extensions. Absence here is a configuration error (the extraction
middleware did not run), and the resolver returns
MtlsError::NoPeerCert.
The second step is parsing the leaf certificate. The chain may
contain intermediate certificates; the leaf is the first one. The
resolver extracts the SAN extension and looks for a URI value
matching the SPIFFE format. Absence of a SPIFFE URI in the SAN
produces MtlsError::NoSpiffeId.
The third step is parsing the SPIFFE URI. The URI must be
well-formed (a spiffe:// scheme, a trust domain, a path). A
malformed URI produces MtlsError::MalformedSpiffeId.
The fourth step is the trust-domain match. The parsed trust
domain must equal the configured one. A mismatch produces
MtlsError::TrustDomainMismatch.
The fifth step is the tenant resolution. The path is fed to the
configured TenantResolver, which returns a TenantId. The
resolver assembles the WorkloadPrincipal with the SPIFFE id, the
trust domain, the issuer (Issuer::Mtls), and the tenant id, and
returns it.
What the principal looks like
A successful validation produces:
Principal::Workload(WorkloadPrincipal {
workload_id: WorkloadId::new("spiffe://prod.example.com/svc/billing/tenant-acme"),
trust_domain: TrustDomain::new("prod.example.com"),
issuer: Issuer::Mtls,
tenant_id: TenantId::parse("acme").unwrap(),
tenant_slug: "acme".into(),
service_name: "billing".into(),
attributes: { /* X.509 fields the deployment exposes */ },
})
The attributes map carries any X.509 fields the deployment
chooses to surface (the certificate's serial number for audit, the
certificate's expiry for short-lived-cert tracking, custom
extensions). The choice is the deployment's; the resolver exposes
the chain so the adopter can read what they need.
Combining with other resolvers
A common shape is mTLS as the transport-level proof of identity plus a session cookie or a JWT as the application-level proof of who the user behind the workload is. The two layers compose: the mTLS resolver runs first and establishes the workload's identity; the session or JWT layer runs second and establishes the human's identity inside the workload. Cedar policies can match on both.
The composition is what gives a deployment "the calling service is authenticated AND the user inside the call is authenticated", which is the right shape for delegated workflows. Delegated and OBO access covers the pattern from the OBO side.
Threat model
mTLS is robust against the standard attacks when the issuing CA is secure.
Against token theft: there is no token. The credential is a private key the workload holds; an attacker without the key cannot present the certificate.
Against in-flight tampering: the TLS layer protects against it. The certificate is bound to the TLS session; an attacker on the wire cannot substitute a different certificate without breaking the handshake.
Against replay: the certificate is short-lived (SPIRE typically rotates SVIDs every few hours) and bound to a TLS session. Replay across sessions requires the private key, which the attacker does not have.
The remaining attack surface is the issuing CA. A compromised CA can issue compromised certificates, and the validation cannot detect it. The defence is operational: secure the issuing CA, monitor the issuance log, rotate the CA's signing key on a schedule.
The other remaining surface is the workload's private-key storage. A workload that stores its key in a file on disk is vulnerable to file-system compromise; a workload that stores its key in a hardware enclave (TPM, HSM, KMS) is much harder to compromise. SPIRE supports both shapes through its workload-API attestation; the choice is the deployment's.
Troubleshooting
If the resolver returns NoPeerCert for connections that should
work, the extraction middleware is not running, or the terminator
is not forwarding the certificate. Inspect the request extensions
before the resolver runs.
If the resolver returns NoSpiffeId, the certificate does not
carry a SPIFFE URI in the SAN. Inspect the certificate
(openssl x509 -in cert.pem -text) to see what SAN entries are
present. The issuer's configuration may need to be updated to
include the SPIFFE URI.
If the resolver returns TrustDomainMismatch, a workload from a
different trust domain has connected. If this is intentional,
configure federation (covered in Inbound: federation).
If the resolver succeeds but the tenant resolution fails, the path convention is not matching the workload's actual SPIFFE path. Inspect the path and update the tenant resolver to handle the actual format.
Further reading
Workload identity overview covers the SPIFFE model and the
unified Principal type. Inbound: JWT-SVID covers the bearer
token variant for deployments where mTLS is impractical. Inbound:
federation covers cross-trust-domain patterns. mTLS-based
authentication in Part III covers mTLS for human authentication;
the validation mechanics are the same, but the interpretation of
the certificate differs.
Inbound: federation
Federation is the pattern where workloads authenticate against your application using credentials issued by a third party your deployment trusts. The federating issuer typically lives outside the trust domain your own services use: Kubernetes issues service-account tokens for pods, GitHub issues OIDC tokens for Actions runs, an enterprise IdP issues tokens for cross-organisation service calls. None of these are SPIFFE issuers, but axess provides a generic resolver that bridges any JWT-bearer issuer into the unified workload-principal shape.
This chapter covers WorkloadResolver, the single resolver that
handles every non-SPIFFE federation. It is gated on the jwt
feature (transitively enabled by jwt-svid and the rest of the
workload-identity bundle).
What federation means here
The unifying claim of federation in axess is that an external
issuer's token, after validation, produces a Principal::Workload
with the same shape as a SPIFFE workload. The trust domain and the
SPIFFE-style path are synthesised from the issuer's claims; the
issuer field on the principal records which federation produced it
(Issuer::OAuth for the generic case, or one of
Issuer::custom("github_actions") / Issuer::custom("kubernetes") /
Issuer::custom("gitlab_ci") when audit logs need finer granularity).
The synthesis matters because the rest of the system stays uniform. A Cedar policy that says "any workload in the finance tenant may read this resource" works for a SPIFFE-identified service and for a Kubernetes pod and for a GitHub Actions run, without branching. The audit pipeline logs the same principal shape for all three. The application's code does not need to know which federation produced the request.
One resolver, many issuers
axess deliberately ships no per-issuer adapters. Each IdP's
JWT claim shape is small (~20 lines for a #[derive(Deserialize)]
struct, ~30 lines for a mapping closure) and adopters care about
their specific IdP's exact claim semantics, not a generic
average. Hard-coding wif-github, wif-k8s, wif-gitlab features
in the library invites endless additions without reuse benefit.
Instead: one WorkloadResolver<C, F, R> is generic over
C; the adopter's#[derive(Deserialize)]claim structF; the closure mapping verified claims toWorkloadMappingR; JTI replay-store type (defaults toNoReplay)
The library handles JWT verification (signature against JWKS,
iss/aud/exp/nbf/alg checks), trust-domain pinning, and
Principal construction. The closure handles claim
→ identity-components.
Ready-made recipes
examples/workload-identity/ ships claim parsers + mappers for two
common issuers. Adopters copy the recipe that matches their IdP
into their codebase (recommended for production) or depend on the
crate directly (useful for prototypes and tests).
Kubernetes service accounts
Kubernetes mints OIDC-style tokens for pods through the
TokenRequest API. A pod requests a token bound to a specific
audience (the URL of your application, say), and the cluster's
control plane returns a signed JWT carrying the pod's
service-account identity. The token's iss is the cluster's OIDC
issuer URL; the kubernetes.io.{namespace,serviceaccount.name}
custom claim block carries the pod's identity.
use axess_example_workload_identity::kubernetes::{
k8s_sa_mapper, K8sCustomClaims,
};
use axess_factors::federation::workload::WorkloadResolver;
use axess_factors::jwt::verifier::JwtVerifier;
use axess_identity::{Issuer, TrustDomain};
use std::sync::Arc;
// Startup wiring (cache the verifier; reuse across requests):
let verifier = Arc::new(
JwtVerifier::new(cluster_jwks_handle)
.with_issuer("https://kubernetes.default.svc.cluster.local")
.with_audience("axess-platform"),
);
let trust_domain = TrustDomain::new("cluster.local").unwrap();
// Per request: adopter middleware peeks at the token to look up
// tenant_id from the namespace, then constructs the resolver.
let resolver = WorkloadResolver::<K8sCustomClaims, _, _>::new(
verifier.clone(),
trust_domain.clone(),
tenant_id,
Issuer::custom("kubernetes").unwrap(),
bearer_token,
k8s_sa_mapper(trust_domain),
);
let principal = resolver.resolve().await?;
The recipe synthesises a SPIFFE-shape workload id of the form
spiffe://cluster.local/<sa_name>/<namespace>. Adjust the
recipe's path layout if your trust-domain convention differs.
GitHub Actions OIDC
GitHub Actions can issue OIDC tokens for workflow runs. The token carries claims naming the repository, the workflow, the branch, the run id, and the actor. Combined with a trust-domain mapping, the token authenticates a specific workflow run from your organisation against your application.
use axess_example_workload_identity::github_actions::{
github_actions_mapper, GitHubActionsClaims,
};
use axess_factors::federation::workload::WorkloadResolver;
use axess_factors::jwt::verifier::JwtVerifier;
use axess_identity::{Issuer, TrustDomain};
use std::sync::Arc;
let verifier = Arc::new(
JwtVerifier::new(github_jwks_handle)
.with_issuer("https://token.actions.githubusercontent.com")
.with_audience("axess-platform"),
);
let trust_domain = TrustDomain::new("github.actions").unwrap();
let resolver = WorkloadResolver::<GitHubActionsClaims, _, _>::new(
verifier.clone(),
trust_domain.clone(),
tenant_id,
Issuer::custom("github_actions").unwrap(),
bearer_token,
github_actions_mapper(trust_domain),
);
let principal = resolver.resolve().await?;
The recipe synthesises spiffe://github.actions/<repo>/<owner>
and preserves actor, workflow, ref, sha, event_name as
Cedar attributes for policy use (allow only deploys from the
default branch, require a specific workflow file, etc.).
Other issuers (GitLab CI, Okta, Azure AD, Auth0, …)
Write your own recipe. For any new IdP:
- Decode a sample JWT to identify which claims carry the workload
identity (
project_path?namespace_id? a customservice?). - Define a
#[derive(Deserialize)] struct YourClaims { ... }with only the fields you care about.JwtVerifierignores unknown claims, so you don't have to enumerate everything the issuer sends. - Write a mapper closure
Fn(&VerifiedClaims<YourClaims>) -> Result<WorkloadMapping, IdentityError>that produces the(workload_id, service_name, tenant_slug, attributes)shape. - Wire as above, with
Issuer::custom("your_idp_label").unwrap()for audit-log attribution (the constructor validates the label format:[a-z0-9_]{1,32}).
The two shipped recipes are the templates; read their source, adapt as needed.
When federation does and does not fit
Federation is the right answer when the deployment cannot or does not want to issue its own workload identities. A Kubernetes-based deployment that wants to use the pods' service-account tokens directly fits cleanly; an open-source CI integration that accepts tokens from any GitHub Actions run fits cleanly; an enterprise deployment that integrates with a partner's Okta tenant fits cleanly.
Federation is the wrong answer when the deployment runs SPIRE (or another SPIFFE issuer) and can mint its own SVIDs. In that case the SPIFFE-native resolvers (Inbound: JWT-SVID, Inbound: mTLS-SVID) are simpler, the trust model is tighter, and the federation indirection adds nothing.
Multi-resolver deployments are common. The same application typically accepts SPIFFE-native traffic from its own services and federated traffic from external collaborators; the resolvers wire side by side, each with its own router or middleware path, and the unified principal shape lets the policies stay the same across the sources.
Threat model
The federation flows share the threat model of the underlying issuer. A Kubernetes-issued token is as secure as the cluster's OIDC issuer; a GitHub Actions token is as secure as GitHub's issuance pipeline; an OIDC IdP-issued token is as secure as the IdP.
The defences that axess adds are the standard ones: signature
verification against the issuer's JWKS, iss match, aud match,
expiry check, optional clock-skew and max-age bounds, trust-domain
pinning at the resolver layer, and the adopter's claim-mapper
closure (which decides which subject paths the application admits).
The remaining attack surfaces are the issuer-specific ones. A compromised Kubernetes control plane mints compromised tokens. A misconfigured GitHub Actions workflow leaks the OIDC token. A compromised OIDC IdP issues tokens for arbitrary identities. The defences are operational: secure each issuer, monitor for unusual issuance patterns, rotate keys on a schedule.
The audit pipeline (covered in Audit pipeline) emits an event on
every successful workload authentication, recording the issuer
label (Issuer::OAuth / Issuer::Custom(...)) and the
synthesised identity. The events feed the SIEM rules that catch
issuer-level anomalies.
Troubleshooting
If resolve() returns NotAuthenticated, the JWT failed
verification; wrong issuer, wrong audience, expired, bad
signature, or a custom-claim deserialisation failure. Enable
tracing::debug! on axess_factors::federation::workload to see
which step rejected the token.
If resolve() returns InvalidSpiffeId, the resolver verified
the token but the trust domain extracted from the synthesised
WorkloadId did not match the resolver's pinned trust domain.
Typically a mapper bug: the closure synthesised the id under the
wrong trust domain. Check the recipe's trust_domain capture.
If resolve() returns InvalidComponent(...), the claim mapper
rejected the verified claims. The error message names which
claim was missing or malformed. Decode the JWT payload
(base64 -d of the middle segment) to compare claims against the
mapper's expectations.
Further reading
Workload identity overview covers the SPIFFE model the
federation resolver maps into. Cloud STS exchange covers the
next step for many federated tokens: exchanging a workload
identity for short-lived cloud credentials. OAuth 2.0 and OIDC
in Part III covers the underlying OIDC machinery that the
JwtVerifier builds on.
Cloud STS exchange
A workload that has been authenticated through one of the inbound
resolvers may need to call AWS, GCP, or Azure APIs on the
workload's behalf. The cloud-native pattern for this is to
exchange the workload's identity for short-lived cloud credentials
through the cloud provider's Security Token Service. The mechanism
is supported by all three major clouds under similar names (AWS
STS AssumeRoleWithWebIdentity, GCP Workload Identity Federation,
Azure Federated Identity Credentials), and axess provides adapters
for each.
The feature flags are aws-sts, gcp-wif, and azure-fic, plus
an umbrella cloud-sts that enables all three. All are off by
default.
The pattern
The pattern is uniform across clouds. The application has a validated workload identity (a JWT-SVID, a federated OIDC token, a GitHub Actions OIDC token). The application wants to call a cloud API on the workload's behalf. Instead of giving the workload a long-lived cloud key, the application exchanges the workload's identity at the cloud's STS endpoint for a short-lived credential bound to a specific cloud role.
workload identity STS exchange short-lived cloud credential
token ───> ───> (15 minutes, role-scoped)
│
▼
cloud API call
The exchange happens at the application layer, server-side. The workload's identity token never leaves the application; the short-lived cloud credential is what makes the actual cloud API call. The benefit is that no long-lived cloud key ever sits on the workload's filesystem, and revocation of the workload's identity (at the issuer) propagates to the cloud access without any cloud-side action.
AWS STS
The AWS adapter calls AssumeRoleWithWebIdentity, the STS API for
identity federation. The configuration:
use axess::workload::cloud_sts::{AwsStsExchanger, AwsStsConfig};
let exchanger = AwsStsExchanger::new(AwsStsConfig {
role_arn: "arn:aws:iam::123456789012:role/billing-api-prod".into(),
region: "eu-west-1".into(),
session_duration: Duration::from_secs(900), // 15 minutes
role_session_name_strategy: SessionNameStrategy::WorkloadId,
});
The role_arn is the AWS role the credential will assume. The
role's trust policy specifies which web-identity tokens may
assume it; the policy is configured on the AWS side, and the
application's workload-identity issuer must match what the policy
allows.
The session_duration is the lifetime of the resulting
credential. AWS allows between 15 minutes and 12 hours (configurable
per role). Fifteen minutes is the recommended default; a longer
duration trades off some defence against credential theft against
the overhead of re-exchanging.
The role_session_name_strategy controls how the resulting
session is named in CloudTrail and AWS audit logs. Naming the
session after the workload identity (WorkloadId) makes the
audit trail readable; alternative strategies are available for
deployments with specific compliance requirements.
async fn call_aws(
exchanger: &AwsStsExchanger,
principal: &Principal,
) -> Result<(), Error> {
let creds = exchanger
.exchange(principal_to_token(principal))
.await?;
let s3_client = aws_sdk_s3::Client::from_conf(
aws_sdk_s3::Config::builder()
.credentials_provider(creds)
.build()
);
s3_client.list_buckets().send().await?;
Ok(())
}
GCP Workload Identity Federation
The GCP adapter calls Google Cloud's federated-credentials endpoint, which exchanges a token from an external identity provider for a Google Cloud access token. The configuration:
use axess::workload::cloud_sts::{GcpWifExchanger, GcpWifConfig};
let exchanger = GcpWifExchanger::new(GcpWifConfig {
workload_identity_pool: "projects/123/locations/global/workloadIdentityPools/axess".into(),
workload_identity_provider: "external-oidc".into(),
target_principal: "billing-api@project.iam.gserviceaccount.com".into(),
scopes: vec!["https://www.googleapis.com/auth/cloud-platform".into()],
});
The workload_identity_pool and workload_identity_provider name
the GCP-side configuration that maps external identities to GCP
identities. The pool and provider are configured on the GCP side
through the gcloud CLI or Terraform; the application's adapter
references them by name.
The target_principal is the GCP service account the exchange
impersonates. The service account's IAM bindings determine which
GCP resources the resulting credential can access.
The scopes list bounds what the credential can be used for. The
narrowest possible scope is the recommendation; cloud-platform
is the broadest and should be used only when the application
genuinely needs unrestricted access.
Azure Federated Identity Credentials
The Azure adapter exchanges an external identity for an Azure AD access token through the FIC (Federated Identity Credential) mechanism. The configuration:
use axess::workload::cloud_sts::{AzureFicExchanger, AzureFicConfig};
let exchanger = AzureFicExchanger::new(AzureFicConfig {
tenant_id: "00000000-0000-0000-0000-000000000000".into(),
client_id: "11111111-1111-1111-1111-111111111111".into(),
scope: "https://storage.azure.com/.default".into(),
});
The tenant_id is the Azure AD tenant. The client_id is the
managed identity or application registration in that tenant that
the exchange will authenticate as; the FIC binding on the managed
identity determines which external tokens may exchange for it.
The scope is the Azure AD resource the resulting token is bound
to. Azure tokens are audience-scoped; a token for storage cannot
be used against Key Vault. List the scopes the application needs;
use the .default suffix to inherit the managed identity's
configured permissions.
Credential lifecycle
The short-lived credentials returned by all three STS endpoints have explicit expiry. The application's call path needs to respect the expiry:
The simple shape is one exchange per cloud call. The application exchanges, makes the call, discards the credential. The latency overhead is one STS round-trip per call (typically 50 to 200 ms depending on the cloud), which is acceptable for one-off operations.
The optimised shape is to cache the exchanged credential for the duration of its validity. The application exchanges once, caches the credential, uses it for subsequent calls until it nears expiry, then re-exchanges. The cache key is the workload identity plus the target role; the cache value is the credential plus its expiry.
The right shape depends on the call rate. Below a few calls per
minute, the simple shape is fine. Above that, the optimised
shape with a per-workload cache (a ClockTtlCache from
axess-cache) eliminates the per-call STS round-trip.
The expiry handling needs care. A credential that expires mid-call produces an authentication error from the cloud SDK, which the application catches and translates into a re-exchange. The cache wraps the expiry check; calls that get a near-expired credential refresh proactively.
Multi-cloud deployments
A deployment that uses multiple clouds (a workload that calls both AWS and GCP, say) configures one exchanger per cloud. The two are independent; they share the workload identity as input but produce cloud-specific credentials as output.
The pattern composes cleanly. The application has a workload
principal; it has one AwsStsExchanger and one GcpWifExchanger
in scope; calls to AWS go through the AWS exchanger, calls to
GCP go through the GCP exchanger. No cross-cloud coupling.
Threat model
Cloud STS exchange is robust against credential theft because the short-lived credentials it produces are time-bounded. A stolen credential expires within minutes regardless of the attacker's actions.
The remaining attack surfaces:
The first is the workload identity itself. A compromised workload identity can be exchanged for fresh cloud credentials at any time. The defence is to keep the workload identity short-lived (SPIRE rotates SVIDs every few hours, GitHub OIDC tokens are single-use), so a compromised identity has a bounded lifetime.
The second is the STS endpoint. A compromised STS issues compromised credentials. The defence is operational: the cloud provider secures their STS; the application validates the returned credentials by their structure (signature, format) but cannot independently verify that the STS itself is honest.
The third is the role's trust policy. A misconfigured trust policy allows any workload to assume the role, defeating the identity-based restriction. The defence is to review trust policies carefully at deployment time; the principle of least privilege applies.
Audit
Each exchange produces an audit event (a
DelegatedTokenExchanged event in the axess audit pipeline) and
a cloud-side audit event (CloudTrail for AWS, Cloud Audit Logs
for GCP, Activity Log for Azure). The two together give a
complete picture: what identity was exchanged, when, for what
role, and what cloud actions the resulting credential performed.
The retention configuration is in Audit pipeline. The recommendation is longer retention for STS-exchange events than for ordinary authentication events, because the events defend against future compliance review of cross-cloud actions.
Troubleshooting
If the exchange returns AccessDenied from AWS STS, the role's
trust policy does not admit the token. Check the policy's
Principal.Federated and Condition blocks; the most common
issues are a wrong issuer URL, a wrong audience, or a missing
required claim.
If the exchange returns INVALID_ARGUMENT from GCP, the
workload identity pool or provider name is wrong, or the token's
shape does not match what the provider expects. Inspect the
provider configuration through gcloud iam workload-identity-pools providers describe.
If the exchange returns AADSTS70021 from Azure, the FIC binding
on the managed identity does not match the token's subject claim.
Update the FIC configuration to match what the workload identity
emits.
Further reading
Inbound: JWT-SVID, Inbound: federation cover the resolvers that produce the workload identity that gets exchanged here. Outbound: OAuth covers OAuth-based outbound credentials, which are an alternative to cloud STS for some non-cloud downstreams. Audit pipeline covers the retention configuration for cross-cloud audit events.
Outbound: OAuth
This chapter covers the case where the application authenticates itself as a workload against a downstream OAuth-protected service. The application is the OAuth client; the downstream is the resource server. The credential is an access token the application acquires through one of the OAuth client flows (client credentials, token exchange, or refresh of a stored token).
The chapter pairs with Inbound: federation and Cloud STS exchange: those cover the inbound case where the application accepts workload tokens; this covers the outbound case where the application presents them.
The feature flag is outbound-oauth (off by default).
When to use it
Three patterns lead to outbound OAuth.
The first is a service-to-service call between two services your
deployment owns, where the receiving service authenticates
inbound OAuth (typically through the generic WorkloadResolver
from Inbound: federation). The application's outbound
configuration mints a fresh token through the client-credentials
grant, sends it on the request, and the receiving service
validates it.
The second is a call to a SaaS service that requires OAuth (Slack, Stripe, Twilio, an enterprise CRM). The application is registered as an OAuth client at the SaaS, holds a client id and secret, and mints tokens to call the SaaS's API.
The third is a call to a downstream service on a user's behalf,
where the credential is a token exchanged from the user's session
or from a stored refresh token. This is the OBO case, covered in
Delegated and OBO access; the outbound-oauth machinery in this
chapter is what delegated-stored and delegated-exchange use
under the hood.
Configuration
OutboundOAuthClient is the type that mints tokens. The
configuration:
use axess::workload::outbound::{OutboundOAuthClient, OutboundOAuthConfig};
let client = OutboundOAuthClient::new(OutboundOAuthConfig {
token_endpoint: "https://idp.example.com/oauth/token".parse().unwrap(),
client_id: "billing-api-prod".into(),
client_credential: ClientCredential::Secret("...".into()),
scopes: vec!["https://api.downstream.example/.default".into()],
audience: Some("https://api.downstream.example".into()),
});
token_endpoint is the OAuth server's token endpoint. The
endpoint typically comes from the OAuth server's discovery
document; the configuration is the resolved URL.
client_credential carries how the application authenticates to
the token endpoint. The variants are:
pub enum ClientCredential {
Secret(ZeroizedString),
JwtAssertion { signing_key: SigningKey, kid: String },
Mtls, // client cert from outbound TLS
SignedJwt { /* ... */ },
}
The Secret variant is the classic OAuth client secret. The
JwtAssertion variant is RFC 7523 (private_key_jwt
authentication), which is what FAPI-grade integrations use; the
application signs a short-lived assertion JWT with its private
key, and the token endpoint validates it against the registered
public key. The Mtls variant uses the outbound TLS connection's
client certificate as the authentication. The SignedJwt variant
covers cases where the JWT structure differs from RFC 7523.
scopes is the list of scopes requested. The narrowest possible
list is the recommendation; over-broad scopes leak privilege if
the resulting token is compromised.
audience is the optional audience parameter, used by some token
endpoints (Azure AD, Auth0, others that follow the same pattern)
to bind the resulting token to a specific resource.
Minting tokens
The simple shape calls mint_token directly:
async fn call_downstream(
client: &OutboundOAuthClient,
) -> Result<(), Error> {
let token = client.mint_token().await?;
let response = http_client
.get("https://api.downstream.example/data")
.header("Authorization", format!("Bearer {}", token.access_token))
.send()
.await?;
Ok(())
}
Each mint_token call hits the token endpoint, exchanges the
client credentials, and returns the access token. The cost is one
round-trip per call.
The optimised shape caches the token for the duration of its validity:
async fn call_downstream_cached(
client: &CachedOutboundOAuthClient,
) -> Result<(), Error> {
let token = client.get_cached().await?;
// token is fresh or freshly-minted; cache handles the expiry.
// ... use it
}
CachedOutboundOAuthClient is the cache wrapper. The cache uses
the same ClockTtlCache machinery the rest of axess uses; the
TTL is the token's expires_in value, minus a small buffer so a
token that expires mid-call is refreshed proactively.
The right shape depends on the call rate. Below a few calls per minute, the simple shape works. Above that, the cache is worth the complexity.
Token exchange (RFC 8693)
The token-exchange flow is the alternative to client-credentials when the outbound call is on behalf of an inbound principal (human or workload). The application presents the inbound credential to a token-exchange-capable IdP and receives a token bound to the downstream audience.
use axess::workload::outbound::{TokenExchanger, ExchangeRequest};
let exchanger = TokenExchanger::new(/* ... */);
let token = exchanger.exchange(ExchangeRequest {
subject_token: inbound_token,
subject_token_type: "urn:ietf:params:oauth:token-type:jwt".into(),
audience: "https://api.downstream.example".into(),
scopes: vec!["read:data".into()],
}).await?;
The exchange runs through the IdP's token endpoint with the RFC 8693 parameters; the IdP validates the subject token, applies whatever exchange policy it has, and returns a token for the requested audience. The pattern is what most enterprise IdPs support today (Azure AD, Okta, Auth0); the OBO chapter covers it in detail from the application's side.
DPoP and sender-constrained tokens
The FAPI 2.0 chapter (FAPI 2.0) covers DPoP as a way to bind access tokens to a key the client controls. The outbound-oauth machinery supports DPoP through an opt-in configuration:
let config = OutboundOAuthConfig {
// ... standard configuration ...
sender_constraint: Some(SenderConstraint::DPoP {
key_provider: Box::new(my_dpop_key_provider()),
}),
};
When sender_constraint is set, the client generates a DPoP
proof on each call, signed with the configured key, and attaches
it to the request along with the access token. The downstream
validates the proof, matches the key thumbprint against the
token's binding, and serves the request.
The cost is one extra HTTP header per call plus a signature. The benefit is that a stolen access token is unusable without the DPoP key, which the client never transmits.
Threat model
The outbound OAuth flows have a smaller threat surface than the inbound flows because the application controls both ends of the trust relationship.
Against client credential theft: the credential lives in the application's secrets store. Theft requires application-level compromise, which has bigger problems than just the OAuth credential.
Against access token theft in transit: TLS protects the wire. A stolen token from a TLS-protected call requires breaking TLS, which is not the OAuth client's defence to provide.
Against access token theft at rest: tokens are short-lived (typically minutes) and held in process memory. A long-lived refresh token (in the stored OBO case) is what carries longer exposure; the encrypted credential store decorator covers that.
Against scope creep: the scopes parameter restricts what the token can do. The discipline is to request the narrowest scopes the application needs, so a compromised token has limited blast radius.
Troubleshooting
If the token endpoint returns invalid_client, the client
credentials are not what the IdP expects. The most common cause
is using Secret against an endpoint that requires
JwtAssertion, or vice versa.
If the token endpoint returns invalid_scope, the requested
scopes are not authorised for this client. Check the client's
registration at the IdP to see which scopes are permitted.
If the downstream returns 401 on the apparently-fresh token, the
audience does not match the downstream's expected audience. Some
IdPs default the audience to the client id rather than to a
resource URL; set the audience parameter explicitly.
Further reading
OAuth 2.0 and OIDC covers the inbound OAuth machinery and the shared OIDC primitives. FAPI 2.0 covers DPoP and the sender-constrained-token pattern. Delegated and OBO access covers the higher-level OBO machinery that uses outbound OAuth under the hood. Operations runbook covers client-credential rotation and the DPoP key lifecycle.
Outbound: mTLS
This chapter covers the case where the application presents an X.509 client certificate during the outbound TLS handshake to a downstream service that requires mTLS. The credential is the application's workload identity in X.509 form, typically an X.509-SVID issued by SPIRE or an equivalent. The downstream validates the certificate against its trust anchor and accepts or rejects the connection.
The feature flag is outbound-mtls (off by default).
When to use it
Outbound mTLS is the right pattern for service-to-service traffic within a federation that uses mTLS as the standard authentication mechanism (a SPIFFE-based service mesh, an intra-organisation network where everything speaks mTLS, a partner integration where both sides have agreed to mTLS). The application's certificate identifies it as a workload to the downstream; no bearer token needs to ride the request.
The pattern is operationally simpler than outbound OAuth because the authentication happens once at connection setup rather than per request. A long-lived TLS connection handles many requests without re-authenticating; a short-lived connection re-authenticates on the next request. The cost is the TLS handshake's CPU and round-trip; the benefit is no per-request authentication state.
Configuration
OutboundMtlsClient is the type that holds the certificate and
key, and provides them to the outbound TLS handshake. The
configuration:
use axess::workload::outbound::{OutboundMtlsClient, OutboundMtlsConfig};
let client = OutboundMtlsClient::new(OutboundMtlsConfig {
client_cert_path: "/var/lib/axess/svid/cert.pem".into(),
client_key_path: "/var/lib/axess/svid/key.pem".into(),
ca_bundle_path: Some("/var/lib/axess/svid/ca.pem".into()),
reload_interval: Some(Duration::from_secs(300)),
});
client_cert_path and client_key_path are filesystem paths to
the certificate and the private key. The conventional location is
where SPIRE writes them: SPIRE rotates the certificate on a
configurable schedule (typically every few hours), writes the new
files atomically, and the client picks them up on next read.
ca_bundle_path is the optional path to the trust anchor for the
downstream's server certificate. When set, the client validates
the downstream's server cert against this bundle; when unset, the
client uses the system trust store.
reload_interval controls how often the client checks the
certificate files for changes. The check is a stat call; an
unchanged file is a no-op, a changed file triggers a re-read. The
default (every five minutes) matches typical SPIRE rotation
schedules; deployments with faster rotation lower this.
The TLS handshake
The client integrates with the application's HTTP client (typically
reqwest, but the pattern generalises) through a custom
Connector:
use axess::workload::outbound::OutboundMtlsClient;
use reqwest::Client;
let mtls = OutboundMtlsClient::new(/* ... */);
let http_client = Client::builder()
.use_preconfigured_tls(mtls.rustls_client_config())
.build()?;
let response = http_client
.get("https://downstream.example/data")
.send()
.await?;
rustls_client_config returns a rustls ClientConfig with the
certificate, key, and trust anchor configured. The
use_preconfigured_tls integration on reqwest accepts this
directly; other HTTP clients have similar integration points.
The handshake validates the downstream's server certificate against the configured trust anchor (or the system store), then presents the client certificate. If the downstream requires the client certificate and the application's certificate is missing or invalid, the handshake fails. If the downstream does not require the certificate, the handshake succeeds and the certificate is ignored.
Certificate rotation
The certificate rotation is what makes outbound mTLS sustainable in production. A static certificate provisioned at deployment time expires; the deployment has to redeploy to refresh it. A rotated certificate refreshes itself; the deployment runs indefinitely.
SPIRE rotates X.509-SVIDs on a schedule the operator configures
(typically every few hours). The new certificate is written
atomically to the filesystem (a temporary file plus a rename, so
the in-progress reads see either the old or the new, never a
truncated file). The application's OutboundMtlsClient reads the
files at construction and on its reload interval.
The reload-interval choice matters. Too short, and the client spends CPU on stat calls. Too long, and the client uses an expired certificate, producing handshake failures. The recommendation is to set the interval to about a third of the certificate's lifetime, so a typical rotation leaves enough time for the next reload to pick up the new files before expiry.
A reload that finds a malformed certificate logs the error and keeps the previous certificate in memory. The client continues to function until the previous certificate expires, by which point either the malformed state is fixed or the handshake fails. The graceful-degradation pattern is the right shape: a botched rotation should not bring down the application immediately.
When the downstream is also axess
A common shape is two axess-instrumented services calling each
other over mTLS. The calling side presents its X.509-SVID through
the outbound-mtls machinery; the receiving side validates it
through the mtls resolver from Inbound: mTLS-SVID. The two
sides compose without any further integration: the same SPIFFE
identity flows through the TLS handshake, the receiving resolver
extracts it, the resulting principal is the calling service's
identity.
The pattern is what gives a SPIFFE-based deployment a fully identity-aware service mesh at the application layer, without requiring a sidecar proxy. The mesh's identity is the application's identity; the audit trail records the same identity at every hop.
Threat model
Outbound mTLS shares the threat model of the X.509-SVID inbound case from Inbound: mTLS-SVID. The key-storage problem is the biggest concern: a workload whose private key is on disk is vulnerable to filesystem compromise; a workload whose key lives in a TPM, HSM, or KMS is much harder to compromise.
The additional concern for outbound is the downstream's trust configuration. A misconfigured downstream that accepts any client certificate from any CA (or that does not require client certificates at all) defeats the authentication. The defence is operational: ensure the downstream's trust configuration is correct, monitor for unexpected accepted connections, audit the configuration on a schedule.
Troubleshooting
If the handshake fails with a certificate-validation error, the downstream does not trust the application's CA. The downstream's trust bundle needs to include the application's CA; this is the downstream's configuration, not the client's.
If the handshake succeeds but the downstream returns 401 on every request, the downstream is performing authorisation against the certificate's identity rather than just authentication. Check the downstream's authorisation policy: it may require a specific SPIFFE path, a specific issuer, or a specific X.509 extension that the application's certificate does not have.
If the reload fails silently and the application uses an expired certificate, check the reload-interval configuration and the application's log output. The reload errors are logged at warn level; a missed reload typically surfaces as a "failed to read certificate" message.
Further reading
Inbound: mTLS-SVID covers the receiving side of the same machinery. Workload identity overview covers the SPIFFE model both sides use. Cloud STS exchange covers the alternative pattern for downstreams that require bearer tokens rather than mTLS. Operations runbook covers the certificate rotation and the key-storage choices for production deployments.
Delegated and OBO access
The scenario is common: your application needs to act on behalf
of the user against a downstream service. A user signs in, grants
your application the right to read their inbox or post on their
behalf, and from that moment forward the application can make
calls to the downstream service that the downstream sees as
coming from the user. The mechanism is on-behalf-of (OBO) access,
and axess covers two shapes through the delegated/ module under
axess-core.
The feature flag is delegated (off by default), with two
narrower variants (delegated-stored, delegated-exchange) that
turn on each shape independently. The module lives inside
axess-core rather than as a separate crate because the
encryption envelope it needs already ships with the SQL session
backends, so the isolation benefit a separate crate would have
provided was illusory. Adopters who do not turn on the feature
pay zero compile cost.
The two shapes
OBO comes in two architectural shapes. The shape matters because the operational characteristics differ: where credentials live, how often they refresh, what happens when the user revokes consent.
The first shape is stored OBO. The user grants consent once through an OAuth flow; the application receives a refresh token along with the initial access token; the application persists the refresh token; future calls to the downstream service use the refresh token to mint a fresh access token, then use the access token to make the actual call. The pattern is what most "connect your Google account" or "connect your Slack account" flows do.
The second shape is token exchange (RFC 8693). The user's session in the application carries a credential (a session cookie, a JWT session, a workload identity token). When the application needs to call a downstream service on the user's behalf, it presents the credential to a Security Token Service (STS) and receives a short-lived access token bound to the call. There is no persistent storage of credentials for the downstream; the exchange happens per call (or per a short cache window).
The two shapes solve different problems. Stored OBO is right when the application needs to act on the user's behalf when the user is not actively present (a scheduled report that pulls from Gmail at 6am, a background sync that runs while the user is offline). Token exchange is right when the application needs to act on the user's behalf only while the user has an active session, and where the user's session credential can be exchanged for a downstream credential at low cost.
Stored OBO
The stored OBO shape uses the delegated-stored feature. The
machinery has three moving parts: an OAuth flow that grants
initial consent, a credential store that persists the refresh
token, and a refresh path that mints fresh access tokens for
calls.
The initial grant is an OAuth authorization code flow where the
scopes include the downstream's access scope (https://mail.google.com/,
channels:read, whatever the downstream's vocabulary is) and the
flow includes offline_access (the OAuth scope that asks for a
refresh token). The flow's success returns both an access token
(usable immediately) and a refresh token (storable for later use).
The persistence runs through the DelegatedCredentialStore
trait:
#[async_trait]
pub trait DelegatedCredentialStore: Send + Sync {
async fn save_credential(
&self,
owner: &CredentialOwner,
credential: StoredCredential,
) -> Result<(), StoreError>;
async fn load_credential(
&self,
owner: &CredentialOwner,
downstream: &str,
) -> Result<Option<StoredCredential>, StoreError>;
async fn revoke_credential(
&self,
owner: &CredentialOwner,
downstream: &str,
) -> Result<(), StoreError>;
}
pub struct StoredCredential {
pub access_token: ZeroizedString,
pub refresh_token: ZeroizedString,
pub expires_at: DateTime<Utc>,
pub scopes: Vec<String>,
pub downstream: String,
}
The owner is typically the user, identified by UserId and
TenantId. The downstream is named by a string ("google.com",
"slack", "github"), letting one user have multiple stored
credentials for different downstreams.
The encrypted variant is EncryptedDelegatedCredentialStore<S, K>,
a decorator that wraps any store with AES-256-GCM at-rest
encryption using a key the deployment provides. The trait surface
is the same; the encryption happens transparently inside the
decorator. Production deployments use the encrypted variant.
The refresh path runs on demand. When the application needs to call the downstream, it loads the stored credential, checks whether the access token is still valid, and either uses it directly or runs the refresh exchange to mint a fresh access token. The fresh token replaces the stored one if rotation is configured (most downstreams rotate the refresh token on each refresh, which is the same defence the session refresh-token mechanism uses; Refresh tokens and session continuity covers the family-based theft detection in detail).
Token exchange
The token exchange shape uses the delegated-exchange feature.
The machinery is much smaller because there is no persistent
storage: the exchange runs per call.
The exchange is an RFC 8693 token exchange. The application presents:
- A subject token: the credential identifying the user. This might be the user's session ID, a JWT session token, or a workload identity token that names the user.
- A subject token type: an identifier for the kind of subject
token (
urn:ietf:params:oauth:token-type:access_token,urn:ietf:params:oauth:token-type:jwt, an application-specific string). - The audience: the downstream service the token will be used against.
- Optional: the scope of the requested token (defaults to "all scopes the user has").
The STS validates the subject token, determines the user's identity, applies whatever policy decisions the deployment has configured (Cedar policies that govern the exchange, the user's allowed downstreams), and returns an access token bound to the audience.
use axess::delegated::{ExchangeRequest, TokenExchanger};
let exchange = TokenExchanger::new(sts_config);
let downstream_token = exchange
.exchange(ExchangeRequest {
subject_token: session_credential,
audience: "https://api.downstream.example",
scopes: vec!["read:data".to_string()],
})
.await?;
let response = http_client
.get("https://api.downstream.example/data")
.header("Authorization", format!("Bearer {}", downstream_token.access_token))
.send()
.await?;
The exchange runs in the request path. The latency cost is one round-trip to the STS plus the actual downstream call. The exchanged token is short-lived (typically minutes), so the application either re-exchanges per call (the simple shape) or caches the exchanged token for the duration of its validity (the optimisation, which is worth the complexity only at high call rates).
Which to use
The decision tree is short.
If the application needs to act on the user's behalf while the user is offline (a background job, a scheduled report, a notification that runs hours after the user has gone home), use stored OBO. Token exchange does not work because the user's session does not exist when the call needs to happen.
If the application calls the downstream only while the user is actively signed in, and the downstream service supports token exchange (Azure AD, Google Cloud, most enterprise SaaS that supports RFC 8693), use token exchange. The credential never hits your database, so the breach impact is smaller.
If the application needs both shapes, both work side by side. The two crates compose without conflict; turn on both feature flags.
The most common shape in practice is hybrid: token exchange for the foreground synchronous calls (the user clicks "fetch latest data from Gmail"), stored OBO for the background asynchronous calls (the nightly sync that pulls all new mail since the last run). The two flows handle the two needs.
Audit and consent
Both shapes need an audit trail. The user granted consent at a specific moment; that moment is what defends against later disputes ("the application made calls I did not authorise").
The stored OBO shape emits a DelegatedConsentGranted audit
event at the initial OAuth flow and a DelegatedCredentialUsed
event on each refresh. The first event records what the user
agreed to (which scopes, which downstream); the second records
each use (when, against which downstream, for which operation if
the application surfaces that).
The token exchange shape emits a DelegatedTokenExchanged event
on each exchange. The event records the subject token's source,
the audience, the scopes, and the timestamp.
The audit retention for delegated events is typically longer than for ordinary authentication events because the events defend against future disputes that may surface months or years later. The retention configuration is in Audit pipeline.
Revocation
Both shapes need a revocation path. The user (or an administrator) decides the application should no longer act on their behalf; the next call should fail.
Stored OBO revocation runs through DelegatedCredentialStore::revoke_credential.
The credential is removed from the store (or marked revoked, if
the store retains for audit). Subsequent loads return None;
the application's call path either treats this as "user has not
granted consent" or as "consent was revoked, ask again."
Token exchange revocation runs through the user's session revocation. Logging the user out invalidates the session credential, which means subsequent exchanges fail; in-flight calls that have already exchanged the token continue until the exchanged token expires (typically minutes). The granularity is coarser than stored OBO but the operational simplicity is the trade-off.
Either shape benefits from the downstream's own revocation mechanism. Most OAuth providers support RFC 7009 token revocation; calling it on logout invalidates the access and refresh tokens at the IdP, so even a stolen credential cannot be used. Stored OBO with downstream revocation gives the strongest possible revocation guarantee.
Threat model
The threat surface for OBO is unusual. The application acts as the user, which means a compromise of the application is a compromise of the user's downstream account. The defences:
The first is to minimise the scope of the OAuth grant. Request
the narrowest scopes the application needs (channels:read not
channels:*, the specific calendar not "all calendars"). The
attacker who compromises the application can act only within the
granted scopes.
The second is to encrypt the stored credentials at rest. The
EncryptedDelegatedCredentialStore decorator covers this. An
attacker who breaches the database without the encryption key
cannot use the stored credentials.
The third is to monitor the audit events. A spike in
DelegatedCredentialUsed events for a user, especially for
operations the user does not typically perform, is a strong
signal of compromise. The SIEM rules in Audit events name the
patterns.
The fourth is to time-bound consent. Some downstreams support explicit consent expiry; for those that do not, the application can require the user to re-consent on a schedule (every ninety days, every year). The friction is real; the defence against long-lived stale grants is also real.
What this enables
OBO is what lets axess fit into the kind of application that does more than authenticate users for itself: a unified inbox that pulls from Gmail and Outlook, a CI pipeline that posts to Slack on the user's behalf, a calendar integration that books meetings. The mechanism is opt-in (the feature flag), the two shapes cover the architectural choices, and the encryption-at-rest plus the audit trail let the deployment defend its decisions.
Further reading
Refresh tokens and session continuity covers the refresh-token
family-detection mechanism that also applies to stored OBO
credentials. OAuth 2.0 and OIDC covers the OAuth flow that
grants the initial consent. Workload identity overview covers
the subject-token side of token exchange when the subject is a
workload rather than a human. Audit pipeline covers the
retention configuration for Delegated* audit events.
Local IdP
axess::local_idp is an in-process workload-identity issuer. It mints
JWTs against a signing key it holds locally, exposes the matching
JWKS, and serves the RFC 8414 discovery document. The crate exposes
this surface in two layers, both built on the same primitives:
-
Production
LocalIdp. Adopter wires a [LocalIdpKeyStore] implementation (file system, Vault, KMS, ...) and the [LocalIdp] reads the current + historical keys, mints, and rotates atomically on operator request. -
Testing
LocalIdpFixture. In-process value that mints JWTs with a generated keypair and exposes aJwkSethandle that a [JwtVerifier] can read. No HTTP endpoints, no key store; justmint()+jwks_handle().
Both layers share [MintClaims], [LocalIdpSigningKey], and the
issuance pipeline that lives in
axess::local_idp::primitives.
A token minted by either layer verifies against the same JWKS shape,
which is the property that lets adopters run the same downstream
verifier in tests and in production.
What both layers do NOT do
Neither layer is a full OAuth 2.0 Authorization Server. There is:
- no authorization-code flow, no PKCE handshake;
- no end-session endpoint;
- no refresh-token rotation;
- no consent UX;
- no user store.
Use a real Authorization Server (Keycloak, Ory Hydra, Okta, Auth0,
Azure AD, etc.) when you need any of those. LocalIdp exists for
direct workload-identity issuance: a process mints short-lived
JWTs for service-to-service flows it controls.
The feature flag is local-idp (off by default), enabled with
features = ["local-idp"] on the axess facade. It pulls in
oauth, oidc, and jwt as transitive features.
Production: LocalIdp
When to use
-
A service needs to mint workload-identity JWTs for its own internal flows (e.g. signing tokens that downstream services will verify via the published JWKS).
-
A development or staging deployment needs a self-contained IdP without standing up Keycloak. The same code path runs in production; only the [
LocalIdpKeyStore] backend changes. -
An air-gapped or single-tenant deployment wants on-host token issuance with no external dependency.
When not to use
If you need a user-facing IdP with login UI, OIDC authorization code
flow, refresh tokens, or federation, reach for Keycloak / Ory Hydra
/ similar. LocalIdp deliberately stops at issuance.
The LocalIdpKeyStore trait
Adopters implement persistence against their own key material:
pub trait LocalIdpKeyStore: Send + Sync + 'static {
type Error: std::error::Error + Send + Sync + 'static;
async fn load_all(&self) -> Result<LoadedKeys, Self::Error>;
async fn rotate(&self, new_current: LocalIdpSigningKey)
-> Result<(), Self::Error>;
}
pub struct LoadedKeys {
pub current: LocalIdpSigningKey,
pub historical: Vec<LocalIdpSigningKey>,
}
load_all returns current + historical keys from a single
consistent read. The JWKS published at /.well-known/jwks.json
includes all of them so tokens already in flight under a rotated-out
historical key continue to verify until the operator removes that
key from the store.
rotate persists a new current key, demoting the previous current
to historical, atomically. Adopters typically expose this through
their own admin endpoint or out-of-band tooling.
MemoryLocalIdpKeyStore for prototyping
A MemoryLocalIdpKeyStore ships with the crate for dev and test
deployments where keys can live in process memory:
use axess::local_idp::{LocalIdp, LocalIdpSigningKey, MemoryLocalIdpKeyStore};
let key = LocalIdpSigningKey::generate_es256().with_key_id("v1");
let store = MemoryLocalIdpKeyStore::with_current(key);
let idp = LocalIdp::from_key_store("https://idp.example.com", store)
.await
.expect("load keys");
Memory storage is not for production: restarts lose the keys, and
every restart produces fresh JWKS that breaks tokens already in
flight. The examples/local_idp/ directory implements a file-backed
[LocalIdpKeyStore] with atomic rotation that the production path
should pattern after; the same shape adapts to Vault, AWS KMS, GCP
KMS, or any other key management backend.
Minting
use axess::local_idp::MintClaims;
use chrono::{Duration, Utc};
let token = idp
.mint(
&MintClaims::new("worker-1", Utc::now() + Duration::minutes(5))
.with_audience("https://api.example.com")
.with_issued_at(Utc::now()),
)
.await?;
[MintClaims] is a builder: new(subject, exp) is the minimum;
with_audience, with_audiences (multi-aud), with_issued_at,
with_not_before, with_jwt_id, and with_custom_claim cover the
standard JWT fields. mint_with_header accepts a caller-supplied
jsonwebtoken::Header for cases that need custom header fields
(typ, cty, etc.).
The clock is injectable via .with_clock(...). Production wires
SystemClock; DST tests wire MockClock for reproducible
issuance.
Rotation
let new_key = LocalIdpSigningKey::generate_es256().with_key_id("v2");
idp.rotate_signing_key(new_key).await?;
The call atomically:
- Persists the new current via [
LocalIdpKeyStore::rotate]. - Demotes the previous current to historical.
- Rebuilds the JWKS snapshot so subsequent
/jwks.jsonreads include both keys.
In-flight verifications using the old kid continue to succeed
because the historical entry stays in the published JWKS.
Discovery + JWKS endpoints
LocalIdp::router() returns a ready-to-mount Axum router that
serves the two standard endpoints:
let app = axum::Router::new()
.nest("/", idp.router())
.route("/issue", axum::routing::post(issue));
Routes:
GET /.well-known/openid-configuration: RFC 8414 metadata.GET /jwks.json: current + historical public JWKs.
with_base_url(...) overrides the URL the discovery document
advertises for jwks_uri when the IdP sits behind a reverse proxy.
with_metadata_field(name, value) appends adopter-extension fields
to the discovery document (scopes_supported, claims_supported,
FAPI fields, etc.).
For full control, the lower-level handlers in
axess::local_idp::discovery
expose openid_configuration and jwks as standalone axum
handlers.
Production-pattern example
The examples/local_idp/
crate is the reference implementation:
- File-backed
LocalIdpKeyStore(FileLocalIdpKeyStore) with the directory layout patternhistorical/{kid}.pem+ atomiccurrent.kidpointer file. POST /admin/rotateoperator endpoint.POST /issuemint endpoint.- A curl walkthrough of the full discover-mint-rotate cycle.
Testing: LocalIdpFixture
When to use
Integration tests that exercise:
- The inbound JWT-SVID resolver (
axess::authn::jwt::svid::JwtSvidResolver). - The OAuth Resource Server resolver path.
- Any of the cloud STS adapters.
- The
JwtVerifiershape generally.
The fixture mints tokens that verify against its own JWKS, so a
test can produce a token with mint() and pass it to the resolver
under test without involving an external IdP.
What it is NOT
The fixture is not an HTTP service. It is a value with mint(),
jwks_handle(), and a handful of accessors. Tests use it by:
- Constructing the fixture.
- Calling
idp.mint(&MintClaims::...)to obtain a JWT. - Wiring a
JwtVerifiertoidp.jwks_handle()so verification reads the same JWKS the fixture signed against.
There is no authorize endpoint, no token endpoint, no Tower service wrapping; the fixture just produces signed tokens and exposes the verification key set.
The feature flag is testing plus local-idp. The fixture lives
under axess::testing::local_idp::LocalIdpFixture.
Construction
use axess::testing::local_idp::LocalIdpFixture;
let idp = LocalIdpFixture::new("https://test.idp.local");
new(issuer) generates a fresh RSA-2048 keypair per call. Other
constructors:
LocalIdpFixture::with_algorithm(issuer, Algorithm::ES256): generate with a specific signing algorithm. Supported: RS256, RS384, RS512, ES256.LocalIdpFixture::with_signing_key(issuer, key): explicit key (use when the test needs a stable signature across runs).
Builder methods (chained on the constructed fixture):
.with_historical_signing_key(key): add a key to the JWKS without rotating to it. Drives JWKS-cache-refresh tests..with_extra_public_jwk(jwk): add an externally-supplied public JWK to the published set..rotate_signing_key(new_key): swap the signing key; the old key moves to historical and remains in the JWKS..with_max_ttl(duration): cap minted token lifetime. Over-cap mints panic (test-time misuse)..with_issuance_listener(arc): install an [IssuanceListener] for assertion-side recording..with_key_id(kid): override the auto-generatedkid.
Minting
use axess::testing::local_idp::{LocalIdpFixture, MintClaims};
use chrono::{Duration, Utc};
let idp = LocalIdpFixture::new("https://test.idp.local");
// Standard JWT.
let token = idp.mint(
&MintClaims::new("alice", Utc::now() + Duration::hours(1))
.with_audience("https://api.example.com"),
);
// SPIFFE JWT-SVID shape (subject = SPIFFE ID, audience required).
let svid = idp.mint_jwt_svid(
"test.gnomes", // trust domain
"worker", // workload path
"acme", // namespace (optional positional)
"sts.amazonaws.com", // audience
Duration::minutes(5),
);
mint_with_header accepts a caller-supplied header for cases that
need custom fields.
Sharing the JWKS with JwtVerifier
use axess::authn::jwt::verifier::JwtVerifier;
let verifier = JwtVerifier::new(idp.jwks_handle())
.with_algorithms(idp.verifier_algorithms());
let claims = verifier
.verify::<MyClaims>(&token, "https://api.example.com")
.await?;
jwks_handle() returns an Arc<RwLock<JwkSet>> that the verifier
borrows. Calls to rotate_signing_key on the fixture update the
shared JWKS in place, so the verifier sees the rotation without any
explicit refresh.
Feeding a cloud STS adapter
The fixture's mint_jwt_svid produces SPIFFE-shaped tokens suitable
for cloud STS exchange tests:
use axess::workload::outbound::cloud_sts::aws::AwsStsClient;
let idp = LocalIdpFixture::new("https://oidc.test.local");
let token = idp.mint_jwt_svid(
"test.gnomes", "worker", "acme",
"sts.amazonaws.com",
Duration::minutes(5),
);
// Hand the token to a mocked AWS STS endpoint to exercise the
// AssumeRoleWithWebIdentity flow without hitting real AWS.
Why both shapes coexist
Production LocalIdp and the test LocalIdpFixture share the same
primitives module (axess::local_idp::primitives). The primitives
define LocalIdpSigningKey, MintClaims, IssuanceEvent,
IssuanceListener, and the internal JWT-encode pipeline. Both
layers route their mint() calls through these primitives.
The consequence: a token minted by the fixture in a test verifies
identically against a JwtVerifier configured with production
LocalIdp's published JWKS, given the same signing key. Tests
that pin a specific JWT signature exercise the same code paths that
sign in production.
The split exists for what each layer adds on top:
- Production carries the [
LocalIdpKeyStore] abstraction so keys survive process restarts and can rotate without code changes. - Testing carries the in-memory key generation, the
MockIssuanceListener, and ergonomic builders that match what test code typically wants to assert.
Neither subsumes the other; the production class is not the right fit for a unit test (no key store means no mint), and the fixture is not the right fit for production (in-memory keys lose on restart). The shared primitives are what lets both shapes claim "this is the same JWT issuer" without code duplication.
Audit events
Every authentication and authorisation decision axess makes
produces an audit event. The events are typed
(AuthEvent, with one variant per kind of decision), they
carry every field a compliance review needs, and they emit
asynchronously so the authentication hot path does not block on
the audit dispatch. This chapter covers the event catalogue, the
fields each variant carries, the SOC alert thresholds that map
events to operational signals, and the SIEM query patterns the
events are designed to feed.
The chapter pairs with Audit pipeline, which covers how the events get from the application into the regulatory store and the analytics path.
What the events are for
Authentication is a security-sensitive operation, and security-sensitive operations need a defensible audit trail. Three audiences read the trail.
The first is the compliance auditor. A regulator (or an external auditor verifying compliance with a regulator's requirements) needs to verify that the application enforced the controls the regulation requires: that MFA was demanded where MFA was required, that lockouts fired when configured, that no cross-tenant access happened. The audit trail is what answers these questions.
The second is the incident responder. When something goes wrong (a user reports unauthorised access, a SIEM rule fires on an anomalous pattern, a breach is suspected), the responder needs to reconstruct what happened: which sessions were active, what authentications succeeded, what authorisations were granted. The audit trail is what supports the reconstruction.
The third is the operational dashboard. The application's running state is visible through the audit trail: how many logins succeed per hour, what fraction trigger lockouts, which tenants are active. The trail feeds the SIEM rules and the operational metrics.
The three audiences want different things from the same data, which is what drives the dual-stream design: a regulatory stream optimised for completeness and immutability, an analytics stream optimised for query latency and aggregation. Audit pipeline covers the streams; this chapter covers the events themselves.
The event catalogue
AuthEvent is the enum. The variants are stable across axess
versions; new variants can be added (variants are appended, not
renumbered), but existing variants do not change shape.
The catalogue is grouped by the operation that produces each event. The grouping below is for readability; the wire format is flat.
Authentication lifecycle
LoginStarted records the beginning of a login attempt. Fields:
user_id, tenant_id, method_name, client_ip, user_agent,
device_id (optional), timestamp.
FactorVerified records each successful factor verification.
Fields: user_id, tenant_id, factor_kind, attempt_index,
timestamp.
FactorFailed records each failed factor verification. Fields:
user_id, tenant_id, factor_kind, attempt_index,
failure_reason (string), timestamp.
LoginCompleted records a successful end-to-end login. Fields:
user_id, tenant_id, method_name, factors_completed,
session_id, device_id (optional), timestamp.
LoginFailed records a failed end-to-end login (any factor
failed beyond retry, or a configuration error). Fields:
user_id (optional, may be unknown), tenant_id,
method_name (optional), failure_reason, timestamp.
Logout records a session ending. Fields: user_id,
tenant_id, session_id, reason (user-initiated,
admin-revoked, session-expired, cookie-fingerprint-mismatch),
timestamp.
Lockout
LockoutTriggered records the per-user or per-tenant or per-IP
lockout firing. Fields: scope (User, Tenant, IP), scope_value,
until (optional unlock time), triggered_by_event (the
preceding FactorFailed id), timestamp.
LockoutCleared records the lockout ending (either timing out or
being administratively cleared). Fields: scope, scope_value,
cleared_by (Timeout, Admin), timestamp.
Device identity
The six device events fire on the device-identity lifecycle covered in Device identity.
DeviceFirstSeen records a previously-unknown device fingerprint
appearing. Fields: device_id (newly minted), user_id,
tenant_id, fingerprint_features (the redacted feature set),
timestamp.
DeviceTrustGranted records a device transitioning to Trusted.
Fields: device_id, user_id, tenant_id, granted_by (User,
Admin), timestamp.
DeviceRevoked records a device transitioning to Revoked.
Fields: device_id, user_id, tenant_id, revoked_by (User,
Admin, System), reason, timestamp.
DevicePurged records a device record being deleted (retention
sweep). Fields: device_id, user_id, tenant_id,
purged_at_age_days, timestamp.
DeviceBindingAdded records a new refresh-token binding to the
device. Fields: device_id, user_id, tenant_id,
refresh_token_id, timestamp.
DeviceFingerprintMismatch records a session presenting a
fingerprint that does not match its bound device. Fields:
device_id, user_id, tenant_id, session_id, policy_action
(Warn, Reauth, Revoke), timestamp.
Authorisation
AuthzAllow records a successful Cedar evaluation. Fields:
principal_uid, tenant_id, action, resource_uid,
matched_policy_ids, timestamp.
AuthzDeny records a denied evaluation. Fields: principal_uid,
tenant_id, action, resource_uid, matched_policy_ids (the
policies that produced the deny, if any), denial_reason,
timestamp.
AuthzEntityNotFound records an evaluation failure where the
entity provider could not produce an entity the policies needed.
Fields: principal_uid, missing_entity_uid, policy_id_referencing,
timestamp.
Workload identity
The workload events fire on the workload-identity resolvers' decisions.
WorkloadAuthenticated records a successful workload
authentication. Fields: workload_id, trust_domain, issuer,
tenant_id, client_ip, timestamp.
WorkloadRejected records a failed workload authentication.
Fields: attempted_workload_id (optional), attempted_trust_domain,
failure_reason, client_ip, timestamp.
Tenant lifecycle
TenantCreated, TenantSuspended, TenantUnsuspended, and
TenantDeleted record the tenant lifecycle from Multi-tenancy.
Fields are uniform: tenant_id, operator_principal, reason,
timestamp.
Delegated access
DelegatedConsentGranted records a user granting an application
OBO access (the moment of OAuth consent). Fields: user_id,
tenant_id, downstream, scopes, expires_at (optional),
timestamp.
DelegatedCredentialUsed records a refresh of a stored OBO
credential. Fields: user_id, tenant_id, downstream,
timestamp.
DelegatedConsentRevoked records a user (or admin) revoking
consent. Fields: user_id, tenant_id, downstream,
revoked_by, timestamp.
DelegatedTokenExchanged records an RFC 8693 token exchange.
Fields: subject_principal_uid, tenant_id, audience,
scopes, timestamp.
Administrative
UserSuspended, UserUnsuspended, UserDeleted, PasswordReset
(by admin), FactorReset (by admin), SessionInvalidated (by
admin) cover the administrative operations. Fields name the
target user, the operator principal, the reason, and the
timestamp.
Emit cadence and fields
The events fire synchronously from the operation that produces them (the factor verification, the policy evaluation, the administrative action). The synchronous emit guarantees that an operation that succeeds has an event; an event without a corresponding operation is impossible.
The events are then handed to the audit pipeline (a sink trait the application configures), which dispatches them asynchronously. The pipeline's failure modes do not block the operation that produced the event; a sink that is slow or unavailable does not slow the application's hot path. The trade-off is that an event the sink fails to receive is lost, which is why the pipeline has a durable buffer in front of network-bound sinks. Audit pipeline covers the pipeline's reliability story.
Every event carries: a stable wire shape (so external systems can
parse against a single schema), a per-event id (for deduplication
and cross-referencing), a tenant id (so multi-tenant deployments
can route events to per-tenant sinks), and a timestamp (in UTC,
generated through the injected Clock trait for DST).
The fields specific to each variant are listed in the catalogue
above. The complete schema lives in the
axess-events crate documentation.
SOC alert thresholds
The events are designed to feed SOC (Security Operations Center) alerting. The thresholds below are starting points; tune to the specific deployment.
FactorFailed events from a single source IP at a rate above
one per second indicate brute-forcing. The per-IP lockout (covered
in Multi-tenancy §"Three-lever lockout") catches the worst
cases at the application layer; the SIEM alert covers the rate
even when individual events fall below the lockout threshold.
LockoutTriggered events at any rate are worth reviewing. A
legitimate user occasionally mistypes their password and triggers
a lockout; the rate should be a handful per day across a
deployment. A spike indicates either a credential-stuffing attack
or a configuration problem.
DeviceFingerprintMismatch events fire on every mismatch beyond
the configured tolerance. Above a few per hour, either the
tolerance is too tight (a legitimate-traffic issue, calibrate the
tolerance) or there is a real attack (a stolen cookie being
replayed). Investigate.
AuthzDeny events fire on every policy deny. The rate should
correlate with legitimate user error (trying to access something
they cannot); a spike correlates with either a policy misconfiguration
(the rule is denying things it should permit) or with an attempt
to find privilege-escalation holes.
WorkloadRejected events fire on every failed workload
authentication. A clean deployment should see almost none of
these; even a small rate indicates either a configuration problem
or an attempted attack against the workload-identity surface.
DelegatedConsentGranted events fire on every user consent
moment. The rate is informational; the events are most useful at
the per-user level for "show me everything this user has granted
the application permission to do."
SIEM query patterns
The events are designed for cheap aggregation in a SIEM. A sketch of useful queries:
-- Brute-force detection: top failing source IPs per minute.
SELECT
client_ip,
DATE_TRUNC('minute', timestamp) AS minute,
COUNT(*) AS failures
FROM auth_events
WHERE event_type = 'FactorFailed'
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY 1, 2
ORDER BY failures DESC
LIMIT 20;
-- Lockout activity: users locked out in the last day.
SELECT
scope_value AS user_id,
COUNT(*) AS lockouts,
MAX(timestamp) AS last_lockout
FROM auth_events
WHERE event_type = 'LockoutTriggered'
AND scope = 'User'
AND timestamp > NOW() - INTERVAL '1 day'
GROUP BY scope_value
ORDER BY lockouts DESC;
-- Step-up gaps: users hitting AuthzDeny on actions requiring stronger factors.
SELECT
principal_uid,
action,
COUNT(*) AS denials
FROM auth_events
WHERE event_type = 'AuthzDeny'
AND denial_reason LIKE '%factors_completed%'
AND timestamp > NOW() - INTERVAL '1 day'
GROUP BY 1, 2
ORDER BY denials DESC;
-- Suspicious devices: fingerprint mismatches followed by trust grant.
SELECT
mismatch.user_id,
mismatch.device_id,
mismatch.timestamp AS mismatch_at,
grant.timestamp AS granted_at
FROM auth_events mismatch
JOIN auth_events grant
ON mismatch.device_id = grant.device_id
WHERE mismatch.event_type = 'DeviceFingerprintMismatch'
AND grant.event_type = 'DeviceTrustGranted'
AND grant.timestamp > mismatch.timestamp
AND grant.timestamp < mismatch.timestamp + INTERVAL '1 hour';
The queries assume a SQL-shaped SIEM (Splunk SPL, Sumo Logic LogReduce, ClickHouse). Adapt to the deployment's chosen tool.
Extending the catalogue
Applications occasionally need to record events that axess does not emit: a domain-specific operation that should appear in the same trail (a fund transfer, a configuration change, a sensitive read). The pattern is to extend the audit pipeline with a custom event type that rides the same machinery.
The mechanism is the AuthEventEnvelope wrapper: any
Serialize + Deserialize type can be carried alongside the
built-in AuthEvent variants as long as it implements
AuditPayload. The application's domain events go into the
envelope; the audit pipeline routes them to the same sinks as
the axess events.
The trade-off is schema. A custom event type that the SIEM does not know about does not feed the dashboards or alerts. The typical pattern is to coordinate the schema with the SIEM team before adding new event types, so the dashboards and alerts can update in parallel.
What this enables
The audit catalogue is what makes axess defensible against regulatory and incident-response scrutiny. The events are typed, complete, asynchronous, and built for the SIEM tooling that production deployments already run. The catalogue itself is the reference; the Audit pipeline chapter covers how the events flow from the application to the storage layer.
Further reading
Audit pipeline covers the dual-stream architecture (regulatory
plus analytics), the hot/cold retention tiering, and the
reliability story for the asynchronous dispatch. Multi-tenancy
covers the tenant-scoped routing of events. Cedar policy
fundamentals covers the policy evaluator that produces the
Authz* events. Security posture covers the GDPR and PCI-DSS
posture for audit-event PII.
Audit pipeline
The audit pipeline is what moves events from the authentication hot path to the storage layers that compliance, incident response, and operations consume. The pipeline has two streams (regulatory and analytics), three retention tiers (hot, archived, deleted), and a small number of trait surfaces that adopters implement against their own storage. This chapter covers the architecture, the configuration, and the operational patterns that make the pipeline trustworthy under load.
The chapter pairs with Audit events, which catalogues what flows through the pipeline; this chapter covers how the flow itself works.
The dual stream
Two audiences read the audit trail. Compliance auditors want completeness, immutability, and unambiguous provenance; they will accept slow queries and rigid schemas in exchange. SOC and operations teams want low query latency, flexible aggregation, and enrichment with operational context (geo lookups, ASN data, parsed user-agent strings); they accept some loss of fidelity and some divergence from the wire format in exchange.
The two requirements conflict. A single store optimised for one audience disserves the other. The pipeline's answer is to fan out: the same event flows into two streams, each shaped for its audience.
The regulatory stream uses AuthEvent directly. The shape is
exactly what the catalogue in Audit events describes: stable
fields, no enrichment, byte-for-byte uniform across deployments.
The stream feeds the regulatory store, which is typically a
database or a log archive with strong durability and immutability
guarantees.
The analytics stream uses RichAuthnEvent, a denormalised wrapper
that adds optional enrichment fields (device trust level, geo
lookup, parsed user-agent, ASN, configurable tags). The fields
are populated by an EventEnrichment closure the application
provides; the closure runs once per event, populates whatever
data the deployment wants, and returns the enriched event. The
stream feeds the analytics store, which is typically a columnar
database (ClickHouse, DuckDB) or a streaming platform (Apache
Iggy with rkyv).
AuthEvent (regulatory wire)
│
▼
┌──────────────┐
│ AuditPipe │
└───┬──────────┘
│ fan-out
┌───────────┼────────────┐
▼ ▼ ▼
IdentityAuthnLog AuthnAnalyticsSink AuditArchiver
(lockout depends) (enriched stream) (cold tier)
│ │ │
▼ ▼ ▼
primary store analytics store archive store
The fan-out runs once per event. The performance cost is small because each sink is fire-and-forget; a slow sink does not slow the authentication hot path, but it can lose events under pressure, which is the next concern.
Reliability and fire-and-forget
The pipeline's emit path is synchronous (the event is constructed on the authentication hot path and handed to the pipeline before the operation returns), but the dispatch to each sink is asynchronous. The trade-off is what every audit-pipeline design has to make.
A fully-synchronous pipeline blocks the authentication operation until every sink acknowledges the event. The latency cost is the sum of every sink's latency; one slow sink slows every login. The pattern is a non-starter for production.
A fully-asynchronous pipeline with no durability lets the events fan out to sinks in the background. The latency cost is zero (the operation returns before the sinks see the event). The trade-off is that an event lost between emit and the sink is genuinely lost; there is no retry, no acknowledgement, no delivery guarantee.
Axess takes a middle position. The synchronous emit produces an event handed to the pipeline; the pipeline buffers the event in memory or in a durable queue (the choice is configuration); a background task dispatches from the buffer to each sink with retry. The buffer absorbs sink latency without blocking the authentication operation; the buffer's durability determines whether events survive an application crash.
The configuration shape:
pub struct AuditPipeConfig {
pub regulatory_sink: Arc<dyn IdentityAuthnLog>,
pub analytics_sink: Option<Arc<dyn AuthnAnalyticsSink>>,
pub buffer: BufferStrategy, // InMemory | FsBacked { path }
pub max_buffer_size: usize,
pub on_buffer_full: BufferFullPolicy, // DropOldest | Block | ShutdownAuthn
pub enrichment: Option<Arc<dyn EventEnrichment>>,
}
buffer controls where the in-flight events live. InMemory is
the simple choice: a bounded VecDeque that holds events between
emit and dispatch. Events in the buffer are lost on application
crash; for most deployments, the regulatory sink's own durability
(the database transaction that records the event) is what
matters, and the in-memory buffer is just for absorbing latency
spikes.
FsBacked { path } writes the buffer to disk so events survive
a crash. The cost is one local-disk write per event; the benefit
is that the audit trail does not lose events to short network
outages or process restarts. Deployments in regulated
environments use the file-backed buffer; everyone else uses the
in-memory one.
max_buffer_size is the cap. Above it, the on_buffer_full
policy fires.
on_buffer_full is the choice for what happens when the buffer
fills. DropOldest is the high-throughput default: the oldest
buffered events are evicted so the newest fit. Block is the
strict choice: the authentication operation that produced the
event blocks until the buffer has room; the latency cost can be
substantial but no events are lost. ShutdownAuthn is the
fail-shut choice: the authentication subsystem stops accepting
new logins until the buffer drains. Regulated deployments
typically choose Block or ShutdownAuthn; permissive
deployments choose DropOldest.
The IdentityAuthnLog sink
The regulatory sink is the IdentityAuthnLog implementation the
application already provides for the lockout policy (covered in
Identity store implementation). The pipeline writes events to
this sink as the canonical record. The sink's storage backend is
the application's choice; the typical pattern is a Postgres or
MySQL table with append-only writes and an index on
(user_id, tenant_id, timestamp) for the lockout-policy queries.
The pattern means the regulatory store is what the application already needs for lockout. The pipeline does not add a second database; it just uses what is already there.
The AuthnAnalyticsSink
The analytics sink is the optional stream for the SIEM and analytics consumers. The trait:
#[async_trait]
pub trait AuthnAnalyticsSink: Send + Sync {
async fn dispatch(&self, event: RichAuthnEvent) -> Result<(), SinkError>;
}
The sink is a fire-and-forget dispatcher. A failed dispatch is
logged and dropped; the buffer's retry semantics handle the
transient cases. The implementations the
audit-archive-fs feature provides cover the filesystem case;
for streaming or columnar stores, the implementation is the
application's.
A typical Apache Iggy implementation:
struct IggyAnalyticsSink {
client: IggyClient,
topic: String,
}
#[async_trait]
impl AuthnAnalyticsSink for IggyAnalyticsSink {
async fn dispatch(&self, event: RichAuthnEvent) -> Result<(), SinkError> {
let bytes = rkyv::to_bytes::<_, 256>(&event).map_err(SinkError::serialize)?;
self.client.send(self.topic.clone(), bytes.to_vec()).await
.map_err(SinkError::transport)?;
Ok(())
}
}
The rkyv serialisation is the recommendation. RichAuthnEvent
derives rkyv::Archive, rkyv::Serialize, and
rkyv::Deserialize, which produces a wire format that is
significantly more compact than JSON, much faster to serialise,
and zero-copy on the deserialise side. For a stream that pumps
millions of events per day, the difference is operationally
meaningful.
A ClickHouse implementation is the equivalent for batch shipping: the sink accumulates events in memory until a threshold (batch size or time interval), then issues a bulk insert. The pattern matches ClickHouse's preferred ingestion shape.
The three-tier retention
The regulatory stream's events grow without bound by default. A deployment with millions of users produces hundreds of millions of events per year; the storage cost and the query cost both trend up unless the deployment manages the retention.
The retention story has three tiers, with explicit transitions between them.
The hot tier is the live authn_attempts table (or whatever the
regulatory sink writes to). Events stay in the hot tier for as
long as they are operationally useful: the lockout policy's
last_attempts query, the SIEM's recent-events dashboards, the
incident-response window. The recommended hot retention is
between 7 and 90 days, with 30 days as a sensible default for
most deployments.
The archived tier is a cheaper, slower store that holds events for the compliance retention period. The data is the same; the access pattern is different. Queries against the archive are slower (typically minutes rather than milliseconds) and less flexible (no indexed lookup; full-scan reads against a known date range). The archive is the answer to "show me everything that happened to this user three years ago." The retention here is set by the regulatory regime: PCI-DSS asks for one year; banking regulations ask for seven years; HIPAA asks for six years. Configure to match.
The deleted tier is what comes after the archive expires. The
events are removed entirely; the deletion is auditable (a
DeletionEvent itself, recording the date range and the count)
but the underlying data is gone. Some deployments never reach
this tier (an indefinite archive is a defensible choice for
small-volume deployments); others rotate through it on the
regulatory schedule.
AuditArchiver
The transition from hot to archived runs through the
AuditArchiver trait:
#[async_trait]
pub trait AuditArchiver: Send + Sync {
async fn archive_batch(&self, events: Vec<AuthEvent>) -> Result<(), ArchiveError>;
async fn purge_batch(&self, range: ArchiveDateRange) -> Result<usize, ArchiveError>;
}
The trait has two methods. archive_batch writes a batch of
events to the cold store. purge_batch removes a date range from
the archive (for the deleted-tier transition).
The pipeline runs an AuditRetentionLoop<S, A> (S is the source
IdentityAuthnLog, A is the archiver) that drives the
transitions on a configurable schedule:
let retention_policy = AuditRetentionPolicy {
archive_after: Duration::from_secs(30 * 86400), // 30 days
purge_hot_after_archive: Duration::from_secs(7 * 86400),
delete_archive_after: None, // never purge archive
};
let loop_handle = AuditRetentionLoop::new(
identity_authn_log.clone(),
Arc::new(my_archiver),
retention_policy,
).run();
The loop runs once per configured interval (typically daily).
Each run does three things: it reads the events from the hot
tier that have aged past archive_after, it batches them into
the archiver, and it purges the hot tier of events whose
archive copy was made more than purge_hot_after_archive ago.
The delete_archive_after field is the optional final
transition. None means the archive grows indefinitely; a
configured duration means the archive itself is purged at that
age.
The defaults (30 days hot, 7 days hot retention after archive, no archive deletion) are conservative for finance. PCI-DSS asks for one year of audit retention, which the defaults satisfy by keeping events in the archive indefinitely. Other regulatory regimes have different requirements; tune to match.
Filesystem archive
The audit-archive-fs feature ships
FilesystemAuditArchiver, a reference implementation that
writes archived events to a day-partitioned JSONL directory:
/var/lib/axess/audit/
YYYY-MM-DD.jsonl
YYYY-MM-DD.jsonl
YYYY-MM-DD.jsonl
...
Each file is append-only, fsynced per batch, and contains
newline-delimited JSON-encoded events. The format is readable by
standard tools (grep, jq, awk), survives forensic
investigation, and lifts cleanly into cloud object storage when
the deployment moves the archive there.
The reference implementation is for deployments with straightforward audit-storage needs. Larger deployments typically use S3 (with object-lock for immutability), GCS (with retention policies), or a dedicated audit-log service (Splunk, Datadog, SumoLogic). The trait surface is the same; the implementation is the deployment's.
Backpressure and tenant isolation
In a multi-tenant deployment, one tenant's audit load can
overwhelm the pipeline if the buffer is shared. The pattern that
works is per-tenant pipelines: each tenant has its own
AuditPipe with its own buffer and its own retention
configuration. The configuration matches what the tenant has
agreed to (high-throughput tenants get larger buffers; regulated
tenants get file-backed buffers). One tenant's spike does not
affect another's.
The cost is operational complexity: one configuration per tenant. The benefit is isolation; the SLA you offer a tenant is genuinely a per-tenant SLA, not a deployment-wide average.
For most deployments, a single shared pipeline with conservative defaults is fine. The per-tenant shape is for deployments with strict per-tenant guarantees.
What this enables
The pipeline is what turns axess's audit events into a defensible production audit trail. The dual stream serves the two audiences; the buffer absorbs latency without blocking the hot path; the retention tiers balance storage cost against query needs and regulatory requirements. The mechanism is small (a handful of traits, one fan-out, one retention loop), and the configuration is the deployment's lever for tuning to specific requirements.
Further reading
Audit events catalogues what flows through the pipeline.
Identity store implementation covers the regulatory sink
(the IdentityAuthnLog trait). Multi-tenancy covers the
per-tenant configuration patterns. Security posture covers
the GDPR posture for archived audit data and the PII fields
that may need scrubbing before archive.
Rate limiting
A rate limiter is the layer that caps how many requests an
identified caller may make per unit time. For an authentication
surface, the rate limiter is one of the most consequential pieces
of operational defence in depth: the lockout policy catches the
specific case of failed credentials, but the rate limiter catches
the broader case of brute-force and credential-stuffing
distribution. This chapter covers the RateLimitLayer Tower
middleware, the key-extraction strategies that determine what is
rate-limited, the tuning patterns for different endpoints, and the
SLI signal the layer produces.
Why rate limiting matters
The lockout policy in Multi-tenancy catches one specific pattern: many failures against one identifier. A rate limiter catches a wider pattern: a high volume of requests against an endpoint, regardless of identifier, regardless of success.
The shapes of attack the rate limiter catches:
Credential stuffing. An attacker with a list of credentials tries each one against the login endpoint. Each individual attempt fails on its own credentials (no lockout against any single user), but the aggregate rate is far above legitimate traffic. The rate limiter on the login endpoint, keyed by source IP, drops the attack to a trickle.
Account-existence enumeration. An attacker probes the signup endpoint to find which usernames are taken. Each request might succeed (the username is unique) or fail (the username is taken), and the response leaks the information. The rate limiter caps the enumeration rate; combined with response-shaping (return the same shape for both cases), the attack becomes impractical.
Token-replay forwarding. An attacker who has captured a valid session cookie forwards it through many connections to evade fingerprint detection. Each request looks legitimate on its own; the aggregate volume is the giveaway. The rate limiter keyed by session id catches the pattern.
Workload misbehaviour. A workload that for some reason has entered a tight loop calling the application's API. The authentication side validates the workload token on each request; the rate limiter catches the runaway pattern before it overwhelms the service.
The layer
RateLimitLayer is a Tower layer with a small configuration:
use axess::{RateLimitLayer, RateLimitConfig, KeyExtractor};
use std::time::Duration;
let layer = RateLimitLayer::new(
RateLimitConfig::builder()
.max_requests(10)
.window(Duration::from_secs(60))
.key(KeyExtractor::PeerIp)
.build(),
);
The configuration says "no more than ten requests per minute,
keyed by the peer IP." The layer counts requests against each
distinct peer IP; when a key has hit the limit within the window,
subsequent requests get a 429 (Too Many Requests) with a
Retry-After header.
The window is a sliding token bucket. The math: each key has a
bucket of max_requests tokens; each request consumes one;
tokens regenerate at a rate of max_requests per window. A
burst of more than max_requests requests within a short
interval consumes all the tokens; subsequent requests are
rejected until enough tokens have regenerated.
The state of the buckets lives in memory by default
(BucketStore::InMemory). For multi-instance deployments where
the same caller can reach any instance, the rate limit needs to
be aggregated across instances; BucketStore::Valkey { client }
shifts the state to a shared Valkey instance.
Key extraction
The key is what the rate limiter counts against. The
KeyExtractor enum carries the choices:
pub enum KeyExtractor {
PeerIp, // request source IP (read through trusted-proxy)
SessionId, // present session id
UserId, // authenticated user
TenantId, // authenticated tenant
WorkloadId, // authenticated workload
Custom(Arc<dyn KeyExtractorFn>), // application-supplied
Composite(Vec<KeyExtractor>), // multi-key (one bucket per combination)
}
The choice of key determines which attack the limiter catches.
PeerIp catches single-source attacks; SessionId catches
session-replay attacks; UserId catches per-user runaway loops;
TenantId catches per-tenant runaway (which can be a noisy
neighbour rather than an attack).
The Composite choice creates one bucket per combination of
the named keys. A rate limit keyed by (PeerIp, UserId) lets a
single legitimate user from one IP do their normal work while
catching a single attacker IP that is rotating through many
users (the composite key is unique per (ip, user) pair, so the
attacker exhausts each pair's bucket once per user, but the
total request rate stays bounded).
The Custom choice is the escape hatch for keys axess does not
know about: the OAuth client id, a custom request header, the
authenticated session's tenant slug. The application provides
the extraction function; the layer uses it to derive the key.
Per-endpoint rate limits
Different endpoints have different sensitivities. A login endpoint can tolerate a few requests per second per IP because real users do not log in fast; a search endpoint accepts hundreds per second because real users browse. The configuration shape is typically per-endpoint:
let auth_routes = Router::new()
.route("/login", post(login))
.route("/signup", post(signup))
.route("/reset-password", post(reset_password))
.layer(RateLimitLayer::new(
RateLimitConfig::builder()
.max_requests(10)
.window(Duration::from_secs(60))
.key(KeyExtractor::PeerIp)
.build(),
));
let api_routes = Router::new()
.route("/data", get(get_data))
.layer(RateLimitLayer::new(
RateLimitConfig::builder()
.max_requests(300)
.window(Duration::from_secs(60))
.key(KeyExtractor::SessionId)
.build(),
));
let app = Router::new()
.merge(auth_routes)
.merge(api_routes)
.layer(session_layer);
The pattern is to layer the rate limit on the specific routes it applies to, with the most restrictive limits on the most sensitive endpoints. A login endpoint with a tight per-IP limit is the canonical case; a token-refresh endpoint with a per-session limit is the second canonical case.
The trusted-proxy configuration covered in Cookies, fingerprinting,
hijack detection applies to the PeerIp extractor here as well.
Read the IP from the forwarded header only when the immediate
peer is a trusted proxy; otherwise the rate limiter can be
spoofed.
Tuning the windows
Tuning the rate limit is more art than science, but a few guidelines hold up.
For login endpoints: 10 requests per minute per IP is the conservative starting point. Real users log in at most a few times a day from any one IP. Credential-stuffing attacks need hundreds per minute to be efficient; 10 is well below that. Tune up only if the warn rate is too high on legitimate traffic (many users behind a corporate NAT, for instance).
For signup endpoints: 5 requests per minute per IP. Signup is even less frequent for legitimate users than login; account enumeration is best stopped tight.
For password reset: 3 requests per hour per IP. A reset is a once-in-a-while operation. Attackers spam reset to exhaust the victim's inbox; the tight limit is the defence.
For token refresh: matched to the session TTL. A session that refreshes every hour should have a rate limit of a few refreshes per hour per session id; an attacker who steals a session cannot extract value through rapid refresh.
For data endpoints: matched to the application's expected use pattern. An API for human-driven dashboards sees a few requests per minute per session; an API for programmatic clients sees hundreds per second per workload. The pattern is deployment-specific.
The default to start with is to measure first. The metrics from
AuthnMetrics::rate_limit_rejected (covered below) tell you the
real reject rate; the calibration is then to set the limit just
above the legitimate-traffic envelope.
What happens at the limit
A request that hits the rate limit gets:
A 429 status code. The standard HTTP response for "Too Many Requests."
A Retry-After header. The value is the number of seconds the
client should wait before retrying. The header is read by
browsers and well-behaved clients; attackers ignore it.
A short JSON body explaining the limit. The body is generic ("rate limit exceeded") rather than specific (no "you have 0 of 10 requests remaining"); the latter leaks the limit configuration, which lets an attacker calibrate their attack to just under the limit.
The application's metrics record the rejection. The
AuthnMetrics::rate_limit_rejected method is the metric;
applications wire it to their Prometheus or OpenTelemetry
counter.
Distinguishing attack from misconfiguration
A high rate of 429s is operationally interesting. The cause is either an attack (real attacker getting throttled) or a misconfiguration (legitimate traffic hitting a limit that was set too low).
The signals that distinguish them:
A rate of 429s heavily concentrated on a small set of source IPs, with the IPs not matching legitimate user patterns (datacenter IPs, VPN exit nodes, residential ASNs from countries the application does not typically serve) suggests attack.
A rate of 429s spread across many IPs, matching legitimate user patterns (residential ASNs from served countries, mixed mobile and home connections), suggests misconfiguration.
The audit events the rate limiter produces (a RateLimitRejected
event per drop) carry the source IP, the endpoint, and the
timestamp; SIEM queries against these distinguish the patterns
quickly.
Per-tenant rate limits
For multi-tenant deployments, the rate limit configuration can
be per-tenant. A tenant with a higher SLA gets a higher rate
limit; a tenant with a lower SLA gets a tighter one. The
mechanism is the same RateLimitLayer, with a Custom key
extractor that composes the standard key (typically PeerIp)
with the tenant id, and with separate RateLimitConfigs per
tenant tier.
The pattern is operationally complex (one configuration per tenant tier), so most deployments use a single shared limit and calibrate to the deployment-wide envelope. The per-tenant shape is for deployments where the SLA differences are explicit and the operational overhead is justified.
Metrics
The layer emits two metrics through the AuthnMetrics trait:
rate_limit_rejected is incremented on each 429. The metric is
the primary signal for tuning and for attack detection.
rate_limit_evaluated (optional, off by default) is incremented
on every request the layer sees, regardless of outcome. The
ratio of rejected to evaluated is the reject rate; below 0.1%
typically means the limit is set well, above 1% suggests either
attack or misconfiguration.
The AuthnMetrics implementation is the application's; it
typically routes to Prometheus, OpenTelemetry, or whatever
metrics system the deployment uses. The
examples/sqlite/
reference application shows a simple AtomicU64-based
implementation suitable for adapting to a real metrics system.
Composing with the lockout policy
The rate limiter and the lockout policy are different defences that compose. The rate limiter catches volume; the lockout policy catches credential pattern. Both fire on attacks, in different shapes.
The pattern that emerges: the rate limiter is the first line of defence against credential stuffing. It drops the attack to a trickle before any individual user's lockout policy can fire. The lockout policy then catches the few attempts that get through, marking the targeted user accounts as locked.
A deployment that has rate limiting but no lockout policy is vulnerable to slow attacks that stay below the rate limit. A deployment that has lockout but no rate limiting is vulnerable to high-volume attacks that distribute across many users. Both together cover both attack shapes.
What this enables
The rate limiter is the operational layer that sits between "the request was sent" and "the authentication logic runs." A deployment without it is vulnerable to a class of attack that the authentication logic alone cannot prevent; a deployment with it has the broader defence against volume-based attacks that complements the credential-pattern defence of lockout.
Further reading
Multi-tenancy covers the lockout policy that pairs with the
rate limit. Audit events catalogues the RateLimitRejected
event the layer emits. Cookies, fingerprinting, hijack
detection covers the trusted-proxy configuration that
determines how PeerIp reads the source IP. Operations
runbook covers the metrics dashboards and the SIEM rules that
turn the rate-limit signal into alerts.
Security posture
This chapter is the production-readiness chapter. It covers the crypto choices axess makes by default, the production integration requirements an adopter has to meet before launch, the compliance touch-points (GDPR, SOC 2, PCI-DSS, HIPAA) the deployment will face, and the disclosure protocol for handling the inevitable vulnerability report.
The chapter has two halves. The first half is axess-specific and
covers the crypto backends, the FIPS-routing notes, and the PII
classification. The second half is the canonical SECURITY.md
from the repo root, included verbatim so the production
checklist lives in one place rather than two.
Crypto backends
Axess uses three crypto backends, chosen per operation:
RustCrypto is the default for most cryptographic primitives. The implementations are pure Rust, with no system-library dependency, and the project's audit history is good. Axess uses RustCrypto for AES-256-GCM (the session envelope), HMAC-SHA256 (cookie signing, fingerprint binding), Argon2id (password hashing), TOTP and HOTP (the RFC 6238 and RFC 4226 implementations), and SHA-256 (refresh token hashing).
aws-lc-rs is an alternative
for deployments that need FIPS 140-3 validated crypto. The
backend wraps the FIPS-validated aws-lc library; selecting it
through a Cargo feature redirects the relevant primitives to
the validated implementations. The trade-off is binary size
(the FIPS module adds a few megabytes) and platform support
(aws-lc does not build on every target).
ring is a third option, used historically for TLS-adjacent primitives. The project is mature but the maintenance cadence has slowed; axess uses ring in a few legacy spots and is migrating away. New code uses RustCrypto by default and aws-lc-rs when FIPS is required.
The selection is a Cargo feature, configured per crate:
[dependencies]
axess = { version = "0.2", features = ["crypto-aws-lc"] }
The default is crypto-rust (which is the same as not specifying
a backend); crypto-aws-lc is the FIPS variant. The crates that
depend on a specific backend gate their implementations on the
feature; the build refuses if the application requests
incompatible backends (a deployment cannot simultaneously enable
RustCrypto and aws-lc-rs for the same operation).
FIPS targeting
A FIPS 140-3 validated deployment requires three things to be true.
The first is that every cryptographic operation runs through a
validated module. Axess's crypto-aws-lc feature routes the
relevant operations through aws-lc-rs. The choice satisfies the
"validated module" requirement.
The second is that the deployment's compile and link chain does
not introduce non-validated crypto. Cargo's dependency graph is
the source of truth here; running cargo tree and inspecting
for non-aws-lc crypto crates (rustls, ring, the older
RustCrypto crates) shows what the deployment actually pulls in.
Anything that introduces non-validated crypto needs to be
replaced or compiled out.
The third is that the validation certificate covers the platform the deployment runs on. NIST publishes FIPS validation certificates per platform-binary combination; a certificate for Linux x86-64 does not cover macOS ARM. The deployment's compliance evidence must include the certificate matching the production platform.
The deployment's compliance team owns the end-to-end FIPS validation; axess provides the crypto-backend lever. The chapters that depend on specific crypto choices (session envelope, refresh-token hashing, HMAC fingerprint) all use the configured backend automatically.
PII classification
The application records PII across several stores. The
classification matters for GDPR (the data subject's rights), for
SOC 2 (the control objectives), and for the retention sweep
(Device identity's device_retention_days). The classification:
Primary PII includes the user's identifier (email, username, or similar), their hashed password, their TOTP secret, their FIDO2 credentials, their IP address as seen during authentication, and their device fingerprint. This data lives in the identity store and the device store; the retention is the application's choice within whatever regulatory bounds apply.
Secondary PII includes the audit-event log (which references the
primary PII through user_id, tenant_id, device_id, and
client_ip). The audit retention covered in Audit pipeline
applies here; for GDPR the typical pattern is to retain audit
data longer than the primary PII but to scrub or hash the IP
addresses after the operational hot window.
Pseudonymous data includes the session id, the refresh token hash, and the device id itself (a UUID that does not name the user directly). These can be retained longer than the primary PII without GDPR implications; they only become PII when joined to the primary data, and the join requires access to the identity store.
The GDPR right-to-erasure verb (IdentityAdmin::erase_user)
cascades through every store: the user's primary PII is removed
from the identity store, the user's device records are
removed from the device store, the user's sessions are removed
from the session store, and the user's refresh tokens are
removed from the refresh-token store. The audit-event entries
that reference the user are not removed (the audit trail is
load-bearing for compliance); the user's identifier in the
events is hashed to a pseudonymous token, which makes the events
non-PII without losing the ability to correlate them.
Compliance touch-points
The deployment will face one or more of these regulatory frames. Axess does not provide compliance on its own; it provides the controls each framework requires. The touch-points:
GDPR (EU data protection): the right-to-erasure verb (above),
the audit trail's retention configuration, the IP-address
scrubbing in the cold-tier archive, the per-tenant
device_retention_days. The deployment owns the data subject
notices, the privacy policy, and the legal basis for processing;
axess provides the technical mechanisms.
SOC 2 (operational controls): the audit catalogue (every authentication and authorisation decision produces an event), the lockout policy (defends against credential stuffing), the session and refresh-token security (covered in earlier chapters), the operational metrics (covered in Operations runbook). The deployment owns the policy and procedure documentation; axess provides the operational evidence.
PCI-DSS (payment card data, if applicable): the strong authentication for administrative access, the audit retention of at least one year, the cryptographic protection of session data at rest. The deployment owns the cardholder data environment; axess covers the authentication boundary into it.
HIPAA (US healthcare data, if applicable): the strong authentication for protected health information access, the audit retention of at least six years, the encryption of session data at rest and in transit. The deployment owns the HIPAA-covered systems; axess covers the authentication boundary.
The chapters that cover the relevant mechanisms are the place to look up specific controls: Session lifecycle and crypto envelope for the at-rest encryption, Audit pipeline for the retention, Refresh tokens and session continuity for the refresh-token hygiene, Multi-tenancy for the lockout policy. The compliance documentation maps the framework's requirements to the relevant chapters.
Disclosure protocol
The vulnerability disclosure protocol lives in the canonical
SECURITY.md
at the repo root. The summary:
Vulnerability reports go through the private channel described
in SECURITY.md (typically a security email or GitHub Security
Advisories). Do not file vulnerabilities on the public issue
tracker.
The maintainers acknowledge reports within a few business days and triage to a severity level. Critical and high-severity issues get a private fix in a security branch, a coordinated disclosure window, and a CVE if the issue warrants one. Lower severity issues fix in the normal development cycle.
Adopters are expected to keep their axess dependency current.
Vulnerability fixes ship in the next patch release; the changelog
notes which fixes are security-relevant. Deployments behind on
patches accept the risk of the unfixed vulnerabilities.
Canonical SECURITY.md
The rest of this chapter is the canonical SECURITY.md from the
repo root, included so the production checklist is in one place.
Security Policy
Reporting a Vulnerability
If you discover a security issue in Axess, please report it through GitHub's private vulnerability reporting (the Report a vulnerability button under the repository's Security tab) or by emailing security@gnomes.ch. Do not open a public issue.
Response targets (best-effort while the project is pre-1.0):
- Acknowledgement: within 48 hours of report
- Triage and severity assessment: within 7 calendar days
- Critical / High fix: patch release within 7 calendar days of confirmation
- Medium fix: patch in the next scheduled release (typically within 30 days)
- Advisory: published via GitHub Security Advisory once a fix is available
Only the latest 0.x minor receives security patches. If you are on an older version, upgrade to receive fixes.
Using Axess Securely
Axess is a library for authentication and authorization. Its security depends on correct integration and configuration in your application.
Production integration checklist
Transport and cookies
-
Terminate TLS before Axess sees requests. All session cookies default to
Secure; HttpOnly; SameSite=Lax. -
Set an HSTS header (
Strict-Transport-Security: max-age=63072000; includeSubDomains) at the reverse-proxy or application layer so browsers never downgrade to HTTP. - Use a cryptographically random 32-byte signing key loaded from a secrets manager. Never hard-code or re-use the all-zero example key.
CSRF
-
Mount
CsrfLayeron state-changing routes. The shipped middleware implements signed double-submit cookie protection;CsrfConfig::new(signing_key)is the entry point. -
SameSite=Lax(the cookie default) mitigates the most common vectors, but is not sufficient on older browsers or cross-siteGET-triggered mutations; keepCsrfLayerengaged. -
For API-only endpoints, validate
Origin/Refererheaders or use a custom request header as a CSRF defence in addition.
Session binding and hijacking
-
Enable session binding (e.g.
UserAgentBinding) to detect cookie theft from a different browser/client. - Understand the trade-off: session binding raises the bar for opportunistic theft but does not protect against an attacker who copies the User-Agent string along with the cookie.
- Consider combining with IP-subnet or TLS channel binding for higher-security environments.
Session registry and forced logout
-
If using a session registry for forced logout, guard all authenticated routes with registry validity checks; not just
require_authn!; so suspended or force-logged-out users cannot continue using stale sessions. -
Call
suspend_user(which automatically invalidates registry entries) rather than updating store status manually.
Rate limiting
-
Apply per-IP rate limiting on login, factor verification, and OAuth callback endpoints using the built-in
RateLimitLayer. Axess enforces per-user lockout, but distributed brute-force across many usernames requires IP-level throttling.
Recommended configuration for authentication endpoints:
#![allow(unused)] fn main() { use axess::{RateLimitLayer, RateLimitConfig, KeyExtractor}; use std::time::Duration; // Tight limit for login / factor verification (5 attempts per 60 s per IP). let auth_rate_limit = RateLimitLayer::new( RateLimitConfig::builder() .max_requests(5) .window(Duration::from_secs(60)) .key(KeyExtractor::ForwardedIp) .build(), ); // Separate, tighter limit for OTP verification (3 attempts per 60 s). let otp_rate_limit = RateLimitLayer::new( RateLimitConfig::builder() .max_requests(3) .window(Duration::from_secs(60)) .key(KeyExtractor::ForwardedIp) .build(), ); let app = Router::new() .route("/login", post(login_handler)) .route("/verify-totp", post(totp_handler)) .layer(auth_rate_limit) // Or apply per-route: .route("/verify-email-otp", post(otp_handler)) .route_layer(otp_rate_limit); }
- Rate-limit OTP verification endpoints separately; 8-digit email OTPs have 10^8 possibilities but a tighter window reduces feasibility further.
Trusted proxy and IP extraction
-
If you rely on
X-Real-IPorX-Forwarded-Forfor audit trails or rate limiting, ensure your reverse proxy strips these headers from untrusted client requests before forwarding. Axess trusts the first entry inX-Forwarded-For.
Session store selection
-
In-memory stores (
MemorySessionStore,MemoryRefreshTokenStore) are for testing only. They use non-constant-time lookups and do not persist across restarts. -
SQL stores (
SqliteSessionStore,PostgresSessionStore,MysqlSessionStore) support optional AES-256-GCM encryption at rest viaSqliteSessionStore::new(pool, SessionCrypto::new(key)); opt out only via the explicit::plaintext(pool)constructor (dev/test only). -
Valkey store supports AES-256-GCM encryption via
ValkeySessionStore::new(client, key). Plaintext available via::plaintext(client)for dev/test. -
All encryption-capable stores support key rotation via
SessionCrypto::with_previous_key(old_key); sessions encrypted with the previous key are transparently re-encrypted on the next access.
Content Security Policy
-
Set a
Content-Security-Policyheader on all HTML responses to mitigate XSS impact. At minimum:default-src 'self'; script-src 'self'; style-src 'self'. -
Avoid
unsafe-inlineandunsafe-evalin CSP directives.
OAuth / OIDC
-
Register only HTTPS issuer URLs. Axess rejects
http://issuer URLs in discovery (localhost /127.0.0.1/[::1]exemption for dev). -
Request the minimum scopes needed; avoid
offline_accessunless refresh tokens are required. - Validate that the OAuth redirect URI matches exactly; do not use wildcard patterns.
Social login (plain OAuth 2.0)
-
Prefer OIDC whenever the provider supports it. Reach for
SocialProvideronly for IdPs that explicitly don't (GitHub user login, Twitter/X, Discord, Reddit, Spotify, …). - Understand the weaker security model: identity comes from a userinfo HTTPS GET, not from a signed assertion. A compromised IdP can impersonate any of its users to your service; you accept that blast radius when you adopt the provider.
-
Keep PKCE on (the default). A handful of providers reject the extra parameter;
SocialProvider::without_pkceis the opt-out and should be used sparingly. -
Verify
csrf_stateecho on the callback before callingexchange_code;SocialProvider::mint_csrf_stateproduces a fresh value routed through the same injectable RNG as PKCE.
Workload identity
-
Pin the trust domain at resolver construction. Every shipped resolver (
JwtSvidResolver,MtlsResolver,WorkloadResolver) accepts an expectedTrustDomainand rejects tokens / certs whose synthesisedWorkloadIdlives under a different one; defense in depth against a confused-deputy where the JWKS or CA happens to be shared across trust domains. -
For the generic
WorkloadResolver, keep adopter-supplied claim mappers strict about whichsubjectpaths the application admits. The recipes inexamples/workload-identity/are templates, not policy. -
When fetching SVIDs from a local SPIRE agent, use the
spire-workloadcrate today; seedocs/workload-identity/jwt-svid.mdfor the fetch-side recipe.
Dependencies
-
Regularly update Axess and its dependencies (
cargo update). -
Run
cargo auditin CI to catch known vulnerabilities in the dependency tree.
Trusted proxy configuration (detailed)
Axess extracts client IP addresses from the X-Real-IP and X-Forwarded-For headers for audit logging and rate limiting. These headers are only trustworthy if your reverse proxy strips or overwrites them before forwarding.
If you don't run behind a trusted reverse proxy, these headers are user-controlled and any IP-based security decision (rate limiting, geo-blocking, audit trails) can be spoofed.
Configure your reverse proxy to:
- Strip incoming
X-Forwarded-ForandX-Real-IPfrom client requests. - Set
X-Real-IPto the immediate client address (TCP peer). - Append
X-Forwarded-Forwith the client address (for multi-hop chains).
Example for nginx:
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
Axess reads X-Real-IP first; if absent, it takes the first entry from X-Forwarded-For. It does not walk the forwarded chain or maintain a trusted-proxy allowlist; that is the reverse proxy's responsibility.
Feature inventory
The shipped security surface, grouped by area. Caveats live in the Notes column.
Authentication factors
| Feature | Notes |
|---|---|
| Password | Argon2id hashing with workspace-pinned parameters, per-user lockout, password-reuse history, plaintext zeroized after hash. |
| TOTP (RFC 6238) | Constant-time comparison, last-step replay guard, SHA-1 / SHA-256 / SHA-512, 6–8 digit codes. |
| HOTP (RFC 4226) | Counter advancement, zeroized secrets, same algorithm options as TOTP. |
| Email OTP | 8-digit default, Argon2-hashed codes at rest, TTL-bound, single-use. |
| FIDO2 / WebAuthn | Registration + authentication + clone detection + discoverable / passwordless. Per-ceremony UV / attestation policy waits on webauthn-rs 0.6 stable; see ROADMAP.md. |
| LDAP bind | Verifier over ldap3 with TLS via rustls. Bind only; schema mapping is the application's responsibility. |
| mTLS factor | X.509 certificate verification against a configured trust anchor; SAN URI extraction for SPIFFE / regular identity binding. |
| JWT bearer | Generic JWT verifier with JWKS rotation, iss / aud / exp / nbf / alg allowlist, clock-injected for DST. |
| Multi-factor chains | Ordered factor pipeline (FactorStep::AnyOf for choice steps), session state machine enforces sequencing at compile time. |
Sessions
| Feature | Notes |
|---|---|
| Session cookies | HMAC-SHA256 signed, Secure; HttpOnly; SameSite=Lax by default. Configurable via SessionLayer::with_secure / with_same_site. |
| Session binding | HMAC-keyed fingerprint (not a plain hash); UserAgentBinding + extension points for IP / TLS channel binding. |
| Session registry + forced logout | SessionRegistry::invalidate_user is error-observable and fail-closed; cooperates with suspend_user for combined identity + session revocation. |
| ID cycling | Automatic on Guest→Authenticated transition (fixation defense) and on logout; explicit AuthSession::regenerate for app-defined privilege boundaries (MFA enrollment, password change, role grant). See docs/sessions/lifecycle.md. |
| Refresh tokens | Rotation with family revocation on reuse; integration with device-binding cascade. |
| In-memory store | Testing only; no persistence, no encryption, non-constant-time lookups. |
| SQLite session store | Optional AES-256-GCM via SqliteSessionStore::new(pool, SessionCrypto::new(key)). Key rotation via with_previous_key. |
| Postgres session store | Same encryption model. Recommended for multi-instance deployments. Validated against CockroachDB via the cockroach_compat CI job. |
| MySQL / MariaDB session store | Same encryption model. Tested against MySQL 8.x and MariaDB 10.5+. |
| Valkey session store | AES-256-GCM, key rotation, TTL-managed eviction. |
Cross-backend Store<K, V> trait | All five backends implement it for adopters that want backend-agnostic dispatch via Arc<dyn Store<…>>. |
Device identity
| Feature | Notes |
|---|---|
| Three-stage trust ladder | Unknown → Seen → Trusted (plus terminal Revoked); retention sweep demotes idle devices and purges revoked rows past the grace window. |
| Per-tenant fingerprint pepper | Stops cross-tenant fingerprint correlation; TenantPepperResolver is adopter-provided. |
| Cascade revocation | Refresh-token family compromise revokes every device that carried that family's binding. |
CachedDeviceStore decorator | LRU + clock-driven TTL eviction; revocation propagates through set_trust_level. |
Five DeviceStore backends | Memory, SQLite, Postgres, MySQL / MariaDB, Valkey; surface-equivalent across SQL dialects + Valkey hash storage, optional AES-256-GCM envelope on the bindings blob (SQL backends). |
| Adopter-supplied store recipe | Documented contracts (tenant scoping, atomic save, hot-path sighting, required sweep) in docs/identity/device.md for adopters with non-shipped backends (DynamoDB, MongoDB, …). |
| PII tokenisation | MemoryDevicePiiStore reference impl + adopter trait for the GDPR-scoped fields (label, last-seen IP). |
Workload identity
| Feature | Notes |
|---|---|
Principal::{Human, Workload} unified abstraction | Same ToCedarEntity bridge for both shapes; Cedar policies authorise across without branching. |
JwtSvidResolver | SPIFFE JWT-SVID spec adherence; mandatory spiffe:// URI in sub, trust-domain extracted and pinned. |
MtlsResolver | SPIFFE X.509-SVID over mTLS via leaf-cert SAN URI extraction. |
WorkloadResolver<C, F, R> | Generic JWT-bearer workload resolver covering GitHub Actions OIDC, Kubernetes service accounts, GitLab CI, Okta, Azure AD, Auth0, LocalIdP, and any other JWT-issuer via an adopter-supplied claim parser + mapping closure. Ready-made recipes for GitHub Actions + k8s SA ship in examples/workload-identity/. |
| Cloud STS exchange | aws-sts, gcp-wif, azure-fic adapters for exchanging federated workload identity for short-lived cloud credentials. |
| Outbound identity | outbound-oauth (axess as OAuth client with client_credentials / private_key_jwt) and outbound-mtls (axess presenting an mTLS identity to downstream services). |
OAuth / OIDC ceremonies
| Feature | Notes |
|---|---|
| Authorization Code + PKCE | Discovery, token exchange, nonce validation; HTTPS-enforced (localhost / 127.0.0.1 / [::1] exemption for dev). |
| Client Credentials | Real HTTP token exchange via OAuthProviderConfig. |
| Device Code (RFC 8628) | Real HTTP; device endpoint configured via with_device_authorization_endpoint. Nonce-bindable per RFC. |
| Token refresh | Provider-delegated refresh with audit logging. |
| FAPI 2.0 Baseline Profile | Pushed Authorization Requests (PAR, RFC 9126), DPoP (RFC 9449), JARM, RP-initiated logout. Strict nbf enforcement on ID tokens with clock-injected validation. |
| Back-Channel Logout | JWT signature verified via cached JWKS; sid-based session invalidation. |
| Front-Channel Logout | GET handler with sid query parameter; shared SidMap with back-channel. |
| Plain-OAuth-2.0 social login | SocialProvider (gated on social, off by default) for IdPs that don't support OIDC (GitHub user login, Twitter/X, Discord, Reddit, Spotify, …). Weaker security model than OIDC; identity comes from a TLS-trusted userinfo endpoint, not from a signed assertion. Parallel types (SocialClaims vs IdTokenClaims) keep the distinction visible at the call site. PKCE on by default. |
| LocalIdpFixture | In-process IdP minting workload JWTs against an in-memory RSA-2048 keypair + matching JWKS endpoint. RS256 + ES256, RFC 8414 discovery, multi-key rotation. |
On-behalf-of (OBO)
| Feature | Notes |
|---|---|
delegated-stored | RFC 6749 §4.1 Authorization Code + PKCE with persisted refresh token for long-lived offline access. |
delegated-exchange | RFC 8693 Token Exchange for short-lived per-request exchange. |
delegated-stored-encrypted | EncryptedDelegatedCredentialStore<S, K> decorator wraps any delegated-credential backend with AES-256-GCM at rest. |
Authorization
| Feature | Notes |
|---|---|
| Cedar Policy engine | RBAC + ABAC + ReBAC. AuthzStore orchestrates evaluation; ToCedarEntity bridges principals, resources, and contexts. |
| Layered policy bundle | Base + overlay; adopters drop additional .cedar and .schema.cedar.json files into an overlay/ directory. |
| Procedural macros | require_authn!, require_partial_authn!, require_authz! guard handler functions at compile time. |
| Entity caching | EntityCache (in-process, default), MokaEntityCache, ValkeyEntityCache (cross-node). Asymmetric defaults: cache authz, not authn. |
Middleware
| Feature | Notes |
|---|---|
| CSRF | Signed double-submit cookie; required for state-changing form posts. |
| Rate limiting | Token-bucket via RateLimitLayer. KeyExtractor::{PeerIp, ForwardedIp} for direct vs trusted-proxy deployments. |
| Request ID | X-Request-Id extraction + generation. |
| Trace ID | W3C Trace Context (traceparent) propagation. |
| WebSocket | Revocation-aware wrapper that closes connections on session invalidation. |
Audit and observability
| Feature | Notes |
|---|---|
AuthEvent regulatory audit trail | Six device-identity event variants + the full authn event surface. |
AuthnMetrics | 17-method trait (counters + timers) with no-op defaults. |
AuditArchiver + AuditRetentionPolicy | Hot / cold tiering with three-stage retention (90d / 7d / never defaults). FilesystemAuditArchiver reference impl behind audit-archive-fs. |
AuthnAnalyticsSink + RichAuthnEvent | Denormalised analytics path parallel to the regulatory AuthEvent. serde + rkyv derives for Apache Iggy / ClickHouse / DuckDB / Snowflake. |
TracingCapture | Test subscriber for asserting on emitted tracing events. |
PII classification
Axess processes personal data as part of authentication. This section documents what the library logs, stores, and never touches; useful for GDPR Data Protection Impact Assessments and SOC2 evidence packages.
What axess logs (via tracing and AuthEvent)
| Field | Where | Purpose | PII? |
|---|---|---|---|
user_id | Structured log spans, AuthEvent | Correlate events to accounts | Pseudonymous; opaque ID, not directly identifying |
tenant_id | Structured log spans, AuthEvent | Multi-tenant correlation | No |
session_id | AuthEvent, tracing spans | Session correlation | No (random UUID) |
IP address | AuditContext (extracted from headers) | Geo/fraud detection, compliance | Yes; personal data under GDPR |
User-Agent | AuditContext, session binding | Client identification, hijack detection | Indirect; device fingerprint |
event_type | AuthEvent | Audit trail (login, factor verified, logout) | No |
factor_kind | AuthEvent | Which factor was attempted | No |
success/failure | AuthEvent | Security monitoring | No |
request_id | AuditContext | Request tracing | No |
What axess stores in session data
| Field | Storage | PII? |
|---|---|---|
user_id / tenant_id | Session store (Memory / SQLite / Postgres / MySQL / Valkey) | Pseudonymous |
auth_state | Session store | No |
fingerprint | Session store (HMAC hash) | No (one-way hash) |
custom | Session store (application-defined) | Depends on application |
What axess NEVER logs or stores
- Plaintext passwords (only Argon2id hashes are stored; input is zeroized after hashing)
- TOTP/HOTP secrets in logs (stored encrypted in
FactorConfig, zeroized on drop) - Session cookie values
- OAuth tokens (access, refresh, ID tokens); these stay in memory only during the exchange
- PKCE verifiers, CSRF state tokens (cleared from session after use)
Recommendations
- Encrypt at rest: pass a
SessionCrypto::new(key)to the SQL session-store constructors (SqliteSessionStore::new(pool, crypto), same for Postgres / MySQL) or useValkeySessionStore::new(client, key)so session data (which containsuser_id) is AES-256-GCM protected. The explicit::plaintext(pool)constructor opts out and is dev/test only. - Log retention: configure your log aggregator to retain auth events per your compliance requirements (MiFID II: 5 years; GDPR: minimize).
- Right to erasure: deleting a user's sessions (
SessionRegistry::invalidate_user) and database records satisfies GDPR erasure for axess-managed data. Thecustomsession field is the application's responsibility.
Compliance framework mapping
GDPR
| Requirement | How Axess addresses it |
|---|---|
| Lawful basis for processing | Application's responsibility. Axess processes only what the app sends. |
| Data minimization | Sessions store only user_id, tenant_id, auth_state, and fingerprint (HMAC hash). |
| Right to erasure | SessionRegistry::invalidate_user() + database record deletion. |
| Data protection by design | AES-256-GCM encryption at rest, zeroization of secrets in memory. |
| Breach notification | Application responsibility. Axess provides audit trail via AuthEvent. |
| DPA (Data Processing Agreement) | Not applicable; Axess is a library, not a service. |
SOC2
| Trust service criteria | How Axess addresses it |
|---|---|
| CC6.1; Logical access security | MFA, session binding, Cedar policy authorization |
| CC6.3; Access revocation | SessionRegistry::invalidate_user(), session TTL |
| CC7.2; Monitoring | AuthnMetrics trait (17 hooks), AuthEvent audit trail, tracing |
| CC8.1; Change management | Application responsibility (CI/CD, version pinning) |
PCI-DSS
| Requirement | How Axess addresses it |
|---|---|
| 8.3; MFA for admin access | Multi-factor chain support (password + TOTP/FIDO2) |
| 8.6; Session management | Signed cookies, TTL, session binding, forced logout |
| 3.4; Encryption of cardholder data | AES-256-GCM session encryption (session store, not card data) |
| 10.2; Audit trails | AuthEvent records login attempts, factor verifications, logouts |
HIPAA
| Safeguard | How Axess addresses it |
|---|---|
| Access control (§164.312(a)) | MFA, Cedar RBAC/ABAC, session state machine |
| Audit controls (§164.312(b)) | AuthEvent audit trail with timestamps |
| Integrity controls (§164.312(c)) | HMAC-signed session cookies, AES-GCM encryption |
| Transmission security (§164.312(e)) | Application must terminate TLS; Axess sets Secure cookie flag |
These mappings are informational. Compliance certification requires assessment of the complete application stack, not just the authentication library.
Supported Versions
We recommend using the latest release of Axess and actively maintained branches.
Disclaimer
Axess is provided as a library. While we strive for secure defaults, the overall security of your application depends on your usage and integration.
Further reading
Operations runbook covers the production-launch checklist (key rotation, multi-instance considerations, graceful shutdown). Audit events and Audit pipeline cover the audit mechanisms the compliance frames depend on. Migration guide covers cross-version upgrade paths, including security-relevant breaking changes.
Operations runbook
This chapter is the operator-facing runbook. It covers the pre-launch checklist, the routine rotations the deployment needs to schedule, the multi-instance considerations that catch deployments off-guard, the graceful-shutdown sequence, the health-check and metrics surfaces, and the emergency procedures for the categories of incident that recur.
The chapter has two halves. The first half is operational
guidance specific to axess. The second half is the canonical
OPERATIONS.md
from the repo root, included so the deployment's runbook
checklist is in one place.
Pre-launch checklist
The list below is the minimum an axess-instrumented deployment should clear before serving real traffic. Each item is covered in detail in another chapter; the list here is the inventory.
The session signing key is loaded from the deployment's secrets
manager. The key is 32 bytes of cryptographic randomness, stable
across process restarts. The development placeholder
([0; 32] from Getting started) is replaced.
The session envelope key is loaded the same way. The two keys are independent; one is for HMAC signing the cookie, the other is for AES-256-GCM encrypting the session payload at rest. Session lifecycle and crypto envelope covers the distinction.
The fingerprint pepper is loaded for the fingerprint binding. Each tenant has its own pepper, stored alongside the tenant record; Multi-tenancy and Cookies, fingerprinting, hijack detection cover the mechanism.
The session cookie has Secure=true set. TLS terminates at the
edge; the application sees only HTTPS traffic; the cookie is
only sent on HTTPS.
The trusted-proxy list is configured. The application reads the
forwarded header (X-Forwarded-For or Forwarded) only when
the immediate peer is in the trusted list. Without this, the
fingerprint and the rate-limit keys can be spoofed.
The rate limit is configured on the login, signup, password-reset, and any other authentication-adjacent endpoints. The defaults from Rate limiting are starting points; calibrate to the deployment's legitimate-traffic envelope.
The lockout policy is configured (or the global default is accepted). The three levers (per-user, per-tenant, per-IP) all have explicit thresholds suited to the deployment's risk posture. Multi-tenancy §"Three-lever lockout" covers the configuration.
The audit pipeline is wired. The regulatory sink is the
IdentityAuthnLog the lockout policy already uses; the
analytics sink (if configured) is the deployment's SIEM
connector. The retention loop is configured with the
deployment's required retention period. Audit pipeline covers
the full pipeline configuration.
The health check is wired. /healthz (or whatever the
deployment chooses) queries the session store, the identity
store, and the device store; the response is a JSON document
that aggregates the per-component states. Operations runbook
in the canonical SECURITY/OPERATIONS section covers the
deployment expectations.
The metrics are exported. The AuthnMetrics trait is
implemented; the metric values flow into Prometheus or
OpenTelemetry; the dashboards cover the auth-attempt rate, the
failure rate, the rate-limit rejection rate, and the lockout
trigger rate. Operations runbook below covers the
production-dashboard expectations.
The Cedar policy set is loaded and validated against the schema. The startup path refuses if the validation fails; a production launch with a misconfigured policy set never gets to serve traffic. Cedar policy fundamentals covers the validation flow.
The cleanup tasks are scheduled. The session cleanup, the device retention sweep, the audit retention loop, the OAuth JWKS cache refresh: all of these run on intervals; the scheduler is the application's responsibility. Backends §"SQLite" and similar sections cover the per-backend cleanup patterns.
Key rotation
The deployment has three keys to rotate on a schedule: the session signing key, the session envelope key, and the per-tenant fingerprint pepper. The mechanism is the same shape for all three: provide the new key alongside the old one for a transition window, let in-flight sessions and devices roll over, then remove the old key.
Session signing key
The signing key is what HMAC-protects the session cookie. Rotating it without invalidating sessions requires keeping the old key available for verification during the transition.
let session_layer = SessionLayer::new(store, new_signing_key)
.with_previous_key(old_signing_key)
.with_ttl(session_ttl);
with_previous_key accepts the old key. Cookies signed with the
old key continue to validate; new cookies sign with the new key.
After enough time for all old cookies to expire (one session TTL
plus a safety margin), the previous key can be removed.
The rotation sequence:
- Deploy the application with
new_signing_key = old_keyandprevious_key = old_key. Nothing has changed; this is the baseline. - Generate a fresh 32-byte signing key. Store it in the secrets manager alongside the existing one.
- Deploy the application with
new_signing_key = fresh_keyandprevious_key = old_key. New cookies sign with the fresh key; existing cookies continue to validate against the old. - Wait one session TTL. By the end of this window, every existing session has either expired or been refreshed (which re-signs the cookie with the fresh key).
- Deploy the application with
previous_key = None(or absent). The old key is now unused. - Remove the old key from the secrets manager.
Session envelope key
The envelope key is what AES-256-GCM protects the session payload at rest. Rotating it without invalidating sessions is similar to the signing-key rotation, with the additional consideration that sessions stored before the rotation continue to be readable but new writes use the new key.
let crypto = SessionCrypto::new(new_envelope_key)
.with_previous_key(old_envelope_key);
let store = SessionStore::new(pool, crypto);
The rotation sequence is the same as the signing key. The transition window covers one session TTL; after that, every stored session has been rewritten with the new key.
For deployments with long session TTLs (a week or a month), rotating the envelope key per the deployment's compliance cycle (quarterly, semiannually) requires the transition window to be at least the TTL. Alternative: a background scan that proactively rewrites stored sessions with the new key, finishing the rotation faster than the TTL would.
Per-tenant fingerprint pepper
The fingerprint pepper rotates per-tenant rather than globally. The mechanism is on the tenant record:
service.rotate_fingerprint_pepper(
&tenant_id,
new_pepper,
).await?;
The rotation invalidates every device record under the tenant.
Existing sessions remain valid (they do not depend on the device
record), but the next request from each user re-registers their
device from scratch (transitioning the device to Unknown and
walking the assurance ladder again). Users see no break; the
device store sees a churn.
The pepper rotates on tenant suspension and on demand. The default cadence is annual; tighter cadences are appropriate for high-sensitivity deployments.
Multi-instance considerations
A deployment that runs multiple application instances behind a load balancer has a handful of considerations the single-instance deployment does not.
Shared session store. The session backend must be cluster-safe: Postgres, MySQL, or Valkey. SQLite is single-writer and works only for single-instance deployments. Backends covers the choices.
Shared signing and envelope keys. Every instance must use the same keys; otherwise an instance that issued a cookie cannot have the cookie validated by a different instance that receives the next request. The secrets manager is the source of truth; each instance pulls the keys at startup.
Shared rate-limit state. If the rate limiter is keyed by
PeerIp and the buckets live in memory per instance, an
attacker hitting all instances in parallel evades the limit.
The fix is BucketStore::Valkey { client }, which moves the
state to a shared Valkey instance; every application instance
sees the same buckets.
Session affinity (sticky sessions). Optional, not required. The session is stored server-side; any instance can serve any session. Some deployments prefer sticky sessions to improve local cache hit rates; the trade-off is reduced resilience to instance failure.
Load-balancer-level fingerprint handling. The load balancer
must forward the real client IP through X-Forwarded-For (or
the load balancer's specific header). The application's
trusted-proxy list must include the load balancer's IP range.
Without this, every request looks like it came from the load
balancer, and the fingerprint and rate-limit keys are useless.
Graceful shutdown
A graceful shutdown drains in-flight requests before stopping the process. The pattern in axess:
The process receives a SIGTERM (from Kubernetes, systemd, or whatever orchestrator). The application's shutdown handler sets a flag that tells the HTTP server to stop accepting new connections.
In-flight requests continue. The HTTP server is in draining mode; new connections get refused (which the load balancer treats as the signal to route elsewhere), existing connections complete their request.
The shutdown handler waits for the in-flight requests to complete, with a timeout (typically 30 seconds; long enough for real requests, short enough that a stuck request does not block shutdown forever).
The audit pipeline drains. The shutdown handler triggers the pipeline to flush its buffer to all sinks. The wait is bounded (typically 10 seconds); buffered events that do not flush in time are written to a local recovery log for the next process start to pick up.
The session store closes. The connection pool drains; in-flight queries complete; the pool releases its connections.
The process exits.
The pattern is what Axum's with_graceful_shutdown enables; the
application wires the shutdown signal through the standard
shutdown handler. No axess-specific code is needed beyond the
audit-pipeline drain.
Health checks and metrics
A production deployment exposes /healthz and /metrics
endpoints. The health check confirms the application's
backends are reachable; the metrics expose the operational
counters.
The health check pattern:
let health = Arc::new(
CompositeHealthCheck::new()
.add("session_store", session_store.clone())
.add("identity_store", identity_store.clone())
.add("device_store", device_store.clone())
);
async fn healthz(State(state): State<AppState>) -> impl IntoResponse {
let status = state.health.check_all().await;
let code = if status.is_healthy() {
StatusCode::OK
} else {
StatusCode::SERVICE_UNAVAILABLE
};
let body = serde_json::json!({
"status": if status.is_healthy() { "healthy" } else { "unhealthy" },
"components": status.components,
});
(code, axum::Json(body))
}
Each backend that implements HealthCheck provides its own probe
(typically a bounded SELECT 1 for SQL backends or a PING for
Valkey). The composite aggregates the results; the endpoint
returns 200 on all-healthy or 503 on any-unhealthy.
The metrics pattern:
async fn metrics_endpoint(State(state): State<AppState>) -> impl IntoResponse {
let m = &state.metrics;
axum::Json(serde_json::json!({
"auth_attempts": m.auth_attempts.load(Ordering::Relaxed),
"auth_successes": m.auth_successes.load(Ordering::Relaxed),
"auth_failures": m.auth_failures.load(Ordering::Relaxed),
"rate_limit_rejections": m.rate_limit_rejections.load(Ordering::Relaxed),
}))
}
The metrics implementation (covered in AuthnMetrics trait)
exposes the counters; the endpoint serialises them in whatever
format the deployment's metrics system expects (Prometheus
text format, JSON, OpenMetrics).
The dashboards the operational team uses combine these counters with the audit-event volumes from the SIEM. Audit events §"SIEM query patterns" covers the SIEM-side queries.
Common failures and remedies
The categories of failure that recur in production deployments, and the standard responses.
Spike in auth_failures: typically a credential-stuffing attack
or a credential leak elsewhere. The rate limiter should be
absorbing the bulk; the lockout policy catches the rest.
Investigate the source IPs in the failure events; if the spike
is concentrated on a small set of IPs, block them at the WAF; if
it is spread broadly, the leak is the larger concern.
Spike in rate_limit_rejections: either an attack (real
attacker getting throttled) or a misconfiguration (legitimate
traffic hitting a limit too tight). Rate limiting
§"Distinguishing attack from misconfiguration" covers the
signals.
Health check failing on session store: the session backend is unreachable. Investigate the database. Until the backend is back, the application cannot serve authenticated traffic; the load balancer treats the 503 as a signal to route around the instance.
Session cookie validation failing for known-good sessions: the signing key has changed without the previous-key transition. Add the previous key to the configuration; sessions will start validating again as soon as the deployment picks up the change.
Spike in DeviceFingerprintMismatch events: typically the
fingerprint tolerance is too tight. Calibrate against the warn
rate; widen the IP-prefix tolerance or the user-agent matching.
Cookies, fingerprinting, hijack detection covers the tolerance
configuration.
Audit pipeline buffer filling: the analytics sink is slow or
down. Inspect the sink's metrics; if it is the SIEM under
maintenance, the buffer fills until the policy fires
(DropOldest, Block, or ShutdownAuthn). Plan for the
maintenance window through the deployment's standard
notification process.
Canonical OPERATIONS.md
The rest of this chapter is the canonical OPERATIONS.md from
the repo root.
Axess; Operations Guide
Deployment, key management, and operational procedures for production environments.
Key rotation (zero-downtime)
Session signing keys and encryption keys can be rotated without invalidating active sessions.
Signing key rotation
The signing key authenticates session cookies via HMAC-SHA256. Rotation requires a code change (new key), but SessionLayer does not support a previous signing key; rotating the signing key invalidates all active sessions.
Procedure:
- Generate a new 32-byte signing key in your secrets manager.
- Deploy the new key. All active sessions become invalid (users must re-authenticate).
- Schedule signing key rotation during low-traffic windows.
Encryption key rotation
SessionCrypto supports transparent key rotation via with_previous_key():
#![allow(unused)] fn main() { let crypto = SessionCrypto::new(new_key) .with_previous_key(old_key); }
Procedure:
- Generate a new 32-byte encryption key in your secrets manager.
- Deploy with both keys: new as
current, old asprevious. - Sessions encrypted with the old key are transparently re-encrypted with the new key on next access.
- After all sessions have been accessed (or after the session TTL expires), remove the previous key from the deployment.
- Monitor the
"session decrypted with previous (rotated) key"log message to track migration progress.
Multi-instance deployment
Shared state requirements
| Component | Sharing requirement |
|---|---|
| Signing key | Must be identical across all instances |
| Encryption key | Must be identical across all instances |
| Session store | Valkey, PostgreSQL, or MySQL (shared). SQLite is single-instance only. |
| Session registry | Valkey-backed (ValkeySessionRegistry). In-memory is single-instance only. |
| OIDC sid_map | In-memory per instance. Back-channel logout works when the IdP sends to the instance that handled the login. Use sticky sessions or a shared store for full coverage. |
| Rate limit buckets | In-memory per instance. For distributed rate limiting, use an external solution (e.g. Valkey-based sliding window at the reverse proxy). |
Health checks
Implement a /healthz endpoint using the CompositeHealthCheck trait:
#![allow(unused)] fn main() { use axess::{CompositeHealthCheck, HealthCheck, HealthStatus}; async fn healthz(State(health): State<CompositeHealthCheck>) -> impl IntoResponse { match health.check().await { HealthStatus::Healthy => StatusCode::OK, HealthStatus::Degraded(_) => StatusCode::OK, // still serving HealthStatus::Unhealthy(_) => StatusCode::SERVICE_UNAVAILABLE, } } }
All session store implementations (SqliteSessionStore, PostgresSessionStore, MysqlSessionStore, ValkeySessionStore) implement HealthCheck.
Session store migration
To migrate from one session store to another (e.g. SQLite to Valkey):
- Dual-write phase: deploy a wrapper that writes to both stores, reads from the new store first with fallback to the old store.
- Cutover: once the old store's TTL has expired (default 24h), switch reads to the new store only.
- Cleanup: remove the old store configuration.
There is no built-in migration tool. Sessions are short-lived (default 24h TTL), so a simpler approach is:
- Deploy the new store.
- Accept that active sessions on the old store will expire naturally.
- New sessions are created on the new store.
Session cleanup
SQLite, PostgreSQL, and MySQL stores accumulate expired sessions. Use the built-in helper:
#![allow(unused)] fn main() { let store = SqliteSessionStore::new(pool, crypto); store.init_schema().await?; let _cleanup = store.spawn_cleanup_task(Duration::from_secs(3600)); }
PostgresSessionStore::spawn_cleanup_task and MysqlSessionStore::spawn_cleanup_task work the same way. The returned JoinHandle aborts the loop when dropped; store it for the lifetime of the application (or pass it through to graceful shutdown, see below).
Valkey manages expiration natively via TTL; no cleanup needed.
Graceful shutdown
Axess spawns long-lived background tasks for everything that needs to run on a wall-clock cadence: session cleanup, JWKS refresh, back-channel-logout sid_map aging. None of these survive SIGTERM unless the application drains them; tokio::spawn tasks are unconditionally aborted when the runtime stops.
The standard pattern is Axum's with_graceful_shutdown plus explicit abort/await of every JoinHandle axess returns:
use axum::serve; use std::sync::Arc; use tokio::signal; #[tokio::main] async fn main() -> anyhow::Result<()> { // ── Build stores and spawn axess background tasks ───────────── let session_store = SqliteSessionStore::new(pool.clone(), crypto); session_store.init_schema().await?; let cleanup_handle = session_store.spawn_cleanup_task( std::time::Duration::from_secs(3600), ); let jwks_handle = oauth_provider.spawn_jwks_refresh( std::time::Duration::from_secs(3600), ); // ── Shared shutdown signal ──────────────────────────────────── let shutdown = async { let ctrl_c = async { signal::ctrl_c().await.ok(); }; let term = async { #[cfg(unix)] { use signal::unix::{SignalKind, signal}; if let Ok(mut s) = signal(SignalKind::terminate()) { s.recv().await; } } }; tokio::select! { _ = ctrl_c => {}, _ = term => {} } }; // ── Serve until SIGTERM/SIGINT ──────────────────────────────── let listener = tokio::net::TcpListener::bind("0.0.0.0:8080").await?; serve(listener, app) .with_graceful_shutdown(shutdown) .await?; // ── Drain background tasks ──────────────────────────────────── // Aborting is safe; both loops persist via the database, so a // killed cleanup tick at most leaves expired rows for the next // scheduled run, and a killed JWKS tick leaves the cached JWKS // intact until the next process serves a request. cleanup_handle.abort(); jwks_handle.abort(); let _ = cleanup_handle.await; let _ = jwks_handle.await; Ok(()) }
What survives shutdown vs what is lost
| State | Survives? | Notes |
|---|---|---|
| Persisted sessions (SQL / Valkey) | Yes | Stored in DB; new process re-reads. |
MemorySessionStore contents | No | In-process only; everyone is logged out. |
MemorySessionRegistry contents | No | Same; fresh registry on restart. |
| Refresh tokens (SQL / Valkey) | Yes | Hash + family in DB; rotation continues seamlessly. |
| JWKS cache | No (re-fetched) | First post-restart OAuth callback warms it. |
sid_map (back-channel logout) | No | OIDC sid → local session mapping is in-process. Sessions remain valid; only the sid-keyed lookup is lost, so a back-channel logout that arrives before re-login will silently no-op. Acceptable; the session still expires on its TTL. |
| In-flight HTTP request being served | Yes (via with_graceful_shutdown) | Axum waits for active connections to close before returning from serve. |
In-flight cleanup_expired query | Aborted | The next scheduled cleanup picks up the slack. |
In-flight refresh_jwks HTTP call | Aborted | The next request triggers a fresh fetch on demand. |
Why drain the handles after serve returns
with_graceful_shutdown only drains in-flight HTTP requests. The tokio::spawn'd cleanup / JWKS refresh tasks are independent of the HTTP server and continue running until the runtime is dropped. Without an explicit abort().await they hold a reference to the store clone and the runtime keeps them alive; at minimum delaying shutdown to the next tick, at worst (with tokio::main(flavor = "current_thread")) deadlocking because the abort signal can't be processed while the runtime is also waiting for the task to yield.
Monitoring and alerting
Recommended SLOs and alert rules
The thresholds below are starting points for a single-region deployment serving thousands to low-millions of users. Tune to your traffic shape; a free-tier app with no MFA will see very different baselines than a banking dashboard with mandatory FIDO2. The general rule: alert on ratios and rates, not absolute counts, so an alert that fires at 1k DAU still fires at 100k DAU without re-tuning.
Critical (page on-call)
| Signal | Threshold | Why it matters |
|---|---|---|
auth_failure / (auth_success + auth_failure) | > 50% for 5 min | Either a brute-force campaign is in progress or the IdP is down. Either way, real users are locked out. |
account_locked rate | > 10 / minute for 5 min | Sustained password-spray; tens of accounts being locked per minute is well above any realistic legitimate spike. |
session_binding_mismatch rate | > 1 / minute per tenant for 5 min | Either a stolen session cookie is being replayed across user agents, or a buggy client is rotating UAs mid-session. Investigate immediately. |
Health check returns Unhealthy | for 2 consecutive checks | Session store / database is unreachable; users cannot log in. |
JWKS RwLock was poisoned log | any occurrence | A panic happened while holding the JWKS lock; OAuth verification may be silently degraded. |
Warning (alert in chat / ticket queue)
| Signal | Threshold | Why it matters |
|---|---|---|
factor_failure / factor_attempt (per factor kind) | > 30% for 15 min | Targeted factor probe (e.g. TOTP guessing) or a regression in the factor verification code. |
rate_limit_rejected / (rate_limit_allowed + rate_limit_rejected) | > 5% for 10 min | Either the rate limit is mis-tuned for legitimate traffic or an attacker is sustained-firing requests. |
sid_map capacity reached; evicted oldest mapping log | > 1 / minute | OAuth login throughput exceeds the 10 K-entry sid_map cap; back-channel logout precision degrades (some sid lookups will miss). Increase MAX_SID_MAP_ENTRIES or shorten the TTL. |
session decrypted with previous (rotated) key log | persists > 7 days after rotation | Long-lived sessions are still on the old key. The next rotation will invalidate them; communicate the cutover. |
account_locked rate | > 1 / minute for 5 min | Background brute force or aggressive credential stuffing. Below paging threshold but worth watching. |
session custom data exceeds size limit log | any occurrence | Application is writing too much to the session; investigate before users hit it in production. |
Info (dashboard only, no alert)
auth_attempt, auth_success, factor_attempt, factor_success, session_created, session_invalidated, rate_limit_allowed;
useful for trend dashboards, capacity planning, and as denominators for the ratio-based alerts above. Avoid alerting on absolute counts; they swing wildly with traffic.
Computing rates from counters
AuthnMetrics exposes counters; alerts live in your monitoring system (Prometheus / Datadog / Grafana / CloudWatch). The standard pattern in Prometheus terms:
# Auth failure rate over 5 minutes
rate(axess_auth_failure_total[5m])
/ (rate(axess_auth_success_total[5m]) + rate(axess_auth_failure_total[5m]))
> 0.5
Implement the AuthnMetrics trait against your metrics client and emit _total-suffixed counters for the rate queries above to compose cleanly.
Key log messages
| Message | Severity | Action |
|---|---|---|
"session decrypted with previous (rotated) key" | Info | Key rotation in progress; monitor until gone |
"JWKS RwLock was poisoned" | Warn | Investigate what panicked while holding the lock |
"sid_map capacity reached" | Warn | Many OAuth logins; consider increasing capacity |
"session custom data exceeds size limit" | Warn | Application is writing too much to session |
"login rejected by tenant IP policy" | Warn | Legitimate user from blocked IP, or attack |
Emergency procedures
Force-logout all users
#![allow(unused)] fn main() { // Via session registry (if configured): registry.invalidate_user(&user_id).await; // Nuclear option; clear the session store: store.cleanup_expired().await; // only clears expired // For immediate full clear: truncate the sessions table or flush Valkey. }
Encryption key compromise
- Generate a new encryption key immediately.
- Deploy with new key only (no previous key); this invalidates all active sessions.
- Rotate the signing key as well (the attacker may have decrypted session data containing the HMAC tag).
- Review audit logs for suspicious session activity during the compromise window.
Further reading
Security posture covers the production-readiness posture and the compliance touch-points. Audit pipeline covers the audit retention and the buffer-overflow policies. Migration guide covers cross-version upgrades and the security-relevant breaking changes. Backends covers the per-backend operational notes (CockroachDB caveats, MySQL timezone handling, Valkey eviction policies).
Migration guide
This chapter is the cross-version migration reference. Each axess release that ships a breaking change documents the change here, with the symptom (what the compiler or the runtime will tell you), the rationale (why the change happened), and the fix (what to update in adopter code). The pattern is ordered by version, with the most recent breaks first.
The chapter is sorted by what you will see, not by what we changed. A breaking change manifests as either a compile error (the type system rejected something it accepted before), a runtime error (a deserialization fails, a config rejects), or a behaviour change (the same code does something subtly different). The sections below group by symptom; finding your case is faster than reading the full changelog.
Upcoming: 0.1.x to 0.2.0
The first crates.io publish is the 0.2.0 release. The accumulated
changes since the previous stable line are catalogued
exhaustively in
CHANGELOG.md;
this chapter covers the breaking ones an adopter has to act on.
Compile errors you will see
use axess::PolicyStore becomes use axess::AuthzStore. The
authorisation entry point was renamed for consistency with the
Authz* prefix convention. The new name better describes what
the type is (an immutable store of policies plus schema, not
just a policy collection).
use axess::AxessSession becomes use axess::AuthSession. The
session extractor was renamed; the new prefix is the shared
Auth* prefix from the naming conventions (Architecture at a
glance).
use axess::backends::SqliteStore becomes
use axess::backends::sqlite::SessionStore. The backend module
layout was reorganised so the same trait name (SessionStore)
appears under each backend's namespace; the previous flat
SqliteStore symbol no longer exists.
AuthnService::new(backend) becomes
AuthnService::new(identity_store, factor_store). The service
now takes the two stores separately so adopters can wire
different implementations (for instance, a read-replica
identity store and a write-only factor store). When the two
stores are the same type (the common case), pass it twice.
SessionLayer::with_secret becomes
SessionLayer::with_signing_key. The previous name was
ambiguous; the new name names what the bytes are used for
(HMAC signing the cookie).
AuthState::Logged becomes AuthState::Authenticated. The
state was renamed for clarity; nothing else changed about the
variant.
Configuration changes
The axess_factors_default_password_hasher config function is
gone. Argon2id is now the default; deployments that need a
different hasher (PBKDF2, legacy bcrypt) implement a custom
factor and register it. Factors and methods covers the
extension pattern.
The AuditPipeConfig shape changed. The sinks: Vec<Box<dyn Sink>>
field was replaced with explicit regulatory_sink: Arc<...>
and analytics_sink: Option<Arc<...>> fields, reflecting the
dual-stream architecture from Audit pipeline. The change
makes the wire-stable vs. enriched stream distinction explicit
in the config.
The RateLimitConfig no longer accepts a key_fn field
directly; use KeyExtractor::Custom(Arc<dyn KeyExtractorFn>)
to provide a custom extractor, or use one of the built-in
variants (PeerIp, SessionId, UserId, TenantId,
WorkloadId, Composite). The change is to make the common
cases discoverable without losing the escape hatch.
Behaviour changes
The Authenticating state now carries a Vec<FactorKind> for
remaining rather than the previous Option<FactorKind>. The
change is what enables multi-factor methods longer than two
factors. Code that pattern-matched on Some(kind) needs to
adapt to remaining.first() or to iterate over the list.
The lockout policy now defaults to per-IP in addition to
per-user and per-tenant. The previous default only locked the
user; the new default also throttles the source IP. Deployments
that explicitly want only per-user lockout configure
LockoutPolicy::per_user_only().
The session cookie's SameSite attribute now defaults to Lax
rather than Strict. The change is to match modern browser
defaults and to admit cross-site link-to-app navigations as
legitimate. Deployments that need Strict configure it
explicitly.
The fingerprint binding now defaults to FingerprintPolicy::Warn
rather than FingerprintPolicy::Reauth. The new default is
quieter during initial rollout. Production deployments that
want stricter posture lift to Reauth or Revoke after
calibrating the warn rate (Cookies, fingerprinting, hijack
detection covers the calibration).
Schema migrations
The users table gained a tenant_status field for the
tenant-suspension support. The migration is a single ALTER TABLE
that adds the column with a default value. The
examples/sqlite/migrations/
shows the SQL.
The devices table gained a fingerprint_hash field and lost
the previous fingerprint_raw field. The migration is destructive:
the fingerprint_raw field carried PII that the new design
hashes before storage (Device identity covers the rationale).
Adopters who want to preserve the audit trail of past fingerprints
write the migration accordingly; adopters who do not, just
drop the column.
The authn_attempts table gained an event_kind enum field
that distinguishes between attempt outcomes, rather than relying
on a separate outcome string. The migration is non-destructive;
the outcome field stays for backward compatibility and is
populated from event_kind automatically.
The session-data schema version bumped from 1 to 2. The new
version adds a device_id field on Authenticated (for the
device-binding work covered in Device identity). The
schema-migration code (Schema migration) handles existing
sessions transparently; no manual data migration is needed.
Workspace structure changes
The axess-delegated crate folded back into axess-core. The
adopter import paths stay the same (axess::delegated::*
continues to work through facade re-export); the
Cargo.toml no longer needs an explicit axess-delegated
dependency, just the delegated feature on axess. The
workspace dropped from 11 to 10 library crates.
Recommended migration sequence
For deployments running on the 0.1.x line:
The first step is to read this chapter end-to-end. Make a checklist of every change that applies to your code.
The second step is a parallel-deploy approach. Stand up a 0.2.0 build alongside the production 0.1.x; route a small fraction of traffic to it; observe behaviour. The session cookies between the two versions are not compatible (the schema-migration mechanism handles cookie reads but not writes across major versions), so the parallel deploy needs to be on isolated session storage.
The third step is the cutover. Once the 0.2.0 build has been green for at least the session TTL on the production-like sample, route 100% of traffic to it. The 0.1.x build can be decommissioned after a roll-back window has passed without incident.
The roll-back path: if 0.2.0 surfaces problems, route traffic
back to 0.1.x; the sessions that started under 0.2.0 will be
invalid against 0.1.x and will land as Guest, prompting
re-login. The user-visible impact is one re-login; the
behavioural impact is bounded.
Future migrations
The pattern from 0.1.x to 0.2.0 is the pattern future migrations will follow. Each migration documents itself here, sorted by release. The pattern:
Symptom: what the compiler or the runtime will tell you.
Rationale: why the change happened. Most changes happen because the previous shape was wrong in a specific way (a footgun, a performance bug, a security gap, an inconsistency with the rest of the library). The rationale gives the explanation; the next section gives the action.
Action: what to update in adopter code. The action is the shortest possible change that satisfies the new shape; longer restructurings are flagged as optional improvements.
A typical migration entry runs five to ten lines for a small change, a few paragraphs for a larger one. The chapter grows additively; older migrations are not removed.
What does not migrate
Some adopter changes do not produce a migration entry. The patterns:
Behaviour that was bug-fixed. A previous version's incorrect behaviour might have been load-bearing for an adopter who built around it; the fix is still the right thing to do, and the adopter has to adapt. The fix appears in the changelog as a bug fix; if the bug-fix is large enough to warrant a migration entry, it lands here, but not all of them do.
Internal refactors that do not change the public API. The
internal split between axess-core modules is free to
reorganise without producing a migration entry, as long as the
public re-exports stay stable.
Configuration defaults that change but are configurable. A default that flipped is a behaviour change, captured above. A default that is configurable in both directions and the configuration is the source of truth does not produce a migration entry; the adopter's existing configuration continues to apply.
Further reading
Schema migration covers the per-session schema migration
mechanism that handles session-data shape changes. The
CHANGELOG.md
covers the exhaustive list of changes per release; this chapter
is the curated migration subset. Security posture covers the
security-relevant breaking changes specifically, with the
disclosure protocol for security fixes.
Contributing
This chapter is the contributor reference. It covers what we expect of pull requests, the testing requirements (including the non-negotiable DST discipline), the AX-NNN tracking convention, and the naming and visibility conventions that show up at code review.
The chapter has two halves. The first half is contributor-facing
guidance specific to working on axess. The second half is the
canonical CONTRIBUTING.md
from the repo root, included so the workflow checklist is in one
place.
Before you open a PR
Three things to do before you open a PR.
The first is to read or skim Architecture at a glance. The verifier-versus-orchestrator boundary, the three state slices, the DST discipline, and the naming conventions are the four architectural decisions that the review process holds new code against. A PR that violates one of them is harder to land; a PR written with them in mind sails through.
The second is to find or create an AX-NNN tracking entry. The
ROADMAP is the source of truth for "what is being worked on" and
"what is committed." A PR that lands a feature should reference
an AX-NNN. A PR that lands a bug fix can do without (though one
is often associated even with fixes). The number lives in the
PR description and in the commit messages; the format is
AX-NNN (no #, no space).
The third is to discuss substantial changes before writing them. The review cycle is faster when the maintainers have agreed to the shape ahead of time. A drive-by PR that rewrites a module is usually rejected even when the rewrite is well-thought-out; the cost of integration is higher than the value of the rewrite. A discussion (an issue, a draft PR description, a comment in an existing thread) before the work starts is the shape that lands.
Testing requirements
Every change passes its tests under both the production and the
mock implementations of Clock, SecureRng, and the backend
traits. The DST discipline is the testing non-negotiable; it is
not aspirational.
A test that fails on the production implementation but passes on the mock is detecting a real bug in the production code (or in the test). A test that fails on the mock but passes on production is detecting either a real timing-dependent bug or an over-strict test; either way it is worth investigating before landing.
The pattern in the test code is to parameterise:
#[tokio::test]
async fn login_succeeds_with_correct_password() {
let suite = TestSuite::default(); // sets up the mocks
let outcome = suite.service
.verify_factor(
&suite.session(),
FactorCredential::Password("Gnomes2+".into()),
)
.await
.unwrap();
assert!(matches!(outcome, FactorOutcome::Authenticated));
}
TestSuite::default() wires MockClock, MockRng,
MockBackend, MockRegistry, the in-memory session store, and
the in-memory device store. The test runs entirely in process,
deterministically, against a known initial state.
For tests that need a real database (integration tests that verify SQL adapters), the pattern is to feature-gate them and run them in CI under a service container:
#[tokio::test]
#[ignore = "requires Postgres"]
async fn postgres_session_round_trip() {
let pool = sqlx::PgPool::connect(env_var("TEST_POSTGRES_URL")?).await?;
// ... full integration test
}
The #[ignore] attribute keeps the test out of the default
cargo test run; the CI runs them explicitly with
cargo test --features integration -- --ignored. The pattern
keeps the inner loop fast (default cargo test is in-process)
while still exercising the integration tests in CI.
What good PR descriptions look like
The PR description is what reviewers read first. The goal is to explain what the PR does, why, and what to look for. The shape:
A one-sentence summary at the top. "Add the BearerToken
factor for inbound API authentication." Not "Misc fixes." The
summary is what shows up in the PR list and in the commit
history.
A "Why" paragraph. What problem does the change solve. The problem might be a documented bug, a missing capability, an operational signal that needs response. The reviewer's first question after "what" is always "why now"; answer it in the description rather than the comments.
A "How" section. The shape of the change. Which modules touched, which traits added or modified, which tests added. The reviewer's first question after "why" is "where to look"; the section is the map.
A "Testing" section. What tests cover the change. The default expectation is unit tests against the mocks; integration tests where the change crosses an integration boundary; manual testing notes for changes that are hard to automate (typically migrations or operational tooling).
A "Migration" section if the change is breaking. What downstream code has to update. The section is what feeds the Migration guide chapter; the maintainers add the entry there as part of the merge, but the PR author drafts the wording.
A reference to the AX-NNN tracking number. If the work is substantial, the AX entry has the larger context; the PR description summarises the slice this PR delivers.
Naming and visibility
The naming conventions from Architecture at a glance are enforced at review. The shapes:
A type that is shared across authentication and authorisation
uses the Auth* prefix. A type used only for authentication
uses Authn*. A type used only for authorisation uses Authz*.
A type that does not fit any of the three either picks one
(typically the broader one) or argues in the PR description
why the convention does not apply.
A type's suffix carries its role. *Store, *Registry,
*Provider, *Resolver, *Config, *Error, *Outcome,
*Decision. A new type that does not fit any of these picks
the closest match or argues in the PR description; the
conventions are tight, but they are not exhaustive, and the
rare exception is acceptable when documented.
A method's verb carries its complexity. get_* is O(1) by
primary key. find_* may scan. load_* and save_* are
serialisation pairs. begin_* and complete_* are ceremony
starts and finishes. verify_* is a credential check. A
method that does not fit any of these picks the closest match.
Visibility defaults to pub(crate). A type is promoted to
pub only when an external consumer needs it; the default is to
not export, and the burden is on the PR to justify the
promotion. The convention catches the common case where an
internal helper accidentally becomes public surface that has to
be maintained forever.
The no-#[non_exhaustive] policy
Axess does not use #[non_exhaustive] on its public enums and
structs. The attribute trades exhaustiveness checking (the
downstream compiler does not catch missing match arms) for
backward compatibility (the upstream can add variants without
breaking downstream). For axess, the trade is the wrong way
around: missing match arms in the downstream are bugs we want to
catch, and the backward-compatibility cost of adding variants is
manageable through deprecation cycles and the migration guide.
A PR that adds #[non_exhaustive] to a public type is rejected
unless the reasoning in the PR description argues a specific
case. The default is to bump the semver major version when a
variant is added, document the change in the migration guide,
and let the downstream's compiler catch the missing arm.
The DST non-negotiable
The DST discipline is reproduced from Architecture at a glance as a contributor reminder:
Every code path that reads wall time goes through the Clock
trait. Every code path that sources entropy goes through the
SecureRng trait. Every backend trait has a mock implementation
that the tests use. A PR that introduces a chrono::Utc::now()
call, a getrandom() call, or a direct database read outside
the trait surface is rejected.
The exceptions are extremely narrow: the axess-cache crate's
moka-cache feature uses wall-clock-driven eviction (opt-in,
documented as DST-breaking), and the production SystemClock
and SystemRng implementations delegate to the OS (these are
the only places where the OS calls happen). New code introduces
neither another exception nor a workaround that hides the same
problem.
The discipline is what lets the test suite be reproducible. A contributor who finds the discipline frustrating is usually about to introduce a bug; the friction is the point.
Canonical CONTRIBUTING.md
The rest of this chapter is the canonical CONTRIBUTING.md from
the repo root.
Contributing to Axess
Thanks for your interest! Axess accepts bug reports, feature requests, documentation improvements, and code contributions.
Before opening a PR for non-trivial work, please file an issue first; this lets us flag overlap with in-flight work in ROADMAP.md and confirm the change fits the library's direction (see docs/intro/architecture.md) before you invest time.
Before you submit
- Fork the repository and create a topic branch from
main. - Tests; add or update tests for every behaviour change. The library uses deterministic simulation testing (DST); inject
MockClock/MockRngrather than callingSystemTime::now()orrand::rng()directly. - Run the full check locally:
cargo fmt --all cargo clippy --workspace --all-features --lib --tests -- -D warnings cargo test --workspace --all-features - Update
CHANGELOG.md; add an entry under the[unreleased]section describing the change. Behaviour-changing entries belong under### Changed (breaking)if they alter a public API. - Open a PR with a description that covers the why; link the issue, summarise the design choice, and call out any deliberate trade-offs.
Coding conventions
- Idiomatic Rust,
async/awaitfor IO,thiserrorfor error types,tracingfor logs. - Prefer traits + generics on hot paths; vtable dispatch (
Box<dyn …>) only where it earns its keep. - Public APIs need rustdoc; including at least one usage example for newly-introduced traits or builders.
- All time + randomness goes through the
Clock/SecureRngtraits. This is non-negotiable; it's what makes the test suite deterministic.
See .github/copilot-instructions.md for the full house style.
Workspace layout
| Crate | Role |
|---|---|
axess | Public facade: middleware builder, re-exports, feature gates |
axess-core | Core types, session orchestrator, Cedar authz integration, on-behalf-of credential storage + token exchange |
axess-cache | Generic clock-aware TTL cache |
axess-clock | Clock / MockClock traits for DST |
axess-events | rkyv-serialisable audit event types |
axess-factors | Authentication factor implementations |
axess-identity | Newtype ID macros + impls |
axess-macros | Procedural macros for route guards |
axess-rng | SecureRng / MockRng traits |
axess-strings | Short hot-path string primitive |
examples/* | Reference example applications |
Repository conventions
A few rules that aren't obvious from reading the code but affect every PR. Most exist because the cost of not following them showed up somewhere.
Module layout
axess uses the modern Rust convention: foo.rs + a sibling foo/ directory holding submodules. No mod.rs files in new code. Every directory module declares its submodules in the foo.rs file next to (not inside) the directory.
Test-sideways-pull
When #[cfg(test)] tests crowd a production file enough to make scrolling expensive, pull them into a sibling tests.rs:
axess-core/src/path/file.rs ; production code +
#[cfg(test)] mod tests;
axess-core/src/path/file/tests.rs; the actual tests, gated by
#![cfg(test)]
Applied so far across several files where the tests-to-production ratio exceeded ~40%.
pub(crate) for state-machine internals
AuthSession carries identity / session-state accessors as pub. State mutation methods (set_authenticated, begin_authenticating, advance_factor, record_attempt_at) are state-machine transitions that the factor pipeline drives; they are pub(crate) so handler code cannot corrupt the state machine. Adopters drive flow through AuthnService; the session is read-only-ish from outside axess-core.
Per-app workflow mutations (set_identifying, set_pending_workflow, clear, regenerate) remain pub; apps build their own two-step identify / workflow-step / logout flows on top.
No #[deprecated] pre-v0.1.0
Breaking changes happen freely in the unreleased [0.2.0] window; adopters get one coordinated migration window, not a long #[deprecated] trail. CHANGELOG documents each break under ### Changed (breaking).
MSRV bumps are breaking changes
The workspace pins rust-version = "1.87" in [workspace.package]. A bump to a higher MSRV requires a minor-version bump on every published crate (0.x → 0.x+1 for 0.x; 1.x → 1.x+1 once stable). The reasoning: adopters pin Rust toolchains in CI; jumping the floor without warning silently breaks their builds.
Procedure for an MSRV bump:
- Justify in the PR description (which compiler feature, why it earns the bump).
- Update
rust-versionin[workspace.package]AND theMSRVjob's toolchain pin in.github/workflows/ci.yml. - Add an entry under
### Changed (breaking)in CHANGELOG.md naming the new floor. - Bump the workspace
version(in[workspace.package]) accordingly.
No #[non_exhaustive] on first-party enums
#[non_exhaustive] trades one breakage class (adding variants) for another (every downstream match needs a wildcard arm forever, even when the caller wants compile-time exhaustiveness on a closed set). Project policy is to bump the version and let downstream match failures be loud. CI enforces this; the ban_non_exhaustive workflow job rejects any PR that introduces the attribute.
No ticket-meta date stamps pre-v0.1.0
Source-code comments do not carry // AX-NNN (YYYY-MM-DD): markers. The CHANGELOG is the authoritative timeline; in-source stamps add noise without information a future reader can use. ROADMAP + CHANGELOG retain their AX-NNN references unchanged.
Closed AX-NNN references get stripped
Once an AX-NNN case closes, every reference in source / doc-strings / test names is stripped, preserving the rationale comment but dropping the case number. Open + deferred cases stay referenced.
Promoting a module out of axess-core
axess-core has accumulated significant surface. When proposing a new crate carve-out, check:
- No reverse dep from axess-core onto the carved module. If the module's types appear in
AuthnServicemethod signatures or in any axess-core trait surface, the carve isn't yet feasible; invert the dependency first. - Module has its own external dep blast. Carving
delegated/intoaxess-delegatedwon because it pullsaes-gcmonly when adopters opt in. A carve that pulls no extra deps is just churn. - Module is consumable in isolation. A consumer who wants only the carved module should not transitively recompile axess-core's protocol surface.
- Re-export via the facade preserves the import path. Adopters write
axess::middleware::ratelimit::*, notaxess_middleware::ratelimit::*. The facade decides the shape.
Security
Do not open public issues for security vulnerabilities. Report them privately per SECURITY.md.
Licensing
By contributing, you agree your contribution will be dual-licensed under MIT and Apache-2.0, matching the project licence.
Community
Be respectful and constructive. See CODE_OF_CONDUCT.md.
Maintainer time is volunteer-funded; review turnaround is best-effort.
Further reading
Architecture at a glance covers the architectural decisions
that review enforces. Publishing runbook covers the
maintainer-only release process. The
CHANGELOG.md
catalogues what each release has shipped, which is useful
context for understanding what the next PR is meant to do.
Publishing runbook
This chapter is the maintainer-only reference. It covers the publish-to-crates.io procedure: the pre-flight checklist, the dependency topological order, the dry-run, the actual publish, and the rollback procedure if something goes wrong.
The audience is the maintainer cutting a release. An adopter does not need this chapter; the chapter is here so the maintainer has a written reference and so the procedure can be followed by a different maintainer if needed.
Pre-flight
The pre-flight checklist runs before the first dry-run. Each item is a binary pass-or-fail; one failure blocks the release.
The CI is green on the release branch. The full test matrix (default features, all-features, per-backend isolation, FIPS backend) all pass. A red CI does not publish.
The version bumps are consistent across the workspace. Every
member of the workspace gets the same version bump (this is the
versioning policy: the workspace ships as one unit). The
Cargo.toml in each crate carries the new version; the
version.workspace = true shape inherits from the root.
The Cargo.lock is up to date. Run cargo update --workspace,
review the changes, commit if needed.
The migration guide is complete. The Migration guide chapter in the book carries an entry for every breaking change in the release. The entry covers the symptom, the rationale, and the fix.
The CHANGELOG.md has a current entry for the release. The entry covers the new features, the breaking changes (referencing the migration guide), and the bug fixes. The entry is the short-form version of the migration guide; both exist because they serve different audiences (the CHANGELOG is the per-release overview, the migration guide is the per-change reference).
The docs.rs configuration is present and correct on every crate that publishes to crates.io. The shape:
[package.metadata.docs.rs]
all-features = true
rustdoc-args = ["--cfg", "docsrs"]
Verify by running cargo doc --all-features locally; the build
must succeed without warnings. A docs build that fails on
docs.rs after publish is a maintenance problem that surfaces
once the release is out.
The description, license, repository, keywords, and
categories fields are populated on every published crate. The
crates.io page renders these; a missing field is a missing
detail in the listing.
The publish = false flag is removed from every crate that
should publish. This is the deliberate gate that keeps
accidental publishes from happening; flipping the flag is what
makes the release possible.
The version branch (a fresh release/0.2.0 branch from main)
exists. The branch is the source of truth for the release; any
fixes during the publish window land on the branch and merge
back to main after.
Topological dependency order
The workspace's library crates publish in dependency order: a crate must be published before any crate that depends on it. The order:
axess-strings(no axess deps)axess-clock(no axess deps)axess-rng(no axess deps)axess-identity(no axess deps)axess-cache(depends onaxess-clock)axess-events(depends onaxess-identity)axess-factors(depends onaxess-identity,axess-clock,axess-rng)axess-macros(no axess deps; procedural macros stand alone)axess-core(depends on everything in the previous tier)axess(the facade, depends onaxess-coreandaxess-factorsandaxess-macros)
The order is generated by cargo publish --dry-run against each
crate in turn, but the maintainer should know it manually so a
publish that fails partway through can be resumed at the right
position.
The dry-run
Before the actual publish, run a dry-run for each crate in topological order. The shape:
cargo publish --dry-run -p axess-strings
cargo publish --dry-run -p axess-clock
cargo publish --dry-run -p axess-rng
cargo publish --dry-run -p axess-identity
cargo publish --dry-run -p axess-cache
cargo publish --dry-run -p axess-events
cargo publish --dry-run -p axess-factors
cargo publish --dry-run -p axess-macros
cargo publish --dry-run -p axess-core
cargo publish --dry-run -p axess
The dry-run does everything the publish does except the upload.
It builds the package, verifies the manifest, runs the publish-time
checks, and prints the path of the .crate file it would have
uploaded. A failure here is an opportunity to fix without having
to yank a half-published release.
A failure on a downstream crate (say axess-core) typically
means an upstream crate (say axess-factors) needs an update
that the dry-run does not yet reflect. The fix is to update the
upstream first; the downstream dry-run picks up the change.
The publish
After the dry-runs pass, run the actual publish in the same topological order. The shape:
cargo publish -p axess-strings
# wait ~30s for crates.io to index, then:
cargo publish -p axess-clock
# wait, then:
cargo publish -p axess-rng
# ... and so on
The wait between publishes is necessary because each subsequent publish needs the previous one to be available on crates.io's index. Without the wait, the downstream publish fails with "crate not found"; with the wait, the index has propagated by the time the next publish queries it.
The wait time is short (30 seconds is generous; sometimes 15 seconds works). For automation, the wait can be scripted with a retry loop that polls the crates.io API until the expected version is listed.
After the final publish (cargo publish -p axess), wait a few
minutes and verify on crates.io that all the crates are listed
at the new version.
The smoke test
After the publish, run a smoke test against a fresh dependency on the published version. The shape:
mkdir /tmp/axess-smoke
cd /tmp/axess-smoke
cargo new --name axess-smoke .
echo 'axess = "0.2"' >> Cargo.toml
cargo build
The build pulls the published crates from crates.io (not from the workspace) and verifies they assemble. A failure here indicates an issue that the dry-run did not catch (typically a crates.io-specific issue like a missing file in the package manifest); the rollback procedure is the response.
For a more thorough smoke test, copy examples/sqlite/ to a
fresh directory, point its Cargo.toml at the published
versions (replacing path = "../../axess" with
version = "0.2"), and verify it builds and runs.
The smoke test is the last gate before announcing the release. A successful smoke test means the publish is real.
The announcement
After the smoke test passes:
Tag the release in git (git tag v0.2.0 && git push --tags).
The tag is the canonical reference point.
Update the status banner near the top of
README.md
from "pre-release" to "0.2.0 released."
Open a 0.3.0-pending section in CHANGELOG.md. Future PRs
land entries under that section until the next release cuts.
Post to the project's announcement channels (the GitHub releases page is the canonical one; the project's Discord, Slack, mailing list, or other channels mirror as appropriate).
Update the docs.rs links anywhere they hardcode a version. The canonical version is now the released one, not the development branch.
Rollback
If the publish goes wrong (a critical bug surfaces, a crate is broken on crates.io, the release was premature), the rollback procedure:
cargo yank --version 0.2.0 axess (and every other crate)
withdraws the version from crates.io. Yanked versions remain
available to existing consumers (so Cargo.lock references
continue to work), but new resolves do not pick them up.
A yanked version cannot be unyanked, and a new publish with the
same version number is not possible. The next release picks a
new version (0.2.1); the fix lands there.
In practice the rollback is needed less often than the dry-run suggests; the topological-publish discipline catches most issues before the upload. A yank is the genuine emergency response, typically for a security issue that warrants withdrawing a specific version.
Post-release maintenance
After the release lands, the maintenance window is the period where the maintainer watches for issues. The shape:
The first 24 hours: monitor crates.io for download numbers (a quick check that the publish reached an audience); monitor the GitHub issue tracker for new bug reports; monitor the dashboards of any deployments that follow main closely.
The first week: triage the issues that come in; assess whether any warrant a patch release (0.2.1). The criteria for a patch release: a critical bug, a security issue, a regression from 0.1.x that the migration guide did not catch.
The first month: roll up the lessons learned. The patches shipped, the issues that surfaced, the documentation gaps the release exposed. The roll-up feeds the next release's planning.
Further reading
Migration guide covers the cross-version compatibility surface
that the release-management decisions depend on. Contributing
covers the development workflow that produces the changes a
release ships. The
CHANGELOG.md
covers the exhaustive list of changes per release.