Persistent 403 On All API Calls — Auth Token Not Accepted After Login (Android)
Metadata
- Date:
2026-04-18 - Status:
root-cause-identified - Severity:
critical - Related issue/ticket:
N/A - Owner:
N/A
About
Overview: - On Android, API calls return 403 even after logging out and back in. The memory feed fails at app startup, and the session disappears mid-run (auth-gate kicks the user to login from Settings while the app appeared fully authenticated). - The 403 blocks all protected API access. The user can't view memories or interact with their data.
Technical Questions: - All 403s originate from the API Gateway Cognito authorizer (before any Lambda is invoked). Lambda AuthError returns status_code=401; API Gateway authorizer rejection returns 403. This means the token itself is being rejected at the Gateway level. - auth_retry_requested fires (not auth_retry_aborted), so the Cognito token refresh via refreshSession SUCCEEDS and returns a new idToken. Yet the retried request with the fresh token ALSO gets 403. This is the key anomaly. - The session disappears mid-run: logs show auth-gate redirect_to_login {"first_segment": "settings"} while the app had just successfully captured audio. The only code paths that call setSession(null) are signOut() and the bootstrap catch/fallthrough. One of these ran unexpectedly. - The app's .env has EXPO_PUBLIC_COGNITO_USER_POOL_ID=us-east-1_WvhXc4lRH and EXPO_PUBLIC_COGNITO_APP_CLIENT_ID=2f6n62s81ic3gm084mgi71b6hn. If AWS SSM /encache/auth/user_pool_id or /encache/auth/app_client_id no longer match these values (e.g., Cognito pool was recreated by Terraform), the Gateway would reject all mobile-issued tokens. - The 4-key SecureStore split (commit 719be108) changed how sessions are persisted. A partial write failure (one key missing) would cause getStoredSession to return null on next load, but this should surface as has_stored_session: false in bootstrap logs.
Resources: - main/app/lib/api/getApi.ts — response interceptor, attemptTokenRefresh, getOrStartRefresh - main/app/lib/api/auth/session.ts — getValidIdToken, persistSession, clearSession - main/app/lib/api/auth/secure-store.ts — getStoredSession, setStoredSession (4-key split) - main/app/lib/api/auth/auth-provider.tsx — bootstrap flow, setSession(null) paths - main/app/lib/api/auth/cognito.ts — EXPO_PUBLIC_COGNITO_USER_POOL_ID, EXPO_PUBLIC_COGNITO_APP_CLIENT_ID - main/server/template.yaml:68 — API Gateway authorizer uses {{resolve:ssm:/encache/auth/user_pool_id}} - main/server/layers/shared/python/shared/lambda_helpers.py:119 — _verify_jwt (returns 401, not 403 — confirms 403 = Gateway) - main/devops/main.tf:386 — SSM parameters /encache/auth/user_pool_id and /encache/auth/app_client_id
Steps to cause failure
flowchart LR
AppStart --> FeedFires["Memory feed fires\n(bootstrap still loading)"]
FeedFires --> NoToken["getValidIdToken() → null\n(sessionCache cold, SecureStore async)"]
NoToken --> NoHeader["Request sent\nwithout Authorization header"]
NoHeader --> GW403["API Gateway → 403\n(no auth header = Cognito rejects)"]
GW403 --> Retry["Response interceptor:\nrefreshSession() → succeeds\nnew idToken obtained"]
Retry --> Retry403["Retry with fresh token\n→ still 403\n(WHY? — open question)"]
Retry403 --> Exhausted["auth_retry_exhausted"] flowchart LR
MidRun["App running normally\n(capture worked)"] --> NavSettings["User navigates to Settings"]
NavSettings --> SessionNull["session React state = null\nhas_session: false"]
SessionNull --> Redirect["Auth-gate kicks to login"] System
flowchart TD
MobileApp -->|EXPO_PUBLIC_COGNITO_USER_POOL_ID| CognitoPool["Cognito User Pool\nus-east-1_WvhXc4lRH"]
CognitoPool -->|issues idToken| MobileApp
MobileApp -->|Authorization: Bearer idToken| APIGW["API Gateway"]
APIGW -->|validates token against| SSMPool["SSM /encache/auth/user_pool_id\n(resolved at deploy time)"]
SSMPool -->|pool mismatch? → 403| APIGW
APIGW -->|token valid → invokes| Lambda["Lambda\n(AuthError → 401, not 403)"] Reproduction Details
- Cold-start the app on Android.
- Observe
auth_retry_exhausted {"status": 403}for/memories/feedbefore capture starts. - Navigate to Settings → observe
redirect_to_login {"first_segment": "settings"}andhas_session: false. - Log out and back in via magic link.
- Observe 403 persists on protected endpoints.
Reproduction test: N/A — requires live AWS + Cognito; cannot be unit-tested without mocking API Gateway.
Notes for PR
Two distinct failure modes identified:
Failure 1 — Race condition at startup (frontend): getValidIdToken() in the request interceptor returns null when sessionCache is cold and SecureStore hasn't finished loading during bootstrap. The memory feed fires immediately on mount before bootstrap() completes. The fix would be to gate the feed load on loading === false from AuthProvider, or to await session load before making the first API call.
Failure 2 — Persistent 403 even with fresh token (root cause TBD): The refreshed token (from refreshSession() → Cognito) is ALSO rejected by API Gateway. The most likely cause is a Cognito config mismatch: EXPO_PUBLIC_COGNITO_USER_POOL_ID/EXPO_PUBLIC_COGNITO_APP_CLIENT_ID in .env don't match the SSM values the backend was deployed with. To verify:
# Run in your AWS-authenticated terminal:
aws ssm get-parameter --name /encache/auth/user_pool_id --query Parameter.Value --output text
aws ssm get-parameter --name /encache/auth/app_client_id --query Parameter.Value --output text
Compare against .env: - EXPO_PUBLIC_COGNITO_USER_POOL_ID=us-east-1_WvhXc4lRH - EXPO_PUBLIC_COGNITO_APP_CLIENT_ID=2f6n62s81ic3gm084mgi71b6hn
If they differ, update .env to match SSM and rebuild the app. If they match, check CloudWatch Logs for the API Gateway authorizer to see the exact rejection reason.
Failure 3 — Session disappearing mid-run: The auth-gate fires redirect_to_login while the app had a valid session. The only setSession(null) paths outside of explicit signout are: bootstrap error catch, and bootstrap fallthrough (stored session missing + refresh fails). Bootstrap re-runs only if logger reference changes. Needs log evidence of bootstrap_started or token_refresh_failed firing mid-session.
Audit Log
| ID | Action | Note | Context |
|---|---|---|---|
| 1 | Create audit log | Initialize bug investigation | issue created |
| 2 | Trace auth flow | 403 comes from API Gateway (not Lambda) — AuthError in lambda_helpers.py returns 401; API GW Cognito authorizer returns 403. Response interceptor retry also 403 despite fresh token from refresh | code trace |
| 3 | Check Cognito config | .env has pool us-east-1_WvhXc4lRH and client 2f6n62s81ic3gm084mgi71b6hn. Backend resolves from SSM at deploy time. Mismatch between these two is the #1 suspect for persistent 403 | code trace |
| 4 | Identify session disappear path | has_session: false mid-run → React state session became null. Only paths: signOut(), bootstrap error/fallthrough. clearSession() in getValidIdToken catch does NOT call setSession(null) → potential divergence between sessionCache and React state | code trace |
| 5 | Identify race condition | Memory feed fires before bootstrap completes → sessionCache cold → getValidIdToken() returns null → no Authorization header → 403. This is distinct from the persistent 403 after fresh login | code trace |
| 6 | Fix race condition | Added enabled param to useMemoriesFeed and useChatsFeed; index.tsx passes !authLoading so both feeds wait for bootstrap to complete before firing | useMemoryApi.ts, useChatsApi.ts, index.tsx |
| 7 | Fix session state divergence | Added registerOnSessionCleared callback in session.ts; clearSession() now fires it; AuthProvider registers setSession(null) on mount — any token refresh failure now correctly clears React state and triggers auth-gate redirect | session.ts, auth-provider.tsx |
| 8 | Identify has_stored_session: false root cause (hypothesis) | Login succeeds and user sees main screen → memory feed fires → 403 → response interceptor → attemptTokenRefresh() in getApi.ts → if Cognito refreshSession() throws → clearSession() is called → WIPES the session that was just stored → next launch finds nothing. The persistent 403 causing the Cognito refresh attempt is likely from the API Gateway Cognito authorizer using a pool ID baked in at SAM deploy time that may no longer match the mobile app's pool. CloudFormation {{resolve:ssm:...}} resolves ONCE at deploy — SSM update does not update the deployed Gateway. | code trace |
| 9 | Add diagnostic logging | Added auth-store flow logs to secure-store.ts (write started/succeeded, read succeeded/null with key presence detail, token lengths). Added will_clear_session: true to token_refresh_failed in getApi.ts. Added had_retry_token to auth_retry_exhausted. | secure-store.ts, getApi.ts |
| 10 | Root cause confirmed via logs | Full login flow captured. store_session_write_succeeded confirmed (tokens 1071/1091/1788 bytes — well under 2048 limit). token_refresh_succeeded confirmed — Cognito pool IDs are correct on mobile. After Cognito refresh, STILL gets 403 on retry (auth_retry_exhausted {"had_retry_token": true}). Session is never cleared by token_refresh_failed. Conclusion: API Gateway Cognito authorizer has a stale pool ID baked in from its last CloudFormation deployment — {{resolve:ssm:...}} resolves ONCE at deploy, not at runtime. Fix: redeploy the SAM stack so API Gateway picks up current SSM value. The has_stored_session: false on fresh launches is a development-build reinstall artifact (each expo run:android wipes app data including SecureStore). | logs |
Verification
- [ ] Reproduced failure before fix
- [ ] Reproduction test fails before fix
- [ ] Root cause identified with evidence
- [ ] Fix applied at source (no workaround-only patch)
- [ ] Reproduction test passes after fix
- [ ] Reproduction path now passes
- [ ] Regression test added/updated (or
N/Awith reason) - [ ] Verified no duplicate solved-bug log exists for same root cause