Skip to content

Troubleshooting

Symptom → diagnostic → fix. Runbook for the most common failure modes.

SymptomFirst checkFix
helm install soctalk-system fails in pre-install hookkubectl logs -n soctalk-system job/<release>-preinstall-checkInstall the missing cluster prereq (CNI, cert-manager, StorageClass) per the Install guide
API pod CrashLoopBackOff on startupkubectl logs -n soctalk-system deploy/soctalk-system-apiMost often: bad DATABASE_URL Secret, Postgres not ready yet, or Alembic migration failure. Check the Postgres pod first
helm install succeeds but MSSP UI returns 502Ingress controller logs; verify ingress Service endpoints populatedOIDC proxy not deployed or not injecting trusted headers. Check trusted-proxy CIDR
Tenant create returns 500API logs show ProvisionErrorUsually helm install tenant-* failed. Check helm status tenant-<slug>. Namespace and resource-quota issues are most common
Tenant stuck provisioning > 15 minkubectl -n tenant-<slug> get events --sort-by=.lastTimestampSee Tenant stuck in provisioning in operations
Tenant goes degradedAdapter logs in the tenant namespaceNetworkPolicy egress, adapter pod crash, or DNS misresolved
Cross-tenant data visibleRun isolation test suiteP1 incident. RLS is the last line of defense; a failure indicates an application bug or Postgres role misconfiguration
LLM calls failing for one tenantWorker logs: look for 401/403 from the LLM providertenant-<id>-llm Secret api_key is empty or wrong. Rotate via the UI
Wazuh agent can't connectTenant's LB IP (or edge HAProxy IP+port) reachable from the agent host; DNS for <slug>.soc.mssp.* resolves to it; 1514/1515 open through any intermediate firewallSee Wazuh Ingress. 1514 is Wazuh's proprietary protocol — there is no SNI to inspect; routing is by destination address or port. Verify the tenant's Service (type: LoadBalancer or the HAProxy port) is the address the agent is targeting
Postgres StatefulSet won't start (PVC Pending)kubectl describe pvc -n soctalk-systemNo default StorageClass, the class doesn't support RWO, or the cluster is out of disk
PolicyViolation messages from ingress controllerNetworkPolicy allow rulesMake sure the ingress namespace is labeled kubernetes.io/metadata.name=ingress-system
Cilium Hubble shows DROPPED flows between tenant and soctalk-systemNetworkPolicies + Cilium identitiesAdapter egress policy missing or wrong namespaceSelector
Customer user login returns 403 on /api/tenant/*JWT claimsEnsure the user row has tenant_id set and role=customer_viewer
MSSP user impersonation not showing in customer auditAudit queryVerify acting_as column populated on write; the customer audit view joins on tenant_id = own AND acting_as IS NOT NULL
Isolation test fails in CI (FORCE RLS admin can see rows)Migration applied?Re-run alembic upgrade head; ensure FORCE ROW LEVEL SECURITY applied to every tenant-scoped table

Collecting diagnostic bundles

When escalating to support, collect:

bash
# SocTalk system-level state
kubectl get all,events,networkpolicies,resourcequotas \
  -n soctalk-system -o yaml > soctalk-system.yaml
kubectl -n soctalk-system logs deploy/soctalk-system-api --tail=500 > api.log
kubectl -n soctalk-system logs deploy/soctalk-system-orchestrator --tail=500 > orch.log

# Specific tenant
kubectl get all,events,networkpolicies,resourcequotas,limitranges \
  -n tenant-<slug> -o yaml > tenant.yaml
kubectl -n tenant-<slug> logs deploy/soctalk-adapter --tail=500 > adapter.log

# Helm state
helm status -n soctalk-system soctalk-system > helm-system.txt
helm status -n tenant-<slug> tenant-<slug> > helm-tenant.txt

# SocTalk version + lifecycle events for the tenant
soctalk-cli debug-bundle --tenant <slug> > bundle.json

tar czf soctalk-debug-$(date +%s).tgz *.yaml *.log *.txt bundle.json

Review the tarball for customer data before sharing externally. Logs may contain alert excerpts.

Released under the Apache 2.0 License.