Langfuse Troubleshooting Guide: Fix ClickHouse, Postgres, and Ingestion Issues
If you are running a self-hosted AI stack and your traces have stopped appearing, your services will not start, or something broke after an update, this Langfuse troubleshooting guide is for you.
Langfuse runs several moving parts, including a web container, a worker container, PostgreSQL, ClickHouse, Redis, and MinIO (or S3). When one of them misbehaves, the whole pipeline breaks. This guide covers each layer so you know exactly where to look.
If you haven’t set up Langfuse yet, you can start with the self-hosted Langfuse on VPS installation guide.
Table of Contents
Quick Health Check Before You Start
Before going deep into any single service, you can run these commands to see what is alive and what is not.
Check all container statuses:
docker compose ps
Stream live logs from all containers:
docker compose logs --tail=50 -f
Check resource usage per container:
docker stats
Look for containers that keep restarting or show an Exit status, which tells you where the problem is. The steps below go through each service one by one.
Langfuse Troubleshooting: ClickHouse Issues
ClickHouse is the core OLAP database that stores every trace, observation, and score. Most Langfuse troubleshooting cases involve ClickHouse in some way.
ClickHouse Won’t Start or Fails Migration
The most common startup error looks like this:
error: failed to open database: database driver: unknown driver "clickhouse (forgotten import?)"
Or you may see:
error: driver: bad connection in line 0
Both errors mean the web or worker container cannot reach ClickHouse. Check these things in order:
1. Make sure both URLs are set correctly
ClickHouse uses two different URLs for two different purposes, one for HTTP API calls and one for TCP migrations:
CLICKHOUSE_URL=http://clickhouse:8123
CLICKHOUSE_MIGRATION_URL=clickhouse://clickhouse:9000
CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD=YourClickHousePassword
A very common mistake is using http:// in both variables. The migration URL must use clickhouse:// and port 9000, not 8123.
2. Set CLICKHOUSE_CLUSTER_ENABLED to false for single-node setups
If you are running one ClickHouse container, especially for VPS deployments, set this variable to false. Otherwise, the system tries to run cluster commands that fail on a standalone instance:
CLICKHOUSE_CLUSTER_ENABLED=false
3. Verify ClickHouse timezone is UTC
ClickHouse must always run in UTC. A different timezone will cause queries to return empty or wrong results. Check it by connecting to the container:
docker exec -it langfuse-clickhouse-1 clickhouse-client --query "SELECT timezone()"
It must return UTC. If it does not, remove any custom timezone setting from your ClickHouse server config and restart.
4. Check user permissions
The ClickHouse user must have SELECT, ALTER, INSERT, CREATE, and DELETE grants on the database. If you recently changed the username or password without updating the environment variables, migrations will fail silently.
-- Run inside ClickHouse client
GRANT SELECT, INSERT, ALTER, CREATE, DELETE ON default.* TO 'your_user';
ClickHouse Running Out of Disk Space
If you see NOT_ENOUGH_SPACE errors in the logs, ClickHouse has filled its volume. Check disk usage with:
df -h
docker exec -it langfuse-clickhouse-1 clickhouse-client --query "SELECT name, total_space, free_space FROM system.disks"
The fastest short-term fix is to enable data retention inside the Langfuse UI under Settings → Data Retention. This runs a nightly cleanup across ClickHouse and blob storage. For a long-term fix, look at system log tables (trace_log, text_log), which have no TTL by default and grow without limit.
Langfuse Troubleshooting: Postgres Issues
PostgreSQL stores Langfuse’s state, including user accounts, organizations, projects, API keys, and settings. Without a healthy Postgres connection, the web container cannot start at all.
Applying Database Migrations Failed
This error almost always appears in one of two situations:
- Postgres has not finished starting before the web container tried to connect.
- The DATABASE_URL is wrong.
The error looks like:
Applying database migrations failed. This is mostly caused by the database being unavailable.
Exiting...
Fix 1: Add a proper health check to your Postgres service
The web and worker containers must wait for Postgres to be truly ready, not just started:
postgres:
image: postgres:17-alpine
healthcheck:
test: ["CMD-SHELL", "pg_isready -U langfuse -d langfuse"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
And in the web and worker services, use condition: service_healthy instead of service_started:
depends_on:
postgres:
condition: service_healthy
Fix 2: Check the DATABASE_URL format
The connection string must follow this exact pattern:
DATABASE_URL=postgresql://username:password@hostname:5432/dbname
A typo in the username, password, or database name will cause a silent connection failure. Also, make sure the hostname matches the Docker service name, not localhost.
Postgres Migration Fails with Permission Errors
If you changed the Postgres user or if migrations were previously run by a different user, you might see table ownership errors:
ERROR: must be owner of table ...
The fastest fix is to transfer ownership of all tables back to the active user. Connect to Postgres and run:
-- Run for each table, or use a script to loop all tables
ALTER TABLE table_name OWNER TO new_user;
Alternatively, set the DIRECT_URL environment variable to a superuser account specifically for migrations, while keeping DATABASE_URL for regular app operations.
ENCRYPTION_KEY Causing Segfault
If the web container crashes with a segmentation fault right after migrations seem to succeed, the most likely cause is an invalid ENCRYPTION_KEY. It must be exactly 64 hex characters (32 bytes). Generate a valid one with:
openssl rand -hex 32
Note: Never use placeholder zeros like 000…000 in production. That is a security risk and can also cause unexpected behavior.
Langfuse Troubleshooting: Traces Not Appearing
This is one of the most common reasons people do Langfuse troubleshooting. You send a trace, the API returns a 207 status, but nothing shows up in the UI.
Why Traces Are Delayed or Missing
Langfuse processes all traces asynchronously. The pipeline works like this:
- Your app sends events to /api/public/ingestion.
- The Web container accepts the event and writes it to S3/MinIO and queues the job in Redis.
- The Worker container picks up the job within 0 to 60 seconds.
- The worker writes the processed data into ClickHouse.
- The UI reads from ClickHouse and shows the trace.
If any step in this chain breaks, traces will not appear. Here is how to check each step:
Step 1: Check Web container logs
Check the Web logs with:
docker compose logs langfuse-web --tail=100
Look for errors around the time you sent the trace. If you see errors mentioning Redis or S3, events are being rejected before they even hit the queue. Also, a non-207 status code from the ingestion endpoint confirms this.
Step 2: Check your S3/MinIO bucket
Events that the Web container accepts should appear in MinIO at:
/<projectId>/<type>/<eventId>/<randomId>.json
Connect to MinIO at http://your-server:9001 and browse the langfuse bucket. If the file is not there, your S3 config is broken.
Step 3: Check Worker container logs
Check the Worker logs with:
docker compose logs langfuse-worker --tail=100
If no events are being processed at all, this usually points to a Redis or S3 configuration problem. If you see the worker picking up events but failing, the error message will tell you which service is the problem.
Step 4: Query ClickHouse directly
Use the command below to query ClickHouse directly:
docker exec -it langfuse-clickhouse-1 clickhouse-client
Then run:
SELECT * FROM default.traces WHERE project_id = 'your-project-id' LIMIT 10;
SELECT * FROM default.observations WHERE project_id = 'your-project-id' LIMIT 10;
If the trace is in ClickHouse but not in the UI, that is a bug worth reporting. If it is not in ClickHouse, the worker never wrote it.
API Key and SDK Issues
Traces will silently fail if the SDK is misconfigured. Check these things:
- LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY must match what is in your Langfuse project settings exactly.
- LANGFUSE_BASE_URL must point to your self-hosted URL, not to the cloud.
- For short-lived scripts or Jupyter notebooks, you must call langfuse.flush() or langfuse.shutdown() before the script exits, otherwise queued events are dropped.
- If tracing_enabled is set to False or sample_rate is 0.0, no traces will be sent at all.
Enable debug mode to see what the SDK is actually doing:
# Python
langfuse = Langfuse(debug=True)
# Environment variable
LANGFUSE_DEBUG=true
Langfuse Troubleshooting: Redis Problems
Redis is the queue layer that sits between the Web and Worker containers. If Redis goes down or runs out of memory, event processing stops completely.
Redis Out-of-Memory Errors
When Redis fills up, BullMQ, the queue library Langfuse uses, starts rejecting new jobs. You will see errors like:
OOM command not allowed when used memory > maxmemory
When this happens, the web container starts returning 5xx codes for new trace ingestion. You can fix it by:
- Setting the Redis eviction policy to noeviction so it does not silently drop data.
- Scaling the worker container so jobs are processed faster and the queue does not build up.
redis:
image: redis:7-alpine
command: redis-server --maxmemory-policy noeviction
Monitor queue health by checking the BullMQ Admin API that Langfuse exposes.
Redis Connection String
Make sure the connection string format is correct in both the web and worker containers:
REDIS_CONNECTION_STRING=redis://redis:6379/0
If you added a Redis password, the format changes to:
REDIS_CONNECTION_STRING=redis://:your_password@redis:6379/0
Environment Variable Mistakes in Langfuse
Wrong environment variables are the root cause of more than half of all self-hosted Langfuse troubleshooting sessions. Here is a table of the most commonly broken variables and how to fix them:
| Variable | Common Mistake | Correct Format |
|---|---|---|
| CLICKHOUSE_URL | Using wrong port | http://clickhouse:8123 |
| CLICKHOUSE_MIGRATION_URL | Using http:// | clickhouse://clickhouse:9000 |
| DATABASE_URL | Typo in hostname | postgresql://user:pass@postgres:5432/db |
| ENCRYPTION_KEY | Placeholder zeros | 64-char hex from openssl rand -hex 32 |
| NEXTAUTH_SECRET | Left blank | Any random long string |
| NEXTAUTH_URL | Set to localhost on remote server | Your real domain, for example, https://ai.yourdomain.com |
| LANGFUSE_BASE_URL (SDK) | Pointing to cloud | Your self-hosted URL |
| CLICKHOUSE_CLUSTER_ENABLED | true on single node | false for single-container setups |
Important note: Both the web and worker containers need the same ClickHouse, Redis, and S3 environment variables. A common issue is setting them in one container but not the other.
LANGFUSE_INIT Variables Not Working
If you are trying to auto-create an organization, project, and user on first boot using the LANGFUSE_INIT_* variables and nothing gets created, make sure:
- Postgres is fully healthy before the web container starts. Use the service_healthy condition.
- All required LANGFUSE_INIT_* variables are set together; leaving out one, like LANGFUSE_INIT_ORG_ID, causes the whole block to be skipped.
- The LANGFUSE_INIT_PROJECT_PUBLIC_KEY and LANGFUSE_INIT_PROJECT_SECRET_KEY do not already exist in the database from a previous run.
Example of a working minimal set looks like:
LANGFUSE_INIT_ORG_ID=my-org
LANGFUSE_INIT_ORG_NAME=My Organization
LANGFUSE_INIT_PROJECT_ID=my-project
LANGFUSE_INIT_PROJECT_NAME=My Project
LANGFUSE_INIT_PROJECT_PUBLIC_KEY=lf_pk_your_key
LANGFUSE_INIT_PROJECT_SECRET_KEY=lf_sk_your_key
LANGFUSE_INIT_USER_EMAIL=ad***@********in.com
LANGFUSE_INIT_USER_NAME=Admin
LANGFUSE_INIT_USER_PASSWORD=your-secure-password
Langfuse Troubleshooting After Updates
When you pull a new image version, the web container runs database migrations automatically on startup. If anything is wrong, migrations fail, and the container exits.
Before updating, always:
- Back up your Postgres database:
pg_dump -U langfuse langfuse > backup_$(date +%Y%m%d).sql
- Back up your ClickHouse data or take a snapshot of the volume.
- Read the Langfuse release notes for breaking changes before pulling.
If migration fails after an update, the error will be in the web container logs. Common post-update issues include:
- A new migration tries to modify a table owned by a different user. The fix is to transfer ownership as described above.
- A new version requires a new environment variable that was not present before. Fix it by adding it to your .env file.
- ClickHouse version mismatch. Langfuse requires ClickHouse 24.3 or higher.
To roll back, stop the containers, restore your database backup, and pin the previous Docker image tag in your compose file.
Langfuse Pipeline Verification
When you want to confirm the full data pipeline is working after a fix, follow this checklist:
# 1. All containers healthy?
docker compose ps
# 2. Send a test trace via curl
curl -X POST https://your-langfuse-url/api/public/ingestion \
-H "Authorization: Basic $(echo -n 'pk_xxx:sk_xxx' | base64)" \
-H "Content-Type: application/json" \
-d '{"batch":[{"id":"test-123","type":"trace-create","timestamp":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","body":{"id":"test-123","name":"test-trace"}}]}'
# 3. Check web logs for acceptance (look for 207 response)
docker compose logs langfuse-web --tail=20
# 4. Wait 30–60 seconds, then check worker
docker compose logs langfuse-worker --tail=20
# 5. Query ClickHouse for the trace
docker exec -it langfuse-clickhouse-1 clickhouse-client \
--query "SELECT id, name, project_id FROM default.traces WHERE id='test-123'"
If step 5 returns a row, the full pipeline is healthy. If the trace appears in ClickHouse but not in the UI, clear your browser cache or check if the API key used in the curl matches the project you are viewing.
When to Upgrade Your Infrastructure for Self-hosted Langfuse
Langfuse troubleshooting sometimes reveals that the real problem is not a misconfiguration; it is that your server does not have enough resources.
Signs that you need more hardware include:
- ClickHouse or the worker container is constantly at 100% of memory or CPU usage.
- Traces are being dropped during traffic spikes.
- Disk fills up within days, even with retention policies enabled.
- Response times from the UI keep growing.
For teams that rely heavily on observability, losing traces means losing insight into production AI systems, so weak infrastructure becomes a serious risk. Upgrading to a reliable dedicated server gives ClickHouse the dedicated RAM it needs for fast analytical queries, removes resource contention between containers, and provides stable disk I/O that prevents write bottlenecks in both Postgres and MinIO.
At a minimum, production Langfuse deployments need 4 CPU cores, 16 GB RAM, and 100 GB of disk. For teams processing thousands of traces per hour, 8+ cores and 32+ GB RAM is a safer baseline. If you are evaluating the right AI hosting option, PerLod AI Hosting Environment is built for exactly these kinds of inference and observability workloads.
Conclusion
Most Langfuse troubleshooting issues depend on a small set of root causes, including wrong ClickHouse URLs, a Postgres container that was not fully ready, a broken S3 connection, or an invalid ENCRYPTION_KEY. The pipeline is built in layers, and you need to trace the problem from the Web container down to ClickHouse.
We hope you enjoy this guide. For more information, you can check the official troubleshooting page.
FAQs
Why are my Langfuse traces not showing up even though the API returns 207?
A 207 means the event was accepted, but processing is async. Wait 60 seconds, then check the worker logs and query ClickHouse directly for the event.
Why does the ClickHouse container start but the web container crashes in Langfuse?
Check that CLICKHOUSE_MIGRATION_URL uses clickhouse:// with port 9000, not the HTTP URL. Also, confirm CLICKHOUSE_CLUSTER_ENABLED=false if you run a single node.
How do I know if ClickHouse is receiving data in Langfuse?
Connect with clickhouse-client and query SELECT * FROM default.traces LIMIT 5. If results appear, ClickHouse is working fine.