//------------------------------------------------------------------- //-------------------------------------------------------------------
Langfuse Troubleshooting

Langfuse Troubleshooting Guide: Fix ClickHouse, Postgres, and Ingestion Issues

If you are running a self-hosted AI stack and your traces have stopped appearing, your services will not start, or something broke after an update, this Langfuse troubleshooting guide is for you.

Langfuse runs several moving parts, including a web container, a worker container, PostgreSQL, ClickHouse, Redis, and MinIO (or S3). When one of them misbehaves, the whole pipeline breaks. This guide covers each layer so you know exactly where to look.

If you haven’t set up Langfuse yet, you can start with the self-hosted Langfuse on VPS installation guide.

Quick Health Check Before You Start

Before going deep into any single service, you can run these commands to see what is alive and what is not.

Check all container statuses:

docker compose ps

Stream live logs from all containers:

docker compose logs --tail=50 -f

Check resource usage per container:

docker stats

Look for containers that keep restarting or show an Exit status, which tells you where the problem is. The steps below go through each service one by one.

Langfuse Troubleshooting: ClickHouse Issues

ClickHouse is the core OLAP database that stores every trace, observation, and score. Most Langfuse troubleshooting cases involve ClickHouse in some way.

ClickHouse Won’t Start or Fails Migration

The most common startup error looks like this:

error: failed to open database: database driver: unknown driver "clickhouse (forgotten import?)"

Or you may see:

error: driver: bad connection in line 0

Both errors mean the web or worker container cannot reach ClickHouse. Check these things in order:

1. Make sure both URLs are set correctly

ClickHouse uses two different URLs for two different purposes, one for HTTP API calls and one for TCP migrations:

CLICKHOUSE_URL=http://clickhouse:8123
CLICKHOUSE_MIGRATION_URL=clickhouse://clickhouse:9000
CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD=YourClickHousePassword

A very common mistake is using http:// in both variables. The migration URL must use clickhouse:// and port 9000, not 8123.

2. Set CLICKHOUSE_CLUSTER_ENABLED to false for single-node setups

If you are running one ClickHouse container, especially for VPS deployments, set this variable to false. Otherwise, the system tries to run cluster commands that fail on a standalone instance:

CLICKHOUSE_CLUSTER_ENABLED=false

3. Verify ClickHouse timezone is UTC

ClickHouse must always run in UTC. A different timezone will cause queries to return empty or wrong results. Check it by connecting to the container:

docker exec -it langfuse-clickhouse-1 clickhouse-client --query "SELECT timezone()"

It must return UTC. If it does not, remove any custom timezone setting from your ClickHouse server config and restart.

4. Check user permissions

The ClickHouse user must have SELECT, ALTER, INSERT, CREATE, and DELETE grants on the database. If you recently changed the username or password without updating the environment variables, migrations will fail silently.

-- Run inside ClickHouse client
GRANT SELECT, INSERT, ALTER, CREATE, DELETE ON default.* TO 'your_user';

ClickHouse Running Out of Disk Space

If you see NOT_ENOUGH_SPACE errors in the logs, ClickHouse has filled its volume. Check disk usage with:

df -h
docker exec -it langfuse-clickhouse-1 clickhouse-client --query "SELECT name, total_space, free_space FROM system.disks"

The fastest short-term fix is to enable data retention inside the Langfuse UI under Settings → Data Retention. This runs a nightly cleanup across ClickHouse and blob storage. For a long-term fix, look at system log tables (trace_log, text_log), which have no TTL by default and grow without limit.

Langfuse Troubleshooting: Postgres Issues

PostgreSQL stores Langfuse’s state, including user accounts, organizations, projects, API keys, and settings. Without a healthy Postgres connection, the web container cannot start at all.

Applying Database Migrations Failed

This error almost always appears in one of two situations:

  • Postgres has not finished starting before the web container tried to connect.
  • The DATABASE_URL is wrong.

The error looks like:

Applying database migrations failed. This is mostly caused by the database being unavailable.
Exiting...

Fix 1: Add a proper health check to your Postgres service

The web and worker containers must wait for Postgres to be truly ready, not just started:

postgres:
  image: postgres:17-alpine
  healthcheck:
    test: ["CMD-SHELL", "pg_isready -U langfuse -d langfuse"]
    interval: 10s
    timeout: 5s
    retries: 5
    start_period: 30s

And in the web and worker services, use condition: service_healthy instead of service_started:

depends_on:
  postgres:
    condition: service_healthy

Fix 2: Check the DATABASE_URL format

The connection string must follow this exact pattern:

DATABASE_URL=postgresql://username:password@hostname:5432/dbname

A typo in the username, password, or database name will cause a silent connection failure. Also, make sure the hostname matches the Docker service name, not localhost.

Postgres Migration Fails with Permission Errors

If you changed the Postgres user or if migrations were previously run by a different user, you might see table ownership errors:

ERROR: must be owner of table ...

The fastest fix is to transfer ownership of all tables back to the active user. Connect to Postgres and run:

-- Run for each table, or use a script to loop all tables
ALTER TABLE table_name OWNER TO new_user;

Alternatively, set the DIRECT_URL environment variable to a superuser account specifically for migrations, while keeping DATABASE_URL for regular app operations.

ENCRYPTION_KEY Causing Segfault

If the web container crashes with a segmentation fault right after migrations seem to succeed, the most likely cause is an invalid ENCRYPTION_KEY. It must be exactly 64 hex characters (32 bytes). Generate a valid one with:

openssl rand -hex 32

Note: Never use placeholder zeros like 000…000 in production. That is a security risk and can also cause unexpected behavior.

Langfuse Troubleshooting: Traces Not Appearing

This is one of the most common reasons people do Langfuse troubleshooting. You send a trace, the API returns a 207 status, but nothing shows up in the UI.

Why Traces Are Delayed or Missing

Langfuse processes all traces asynchronously. The pipeline works like this:

  • Your app sends events to /api/public/ingestion.
  • The Web container accepts the event and writes it to S3/MinIO and queues the job in Redis.
  • The Worker container picks up the job within 0 to 60 seconds.
  • The worker writes the processed data into ClickHouse.
  • The UI reads from ClickHouse and shows the trace.

If any step in this chain breaks, traces will not appear. Here is how to check each step:

Step 1: Check Web container logs

Check the Web logs with:

docker compose logs langfuse-web --tail=100

Look for errors around the time you sent the trace. If you see errors mentioning Redis or S3, events are being rejected before they even hit the queue. Also, a non-207 status code from the ingestion endpoint confirms this.

Step 2: Check your S3/MinIO bucket

Events that the Web container accepts should appear in MinIO at:

/<projectId>/<type>/<eventId>/<randomId>.json

Connect to MinIO at http://your-server:9001 and browse the langfuse bucket. If the file is not there, your S3 config is broken.

Step 3: Check Worker container logs

Check the Worker logs with:

docker compose logs langfuse-worker --tail=100

If no events are being processed at all, this usually points to a Redis or S3 configuration problem. If you see the worker picking up events but failing, the error message will tell you which service is the problem.

Step 4: Query ClickHouse directly

Use the command below to query ClickHouse directly:

docker exec -it langfuse-clickhouse-1 clickhouse-client

Then run:

SELECT * FROM default.traces WHERE project_id = 'your-project-id' LIMIT 10;
SELECT * FROM default.observations WHERE project_id = 'your-project-id' LIMIT 10;

If the trace is in ClickHouse but not in the UI, that is a bug worth reporting. If it is not in ClickHouse, the worker never wrote it.

API Key and SDK Issues

Traces will silently fail if the SDK is misconfigured. Check these things:

  • LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY must match what is in your Langfuse project settings exactly.
  • LANGFUSE_BASE_URL must point to your self-hosted URL, not to the cloud.
  • For short-lived scripts or Jupyter notebooks, you must call langfuse.flush() or langfuse.shutdown() before the script exits, otherwise queued events are dropped.
  • If tracing_enabled is set to False or sample_rate is 0.0, no traces will be sent at all.

Enable debug mode to see what the SDK is actually doing:

# Python
langfuse = Langfuse(debug=True)

# Environment variable
LANGFUSE_DEBUG=true

Langfuse Troubleshooting: Redis Problems

Redis is the queue layer that sits between the Web and Worker containers. If Redis goes down or runs out of memory, event processing stops completely.

Redis Out-of-Memory Errors

When Redis fills up, BullMQ, the queue library Langfuse uses, starts rejecting new jobs. You will see errors like:

OOM command not allowed when used memory > maxmemory

When this happens, the web container starts returning 5xx codes for new trace ingestion. You can fix it by:

  1. Setting the Redis eviction policy to noeviction so it does not silently drop data.
  2. Scaling the worker container so jobs are processed faster and the queue does not build up.
redis:
  image: redis:7-alpine
  command: redis-server --maxmemory-policy noeviction

Monitor queue health by checking the BullMQ Admin API that Langfuse exposes.

Redis Connection String

Make sure the connection string format is correct in both the web and worker containers:

REDIS_CONNECTION_STRING=redis://redis:6379/0

If you added a Redis password, the format changes to:

REDIS_CONNECTION_STRING=redis://:your_password@redis:6379/0

Environment Variable Mistakes in Langfuse

Wrong environment variables are the root cause of more than half of all self-hosted Langfuse troubleshooting sessions. Here is a table of the most commonly broken variables and how to fix them:

VariableCommon MistakeCorrect Format
CLICKHOUSE_URLUsing wrong porthttp://clickhouse:8123
CLICKHOUSE_MIGRATION_URLUsing http://clickhouse://clickhouse:9000
DATABASE_URLTypo in hostnamepostgresql://user:pass@postgres:5432/db
ENCRYPTION_KEYPlaceholder zeros64-char hex from openssl rand -hex 32
NEXTAUTH_SECRETLeft blankAny random long string
NEXTAUTH_URLSet to localhost on remote serverYour real domain, for example, https://ai.yourdomain.com
LANGFUSE_BASE_URL (SDK)Pointing to cloudYour self-hosted URL
CLICKHOUSE_CLUSTER_ENABLEDtrue on single nodefalse for single-container setups

Important note: Both the web and worker containers need the same ClickHouse, Redis, and S3 environment variables. A common issue is setting them in one container but not the other.

LANGFUSE_INIT Variables Not Working

If you are trying to auto-create an organization, project, and user on first boot using the LANGFUSE_INIT_* variables and nothing gets created, make sure:

  • Postgres is fully healthy before the web container starts. Use the service_healthy condition.
  • All required LANGFUSE_INIT_* variables are set together; leaving out one, like LANGFUSE_INIT_ORG_ID, causes the whole block to be skipped.
  • The LANGFUSE_INIT_PROJECT_PUBLIC_KEY and LANGFUSE_INIT_PROJECT_SECRET_KEY do not already exist in the database from a previous run.

Example of a working minimal set looks like:

LANGFUSE_INIT_ORG_ID=my-org
LANGFUSE_INIT_ORG_NAME=My Organization
LANGFUSE_INIT_PROJECT_ID=my-project
LANGFUSE_INIT_PROJECT_NAME=My Project
LANGFUSE_INIT_PROJECT_PUBLIC_KEY=lf_pk_your_key
LANGFUSE_INIT_PROJECT_SECRET_KEY=lf_sk_your_key
LANGFUSE_INIT_USER_EMAIL=ad***@********in.com
LANGFUSE_INIT_USER_NAME=Admin
LANGFUSE_INIT_USER_PASSWORD=your-secure-password

Langfuse Troubleshooting After Updates

When you pull a new image version, the web container runs database migrations automatically on startup. If anything is wrong, migrations fail, and the container exits.

Before updating, always:

  • Back up your Postgres database:
pg_dump -U langfuse langfuse > backup_$(date +%Y%m%d).sql
  • Back up your ClickHouse data or take a snapshot of the volume.
  • Read the Langfuse release notes for breaking changes before pulling.

If migration fails after an update, the error will be in the web container logs. Common post-update issues include:

  • A new migration tries to modify a table owned by a different user. The fix is to transfer ownership as described above.
  • A new version requires a new environment variable that was not present before. Fix it by adding it to your .env file.
  • ClickHouse version mismatch. Langfuse requires ClickHouse 24.3 or higher.

To roll back, stop the containers, restore your database backup, and pin the previous Docker image tag in your compose file.

Langfuse Pipeline Verification

When you want to confirm the full data pipeline is working after a fix, follow this checklist:

# 1. All containers healthy?
docker compose ps

# 2. Send a test trace via curl
curl -X POST https://your-langfuse-url/api/public/ingestion \
  -H "Authorization: Basic $(echo -n 'pk_xxx:sk_xxx' | base64)" \
  -H "Content-Type: application/json" \
  -d '{"batch":[{"id":"test-123","type":"trace-create","timestamp":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","body":{"id":"test-123","name":"test-trace"}}]}'

# 3. Check web logs for acceptance (look for 207 response)
docker compose logs langfuse-web --tail=20

# 4. Wait 30–60 seconds, then check worker
docker compose logs langfuse-worker --tail=20

# 5. Query ClickHouse for the trace
docker exec -it langfuse-clickhouse-1 clickhouse-client \
  --query "SELECT id, name, project_id FROM default.traces WHERE id='test-123'"

If step 5 returns a row, the full pipeline is healthy. If the trace appears in ClickHouse but not in the UI, clear your browser cache or check if the API key used in the curl matches the project you are viewing.

When to Upgrade Your Infrastructure for Self-hosted Langfuse

Langfuse troubleshooting sometimes reveals that the real problem is not a misconfiguration; it is that your server does not have enough resources.

Signs that you need more hardware include:

  • ClickHouse or the worker container is constantly at 100% of memory or CPU usage.
  • Traces are being dropped during traffic spikes.
  • Disk fills up within days, even with retention policies enabled.
  • Response times from the UI keep growing.

For teams that rely heavily on observability, losing traces means losing insight into production AI systems, so weak infrastructure becomes a serious risk. Upgrading to a reliable dedicated server gives ClickHouse the dedicated RAM it needs for fast analytical queries, removes resource contention between containers, and provides stable disk I/O that prevents write bottlenecks in both Postgres and MinIO.

At a minimum, production Langfuse deployments need 4 CPU cores, 16 GB RAM, and 100 GB of disk. For teams processing thousands of traces per hour, 8+ cores and 32+ GB RAM is a safer baseline. If you are evaluating the right AI hosting option, PerLod AI Hosting Environment is built for exactly these kinds of inference and observability workloads.

Conclusion

Most Langfuse troubleshooting issues depend on a small set of root causes, including wrong ClickHouse URLs, a Postgres container that was not fully ready, a broken S3 connection, or an invalid ENCRYPTION_KEY. The pipeline is built in layers, and you need to trace the problem from the Web container down to ClickHouse.

We hope you enjoy this guide. For more information, you can check the official troubleshooting page.

FAQs

Why are my Langfuse traces not showing up even though the API returns 207?

A 207 means the event was accepted, but processing is async. Wait 60 seconds, then check the worker logs and query ClickHouse directly for the event.

Why does the ClickHouse container start but the web container crashes in Langfuse?

Check that CLICKHOUSE_MIGRATION_URL uses clickhouse:// with port 9000, not the HTTP URL. Also, confirm CLICKHOUSE_CLUSTER_ENABLED=false if you run a single node.

How do I know if ClickHouse is receiving data in Langfuse?

Connect with clickhouse-client and query SELECT * FROM default.traces LIMIT 5. If results appear, ClickHouse is working fine.

Post Your Comment

PerLod delivers high-performance hosting with real-time support and unmatched reliability.

Contact us

Payment methods

payment gateway