You’ll Find Out From Your Customers. Or From Your Dashboard.
Your API has been returning 500 errors for 23 minutes. Three customers emailed support. Two posted on Twitter. Your team found out from the tweets.
This is what happens without monitoring. You’re flying blind, and your customers are your canaries.
The fix isn’t complicated. It isn’t expensive. But most growing companies skip monitoring because it feels like infrastructure work, not product work. Until the first outage costs them a client.
The Four Golden Signals: Start Here, Ignore Everything Else
Google’s SRE handbook defined the four golden signals. They’ve held up for a decade because they cover what actually matters.
Latency. How long do requests take? Track both the average and the 95th percentile. The average can look fine while 5% of your users have a terrible experience. A spike in p95 latency is usually your first warning sign.
Error rate. What percentage of requests fail? A baseline of 0.1% errors is normal. A jump to 2% means something broke. Track this by endpoint, because a healthy homepage and a broken checkout look identical when you average them.
Traffic. How many requests are you handling? This isn’t about growth metrics. It’s about knowing when traffic drops unexpectedly (your DNS could be down) or spikes unexpectedly (you’re getting attacked, or you just hit the front page of Reddit).
Saturation. How close to capacity are your resources? CPU, memory, disk, database connections. When saturation crosses 80%, start planning capacity. When it crosses 90%, you’re one spike away from downtime.
That’s it. Four numbers. If you track nothing else, track these.
The Tools: Free, Open Source, Battle-Tested
Prometheus collects metrics. It’s a time-series database that scrapes numbers from your applications at regular intervals. Free. Open source. Used by everyone from startups to Fortune 500 companies.
Grafana visualizes those metrics. Dashboards with graphs, gauges, and tables. Also free and open source. Connect it to Prometheus and you have a monitoring stack.
For logs, Loki (from the Grafana team) stores and queries log data. Or use the ELK stack (Elasticsearch, Logstash, Kibana) if you need more power. Both free for self-hosting.
OpenTelemetry is the emerging standard for instrumentation. It lets you collect metrics, logs, and traces with a single set of libraries. More tools are adopting it every month, which means less vendor lock-in.
The total cost of this stack, self-hosted: zero for software, plus whatever your server costs. A basic monitoring server on Hetzner runs EUR 10-20/month.
What to Alert On (And What to Ignore)
The worst monitoring setup is one that sends so many alerts nobody reads them. Alert fatigue is real. Your team will start ignoring everything if everything triggers a notification.
Alert on symptoms, not causes. “Error rate above 1% for 5 minutes” is a good alert. “CPU above 70%” is a bad alert (CPU can spike temporarily during normal operation).
Use severity levels. Critical: pages someone at 3 AM (production is down). Warning: sends a Slack message during business hours (something needs attention but isn’t urgent). Info: logged in a dashboard for weekly review.
A good starting set of alerts:
Error rate above 1% for 5 minutes. Something is broken.
P95 latency above 2 seconds for 10 minutes. Users are having a bad experience.
Disk usage above 85%. You’ll run out of space soon.
No requests received in 5 minutes. Your application is probably down, or DNS is broken.
SSL certificate expires within 14 days. You don’t want to learn about this after it expires.
That’s five alerts. Start there. Add more only when you’ve had an incident that these five wouldn’t have caught.
Structured Logging: The Difference Between Debugging for Minutes and Hours
When something breaks, logs are where you find the answer. But only if the logs are useful.
Structured logging means your log entries are JSON objects with consistent fields, not free-form text strings. Instead of "User 123 failed to checkout", you log {"event": "checkout_failed", "user_id": 123, "error": "payment_declined", "cart_value": 89.50}.
The difference matters when you’re searching through millions of log entries at 2 AM. You can filter by user_id, or by error type, or by cart value. Try that with free-form text.
Every log entry should include: timestamp, severity level, request ID (for tracing), and relevant context. Make it a standard. Enforce it in code review.
Uptime Monitoring: The Cheapest Insurance
Before you set up Prometheus and Grafana, set up uptime monitoring. It takes 5 minutes and catches the most basic failures.
Services like UptimeRobot (free for 50 monitors), Betterstack, or Checkly ping your application from external locations. If it doesn’t respond, you get a notification. This catches DNS failures, hosting outages, and certificate problems that internal monitoring would miss.
Set up checks for your homepage, your API health endpoint, and your most critical user-facing pages. Get alerts via Slack or SMS. Done.
When to Level Up
The basic monitoring stack covers you until you have multiple services, complex user journeys, or strict SLA requirements. Once you need to answer “why was this specific user’s request slow?”, you need distributed tracing.
Tools like Jaeger or Grafana Tempo trace a request across multiple services. You can see that the API call took 200ms, but the database query inside it took 180ms. That’s where your bottleneck lives.
But don’t start here. Traces are noisy and expensive to store. Get the basics right first. Add tracing when the four golden signals tell you something is slow but don’t tell you where.
For how monitoring fits into your broader DevOps setup, read our guide. And if you’re modernizing legacy systems alongside building monitoring, our digital transformation playbook covers the full picture.
Need help setting up monitoring for your applications? Let’s build the right observability stack for your team. We’ll set up the dashboards, configure the alerts, and make sure you find problems before your customers do.