Observability Ops
Halaman ini jadi baseline monitoring operasional untuk API, queue, dan proses bisnis inti.
SLO ringkas (starter)
| Domain | Target | Alert threshold |
|---|---|---|
| API availability | >= 99.5% | < 99.0% rolling 1 jam |
| P95 latency endpoint kritikal | < 800 ms | > 1200 ms 10 menit |
| Error rate 5xx | < 1% | > 3% 5 menit |
| Queue failure rate | < 0.5% | > 2% 10 menit |
Signal yang wajib ada
| Layer | Metric/Signal | Tujuan |
|---|---|---|
| HTTP/API | request count, latency p50/p95, 4xx/5xx | deteksi degradasi API |
| Auth | login fail rate, unauthorized spike | deteksi auth misconfig/abuse |
| Queue/Job | pending jobs, failed jobs, retry count | jaga pipeline async tetap sehat |
| DB | slow query count, connection saturation | cegah bottleneck DB |
| Business | reservation created/completed, coupon claimed, active members | validasi health bisnis, bukan cuma teknis |
Runbook triage 15 menit pertama
| Menit | Fokus | Output |
|---|---|---|
| 0–5 | konfirmasi alert valid (bukan false positive) | severity + area terdampak |
| 5–10 | isolasi lapisan rusak (API/DB/Queue/integrasi) | hipotesis utama |
| 10–15 | mitigasi cepat (rollback, throttle, restart worker) | stabilisasi awal |
Query/check operasional yang sering dipakai
# build docs health
cd docs-site/docusaurus && npm run build
# cek endpoint health basic
curl -i "http://localhost/ping"
# contoh cek log aplikasi (sesuaikan path)
tail -n 200 storage/logs/laravel.log
Alert routing
| Severity | Respon target | Tindakan awal |
|---|---|---|
| Sev-1 (down total) | ≤ 5 menit | announce incident, mitigasi instan |
| Sev-2 (degradasi berat) | ≤ 15 menit | triage + workaround |
| Sev-3 (minor/isolated) | ≤ 60 menit | backlog fix terencana |
Marker insiden yang harus dicatat
trace_idcontoh request gagal/sukses.- rentang waktu kejadian (start-end).
- endpoint/fitur terdampak.
- perubahan terakhir sebelum insiden (release/migration/config).
- keputusan mitigasi + hasilnya.
Lihat juga: