Taking Cloud Logging from Good to Better to Best

It's hard to imagine successful software development today without DevOps. But DevOps has become increasingly complex, burdened by a zoo of tools and process gaps (think "silent tech debt"). At CloudGeometry, we have years of experience working through DevOps challenges for hundreds of client engagements, delivering high-quality releases into production – all day, every day.

Today, the market has begun to use "Platform Engineering" to refer to more robust approaches to DevOps. Yet even before the term was in common use, we were already taking a rigorous, repeatable approach to resolving DevOps problems by leaning into proven open-source technologies. We call this integrated toolchain CGDevX. It's our reference implementation for Platform Engineering. You'll find our ongoing efforts in our repo at GitHub.

Having enough data can make good DevOps better. The difference between good, better and best? Logging. When setting up staged environments from development through to production, setting up logging within and across components, workloads, and infrastructure services is almost guaranteed to save precious time for all stakeholders in the SDLC. From root-cause analysis to automation to optimizing designs and algorithms, platform engineering paves a path to better observability – built into your development and delivery workflow with CGDevX. 

Video Highlights

In this video, we demonstrate how to align workload metrics with event data instrumented into applications. Using open-source tools like Loki and Grafana show how to set up better log management, data querying, and visualization while addressing the challenge of sensitive data obfuscation. Here's a breakdown of the key features covered in the video:

  • Introduction to Log Management: We introduce our log management solution, powered by Loki, which efficiently collects log data from Kubernetes clusters. Using Promtail, we gather logs from all pods on each node, highlighting the flexibility of integrating with other clients like Fluentd or Logstash.
  • Log Collection and Discovery: We discuss how we collect critical logs from Kubernetes system components, core platform services, and workloads. Our approach includes automatic discovery and scraping of log data, akin to our Prometheus setup, with filters and labels to manage log collection efficiently.
  • Demo Application and Data Arrival: To demonstrate the value of this pattern, we show you our minimalistic demo app, implemented in Node.js, as it triggers API calls that generate logs. We showcase how these logs are collected in different environments like staging and production, while logs in the development environment are excluded.
  • Visualization and Interface Integration: We utilize Grafana for querying and visualizing log data, offering a unified interface for both metrics and logs. This integration allows for powerful debugging and easy isolation of specific nodes, pods, and time frames.
  • Access Management and Dashboard Customization: Our setup includes OIDC powered by Vault for secure access management in Grafana, with shared RBAC configurations. We provide various dashboards for different Kubernetes services and workloads, which can be customized or integrated with specific monitoring views.
  • Obfuscation of Sensitive Data:  We highlight the importance of obfuscating sensitive data like PII from logs, employing a set of rules configured as code. This ensures compliance with regulations like GDPR, CCPA, or HIPAA.
  • Alerting Based on Log Data: We demonstrate how to configure alerts based on log data using Loki and Grafana, similar to metric-based alerts. This includes setting up notifications through various channels like email, Slack, or PagerDuty.
  • Real-Time Monitoring and Alerting: After initiating an application shutdown via an API request, we monitor the log stream and observe the triggered alert in Grafana, followed by a corresponding notification in Slack.
  • Exploring System Metrics and Alerts: We delve into the Promtail Loki dashboard, showcasing system metrics and alerts. This includes error logs, memory, and CPU usage metrics, vital for understanding system health.

This integrated combination of log management, metrics collection, and visualization raises the value of metrics to a new level, providing a comprehensive and user-friendly approach to monitoring. Custom dashboards, backed by easy alert rule configuration and automated sensitive data obfuscation, simplify the challenges of troubleshooting and compliance (and simplifying user and permission management. Taken as a whole, the CGdevX approach not only aids in immediate problem-solving but also informs changes for future releases, enhancing overall system reliability and performance.