
The objective of this project is to design and implement a cloud-based monitoring and observability platform for distributed systems. The system collects metrics, logs, and traces to provide real-time insights, helping students understand system health monitoring, performance analysis, and observability principles in cloud environments.
Study distributed system architecture and challenges in monitoring performance and reliability.
Analyze key observability concepts: metrics, logs, traces, and alerting mechanisms.
Prepare Software Requirement Specification (SRS) and observability workflow documentation.
Design system architecture including monitoring agents, data collection services, and dashboard interface.
Create database schema for system metrics, logs, trace data, alerts, and user roles.
Implement secure user authentication and role-based access control.
Develop monitoring agents to collect CPU, memory, network, and application-level metrics.
Implement log collection and parsing modules for distributed components.
Integrate distributed tracing for multi-service transaction analysis (simulation for BCA, full tracing with Jaeger/Zipkin for MCA).
Store collected data in a time-series database or log storage system.
Implement alerting system to notify users of anomalies or threshold breaches.
Develop dashboards to visualize system health, performance trends, and incident history.
Maintain audit logs for monitoring and alerting activities.
Perform unit testing, integration testing, and simulation of distributed system load.
Validate observability workflows by simulating node failures and performance degradation.
Prepare documentation including ER diagrams, architecture diagrams, monitoring workflows, and test cases.
Deploy system locally or on a cloud simulation environment for demonstration.