How We Implemented Monitoring for a Small On-Premises Environment – Arnold

For many businesses, IT issues often go unnoticed until they start impacting end users, and the IT Support inbox blowing up. This was exactly the case for one of our clients, who had a small on-premises footprint of around 12 servers but no monitoring or observability in place. A server could run out of disk space, a service wouldn’t start properly or the server would experience long running CPU spikes, and no one would know until users began reporting issues. This reactive approach was causing downtime and frustration.

We stepped in to change that by implementing a robust monitoring and observability system using Zabbix, integrated with Microsoft Teams web hooks for real-time alerts.

The Problem: Zero Observability

Our client’s IT environment faced several challenges:

No proactive monitoring: There was no way to detect potential issues like low disk space or high CPU usage before they became critical.
End-user frustration: Problems were only discovered when employees reported them, leading to delays in resolution.
Operational inefficiency: Remote IT staff had to scramble to diagnose and fix issues without any data or alerts to guide them.

This lack of visibility was not only affecting productivity but also putting business operations at risk.

The Solution: Proactive Monitoring with Zabbix

To address these challenges, we implemented Zabbix, an open-source monitoring solution, tailored to their environment. Here’s what we did:

Automated certain responses, such as clearing temporary files when disk space thresholds were reached.

System Setup and Configuration:

Deployed Zabbix to monitor all on-premises servers.
Configured key metrics such as disk usage, service health, CPU utilization, memory consumption, and network performance.

Custom Alerting System:

Set up thresholds for critical metrics (e.g., disk space below 15%, CPU usage above 90% for extended periods).
Integrated Zabbix with Microsoft Teams to send real-time alerts directly to a dedicated IT Teams channel.

Dashboard and Reporting:

Created a centralized dashboard for IT staff to view the health of all servers at a glance.
Configured periodic reports summarizing system performance and potential risks.

Proactive Maintenance:

Enabled alerts so that IT staff could address issues like disk space running low before they caused downtime.

The Results: Full Visibility and Proactive IT Management

The transformation was immediate and impactful:

Proactive Issue Resolution: The IT team now receives alerts before issues escalate, allowing them to take action quickly.
Reduced Downtime: End users no longer experience delays caused by undetected server problems.
Operational Efficiency: With centralized monitoring, IT staff can focus on strategic tasks instead of constantly troubleshooting.
Improved Business Continuity: The risk of unexpected outages has been significantly reduced.

For example, within the first week of implementation, the system detected a VoIP service not coming online. The alert allowed the IT team to resolve the issue before it impacted the business significantly—which would’ve caused major issues for the sales team answering phone calls from potential or existing customers, a problem that previously would have gone unnoticed until it caused disruptions.