Mar 17, 2023 - Last updated on Apr 01, 2025

Adopting AI and IoT for Predictive Operations

Predictive maintenance is one of the most compelling industrial AI use cases — and one of the hardest to actually ship. The gap between a vendor demo and a working system on a production floor is wider than most organisations expect. This engagement closed that gap for a manufacturer that had been exploring the opportunity for two years without moving beyond a proof-of-concept.

The outcome: 34% reduction in unplanned downtime on the pilot production lines. Pilot ROI positive within five months. Full rollout approved to all five facilities.

The Situation

The client operated five manufacturing facilities across two countries, producing precision industrial components for the automotive and aerospace sectors. Equipment uptime was critical: the facilities ran on tight margins, and a single unplanned stoppage on a key production line could cascade into penalty clauses with downstream customers.

The maintenance team estimated unplanned downtime was costing the business approximately $2.1M per year in lost production, emergency parts procurement, and overtime labour. All maintenance was reactive — equipment was repaired after it failed. There was no early-warning system, no sensor data being captured systematically, and no predictive model.

Leadership had identified predictive maintenance as a priority initiative eighteen months earlier. Two previous attempts — one internal, one with a large consulting firm — had produced a proof-of-concept on historical data but never progressed to a live environment. The gap was always the same: the IT and OT (operational technology) worlds weren’t integrated, the data infrastructure didn’t exist, and the change management challenge of getting maintenance technicians to trust and act on an algorithm’s output had never been addressed.

The Challenge

The technical challenges were substantial, but the organisational challenges were equally important.

On the technical side:

Equipment on the production floor ran proprietary PLCs (programmable logic controllers) with no standard data interface — each vendor had a different protocol
The plant network was air-gapped from corporate IT for security reasons, creating a physical barrier to data flow
Eighteen months of historical maintenance logs existed in a mix of paper records and three separate CMMS systems with inconsistent taxonomy
The ML team in corporate IT had Python skills but no OT domain knowledge; the maintenance team had deep domain knowledge but no data science exposure

On the organisational side:

Maintenance technicians were sceptical — previous technology initiatives had produced tools they didn’t use because the tools didn’t reflect how they actually worked
Plant managers were resistant to any system that might generate false alarms, which would disrupt workflow and erode trust in the alert system
The initiative had already failed twice; there was visible fatigue around another attempt

Our Approach

Rather than starting with the machine learning model — the instinct of most data teams — Nematix started with the data infrastructure and the people.

Weeks 1–4: Technology assessment and site survey

We conducted on-site assessments at two of the five facilities, mapping every major piece of equipment, its failure history, its maintenance cost, and the feasibility of instrumenting it with sensors. We produced a priority matrix: which machines had the highest failure cost, the most predictable failure signatures, and the clearest path to instrumentation.

Three machines across two production lines were selected for the pilot — chosen because they had the highest downtime cost and the clearest historical failure patterns, not because they were the easiest to instrument.

Weeks 5–12: Edge-to-cloud data pipeline

We designed and deployed an edge computing layer using industrial IoT gateways (Advantech) that could communicate with the existing PLCs using OPC-UA, translate their proprietary data formats, and forward normalised telemetry to Azure IoT Hub via a dedicated, firewalled connection that satisfied the plant security team’s requirements.

Vibration, temperature, current draw, and pressure readings were captured at configurable intervals — initially at 1Hz, later tuned per machine based on what proved predictive.

Weeks 13–20: Data preparation and model development

We combined the three CMMS systems’ historical maintenance records with the new sensor telemetry to build a labelled training dataset — mapping sensor state to subsequent failure events. This required significant data cleaning work; the maintenance taxonomy alone took two weeks to reconcile across systems.

The anomaly detection model used a combination of LSTM autoencoders (for temporal pattern recognition) and gradient-boosted classifiers (for discrete fault classification). We deliberately chose explainable outputs — not just “anomaly detected” but “vibration signature on Bearing Unit 3 matches pre-failure pattern seen in 7 of 9 historical failures” — because we knew technician trust depended on the model showing its reasoning.

Weeks 21–26: Pilot deployment and change management

The model was deployed into a maintenance dashboard built specifically for the technicians’ workflow — not a generic BI tool. Alerts were tiered by confidence and lead time, and each alert included the sensor readings that triggered it and the recommended inspection action.

We ran a parallel operation period for eight weeks: the model generated alerts, technicians reviewed them, and we tracked whether they proved accurate. This built trust incrementally. When technicians started acting on alerts — and found the problems the model predicted — adoption became self-reinforcing.

Outcome

At the end of the six-month pilot on three production lines:

Metric	Before pilot	After pilot
Unplanned downtime (pilot lines)	Baseline	−34%
Maintenance cost (pilot lines)	Baseline	−22%
False alert rate	—	8% (well below 15% threshold)
Mean time to detect anomaly	Post-failure	18 hours before failure (avg.)
Pilot ROI payback period	—	5 months

The $2.1M annual downtime cost was projected to reduce by approximately $680k per year across the pilot lines alone. The board approved full rollout to all five facilities, including retroactive instrumentation of the equipment not included in the pilot.

The maintenance team — initially the most sceptical stakeholder group — became the model’s strongest advocates. Two senior technicians were involved in validating alert accuracy and took visible ownership of the rollout internally.

Key Takeaways

Data infrastructure before models. The reason the previous two attempts stalled was that both started with the model before the data existed. A machine learning model is only as good as the data feeding it. Building the edge-to-cloud pipeline first — even though it produced no predictions — was the critical investment.

Explainability is a feature, not an afterthought. In industrial settings, a black-box alert system will never be trusted by the people who have to act on it. Designing the model to surface the specific sensor readings behind each alert was essential to technician adoption.

Pilot selection determines pilot success. We chose machines with high failure cost and clear historical failure signatures — not the easiest machines to instrument. A pilot that proves the concept on the easiest cases doesn’t build organisational confidence in the harder ones.

Change management is half the work. The eight-week parallel operation period — running the model alongside existing processes before replacing them — was the difference between a tool that gets deployed and a tool that gets used.

This engagement draws on our Data Intelligence & Analytics service. If your organisation is exploring AI or IoT adoption, start with a conversation.