GB300 Coming soon. Learn more

February, 2026

Team Spotlight: Traiano Welcome on running AI at scale

Breaking down the architectural decisions and design lessons behind building AI platforms that hold up in the real world.

We talked to Traiano Welcome, one of Firmus’s Senior Solutions Architects, to explore what production-ready AI infrastructure looks like, and what it takes to build platforms that are as reliable as they are capable. Learn more about:

  • why data movement matters more than GPU specs,
  • why automation is a prerequisite rather than a nice-to-have,
  • and what it takes to design systems that hold up under real-world pressure.
1. What lessons from operating large-scale systems most shape how you design AI platforms at Firmus today?

The biggest lesson is that systems fail at the boundaries, not the core. At scale, issues rarely come from a single component. They emerge from how compute, networking, storage, security, and operations interact under load. That experience drives me to design AI platforms holistically, with equal attention to performance, observability, failure modes, and day-2 operations, not just raw GPU capability.

2. How has your background in cloud operations and SRE influenced your approach to designing production AI infrastructure?

My experience allows for input into how we design AI infrastructure, particularly when designing for a system that is operated under pressure. That translates into clear operational ownership, strong automation, predictable recovery paths, and measurable reliability. Production AI systems aren’t experiments, they’re critical platforms, so we focus heavily on repeatability, monitoring, and safe change management from day one.

3. What does the Firmus AI Factory model change about how AI infrastructure is designed compared to traditional HPC or cloud deployments?

The AI Factory model shifts the focus from isolated clusters to end-to-end AI production systems. Instead of designing for peak theoretical performance, we design for throughput, utilisation, security, and lifecycle efficiency, from onboarding customers to scaling inference and managing costs over time. It’s infrastructure built for continuous delivery, not one-off jobs.

4. When customers move from AI pilots to production, what digital infrastructure gaps do you most commonly see?

The most common gaps are around networking, automation, and operational readiness. Pilots often work despite inefficient data paths, manual workflows, or limited monitoring. At production scale, those shortcuts quickly become bottlenecks. Customers also underestimate how important governance, cost controls, and security isolation become once real users and data are involved.

5. Which architectural decisions have the biggest impact on performance and cost in GPU-accelerated AI environments?

GPU selection matters, but data movement matters more. Network topology, storage latency, and GPU-to-GPU communication often have a bigger impact on real performance than raw compute specs. On the cost side, utilisation is everything. Architectures that enable sharing, automation, and right-sizing consistently outperform “maximum power” designs that sit idle.

6. How does designing within an AI Factory model influence choices around GPUs, networking, and cluster topology?

It pushes us toward balanced, purpose-built designs. GPU choice is aligned to workload type, networking is designed to remove contention rather than just maximise bandwidth, and cluster topology reflects how customers actually train and serve models. The goal is predictable, repeatable performance, not bespoke engineering per customer.

7. What role do automation and infrastructure-as-code play in making AI infrastructure reliable at scale?

They’re foundational. Automation and infrastructure-as-code turn complex systems into manageable products. They reduce human error, enable faster recovery, and make scaling repeatable. Without them, reliability doesn’t scale — it degrades.

At AI Factory scale, automation isn’t an efficiency gain; it’s a prerequisite.

8. What aspects of AI infrastructure design have become more important as models and workloads have grown in size and complexity?

Interconnect performance, memory hierarchy, and operational visibility have become far more critical. As models grow, small inefficiencies multiply quickly. At the same time, customers need clearer insight into performance, cost, and resource usage. Transparency and observability are now just as important as raw capability.

9. Having worked in large enterprises, what has surprised you most about building AI platforms at Firmus?

Speed. Firmus combines deep technical rigour with the ability to move decisively. Decisions are grounded in engineering reality, but there’s far less friction between design, execution, and customer delivery. That balance is rare, especially at the scale we’re operating.

10. As AI infrastructure continues to evolve rapidly, what skills or mindsets do you believe are essential for solutions architects working on AI Factory-scale systems?

Systems thinking is critical. Understanding how components interact, not just how they perform individually. Equally important is operational empathy: designing with the people who will run, support, and scale the platform in mind. Finally, adaptability matters: the technology will change, but strong architectural fundamentals endure.