Datacenter Resiliency Architect - New College Grad 2025
NVIDIA
On-site π’ 19 May
Tech & IT Services
California, United States 🇺🇸
Today, NVIDIA is tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing whatβs never been done before takes vision, innovation, and the worldβs best talent. As an NVIDIAN, youβll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work. Come join the team and see how we can make a lasting impact on the world.
We are now seeking a Resiliency Architect to support the development and validation of GPU (graphical processing units) hardware and software resiliency features. In this role, you will be a key member of a team of innovators, challenging the status quo and pushing beyond boundaries. You will have the opportunity to impact the industry's leading Datacenter GPUs and SOCs powering product lines for the growing field of artificial intelligence (AI) and high-performance computing (HPC).
What you'll be doing:
- Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the Datacenter.
- Model and analyze RAS metrics like Failures in Time for permanent and transient errors, and Availability from GPU to Rack to Datacenter. Use models to identify gaps and drive RAS improvements.
- Collaborate with architects, unit designers and software engineers to ensure alignment of verification requirements.
- Develop and implement comprehensive architecture verification testplans for resiliency features
- Execute Architecture Testplan by developing test content, working with Software and Architecture teams to enable, run, and debug tests on Architecture models. Support test debug on RTL, emulation, and silicon.
- Run simulations to analyze Architectural Vulnerability Factor and Liveness of on-die memory, flip-flops, and latches.
- Develop CUDA software diagnostics kernels for to run on clusters of NVIDIA GPUs and identify potential hardware issues.
- Develop and automate fault models to simulate various fault types (e.g., transient faults, stuck-at faults) in gate-level netlist, RTL, architectural model, silicon and other environments.
What we need to see:
- Pursuing or recently completed a Masterβs or PhD degree in Computer Engineering, Electrical Engineering or closely related degree or equivalent experience.
- Familiarity with GPU and Networking Architectures, Computer Architecture basics (including caches, coherence, buses, direct memory access, etc.); Machine Learning/Deep Learning concepts.
- Proficiency in RAS concepts and in developing Architecture models.
- Scripting and automation with Python or similar.
- Proficiency in C/C++.
- Excellent interpersonal skills and ability to collaborate with on-site and remote teams.
- Strong debugging and analytical skills.
- Be self-driven and results oriented.
Ways to stand out from the crowd:
- Experience with resiliency and datacenter RAS.
- Proficiency in Verilog/System Verilog RTL simulations and debug. Ability to set up test benches and integrate various components.
- Programming with CUDA
NVIDIAβs invention of the GPU 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI β the next era of computing β with the GPU acting as the brain of AI factories, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as βthe AI computing companyβ. Do you love the challenge of crafting the highest-performance silicon possible? If so, we want to hear from you! Come, join our Accelerated and Resilient Compute Systems team and help build the resilient, highly available, cost-effective computing platform driving our success in this exciting and quickly growing field.
You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.
Similar jobs
Server System Performance Optimization Intern (Jul - Dec 2025)
On-site π’ 06 May
Tech & IT Services
T´ai-pei, Taiwan 🇹🇼
Hardware Engineer I (Full Time) United States
On-site π’ 08 September
Tech & IT Services
California, United States 🇺🇸
Cloud Support Engineer Intern, Support Engineering
On-site π’ 13 March
Tech & IT Services
Auckland, New Zealand 🇳🇿
Assistant.e Commercial.e - Oracle Applications
On-site π’ 10 April
Tech & IT Services
Colombes, France 🇫🇷
Site Technical Manager Intern
On-site π’ 05 December
Tech & IT Services
Shenzhen, China 🇨🇳
【2027新卒採用/技術職】AWS Professional Services/ Solutions Architect Workshop, Japan Amazon University TA
On-site π’ 18 February
Tech & IT Services
Tokyo, Japan 🇯🇵
Technical Consulting Engineer (Networking) - Early in Career, Krakow
On-site π’ 15 May
Tech & IT Services
Krakow, Poland 🇵🇱
Functional Analyst Support Engineer
On-site π’ 02 April
Tech & IT Services
Jalisco, Mexico 🇲🇽
Associate Solutions Architect Intern
On-site π’ 14 October
Tech & IT Services
Beijing, China 🇨🇳
Solutions Architect Intern, APJ Partner Management
On-site π’ 04 April
Tech & IT Services
Sydney, Australia 🇦🇺
IT Analyst - part time for student
Hybrid π‘π’ 11 March
Tech & IT Services
Budapest, Hungary 🇭🇺
Vaga de Estágio Afirmativa em Cybersegurança para Pessoas com Deficiência
Remote πππ 22 May
Tech & IT Services
Brazil 🇧🇷
Assistant.e Commercial.e - Oracle Applications
On-site π’ 10 April
Tech & IT Services
Colombes, France 🇫🇷
シスコシステムズ 26卒 新卒採用本選考 - ソリューションズエンジニア職
On-site π’ 11 November
Tech & IT Services
Minato, Japan 🇯🇵
Data Center Engineering Operator Intern
On-site π’ 17 March
Tech & IT Services
Santiago De Queretaro, Mexico 🇲🇽
2025 Data Centre Operations Engineer Intern
On-site π’ 01 November
Tech & IT Services
Berlin, Germany 🇩🇪
障がい者採用【新卒/技術職】AWSクラウドサポートエンジニア 本選考
On-site π’ 21 March
Tech & IT Services
Tokyo, Japan 🇯🇵
2025 Data Center Technician Intern
On-site π’ 06 November
Tech & IT Services
Hemel Hempstead, United Kingdom 🇬🇧 London, United Kingdom 🇬🇧
2025 Data Center Technician Intern
On-site π’ 06 November
Tech & IT Services
Milan, Italy 🇮🇹
2025 Data Centre Operations Engineer Intern
On-site π’ 01 November
Tech & IT Services
Frankfurt, Germany 🇩🇪
Data Center Technician Intern
On-site π’ New π₯π₯
Tech & IT Services
Netherlands 🇳🇱 Capital Region, Denmark 🇩🇰 Dublin, Ireland 🇮🇪 Madrid, Spain 🇪🇸 Finland 🇫🇮 Sweden 🇸🇪 United Kingdom 🇬🇧
Consulting Engineer II (Full Time) United States
Remote πππ 08 September
Tech & IT Services
North Carolina, United States 🇺🇸
Network Support Engineering Intern - Summer 2025 Mexico (Meraki)
On-site π’ 24 April
Tech & IT Services
Mexico City, Mexico 🇲🇽
【新卒採用2026入社/高等専門学校】メンテナンステクニシャン
On-site π’ 13 December
Tech & IT Services
Odawara, Japan 🇯🇵
Assistant.e Commercial.e - Oracle Applications
On-site π’ 10 April
Tech & IT Services
Colombes, France 🇫🇷
Data Center Technician Internship, Data Center Operations
On-site π’ 28 February
Tech & IT Services
Cape Town, South Africa 🇿🇦
Site reliability Engineer (Network Engineer) Intern - Placement Year
On-site π’ 11 September
Tech & IT Services
Feltham, United Kingdom 🇬🇧
27卒対象 CXサマーインターンシップ - コンサルティングエンジニア職 (Consulting Engineer)
On-site π’ 08 May
Tech & IT Services
Minato, Japan 🇯🇵
Technical Intern - Oracle OFSS Singapore
Remote πππ 14 April
Tech & IT Services
Singapore 🇸🇬
【新卒採用2026入社/高等専門学校】ファシリティエンジニア - DCEO(データセンターエンジニアリング・オペレーションズ), DCEO
On-site π’ 07 January
Tech & IT Services
Tokyo, Japan 🇯🇵
Solutions Architect Intern 2025
On-site π’ 11 October
Tech & IT Services
Brussels, Belgium 🇧🇪
Network Support Engineer, Fall 2025 (Meraki)
On-site π’ 30 April
Tech & IT Services
Illinois, United States 🇺🇸
Chef de Projet IT Solutions médicales (H/F) – Alternance 1 an
Hybrid π‘π’ 25 March
Tech & IT Services
Saint-Ouen, France 🇫🇷
DBA Intern
On-site π’ 09 March
Tech & IT Services
Singapore 🇸🇬
2025 Data Center Technician Intern
On-site π’ 01 November
Tech & IT Services
Zaragoza, Spain 🇪🇸
Data Center Engineering Operations Trainee, AWS Data Center Engineering Operations
On-site π’ 25 February
Tech & IT Services
Sydney, Australia 🇦🇺
Production Service Systems Administrator 2
On-site π’ New π₯π₯
Tech & IT Services
Singapore 🇸🇬
Consultant Developer - Oracle OFSS Singapore
Remote πππ 14 April
Tech & IT Services
Singapore 🇸🇬
2025 Capacity Install Technician Intern
On-site π’ 10 April
Tech & IT Services
Hemel Hempstead, United Kingdom 🇬🇧 London, United Kingdom 🇬🇧
IT Intern
On-site π’ 24 March
Tech & IT Services
Pulau Pinang, Malaysia 🇲🇾
Associate Solutions Architect Intern
On-site π’ 11 September
Tech & IT Services
Shenzhen, China 🇨🇳
Technical Consulting Engineer (Networking) - Early in Career, Lisbon
On-site π’ 15 May
Tech & IT Services
Oeiras, Portugal 🇵🇹
Data Center Engineering Operations Trainee, AWS Data Center Engineering Operations
On-site π’ 22 May
Tech & IT Services
Sydney, Australia 🇦🇺
Data Center Technical Support Intern 数据中心运维技术支持工程师实习岗
On-site π’ 05 December
Tech & IT Services
Zhongwei, China 🇨🇳
Data Center Infrastructure Engineer Intern
On-site π’ 21 November
Tech & IT Services
Herndon, United States 🇺🇸
Technical Analyst 2-Support
On-site π’ 22 March
Tech & IT Services
Jalisco, Mexico 🇲🇽
NetSuite Technical Support Analyst
On-site π’ 15 April
Tech & IT Services
Ontario, Canada 🇨🇦
27卒対象 CXサマーインターンシップ - テクニカルコンサルティングエンジニア職 (Technical Consulting Engineer)
On-site π’ 08 May
Tech & IT Services
Minato, Japan 🇯🇵
IT Operations Intern
On-site π’ 09 March
Tech & IT Services
Singapore 🇸🇬
Data Center Technical Support Engineering Operations Intern 数据中心基础设施运维技术支持实习岗
On-site π’ 05 December
Tech & IT Services
Zhongwei, China 🇨🇳
Tencent Cloud - EdgeOne Product Solution Architecture Intern (Indonesia)
On-site π’ New π₯π₯
Tech & IT Services
Jakarta, Indonesia 🇮🇩
Data Center Infrastructure Engineer Intern
On-site π’ 05 December
Tech & IT Services
Herndon, United States 🇺🇸 Umatilla, United States 🇺🇸 Columbus, United States 🇺🇸 Dublin, United States 🇺🇸 Hermiston, United States 🇺🇸
Associate Consultant (Health Insurance) - PH Graduate Program
On-site π’ 13 February
Tech & IT Services
Metro manila, Philippines 🇵🇭
Data Center Technical Support Intern 数据中心运维技术支持工程师实习岗
On-site π’ 15 November
Tech & IT Services
Langfang, China 🇨🇳
Solutions Architect Working Student
On-site π’ 06 March
Tech & IT Services
Munich, Germany 🇩🇪
ASIC Verification Engineer - Junior profile - Cairo, Egypt
On-site π’ 08 September
Tech & IT Services
Cairo Al Qahirah, Egypt 🇪🇬
ASIC Physical Design Engineer, Annapurna Labs
On-site π’ 17 March
Tech & IT Services
Austin, United States 🇺🇸 Cupertino, United States 🇺🇸
SQL Support Engineer - Fresh Graduated
On-site π’ 23 January
Tech & IT Services
Jalisco, Mexico 🇲🇽
Data Center Engineering Operations Trainee, Infraops DCEO
On-site π’ 04 April
Tech & IT Services
Sydney, Australia 🇦🇺
Data Center Technical Support Engineering Operations Intern 数据中心设施管理技术支持实习岗
On-site π’ 15 November
Tech & IT Services
Beijing, China 🇨🇳
Associate Technical Support Engineer (ERP) - PH Graduate Program
On-site π’ 17 March
Tech & IT Services
Metro manila, Philippines 🇵🇭
Cloud Support Engineer Intern, Support Engineering
On-site π’ 13 March
Tech & IT Services
Sydney, Australia 🇦🇺
Technical Systems Engineer II (Full Time) - United States
On-site π’ 17 April
Tech & IT Services
North Carolina, United States 🇺🇸
Solutions Architect Intern
On-site π’ 09 April
Tech & IT Services
Sydney, Australia 🇦🇺
Data Center Engineering Operations Trainee, AWS Data Center Engineering Operations
On-site π’ 25 February
Tech & IT Services
Sydney, Australia 🇦🇺
Share this job, spread the word!
Similar jobs
Server System Performance Optimization Intern (Jul - Dec 2025)
On-site π’ 06 May
Tech & IT Services
T´ai-pei, Taiwan 🇹🇼
Hardware Engineer I (Full Time) United States
On-site π’ 08 September
Tech & IT Services
California, United States 🇺🇸
Cloud Support Engineer Intern, Support Engineering
On-site π’ 13 March
Tech & IT Services
Auckland, New Zealand 🇳🇿