Apply now
NVIDIA logo

Datacenter Resiliency Architect - New College Grad 2025

NVIDIA

On-site 🏒 19 May
Tech & IT Services
California, United States 🇺🇸

Today, NVIDIA is tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work. Come join the team and see how we can make a lasting impact on the world.

We are now seeking a Resiliency Architect to support the development and validation of GPU (graphical processing units) hardware and software resiliency features. In this role, you will be a key member of a team of innovators, challenging the status quo and pushing beyond boundaries. You will have the opportunity to impact the industry's leading Datacenter GPUs and SOCs powering product lines for the growing field of artificial intelligence (AI) and high-performance computing (HPC).

What you'll be doing:

  • Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the Datacenter.
  • Model and analyze RAS metrics like Failures in Time for permanent and transient errors, and Availability from GPU to Rack to Datacenter. Use models to identify gaps and drive RAS improvements.
  • Collaborate with architects, unit designers and software engineers to ensure alignment of verification requirements.
  • Develop and implement comprehensive architecture verification testplans for resiliency features
  • Execute Architecture Testplan by developing test content, working with Software and Architecture teams to enable, run, and debug tests on Architecture models. Support test debug on RTL, emulation, and silicon.
  • Run simulations to analyze Architectural Vulnerability Factor and Liveness of on-die memory, flip-flops, and latches.
  • Develop CUDA software diagnostics kernels for to run on clusters of NVIDIA GPUs and identify potential hardware issues.
  • Develop and automate fault models to simulate various fault types (e.g., transient faults, stuck-at faults) in gate-level netlist, RTL, architectural model, silicon and other environments.

What we need to see:

  • Pursuing or recently completed a Master’s or PhD degree in Computer Engineering, Electrical Engineering or closely related degree or equivalent experience.
  • Familiarity with GPU and Networking Architectures, Computer Architecture basics (including caches, coherence, buses, direct memory access, etc.); Machine Learning/Deep Learning concepts.
  • Proficiency in RAS concepts and in developing Architecture models.
  • Scripting and automation with Python or similar.
  • Proficiency in C/C++.
  • Excellent interpersonal skills and ability to collaborate with on-site and remote teams.
  • Strong debugging and analytical skills.
  • Be self-driven and results oriented.

Ways to stand out from the crowd:

  • Experience with resiliency and datacenter RAS.
  • Proficiency in Verilog/System Verilog RTL simulations and debug. Ability to set up test benches and integrate various components.
  • Programming with CUDA

NVIDIA’s invention of the GPU 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI β€” the next era of computing β€” with the GPU acting as the brain of AI factories, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as β€œthe AI computing company”. Do you love the challenge of crafting the highest-performance silicon possible? If so, we want to hear from you! Come, join our Accelerated and Resilient Compute Systems team and help build the resilient, highly available, cost-effective computing platform driving our success in this exciting and quickly growing field.

The base salary range is 120,000 USD - 235,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
Apply now

Similar jobs

Server System Performance Optimization Intern (Jul - Dec 2025)

On-site 🏒 06 May
Tech & IT Services
T´ai-pei, Taiwan 🇹🇼

Hardware Engineer I (Full Time) United States

On-site 🏒 08 September
Tech & IT Services
California, United States 🇺🇸

Cloud Support Engineer Intern, Support Engineering

On-site 🏒 13 March
Tech & IT Services
Auckland, New Zealand 🇳🇿

Assistant.e Commercial.e - Oracle Applications

On-site 🏒 10 April
Tech & IT Services
Colombes, France 🇫🇷

Site Technical Manager Intern

On-site 🏒 05 December
Tech & IT Services
Shenzhen, China 🇨🇳

【2027新卒採用/技術職】AWS Professional Services/ Solutions Architect Workshop, Japan Amazon University TA

On-site 🏒 18 February
Tech & IT Services
Tokyo, Japan 🇯🇵

Technical Consulting Engineer (Networking) - Early in Career, Krakow

On-site 🏒 15 May
Tech & IT Services
Krakow, Poland 🇵🇱

Functional Analyst Support Engineer

On-site 🏒 02 April
Tech & IT Services
Jalisco, Mexico 🇲🇽

Associate Solutions Architect Intern

On-site 🏒 14 October
Tech & IT Services
Beijing, China 🇨🇳

Solutions Architect Intern, APJ Partner Management

On-site 🏒 04 April
Tech & IT Services
Sydney, Australia 🇦🇺

IT Analyst - part time for student

Hybrid 🏑🏒 11 March
Tech & IT Services
Budapest, Hungary 🇭🇺

Vaga de Estágio Afirmativa em Cybersegurança para Pessoas com Deficiência

Remote 🌎🌍🌏 22 May
Tech & IT Services
Brazil 🇧🇷

Assistant.e Commercial.e - Oracle Applications

On-site 🏒 10 April
Tech & IT Services
Colombes, France 🇫🇷

シスコシステムズ 26卒 新卒採用本選考 - ソリューションズエンジニア職

On-site 🏒 11 November
Tech & IT Services
Minato, Japan 🇯🇵

Data Center Engineering Operator Intern

On-site 🏒 17 March
Tech & IT Services
Santiago De Queretaro, Mexico 🇲🇽

2025 Data Centre Operations Engineer Intern

On-site 🏒 01 November
Tech & IT Services
Berlin, Germany 🇩🇪

障がい者採用【新卒/技術職】AWSクラウドサポートエンジニア 本選考

On-site 🏒 21 March
Tech & IT Services
Tokyo, Japan 🇯🇵

2025 Data Center Technician Intern

On-site 🏒 06 November
Tech & IT Services
Hemel Hempstead, United Kingdom 🇬🇧 London, United Kingdom 🇬🇧

2025 Data Center Technician Intern

On-site 🏒 06 November
Tech & IT Services
Milan, Italy 🇮🇹

2025 Data Centre Operations Engineer Intern

On-site 🏒 01 November
Tech & IT Services
Frankfurt, Germany 🇩🇪

Data Center Technician Intern

On-site 🏒 New πŸ”₯πŸ”₯
Tech & IT Services
Netherlands 🇳🇱 Capital Region, Denmark 🇩🇰 Dublin, Ireland 🇮🇪 Madrid, Spain 🇪🇸 Finland 🇫🇮 Sweden 🇸🇪 United Kingdom 🇬🇧

Consulting Engineer II (Full Time) United States

Remote 🌎🌍🌏 08 September
Tech & IT Services
North Carolina, United States 🇺🇸

Network Support Engineering Intern - Summer 2025 Mexico (Meraki)

On-site 🏒 24 April
Tech & IT Services
Mexico City, Mexico 🇲🇽

【新卒採用2026入社/高等専門学校】メンテナンステクニシャン

On-site 🏒 13 December
Tech & IT Services
Odawara, Japan 🇯🇵

Assistant.e Commercial.e - Oracle Applications

On-site 🏒 10 April
Tech & IT Services
Colombes, France 🇫🇷

Data Center Technician Internship, Data Center Operations

On-site 🏒 28 February
Tech & IT Services
Cape Town, South Africa 🇿🇦

Site reliability Engineer (Network Engineer) Intern - Placement Year

On-site 🏒 11 September
Tech & IT Services
Feltham, United Kingdom 🇬🇧

27卒対象 CXサマーインターンシップ - コンサルティングエンジニア職 (Consulting Engineer)

On-site 🏒 08 May
Tech & IT Services
Minato, Japan 🇯🇵

Technical Intern - Oracle OFSS Singapore

Remote 🌎🌍🌏 14 April
Tech & IT Services
Singapore 🇸🇬

【新卒採用2026入社/高等専門学校】ファシリティエンジニア - DCEO(データセンターエンジニアリング・オペレーションズ), DCEO

On-site 🏒 07 January
Tech & IT Services
Tokyo, Japan 🇯🇵

Solutions Architect Intern 2025

On-site 🏒 11 October
Tech & IT Services
Brussels, Belgium 🇧🇪

Network Support Engineer, Fall 2025 (Meraki)

On-site 🏒 30 April
Tech & IT Services
Illinois, United States 🇺🇸

Chef de Projet IT Solutions médicales (H/F) – Alternance 1 an

Hybrid 🏑🏒 25 March
Tech & IT Services
Saint-Ouen, France 🇫🇷

DBA Intern

On-site 🏒 09 March
Tech & IT Services
Singapore 🇸🇬

2025 Data Center Technician Intern

On-site 🏒 01 November
Tech & IT Services
Zaragoza, Spain 🇪🇸

Data Center Engineering Operations Trainee, AWS Data Center Engineering Operations

On-site 🏒 25 February
Tech & IT Services
Sydney, Australia 🇦🇺

Production Service Systems Administrator 2

On-site 🏒 New πŸ”₯πŸ”₯
Tech & IT Services
Singapore 🇸🇬

Consultant Developer - Oracle OFSS Singapore

Remote 🌎🌍🌏 14 April
Tech & IT Services
Singapore 🇸🇬

2025 Capacity Install Technician Intern

On-site 🏒 10 April
Tech & IT Services
Hemel Hempstead, United Kingdom 🇬🇧 London, United Kingdom 🇬🇧

IT Intern

On-site 🏒 24 March
Tech & IT Services
Pulau Pinang, Malaysia 🇲🇾

Associate Solutions Architect Intern

On-site 🏒 11 September
Tech & IT Services
Shenzhen, China 🇨🇳

Technical Consulting Engineer (Networking) - Early in Career, Lisbon

On-site 🏒 15 May
Tech & IT Services
Oeiras, Portugal 🇵🇹

Data Center Engineering Operations Trainee, AWS Data Center Engineering Operations

On-site 🏒 22 May
Tech & IT Services
Sydney, Australia 🇦🇺

Data Center Technical Support Intern 数据中心运维技术支持工程师实习岗

On-site 🏒 05 December
Tech & IT Services
Zhongwei, China 🇨🇳

Data Center Infrastructure Engineer Intern

On-site 🏒 21 November
Tech & IT Services
Herndon, United States 🇺🇸

Technical Analyst 2-Support

On-site 🏒 22 March
Tech & IT Services
Jalisco, Mexico 🇲🇽

NetSuite Technical Support Analyst

On-site 🏒 15 April
Tech & IT Services
Ontario, Canada 🇨🇦

27卒対象 CXサマーインターンシップ - テクニカルコンサルティングエンジニア職 (Technical Consulting Engineer)

On-site 🏒 08 May
Tech & IT Services
Minato, Japan 🇯🇵

IT Operations Intern

On-site 🏒 09 March
Tech & IT Services
Singapore 🇸🇬

Data Center Technical Support Engineering Operations Intern 数据中心基础设施运维技术支持实习岗

On-site 🏒 05 December
Tech & IT Services
Zhongwei, China 🇨🇳

Tencent Cloud - EdgeOne Product Solution Architecture Intern (Indonesia)

On-site 🏒 New πŸ”₯πŸ”₯
Tech & IT Services
Jakarta, Indonesia 🇮🇩

Data Center Infrastructure Engineer Intern

On-site 🏒 05 December
Tech & IT Services
Herndon, United States 🇺🇸 Umatilla, United States 🇺🇸 Columbus, United States 🇺🇸 Dublin, United States 🇺🇸 Hermiston, United States 🇺🇸

Associate Consultant (Health Insurance) - PH Graduate Program

On-site 🏒 13 February
Tech & IT Services
Metro manila, Philippines 🇵🇭

Data Center Technical Support Intern 数据中心运维技术支持工程师实习岗

On-site 🏒 15 November
Tech & IT Services
Langfang, China 🇨🇳

Solutions Architect Working Student

On-site 🏒 06 March
Tech & IT Services
Munich, Germany 🇩🇪

ASIC Verification Engineer - Junior profile - Cairo, Egypt

On-site 🏒 08 September
Tech & IT Services
Cairo Al Qahirah, Egypt 🇪🇬

ASIC Physical Design Engineer, Annapurna Labs

On-site 🏒 17 March
Tech & IT Services
Austin, United States 🇺🇸 Cupertino, United States 🇺🇸

SQL Support Engineer - Fresh Graduated

On-site 🏒 23 January
Tech & IT Services
Jalisco, Mexico 🇲🇽

Data Center Engineering Operations Trainee, Infraops DCEO

On-site 🏒 04 April
Tech & IT Services
Sydney, Australia 🇦🇺

Data Center Technical Support Engineering Operations Intern 数据中心设施管理技术支持实习岗

On-site 🏒 15 November
Tech & IT Services
Beijing, China 🇨🇳

Associate Technical Support Engineer (ERP) - PH Graduate Program

On-site 🏒 17 March
Tech & IT Services
Metro manila, Philippines 🇵🇭

Cloud Support Engineer Intern, Support Engineering

On-site 🏒 13 March
Tech & IT Services
Sydney, Australia 🇦🇺

Technical Systems Engineer II (Full Time) - United States

On-site 🏒 17 April
Tech & IT Services
North Carolina, United States 🇺🇸

Solutions Architect Intern

On-site 🏒 09 April
Tech & IT Services
Sydney, Australia 🇦🇺

Data Center Engineering Operations Trainee, AWS Data Center Engineering Operations

On-site 🏒 25 February
Tech & IT Services
Sydney, Australia 🇦🇺