Apply now
NVIDIA logo

Datacenter Resiliency Architect - New College Grad 2025

NVIDIA

On-site 🏒 New πŸ”₯πŸ”₯
Tech & IT Services
California, United States 🇺🇸

Today, NVIDIA is tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work. Come join the team and see how we can make a lasting impact on the world.

We are now seeking a Resiliency Architect to support the development and validation of GPU (graphical processing units) hardware and software resiliency features. In this role, you will be a key member of a team of innovators, challenging the status quo and pushing beyond boundaries. You will have the opportunity to impact the industry's leading Datacenter GPUs and SOCs powering product lines for the growing field of artificial intelligence (AI) and high-performance computing (HPC).

What you'll be doing:

  • Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the Datacenter.
  • Model and analyze RAS metrics like Failures in Time for permanent and transient errors, and Availability from GPU to Rack to Datacenter. Use models to identify gaps and drive RAS improvements.
  • Collaborate with architects, unit designers and software engineers to ensure alignment of verification requirements.
  • Develop and implement comprehensive architecture verification testplans for resiliency features
  • Execute Architecture Testplan by developing test content, working with Software and Architecture teams to enable, run, and debug tests on Architecture models. Support test debug on RTL, emulation, and silicon.
  • Run simulations to analyze Architectural Vulnerability Factor and Liveness of on-die memory, flip-flops, and latches.
  • Develop CUDA software diagnostics kernels for to run on clusters of NVIDIA GPUs and identify potential hardware issues.
  • Develop and automate fault models to simulate various fault types (e.g., transient faults, stuck-at faults) in gate-level netlist, RTL, architectural model, silicon and other environments.

What we need to see:

  • Pursuing or recently completed a Master’s or PhD degree in Computer Engineering, Electrical Engineering or closely related degree or equivalent experience.
  • Familiarity with GPU and Networking Architectures, Computer Architecture basics (including caches, coherence, buses, direct memory access, etc.); Machine Learning/Deep Learning concepts.
  • Proficiency in RAS concepts and in developing Architecture models.
  • Scripting and automation with Python or similar.
  • Proficiency in C/C++.
  • Excellent interpersonal skills and ability to collaborate with on-site and remote teams.
  • Strong debugging and analytical skills.
  • Be self-driven and results oriented.

Ways to stand out from the crowd:

  • Experience with resiliency and datacenter RAS.
  • Proficiency in Verilog/System Verilog RTL simulations and debug. Ability to set up test benches and integrate various components.
  • Programming with CUDA

NVIDIA’s invention of the GPU 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI β€” the next era of computing β€” with the GPU acting as the brain of AI factories, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as β€œthe AI computing company”. Do you love the challenge of crafting the highest-performance silicon possible? If so, we want to hear from you! Come, join our Accelerated and Resilient Compute Systems team and help build the resilient, highly available, cost-effective computing platform driving our success in this exciting and quickly growing field.

The base salary range is 120,000 USD - 235,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
Apply now

Similar jobs

Data Center Technical Support Engineering Operations Intern 数据中心基础设施运维技术支持实习岗

On-site 🏒 05 December
Tech & IT Services
Beijing, China 🇨🇳 Langfang, China 🇨🇳 Zhongwei, China 🇨🇳

Chef de Projet IT Solutions médicales (H/F) – Alternance 1 an

Hybrid 🏑🏒 25 March
Tech & IT Services
Saint-Ouen, France 🇫🇷

2025 Capacity Install Technician Intern

On-site 🏒 10 April
Tech & IT Services
Hemel Hempstead, United Kingdom 🇬🇧 London, United Kingdom 🇬🇧

【新卒採用2026入社/高等専門学校】メンテナンステクニシャン

On-site 🏒 13 December
Tech & IT Services
Odawara, Japan 🇯🇵

IT Analyst - part time for student

Hybrid 🏑🏒 11 March
Tech & IT Services
Budapest, Hungary 🇭🇺

Solutions Architect Working Student

On-site 🏒 06 March
Tech & IT Services
Munich, Germany 🇩🇪

Site reliability Engineer (Network Engineer) Intern - Placement Year

On-site 🏒 11 September
Tech & IT Services
Feltham, United Kingdom 🇬🇧

Consulting Engineer II (Full Time) United States

Remote 🌎🌍🌏 08 September
Tech & IT Services
North Carolina, United States 🇺🇸

Consultant Developer - Oracle OFSS Singapore

Remote 🌎🌍🌏 14 April
Tech & IT Services
Singapore 🇸🇬

Assistant.e Commercial.e - Oracle Applications

On-site 🏒 10 April
Tech & IT Services
Colombes, France 🇫🇷

Technical Intern - Oracle OFSS Singapore

Remote 🌎🌍🌏 14 April
Tech & IT Services
Singapore 🇸🇬

Data Center Technical Support Intern 数据中心运维技术支持工程师实习岗

On-site 🏒 05 December
Tech & IT Services
Beijing, China 🇨🇳 Langfang, China 🇨🇳 Zhongwei, China 🇨🇳

Data Center Infrastructure Engineer Intern

On-site 🏒 05 December
Tech & IT Services
Herndon, United States 🇺🇸 Umatilla, United States 🇺🇸 Columbus, United States 🇺🇸 Dublin, United States 🇺🇸 Hermiston, United States 🇺🇸

Cloud Support Engineer Intern, Support Engineering

On-site 🏒 13 March
Tech & IT Services
Auckland, New Zealand 🇳🇿

Network Support Engineering Intern - Summer 2025 Mexico (Meraki)

On-site 🏒 24 April
Tech & IT Services
Mexico City, Mexico 🇲🇽

Associate Consultant (Health Insurance) - PH Graduate Program

On-site 🏒 13 February
Tech & IT Services
Metro manila, Philippines 🇵🇭

IT Intern

On-site 🏒 24 March
Tech & IT Services
Pulau Pinang, Malaysia 🇲🇾

【2027新卒採用/技術職】AWS Professional Services/ Solutions Architect Workshop, Japan Amazon University TA

On-site 🏒 18 February
Tech & IT Services
Tokyo, Japan 🇯🇵

Solutions Architect Intern 2025

On-site 🏒 11 October
Tech & IT Services
Brussels, Belgium 🇧🇪

Data Center Technicians Intern

On-site 🏒 07 March
Tech & IT Services
Capital Region, Denmark 🇩🇰

2025 Data Center Technician Intern

On-site 🏒 06 November
Tech & IT Services
Hemel Hempstead, United Kingdom 🇬🇧 London, United Kingdom 🇬🇧

27卒対象 CXサマーインターンシップ - テクニカルコンサルティングエンジニア職 (Technical Consulting Engineer)

On-site 🏒 08 May
Tech & IT Services
Minato, Japan 🇯🇵

Data Center Technical Support Engineering Operations Intern 数据中心设施管理技术支持实习岗

On-site 🏒 15 November
Tech & IT Services
Beijing, China 🇨🇳

Technical Analyst 2-Support

On-site 🏒 22 March
Tech & IT Services
Jalisco, Mexico 🇲🇽

Site Technical Manager Intern

On-site 🏒 05 December
Tech & IT Services
Shenzhen, China 🇨🇳

Data Center Technical Support Intern 数据中心运维技术支持工程师实习岗

On-site 🏒 15 November
Tech & IT Services
Langfang, China 🇨🇳

Cloud Support Engineer Intern, Support Engineering

On-site 🏒 13 March
Tech & IT Services
Sydney, Australia 🇦🇺

Data Center Technician Internship, Data Center Operations

On-site 🏒 28 February
Tech & IT Services
Cape Town, South Africa 🇿🇦

Assistant.e Commercial.e - Oracle Applications

On-site 🏒 10 April
Tech & IT Services
Colombes, France 🇫🇷

Technical Consulting Engineer (Networking) - Early in Career, Krakow

On-site 🏒 New πŸ”₯πŸ”₯
Tech & IT Services
Krakow, Poland 🇵🇱

2025 Data Centre Operations Engineer Intern

On-site 🏒 01 November
Tech & IT Services
Berlin, Germany 🇩🇪

Data Center Technicians Intern

On-site 🏒 New πŸ”₯πŸ”₯
Tech & IT Services
Dublin, Ireland 🇮🇪

Assistant.e Commercial.e - Oracle Applications

On-site 🏒 10 April
Tech & IT Services
Colombes, France 🇫🇷

Associate Solutions Architect Internship 채용연계형 인턴, APJ Partner Management

On-site 🏒 12 August
Tech & IT Services
Seoul, South Korea 🇰🇷

障がい者採用【新卒/技術職】AWSクラウドサポートエンジニア 本選考

On-site 🏒 21 March
Tech & IT Services
Tokyo, Japan 🇯🇵

NetSuite - Graduate Continual Service Improvement Engineer

On-site 🏒 01 May
Tech & IT Services
North ryde, Australia 🇦🇺

IT Support Engineer - fresh graduate

On-site 🏒 01 April
Tech & IT Services
Brno, Czech Republic 🇨🇿

Associate Solutions Architect Intern

On-site 🏒 14 October
Tech & IT Services
Beijing, China 🇨🇳

服务器客户Debug实习生 (Jul - Dec 2025)

On-site 🏒 11 April
Tech & IT Services
Shanghai, China 🇨🇳

【技術職】Cloud Support Engineer サマーインターン

On-site 🏒 12 May
Tech & IT Services
Tokyo, Japan 🇯🇵

2025 Data Center Technician Intern

On-site 🏒 06 November
Tech & IT Services
Milan, Italy 🇮🇹

Network Engineer Intern

On-site 🏒 02 April
Tech & IT Services
Singapore 🇸🇬

Functional Analyst Support Engineer

On-site 🏒 02 April
Tech & IT Services
Jalisco, Mexico 🇲🇽

Data Center Engineering Operations Trainee, AWS Data Center Engineering Operations

On-site 🏒 25 February
Tech & IT Services
Sydney, Australia 🇦🇺

IT Operations Intern

On-site 🏒 09 March
Tech & IT Services
Singapore 🇸🇬

Data Center Infrastructure Engineer Intern

On-site 🏒 21 November
Tech & IT Services
Herndon, United States 🇺🇸

Data Center Engineering Operations Trainee, AWS Data Center Engineering Operations

On-site 🏒 04 April
Tech & IT Services
Sydney, Australia 🇦🇺

Hardware Engineer I (Full Time) United States

On-site 🏒 08 September
Tech & IT Services
California, United States 🇺🇸

Hardware Engineer II (Full Time) United States

On-site 🏒 08 September
Tech & IT Services
California, United States 🇺🇸

Data Center Infrastructure Engineer, Data Center Operations

On-site 🏒 18 February
Tech & IT Services
Hong Kong, Hong Kong SAR China

Network Support Engineer, Fall 2025 (Meraki)

On-site 🏒 30 April
Tech & IT Services
Illinois, United States 🇺🇸

Technical Systems Engineer II (Full Time) - United States

On-site 🏒 17 April
Tech & IT Services
North Carolina, United States 🇺🇸

SQL Support Engineer - Fresh Graduated

On-site 🏒 23 January
Tech & IT Services
Jalisco, Mexico 🇲🇽

Cisco Network Engineer (Intern) - Cracow, Poland

On-site 🏒 17 April
Tech & IT Services
Krakow, Poland 🇵🇱

Solutions Architect Intern, APJ Partner Management

On-site 🏒 04 April
Tech & IT Services
Sydney, Australia 🇦🇺

Server System Performance Optimization Intern (Jul - Dec 2025)

On-site 🏒 06 May
Tech & IT Services
T´ai-pei, Taiwan 🇹🇼

2025 Data Center Technician Intern

On-site 🏒 01 November
Tech & IT Services
Zaragoza, Spain 🇪🇸

ASIC Physical Design Engineer, Annapurna Labs

On-site 🏒 17 March
Tech & IT Services
Austin, United States 🇺🇸 Cupertino, United States 🇺🇸

Data Center Engineering Operator Intern

On-site 🏒 17 March
Tech & IT Services
Santiago De Queretaro, Mexico 🇲🇽

Data Center Engineering Operations Trainee, AWS Data Center Engineering Operations

On-site 🏒 25 February
Tech & IT Services
Sydney, Australia 🇦🇺

シスコシステムズ 26卒 新卒採用本選考 - ソリューションズエンジニア職

On-site 🏒 11 November
Tech & IT Services
Minato, Japan 🇯🇵

【新卒採用2026入社/高等専門学校】ファシリティエンジニア - DCEO(データセンターエンジニアリング・オペレーションズ), DCEO

On-site 🏒 07 January
Tech & IT Services
Tokyo, Japan 🇯🇵

2025 Data Centre Operations Engineer Intern

On-site 🏒 01 November
Tech & IT Services
Frankfurt, Germany 🇩🇪

ASIC Verification Engineer - Junior profile - Cairo, Egypt

On-site 🏒 08 September
Tech & IT Services
Cairo Al Qahirah, Egypt 🇪🇬

Associate Solutions Architect Intern

On-site 🏒 11 September
Tech & IT Services
Shenzhen, China 🇨🇳

Data Center Engineering Operations Trainee, Infraops DCEO

On-site 🏒 04 April
Tech & IT Services
Sydney, Australia 🇦🇺

Associate Technical Support Engineer (ERP) - PH Graduate Program

On-site 🏒 17 March
Tech & IT Services
Metro manila, Philippines 🇵🇭

Associate Technical Support Engineer - PH Graduate Program

On-site 🏒 17 March
Tech & IT Services
Metro manila, Philippines 🇵🇭

27卒対象 CXサマーインターンシップ - コンサルティングエンジニア職 (Consulting Engineer)

On-site 🏒 08 May
Tech & IT Services
Minato, Japan 🇯🇵

Solutions Architect Intern

On-site 🏒 09 April
Tech & IT Services
Sydney, Australia 🇦🇺

DBA Intern

On-site 🏒 09 March
Tech & IT Services
Singapore 🇸🇬

Technical Consulting Engineer (Networking) - Early in Career, Lisbon

On-site 🏒 New πŸ”₯πŸ”₯
Tech & IT Services
Oeiras, Portugal 🇵🇹