Senior Azure Cloud/HPC Engineer

Illinois

Hybrid

Full Time

$120k - $200k

Job Title: High-Performance Computing (HPC) Operations Engineer

Location: Chicago, IL/Hybrid

Our client, a globally recognized safety science company deeply rooted in the Chicago area, is renowned for its prowess in product safety testing and certification. Playing a pivotal role in ensuring the safety and dependability of diverse industry products, they stand at the forefront of technology and innovation. Collaborating closely with businesses, this company evaluates, tests, and certifies products, actively contributing to the enhancement of public safety standards. In the Chicago region, they serve as a cornerstone, championing the highest safety and quality standards across a wide spectrum of products.

Qualifications:
- Bachelor's degree in computer science, Computer Engineering, or a related field.
- 5+ years of robust experience managing and supporting HPC systems in a production environment.
- Proficiency in Linux/Unix system administration, shell scripting, and fundamental programming concepts.
- Familiarity with job scheduling systems (e.g., Slurm, Torque/PBS) and resource management.
- Knowledge of parallel programming concepts and optimization techniques.
- Exceptional problem-solving abilities and adeptness in efficiently diagnosing and resolving technical issues.
- Strong communication and interpersonal skills tailored to effectively engage both technical and non-technical audiences.
- Meticulous attention to detail with a knack for producing and maintaining accurate documentation.
- Capability to thrive in a collaborative, multidisciplinary research environment.

Responsibilities:

1. Infrastructure Collaboration:
- Work collaboratively with researchers, scientists, and technical teams to grasp requirements and implement infrastructure enhancements.
- Deploy and oversee job scheduling systems, optimizing resource allocation and usage.

2. User Support and Optimization:
- Offer technical support to HPC users, aiding in job submission, troubleshooting, and optimization.
- Swiftly address and resolve technical incidents and service requests within the HPC environment.

3. Troubleshooting and Resource Management:
- Diagnose and troubleshoot intricate issues, coordinating with pertinent teams for prompt resolution.
- Manage job queues, prioritize tasks, and allocate computing resources per user requirements.

4. Documentation and Training:
- Develop and maintain comprehensive documentation for system configurations, troubleshooting procedures, and best practices.
- Conduct training sessions for users on HPC utilization, software tools, and best practices.

5. Code Optimization and Performance Analysis:
- Provide guidance and assistance to users in code optimization, debugging, and parallelization techniques.
- Analyze system performance metrics, identify bottlenecks, and propose optimization strategies.

6. Technology Evaluation and Stay Current:
- Stay abreast of advancements in HPC hardware, software, and methodologies.
- Evaluate new technologies and tools to enhance the organization's HPC capabilities.

Benefits:
- Bonus compensation for all employees
- Comprehensive medical, dental, vision, and life insurance plans.
- Generous 401k matching structure (up to 5% of eligible pay) and an additional 4% investment into the retirement savings fund after the first year of continuous employment.
- Flexible working arrangements based on the role.
- Paid time off, including vacation, holiday, sick, and volunteer time off.

Applicants must be currently authorized to work in the US on a full-time basis now and in the future.

Posted by: Nicolas Butko

Specialization: System Administration