https://www.atr.com/search-jobs Title: HPC Systems Engineer
Location: Onsite in Annapolis Junction, MD
Citizenship/Clearance Requirement: US Citizen with an active TS/SCI w/ Full Scope Polygraph Required. This position supports a government contract. We have multiple openings for Computer/Systems Engineers in Annapolis Junction, MD – we are looking for High Performance Computer (HPC) designers and developers to join a highly skilled, high performing agile team to support a nationally significant and fast-paced program. The focus is on developing a range of streamlined, collaborative applications for cybersecurity and analytics that shares data across agencies within the Intelligence Community (IC). This highly collaborative program is focused on injecting new technologies and adding advanced capabilities to on-going operational systems used in critical national security related day-to-day missions. The work is focused on building a data pipeline and artificial intelligence/machine learning (AI/ML)-based analytics capabilities, enabling data to be updated and shared in real-time by modernizing visualization and presentation tools to help drive more informed and timely decisions across the IC. The right candidate will have experience with implementing algorithms and codes for large-scale simulation of mathematical models in an HPC environments. A background in Signals Intelligence (SIGINT) is preferred. Responsibilities: Requirements Gathering: - Confer with other computer, systems, and software engineers to analyze complex requirements, use design software tools, provide support using formal specifications, data flow diagrams, and other accepted design techniques, and will use engineering principles to provide full systems lifecycle support for the growing HPC compute infrastructure Software Development: - Shape the design, development, and/or modification of HPC software solutions by analyzing system performance standards, confer with users, computer/systems or software engineers; analyze systems flow, data usage and work processes; and investigate problem areas Algorithms: - Develop or implement algorithms to address HPC system performance and functional standards
Documentation: - Review HPC software and system documentation to further provide recommendations for improving existing documentation and software/system development process standards Quality Control: - Ensure quality control of all developed and modified HPC software and hardware Required Qualifications: - Active TS/SCI clearance with full scope polygraph
- Bachelors Degree in a STEM field or similar technical discipline
- Knowledge and experience with HPC concepts to include cluster architecture, parallel file systems, and high-speed networking
- Demonstrated ability to provision and configure HPC environments and components
- Solid understanding of accelerated computing scheduling and I/O stacks
- Broad and deep understanding of the issues that affect GPU performance, CPU performance, and scaling performance Proficiency with:
- Agile/Scrum software development methodologies and team collaboration
- Linux (Red Hat/CentOS) including OS, CLI (Command Line Interface), system administration, networking, storage, and security
- Writing Linux based scripts to facilitate application integration
- Lightweight Directory Access Protocol (LDAP) experience - TCP/IP fundamentals
- HPC workflows that use Message Passing Interface (MPI)
- Languages, libraries and tools used in HPC (C++, C, modern Fortran, HIP, CUDA, Python, MPI, OpenMP, etc.)
- Cluster configuration managements tools such as Ansible, Puppet, Salt
- Unix cluster and node monitoring tools, including Node Health Check (NHC), Nagios, Grafana and Prometheus
- Node.js and the NPM (Node Package Manager) ecosystem
- Continuous integration and software CM (Configuration Management) processes/tools
- Container technologies like Docker, Singularity, Shifter, Charliecloud
- Skilled generating and reviewing software/technical documentation
- Understanding of Test Driven Development (TDD) and automation tools Bonus Skills: - A background in Signals Intelligence (SIGINT) is preferred
- Experience working with information security teams to ensure cybersecurity compliance of multi-user systems
- Knowledge of algorithms, methods, software libraries, and other tools commonly used in scientific computation Experience with:
- Bright Computing platform
- Various MPI implementations, IntelMPI, OpenMPI, MPICH
- Fast, multivendor, distributed cluster storage systems like Lustre, GPFS (General Parallel File System), and XFS for HPC workloads
- Deep learning frameworks like PyTorch and TensorFlow
- Software Defined Networking
- Nvidia CUDA libraries and GPUs
- Virtualization techniques, cloud platform solutions
- MLPerf benchmarking
- AI/ML coding
- Apache NiFi
- DevOps
- AWS, Azure or GCP platform Other Details: Work is off site at a contractor facility; 16 hours of telework a week may be authorized (but this is NOT guaranteed), project environment is made up of a highly skilled collaborative team; flexible work schedules; compelling mission; excellent benefits package and attractive salary packages.