SRE (Site Reliability Engineer)

  • Israel
  • New Jersey
  • Texas

Cellwize Wireless company is seeking a full-time SRE to join us!

team and build high-quality technology solutions that revolutionize cellular networks, powered by Artificial Intelligence in the customer cloud. Cellwize provides services through SaaS applications to several Fortune 100 and Fortune 500 customers. You will take ops projects from concept through to launch. You will be responsible for maintaining and improving the company’s production environment for rapid scaling and outstanding performance. You will be responsible to help us keep stellar uptime and reliability. The improvements you implement will be felt by the entire organization.

As s Site Reliability Engineer at Cellwize, you will be responsible for keeping our cloud-based services, streaming frameworks, NoSQL/RDBMS databases and distributed analytical platforms running in multi-cloud environments to deliver unprecedented IT automation and insight into user experiences driven by our AI services over a geographically distributed customers’ networks.

 

Responsibilities:

  • Build infrastructure as a code using Terraform, Ansible and Kubernetes.
  • Manage and performance tune either databases (NIFI, Elasticsearch) or streaming data pipelines (Kafka)
  • Manage CICD pipelines, configuration, automation tools for infrastructure provisioning.
  • Write and maintain runbooks for knowledge driven automated processes and bots.
  • Do capacity planning based on performance, usage, and utilization stats.
  • Partner with developers and quality engineering teams to automate the monitoring, alerting, availability and scalability of our applications and systems.
  • Ensure system availability and business continuity by implementing redundant servers/services.
  • Manage after-hours infrastructure updates and maintenance.
  • Proactively research and propose the use of new concepts, processes, technologies, and tools.
  • Proactive monitoring, diagnosis, on-call rotation and resolution of issues in a 24×7 of multi-cloud environment (OpenStack), analyze failures and provide support for software engineers to debug production issues across microservices and distributed platforms.
  • Follow SRE best practices and procedures.

 

Experience Required For You To Be Successful:

  • Follow SRE best practices and procedures.
  • An extensive background in developing and operating large-scale cloud-based distributed applications
  • Direct experience developing/running applications on OpenStack and AWS.
  • Laser focus and be able to design infrastructure solutions for scalability, reliability, high availability, performance, software maintainability, and operational excellence
  • The ability to “fix the plane while in flight” (not just support greenfield solutions)
  • The ability to prioritize existing technical and infrastructure debt, and experience to build and execute a plan to pay it off

 

Required skills:

  • Delivering reliable operations for web-scale infrastructure for a global market at high release velocity
  • Must have solid experience with at least 1 of the languages: Go, Python
  • Experience with Kafka, Mesos, Nifi, Elasticsearch, MySQL, Vertica, Zookeeper, Nginx.
  • 10+ years of industry experience in managing infrastructure.
  • 5 years Linux administration in a large-scale SaaS environment.
  • 5 years maintaining production systems on AWS and/or OpenStack.
  • 3 years’ experience in managing Kubernetes in a large-scale production environment
  • Strong familiarity in running and optimizing RDBs and NoSQL databases.
  • 3 years using infrastructure as code software (eg. Terraform, AWS and Google Cloud Deployment, CloudFormation).
  • 5 years’ experience in continuous integration practices & tools (Jenkins)
  • Experience with monitoring solutions such as: Prometheus, Grafana, ELK.
Apply