Upwork

Contract: Site Reliability Engineer

🇺🇸 United States

Contract

Upwork currently hires Full Time Employees in the following states: Arizona, California, Florida, Georgia, Illinois, Maryland, Massachusetts, Michigan, Minnesota, Montana, Nebraska, Nevada, Ohio, Oregon, Pennsylvania, South Carolina, Tennessee, Texas, Virginia, and Washington. If you are not located in one of these states, you are welcome to explore their open contract engagements!

Upwork ($UPWK) is the world’s work marketplace. We serve everyone from one-person startups to over 30% of the Fortune 100 with a powerful, trust-driven platform that enables companies and talent to work together in new ways that unlock their potential.

Last year, more than $3.3 billion of work was done through Upwork by skilled professionals who are gaining more control by finding work they are passionate about and innovating their careers.

This is an engagement through Upwork’s Hybrid Workforce Solutions (HWS) Team. Our Hybrid Workforce Solutions Team is a global group of professionals that support Upwork’s business. Our HWS team members are located all over the world.

Work/Project Scope:
This role will participate in our production on-call rotation in your day-time and on some weekends with focus on below areas:

Incident Management: Play an active role in production on-call, responding swiftly to troubleshoot and resolve production issues. Ability to size-up a situation, assess the effectiveness of various tactics/strategies, and make rapid decisions on appropriate courses of action
Ensure high availability by implementing and maintaining resilient cloud architectures, monitor system performance and proactively identify and resolve potential points of failure
Develop and maintain automation scripts, tools and processes to streamline system deployment, monitoring tasks and eliminate toil/reduce operational overhead
Create and maintain a comprehensive dashboard and playbooks for production on-call. Continuously improve the production on-call experience and system sustainability/ effectiveness
Identify areas to improve service resilience through techniques such as chaos engineering, performance/load testing, etc
An out-of-the-box, critical thinker and you don’t just understand the challenges at the present but also know what to plan and do to improve in the future

Must Haves (Required Skills):

3+ years experience as a Site Reliability Engineer or Devops role, with primary focus on managing cloud-based services and infrastructure
Experience with AWS (EC2, S3, ECS, VPC, ElasticSearch, Lambda), LInux system administration and monitoring tools (Prometheus, Grafana,Cloudwatch, Datadog, Dynatrace)
Have good working knowledge of load balancer, firewalls and TCP/IP networking architecture
Strong programming skills in Python, Shell scripting and Terraform
Should have critical thinking, good debugging and problem solving skills
Automation advocate - you truly believe in removing operation load with software
Familiarity with micro services architecture and container orchestration with Kubernetes
Experience with scale testing, disaster recovery, and capacity planning
Excellent verbal and written communication skills (English)

Upwork is proudly committed to fostering a diverse and inclusive workforce. We never discriminate based on race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical condition), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics.

To learn more about how Upwork processes and protects your personal information as part of the application process, please review our Global Job Applicant Privacy Notice

Apply now