エピソード

  • #63 - Does "Big Observability" Neglect Mobile?
    2024/11/12

    Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short.

    * Career Journey and Current Role: Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered.

    * Specialization in Mobile Observability: At Embrace, Andrew and his colleagues build tools for consumer mobile apps, helping engineers, SREs, and DevOps teams integrate observability directly into their workflows.

    * Gap in Mobile Observability: Observability for mobile apps is still developing, with early tools like Crashlytics only covering basic crash reporting. Andrew highlights that more nuanced data on app performance, crucial to user experience, is often missed.

    * Motivation for User-Centric Tools: Leaving “big observability” to focus on mobile, Andrew prioritizes tools that directly enhance user experience rather than backend metrics, aiming to be closer to end-users.

    * Mobile's Role as a Brand Touchpoint: He emphasizes that for many brands, the primary consumer interaction happens on mobile. Observability needs to account for this by focusing on user experience in the app, not just backend performance.

    * Challenges in Measuring Mobile Reliability: Traditional observability emphasizes backend uptime, but Andrew sees a gap in capturing issues that affect user experience on mobile, underscoring the need for end-to-end observability.

    * Observability Over-Focused on Backend Systems: Andrew points out that “big observability” has largely catered to backend engineers due to the immense complexity of backend systems with microservices and Kubernetes. Despite mobile being a primary interface for apps like Facebook and Instagram, observability tools for mobile lag behind backend-focused solutions.

    * Lack of Mobile Engineering Leadership in Observability: Reflecting on a former Meta product manager’s observations, Andrew highlights the lack of VPs from mobile backgrounds, which has left a gap in observability practices for mobile-specific challenges. This gap stems partly from frontend engineers often seeing themselves as creators rather than operators, unlike backend teams.

    * OpenTelemetry’s Limitations in Mobile: While OpenTelemetry provides basic instrumentation, it falls short in mobile due to limited SDK support for languages like Kotlin and frameworks like Unity, React Native, and Flutter. Andrew emphasizes the challenges of adapting OpenTelemetry to mobile, where app-specific factors like memory consumption don’t align with traditional time-based observability.

    * SREs as Connective Tissue: Andrew views Site Reliability Engineers (SREs) as essential in bridging backend observability practices with frontend user experience needs. Whether through service level objectives (SLOs) or similar metrics, SREs help ensure that backend metrics translate into positive end-user experiences—a critical factor in retaining app users.

    * Amazon’s Operational Readiness Review: Drawing from his experience at AWS, Andrew values Amazon’s practice of operational readiness reviews before launching new services. These reviews encourage teams to anticipate possible failures or user experience issues, weighing risks carefully to maintain reliability while allowing innovation.

    * Shifting Focus to “Answerability” in Observability: For Andrew, the goal of observability should evolve toward “answerability,” where systems provide engineers with actionable answers rather than mere data. He envisions a future where automation or AI could handle repetitive tasks, allowing engineers to focus on enhancing user experiences instead of troubleshooting.



    This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    29 分
  • #62 - Early Youtube SRE shares Modern Reliability Strategy
    2024/11/05
    Andrew Fong’s take on engineering cuts through the usual role labels, urging teams to start with the problem they’re solving instead of locking into rigid job titles. He sees reliability, inclusivity, and efficiency as the real drivers of good engineering. In his view, SRE is all about keeping systems reliable and healthy, while platform engineering is geared toward speed, developer enablement, and keeping costs in check. It’s a values-first, practical approach to tackling tough challenges that engineers face every day.Here’s a slightly deeper dive into the concepts we discussed:* Career and Evolution in Tech: Andrew shares his journey through various roles, from early SRE at Youtube to VP of Infrastructure at Dropbox to Director of Engineering at Databricks, with extensive experience in infrastructure through three distinct eras of the internet. He emphasized the transition from early infrastructure roles into specialized SRE functions, noting the rise of SRE as a formalized role and the evolution of responsibilities within it.* Building Prodvana and the Future of SRE: As CEO of startup, Prodvana, they're focused on an "intelligent delivery system" designed to simplify production management for engineers, addressing cognitive overload. They highlight SRE as a field facing new demands due to AI, discussing insights shared with Niall Murphy and Corey Bertram around AI's potential in the space, distinguishing it from "web three" hype, and affirming that while AI will transform SRE, it will not eliminate it.* Challenges of Migration and Integration: Reflecting on experiences at YouTube post-acquisition by Google, the speaker discusses the challenges of migrating YouTube’s infrastructure onto Google’s proprietary, non-thread-safe systems. This required extensive adaptation and “glue code,” offering insights into the intricacies and sometimes rigid culture of Google’s engineering approach at that time.* SRE’s Shift Toward Reliability as a Core Feature: The speaker describes how SRE has shifted from system-level automation to application reliability, with growing recognition that reliability is a user-facing feature. They emphasize that leadership buy-in and cultural support are essential for organizations to evolve beyond reactive incident response to proactive, reliability-focused SRE practices.* Organizational Culture and Leadership Influence: Leadership’s role in SRE success is highlighted as crucial, with examples from Dropbox and Google emphasizing that strong, supportive leadership can shape positive, reliability-centered cultures. The speaker advises engineers to gauge leadership attitudes towards SRE during job interviews to find environments where reliability is valued over mere incident response.* Outcome-Focused Work Over Titles: Emphasis on assembling the right team based on skills, not titles, to solve technical problems effectively. Titles often distract from focusing on outcomes, and fostering a problem-solving culture over role-based thinking accelerates teamwork and results.* Engineers as Problem Solvers: Engineers, especially natural ones, generally resist job boundaries and focus on solving problems rather than sticking rigidly to job descriptions. This echoes how iconic engineers like Steve Jobs valued versatility over predefined roles.* Culture as Core Values: Organizational culture should be driven by core values like reliability, efficiency, and inclusivity rather than rigid processes or roles. For instance, Dropbox's infrastructure culture emphasized being a “force multiplier” to sustain product velocity, an approach that ensured values were integrated into every decision.* Balancing SRE and Platform Priorities: The fundamental difference between SRE (Site Reliability Engineering) and platform engineering is their focus: SRE prioritizes reliability, while platform engineering is geared toward increasing velocity or reducing costs. Leaders must be cautious when assigning both roles simultaneously, as each requires a distinct focus and expertise.* Strategic Trade-Offs in Smaller Orgs: In smaller companies with limited resources, leaders often face challenges balancing cost, reliability, and other objectives within single roles. It's advised to sequence these priorities rather than burden one individual with conflicting objectives. Prioritizing platform stability, for example, can help improve reliability in the long term.* DevOps as a Philosophy: DevOps is viewed here as an operational philosophy rather than a separate role. The approach enhances both reliability and platform functions by fostering a collaborative, efficient work culture.* Focus Investments for Long-Term Gains: Strategic technology investments, even if they might temporarily hinder short-term metrics (like reliability), can drive long-term efficiency and reliability improvements. For instance, Dropbox invested in a shared metadata system to enable active-active disaster recovery, viewing this ...
    続きを読む 一部表示
    36 分
  • #61 Scott Moore on SRE, Performance Engineering, and More
    2024/10/22



    This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    38 分
  • #60 How to NOT fail in Platform Engineering
    2024/10/01

    Here’s what we covered:

    Defining Platform Engineering

    * Platform engineering: Building compelling internal products to help teams reuse capabilities with less coordination.

    * Cloud computing connection: Enterprises can now compose platforms from cloud services, creating mature, internal products for all engineering personas.

    Ankit’s career journey

    * Didn't choose platform engineering; it found him.

    * Early start in programming (since age 11).

    * Transitioned from a product engineer mindset to building internal tools and platforms.

    * Key experience across startups, the public sector, unicorn companies, and private cloud projects.

    Singapore Public Sector Experience

    * Public sector: Highly advanced digital services (e.g., identity services for tax, housing).

    * Exciting environment: Software development in Singapore’s public sector is fast-paced and digitally progressive.

    Platform Engineering Turf Wars

    * Turf wars: Debate among DevOps, SRE, and platform engineering.

    * DevOps: Collaboration between dev and ops to think systemically.

    * SRE: Operations done the software engineering way.

    * Platform engineering: Delivering operational services as internal, self-service products.

    Dysfunctional Team Interactions

    * Issue: Requiring tickets to get work done creates bottlenecks.

    * Ideal state: Teams should be able to work autonomously without raising tickets.

    * Spectrum of dysfunction: From one ticket for one service to multiple tickets across teams leading to delays and misconfigurations.

    Quadrant Model (Autonomy vs. Cognitive Load)

    * Challenge: Balancing user autonomy with managing cognitive load.

    * Goal: Enable product teams with autonomy while managing cognitive load.

    * Solution: Platforms should abstract unnecessary complexity while still giving teams the autonomy to operate independently.

    How it pans out

    * Low autonomy, low cognitive load: Dependent on platform teams but a simple process.

    * Low autonomy, high cognitive load: Requires interacting with multiple teams and understanding technical details (worst case).

    * High autonomy, high cognitive load: Teams have full access (e.g., AWS accounts) but face infrastructure burden and fragmentation.

    * High autonomy, low cognitive load: Ideal situation—teams get what they need quickly without detailed knowledge.

    Shift from Product Thinking to Cognitive Load

    * Cognitive load focus: More important than just product thinking—consider the human experience when using the system.

    * Team Topologies: Mentioned as a key reference on this concept of cognitive load management.

    Platform as a Product Mindset

    * Collaboration: Building the platform in close collaboration with initial users (pilot teams) is crucial for success.

    * Product Management: Essential to have a product manager or team dedicated to communication, user journeys, and internal marketing.

    Self-Service as a Platform Requirement

    * Definition: Users should easily discover, understand, and use platform capabilities without human intervention.

    * User Testing: Watch how users interact with the platform to understand stumbling points and improve the self-service experience.

    Platform Team Cognitive Load

    * Burnout Prevention: Platform engineers need low cognitive load as well. Moving from a reactive (ticket-based) model to a proactive, self-service approach can reduce the strain.

    * Proactive Approach: Self-service models allow platform teams to prioritize development and avoid being overwhelmed by constant requests.



    This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    31 分
  • #59 Who handles monitoring in your team and how?
    2024/09/24
    Why many copy Google’s monitoring team setup* Google’s Influence. Google played a key role in defining the concept of software reliability.* Success in Reliability. Few can dispute Google’s ability to ensure high levels of reliability and its ability to share useful ways to improve it in other settingsBUT there’s a problem:* It’s not always replicable. While Google's practices are admired, they may not be a perfect fit for every team.What is Google’s monitoring approach within teams?Here’s the thing that Google does:* Google assigns one or two people per team to manage monitoring.* Even with centralized infrastructure, a dedicated person handles monitoring.* Many organizations use a separate observability team, unlike Google's integrated approachIf your org is large enough and prioritizes reliability highly enough, you might find it feasible to follow Google’s model to the tee. Otherwise, a centralized team with occasional “embedded x engineer” secondments might be more effective.Can your team mimic Google’s model?Here are a few things you should factor in:Size mattersGoogle's model works because of its scale and technical complexity. Many organizations don’t have the size, resources, or technology to replicate this.What are the options for your team?Dedicated monitoring team (very popular but $$$)If you have the resources, you might create a dedicated observability team. This might call for a ~$500k+ personnel budget so it’s not something that a startup or SME can easily justify. Dedicate SREs to monitoring work (effective but difficult to manage)You might do this on rotation or make an SRE permanently “responsible for all monitoring matters”. Putting SREs on permanent tasks might lead to burnout as it might not suit their goals, and rotation work requires effective planning.Internal monitoring experts (useful but hard capability)One or more engineers within teams could take on monitoring/observability responsibilities as needed and support the team’s needs. This should be how we get monitoring work done, but it’s hard to get volunteers across a majority of teams. Transitioning monitoring from project work to maintenance2 distinct phasesInitial Setup (the “project”) SREs may help set up the monitoring/observability infrastructure. Since they have breadth of knowledge across systems, they can help connect disparate services and instrument applications effectively.Post-project phase (“keep the lights on”)Once the system is up, the focus shifts from project mode to ongoing operational tasks. But who will do that?Who will maintain the monitoring system?Answer: usually not the same teamAfter the project phase, a new set of people—often different from the original team—typically handles maintenance.Options to consider (once again)* Spin up a monitoring/observability team. Create a dedicated team for observability infrastructure.* Take a decentralized approach. Engineers across various teams take on observability roles as part of their regular duties.* Internal monitoring/observability experts. They can take responsibility for monitoring and ensure best practices are followed.The key thing to remember here is…Adapt to Your Organizational ContextOne size doesn’t fit allGoogle's model may not work for everyone. Tailor your approach based on your organization’s specific needs.The core principle to keep in mindAs long as people understand why monitoring/observability matters and pay attention to it, you're on the right track.Work according to engineer awarenessIf engineers within product and other non-operations teams are aware of monitoring: You can attempt to **decentralize the effort** and involve more team members.If awareness or interest is low: consider **dedicated observability roles** or an SRE team to ensure monitoring gets the attention it needs.In conclusionThere’s no universal solution. Whether you centralize or decentralize monitoring depends on your team’s structure, size, and expertise. The important part is ensuring that observability practices are understood and implemented in a way that works best for your organization.PS. Rather than spend an hour on writing, I decided to write in the style I normally use in a work setting i.e. “executive short-hand”. Tell me what you think. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    8 分
  • #58 Fixing Monitoring's Bad Signal-to-Noise Ratio
    2024/09/17

    Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It’s a challenge that’s been around since the beginning of software development and will persist for years to come.

    The core issue is the overwhelming noise from non-essential data, which floods systems with useless alerts.

    This interrupts workflows, affects personal time, and even disrupts sleep.

    Sebastian dove into this problem, highlighting that the issue isn't just about having meaningless pages but also the struggle to find valuable information amidst the noise.

    When legitimate alerts get lost in a sea of irrelevant data, pinpointing the root cause becomes exceptionally hard.

    Sebastian proposes a fundamental fix for this data overload: be deliberate with the data you emit.

    When instrumenting your systems, be intentional about what data you collect and transport.

    Overloading with irrelevant information makes it tough to isolate critical alerts and find the one piece of data that indicates a problem.

    To combat this, focus on:

    * Being Deliberate with Data. Make sure that every piece of telemetry data serves a clear purpose and aligns with your observability goals.

    * Filtering Data Effectively. Improve how you filter incoming data to eliminate less relevant information and retain what's crucial.

    * Refining Alerts. Optimize alert rules such as creating tiered alerts to distinguish between critical issues and minor warnings.

    Dan Ravenstone, who leads platform at Top Hat, discussed “triaging alerts” recently.

    He shared that managing millions of alerts, often filled with noise, is a significant issue.

    His advice: scrutinize alerts for value, ensuring they meet the criteria of a good alert, and discard those that don’t impact the user journey.

    According to Dan, the anatomy of a good alert includes:

    * A run book

    * A defined priority level

    * A corresponding dashboard

    * Consistent labels and tags

    * Clear escalation paths and ownership

    To elevate your approach, consider using aggregation and correlation techniques to link otherwise disconnected data, making it easier to uncover patterns and root causes.

    The learning point is simple: aim for quality over quantity.

    By refining your data practices and focusing on what's truly valuable, you can enhance the signal-to-noise ratio, ultimately allowing more time for deep work rather than constantly managing incidents.



    This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    8 分
  • #57 How Technical Leads Support Software Reliability
    2024/09/10

    The question then condenses down to: Can technical leads support reliability work? Yes, they can! Anemari has been a technical lead for years — even spending a few years doing that at the coveted consultancy, Thoughtworks — and now coaches others.

    She and I discussed the link between this role and software reliability.



    This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    32 分
  • #56 Resolving DORA Metrics Mistakes
    2024/09/04
    We're already well into 2024 and it’s sad that people still have enough fuel to complain about various aspects of their engineering life. DORA seems to be turning into one of those problem areas.Not at every organization, but some places are turning it into a case of “hitting metrics” without caring for the underlying capabilities and conversations.Nathen Harvey is no stranger to this problem.He used to talk a lot about SRE at Google as a developer advocate. Then, he became the lead advocate for DORA when Google acquired it in 2018. His focus has been on questions like:How do we help teams get better at delivering and operating software? You and I can agree that this is an important question to ask. I’d listen to what he has to say about DORA because he’s got a wealth of experience behind him, having also run community engineering at Chef Software.Before we continue, let’s explore What is DORA? in Nathen’s (paraphrased) words:DORA is a software research program that's been running since 2015.This research program looks to figure out:How do teams get good at delivering, operating, building, and running software? The researchers were able to draw out the concept of the metrics based on correlating teams that have good technology practices with highly robust software delivery outcomes.They found that this positively impacted organizational outcomes like profitability, revenue, and customer satisfaction.Essentially, all those things that matter to the business.One of the challenges the researchers found over the last decade was working out: how do you measure something like software delivery? It's not the same as a factory system where you can go and count the widgets that we're delivering necessarily.The unfortunate problem is that the factory mindset I think still leaks in. I’ve personally noted some silly metrics over the years like lines of code.Imagine being asked constantly: “How many lines of code did you write this week?”You might not have to imagine. It might be a reality for you. DORA’s researchers agreed that the factory mode of metrics cannot determine whether or not you are a productive engineer. They settled on and validated 4 key measures for software delivery performance.Nathen elaborated that 2 of these measures look at throughput:[Those] two [that] look at throughput really ask two questions:* How long does it take for a change of any kind, whether it's a code change, configuration change, whatever, a change to go from the developer's workstation. right through to production?And then the second question on throughput is:* How frequently are you updating production?In plain English, these 2 metrics are:* Deployment Frequency. How often code is deployed to production? This metric reflects the team's ability to deliver new features or updates quickly.* Lead Time for Changes: Measures the time it takes from code being committed to being deployed to production. Nathen recounted his experience of working at organizations that differed in how often they update production from once every six months to multiple times a day. They're both very different types of organizations, so their perspective on throughput metrics will be wildly different. This has some implications for the speed of software delivery.Of course, everyone wants to move faster, but there’s this other thing that comes in and that's stability.And so, the other two stability-oriented metrics look at:What happens when you do update production and... something's gone horribly wrong. “Yeah, we need to roll that back quickly or push a hot fix.” In plain English, they are:* Change Failure Rate: Measures the percentage of deployments that cause a failure in production (e.g., outages, bugs). * Failed Deployment Recovery Time: Measures how long it takes to recover from a failure in production. You might be thinking the same thing as me. These stability metrics might be a lot more interesting to reliability folks than the first 2 throughput metrics.But keep in mind, it’s about balancing all 4 metrics. Nathen believes it’s fair to say today that across many organizations, they look at these concepts of throughput and stability as tradeoffs of one another. We can either be fast or we can be stable. But the interesting thing that the DORA researchers have learned from their decade of collecting data is that throughput and stability aren't trade-offs of one another.They tend to move together. They’ve seen organizations of every shape and size, in every industry, doing well across all four of those metrics. They are the best performers. The interesting thing is that the size of your organization doesn't matter the industry that you're in.Whether you’re working in a highly regulated or unregulated industry, it doesn't matter.The key insight that Nathen thinks we should be searching for is: how do you get there? To him, it's about shipping smaller changes. When you ship small changes, they're easier to move ...
    続きを読む 一部表示
    27 分