skip to content
ainoya.dev

Day 0 Operations: Minimizing Risks in Manual Processes for New Products

/ 4 min read

It’s become standard practice to automate and codify product operations. When launching a new product, you’re often starting from scratch with these automation efforts.

While external services have made setting up environments significantly more efficient, achieving perfect automation from the outset can be challenging for many products. It’s more common to build the environment and continuously reduce toil as the product develops.

This inevitably involves manual processes. While it’s acceptable to perform tasks manually with the intention of automating them later, are we underestimating the risks associated with these manual operations? Consider deployments, data migrations, data changes during incident response, and data investigations.

As Heinrich’s Law suggests, overlooking small risks can lead to significant problems down the line. Certain precautions should be taken when performing manual tasks. Just because a process isn’t automated doesn’t mean we should approach risky operations like data queries with creative, on-the-spot solutions. This can be dangerous.

Key Principles for Manual Operations

Here are some key principles to keep in mind:

1. The Principle of Least Astonishment

  • Avoid surprising your team with unexpected information when performing tasks or communicating updates (happy surprises excluded!).
  • Surprises often indicate a lack of awareness beforehand, meaning someone who should have been informed wasn’t. Ensure transparency within the team so that all operations are predictable. Teamwork relies on shared understanding.

2. Document Everything

  • Before the Operation:
    • Create detailed procedures based on anticipated steps.
    • Conduct team reviews of these procedures to identify gaps or omissions.
    • On the day of the operation, simply follow the documented steps mechanically. This reduces psychological burden and promotes safe execution.
  • Rehearsal:
    • Depending on the importance of the operation, conduct a dry run without making actual changes.
    • Execute commands with dry-run options enabled or run read-only queries to simulate the process.
    • Even well-written documentation can reveal unexpected errors or missing steps during a dry run.
  • During the Operation:
    • Comment and record each step and its outcome. This helps prevent secondary damage caused by panicked, undocumented commands during incident response.
    • Communicate with your team before and during the operation. Avoid silent execution, especially when multiple team members are making changes, as the combined outcome can be unpredictable.
  • Verification:
    • For pull requests, document manual testing performed, especially for cases where unit tests are difficult to write due to limitations in the testing environment. For UI testing, capture videos or screenshots.
    • This helps with post-incident analysis, making it easier to identify testing gaps and determine necessary test cases or implementation improvements to prevent recurrence.

Benefits of Documentation

  • Short-term: Standardize common procedures through documented templates.
  • Long-term: Provide requirements for future automation efforts. Analyze and categorize documented tasks to prioritize automation initiatives.

3. Generate Meaningful Logs

  • Go beyond generic error messages. Strive for logs that provide insights into the issue.
  • Bad Log: “An error occurred.” This doesn’t offer any helpful information beyond the fact that an exception was thrown.
  • Good Log: “Data synchronization error with Service A: Service A API is temporarily unavailable. Failed values: value1, value2, value3.” This log clearly explains the problem, reducing the need to dig into the code. It also identifies the specific values that caused the failure.

Practical Examples

Release (Deployment)

  • If deployments occur on a fixed schedule, hold a release planning meeting with the team beforehand. Use this opportunity to align everyone on the upcoming activities. Consider using a tool like Notion to automatically create a release plan document.
  • Include the following in the template:
    • Release Date and Time
    • Release Engineer
    • Release Scope
    • Data Migration (Yes/No)
      • Details of planned queries and their potential impact on production system performance.
    • Post-Release Verification Steps
    • Rollback Procedures

Data Modification

  • Treat production data changes with extreme caution. Implement query reviews for RDBMS systems.
  • Maintain a record of executed queries along with their results. While transaction logs and binary logs can provide this information retrospectively, having readily available query results simplifies root cause analysis in case of secondary issues.
  • Consider using a template like this:
    • Purpose and Summary of the Change
    • Planned Query
    • Pre-Execution Verification Query and Results
    • Post-Execution Verification Query and Results

Conclusion

This post explored strategies for managing manual operations in the early stages of a web service’s lifecycle, before full automation is achieved. The core principles to embrace are: 1. Principle of Least Astonishment, 2. Shared Understanding, and 3. Comprehensive Documentation.

In the future, AI agents might handle these manual tasks. The principles discussed here can serve as guidelines for designing and training these agents, ensuring they operate reliably and predictably.