Operational Best Practices

Operational best practices help organizations operate Big Picture safely and reliably at scale. These practices cover security, monitoring, backup, incident response, and compliance.

Security Practices

Authentication and Authorization

Use strong authentication: Require multi-factor authentication (MFA) for administrative accounts. Use service accounts with limited permissions for automated operations.

Principle of least privilege: Assign only the permissions necessary for each user’s responsibilities. Regularly review role assignments to ensure they remain appropriate.

Separate duties: Separate release creation from approval to prevent unauthorized releases. Use different approvers for different aspects (security, product, operations).

Rotate credentials: Regularly rotate API keys and service account credentials. Use credential management systems to automate rotation.

Key Management

Protect signing keys: Store signing keys in hardware security modules (HSM) or managed key services. Never store private keys in code or configuration files.

Rotate keys regularly: Establish a key rotation schedule. Maintain old keys during grace periods to support clients that haven’t updated.

Monitor key usage: Monitor signing key usage for anomalies that might indicate compromise.

Backup keys securely: Backup signing keys securely. Ensure backups are encrypted and stored separately from operational systems.

Network Security

Use TLS: Always use TLS for API communications. Enforce TLS 1.2 or higher.

Restrict network access: Use network policies to restrict access to Big Picture services. Allow only necessary inbound connections.

Monitor network traffic: Monitor network traffic for anomalies. Alert on unusual access patterns or failed authentication attempts.

Use VPN or private networks: For self-hosted deployments, use VPN or private networks to restrict access.

Monitoring and Observability

Health Monitoring

Monitor service health: Set up health checks for all Big Picture services. Alert on service failures or degraded performance.

Track SLIs and SLOs: Monitor service level indicators (SLIs) and objectives (SLOs). Alert when SLIs approach SLO thresholds.

Monitor database: Monitor database health, connection pools, and query performance. Alert on database failures or performance degradation.

Monitor storage: Monitor artifact storage availability and performance. Alert on storage failures or capacity issues.

Logging

Centralize logs: Aggregate logs from all Big Picture services into a centralized logging system. Use structured logging for easier analysis.

Retain logs appropriately: Retain logs according to compliance requirements. Ensure audit logs are retained for required periods.

Monitor log gaps: Alert on audit log gaps or failures to ensure continuous logging.

Review logs regularly: Periodically review logs to identify anomalies or unauthorized access.

Metrics and Alerting

Collect metrics: Collect metrics for API latency, error rates, license lease success rates, and resource usage.

Set up alerting: Configure alerts for critical issues:

Service failures
High error rates
Database issues
Storage failures
Audit log gaps
Unusual access patterns

Test alerts: Regularly test alerting to ensure notifications are received and actionable.

Review alert noise: Periodically review alerts to reduce noise and improve signal.

Backup and Recovery

Backup Strategy

Backup database regularly: Perform regular database backups. Test backup restoration procedures.

Backup artifacts: Backup artifact storage separately from database backups. Ensure artifacts are recoverable.

Backup configuration: Backup configuration files, policies, and role assignments. Store backups securely.

Backup signing keys: Backup signing keys securely. Ensure backups are encrypted and stored separately.

Recovery Procedures

Document recovery procedures: Document step-by-step recovery procedures for common failure scenarios.

Test recovery regularly: Regularly test backup restoration to ensure procedures work correctly.

Maintain recovery runbooks: Keep recovery runbooks up to date. Review and update procedures after incidents.

Plan for disasters: Develop disaster recovery plans. Test disaster recovery procedures regularly.

Incident Response

Preparation

Define incident response procedures: Document procedures for identifying, responding to, and resolving incidents.

Establish on-call rotation: Set up on-call rotation for critical issues. Ensure on-call engineers have necessary access and documentation.

Maintain incident runbooks: Keep incident runbooks up to date. Include common scenarios and resolution steps.

Practice incident response: Regularly practice incident response through drills or tabletop exercises.

Response

Identify incidents quickly: Monitor for incidents and alert on-call engineers promptly.

Assess impact: Quickly assess the impact of incidents. Determine affected tenants, products, or users.

Contain incidents: Take steps to contain incidents and prevent further damage.

Communicate clearly: Communicate incident status to stakeholders. Provide regular updates during incidents.

Document incidents: Document incidents, root causes, and resolution steps. Conduct post-mortems for significant incidents.

Compliance

Audit Readiness

Maintain audit logs: Ensure audit logging is enabled and functioning. Monitor for audit log gaps.

Review audit logs: Periodically review audit logs to identify anomalies or unauthorized access.

Export logs regularly: Export audit logs before retention expiration if long-term storage is required.

Document procedures: Document audit log access and export procedures for auditors.

Policy Compliance

Review policies regularly: Periodically review update policies to ensure they remain appropriate.

Monitor compliance: Generate compliance reports regularly to monitor adherence to policies.

Document exceptions: Document any policy exceptions and their rationale.

Update policies: Update policies as requirements change. Document policy changes in audit logs.

License Compliance

Track license usage: Monitor license usage to ensure compliance with entitlements.

Generate usage reports: Generate license usage reports regularly for vendor audits.

Review entitlements: Periodically review license entitlements to ensure they match purchases.

Document usage: Document license usage patterns and trends for capacity planning.

Performance and Scalability

Performance Optimization

Monitor performance: Track API latency, database query performance, and storage performance.

Optimize queries: Optimize database queries to reduce latency. Use indexes appropriately.

Cache appropriately: Use caching to reduce database load. Cache policy evaluations and release metadata.

Scale horizontally: Scale services horizontally to handle increased load.

Capacity Planning

Monitor resource usage: Track resource usage (CPU, memory, storage, network) over time.

Plan for growth: Plan capacity increases based on usage trends.

Set quotas: Set quotas for tenants to prevent resource exhaustion.

Monitor quotas: Alert when tenants approach quota limits.

Documentation

Operational Documentation

Document procedures: Document operational procedures, including deployment, configuration, and troubleshooting.

Keep documentation current: Regularly update documentation to reflect current procedures and configurations.

Document changes: Document configuration changes and their rationale.

Share knowledge: Share operational knowledge through documentation and runbooks.

Runbooks

Maintain runbooks: Keep runbooks for common operational tasks up to date.

Include troubleshooting: Include troubleshooting steps and common issues in runbooks.

Test runbooks: Regularly test runbooks to ensure they work correctly.

Review runbooks: Periodically review runbooks and update based on lessons learned.

Change Management

Configuration Changes

Review changes: Review configuration changes before applying them.

Test changes: Test changes in non-production environments before applying to production.

Document changes: Document configuration changes and their rationale in audit logs.

Rollback plan: Always have a rollback plan for configuration changes.

Release Management

Follow approval workflows: Require approval for releases according to configured workflows.

Test releases: Test releases in staging or beta channels before approving for production.

Monitor releases: Monitor release distribution and client adoption.

Document releases: Document release notes, known issues, and breaking changes.

Role-Based Access Control — Configure access control
Approval Workflows — Configure release approvals
Audit Readiness — Prepare for audits
Compliance Reporting — Generate compliance reports