Operational Best Practices
Operational best practices help organizations operate Big Picture safely and reliably at scale. These practices cover security, monitoring, backup, incident response, and compliance.
Security Practices
Section titled “Security Practices”Authentication and Authorization
Section titled “Authentication and Authorization”Use strong authentication: Require multi-factor authentication (MFA) for administrative accounts. Use service accounts with limited permissions for automated operations.
Principle of least privilege: Assign only the permissions necessary for each user’s responsibilities. Regularly review role assignments to ensure they remain appropriate.
Separate duties: Separate release creation from approval to prevent unauthorized releases. Use different approvers for different aspects (security, product, operations).
Rotate credentials: Regularly rotate API keys and service account credentials. Use credential management systems to automate rotation.
Key Management
Section titled “Key Management”Protect signing keys: Store signing keys in hardware security modules (HSM) or managed key services. Never store private keys in code or configuration files.
Rotate keys regularly: Establish a key rotation schedule. Maintain old keys during grace periods to support clients that haven’t updated.
Monitor key usage: Monitor signing key usage for anomalies that might indicate compromise.
Backup keys securely: Backup signing keys securely. Ensure backups are encrypted and stored separately from operational systems.
Network Security
Section titled “Network Security”Use TLS: Always use TLS for API communications. Enforce TLS 1.2 or higher.
Restrict network access: Use network policies to restrict access to Big Picture services. Allow only necessary inbound connections.
Monitor network traffic: Monitor network traffic for anomalies. Alert on unusual access patterns or failed authentication attempts.
Use VPN or private networks: For self-hosted deployments, use VPN or private networks to restrict access.
Monitoring and Observability
Section titled “Monitoring and Observability”Health Monitoring
Section titled “Health Monitoring”Monitor service health: Set up health checks for all Big Picture services. Alert on service failures or degraded performance.
Track SLIs and SLOs: Monitor service level indicators (SLIs) and objectives (SLOs). Alert when SLIs approach SLO thresholds.
Monitor database: Monitor database health, connection pools, and query performance. Alert on database failures or performance degradation.
Monitor storage: Monitor artifact storage availability and performance. Alert on storage failures or capacity issues.
Logging
Section titled “Logging”Centralize logs: Aggregate logs from all Big Picture services into a centralized logging system. Use structured logging for easier analysis.
Retain logs appropriately: Retain logs according to compliance requirements. Ensure audit logs are retained for required periods.
Monitor log gaps: Alert on audit log gaps or failures to ensure continuous logging.
Review logs regularly: Periodically review logs to identify anomalies or unauthorized access.
Metrics and Alerting
Section titled “Metrics and Alerting”Collect metrics: Collect metrics for API latency, error rates, license lease success rates, and resource usage.
Set up alerting: Configure alerts for critical issues:
- Service failures
- High error rates
- Database issues
- Storage failures
- Audit log gaps
- Unusual access patterns
Test alerts: Regularly test alerting to ensure notifications are received and actionable.
Review alert noise: Periodically review alerts to reduce noise and improve signal.
Backup and Recovery
Section titled “Backup and Recovery”Backup Strategy
Section titled “Backup Strategy”Backup database regularly: Perform regular database backups. Test backup restoration procedures.
Backup artifacts: Backup artifact storage separately from database backups. Ensure artifacts are recoverable.
Backup configuration: Backup configuration files, policies, and role assignments. Store backups securely.
Backup signing keys: Backup signing keys securely. Ensure backups are encrypted and stored separately.
Recovery Procedures
Section titled “Recovery Procedures”Document recovery procedures: Document step-by-step recovery procedures for common failure scenarios.
Test recovery regularly: Regularly test backup restoration to ensure procedures work correctly.
Maintain recovery runbooks: Keep recovery runbooks up to date. Review and update procedures after incidents.
Plan for disasters: Develop disaster recovery plans. Test disaster recovery procedures regularly.
Incident Response
Section titled “Incident Response”Preparation
Section titled “Preparation”Define incident response procedures: Document procedures for identifying, responding to, and resolving incidents.
Establish on-call rotation: Set up on-call rotation for critical issues. Ensure on-call engineers have necessary access and documentation.
Maintain incident runbooks: Keep incident runbooks up to date. Include common scenarios and resolution steps.
Practice incident response: Regularly practice incident response through drills or tabletop exercises.
Response
Section titled “Response”Identify incidents quickly: Monitor for incidents and alert on-call engineers promptly.
Assess impact: Quickly assess the impact of incidents. Determine affected tenants, products, or users.
Contain incidents: Take steps to contain incidents and prevent further damage.
Communicate clearly: Communicate incident status to stakeholders. Provide regular updates during incidents.
Document incidents: Document incidents, root causes, and resolution steps. Conduct post-mortems for significant incidents.
Compliance
Section titled “Compliance”Audit Readiness
Section titled “Audit Readiness”Maintain audit logs: Ensure audit logging is enabled and functioning. Monitor for audit log gaps.
Review audit logs: Periodically review audit logs to identify anomalies or unauthorized access.
Export logs regularly: Export audit logs before retention expiration if long-term storage is required.
Document procedures: Document audit log access and export procedures for auditors.
Policy Compliance
Section titled “Policy Compliance”Review policies regularly: Periodically review update policies to ensure they remain appropriate.
Monitor compliance: Generate compliance reports regularly to monitor adherence to policies.
Document exceptions: Document any policy exceptions and their rationale.
Update policies: Update policies as requirements change. Document policy changes in audit logs.
License Compliance
Section titled “License Compliance”Track license usage: Monitor license usage to ensure compliance with entitlements.
Generate usage reports: Generate license usage reports regularly for vendor audits.
Review entitlements: Periodically review license entitlements to ensure they match purchases.
Document usage: Document license usage patterns and trends for capacity planning.
Performance and Scalability
Section titled “Performance and Scalability”Performance Optimization
Section titled “Performance Optimization”Monitor performance: Track API latency, database query performance, and storage performance.
Optimize queries: Optimize database queries to reduce latency. Use indexes appropriately.
Cache appropriately: Use caching to reduce database load. Cache policy evaluations and release metadata.
Scale horizontally: Scale services horizontally to handle increased load.
Capacity Planning
Section titled “Capacity Planning”Monitor resource usage: Track resource usage (CPU, memory, storage, network) over time.
Plan for growth: Plan capacity increases based on usage trends.
Set quotas: Set quotas for tenants to prevent resource exhaustion.
Monitor quotas: Alert when tenants approach quota limits.
Documentation
Section titled “Documentation”Operational Documentation
Section titled “Operational Documentation”Document procedures: Document operational procedures, including deployment, configuration, and troubleshooting.
Keep documentation current: Regularly update documentation to reflect current procedures and configurations.
Document changes: Document configuration changes and their rationale.
Share knowledge: Share operational knowledge through documentation and runbooks.
Runbooks
Section titled “Runbooks”Maintain runbooks: Keep runbooks for common operational tasks up to date.
Include troubleshooting: Include troubleshooting steps and common issues in runbooks.
Test runbooks: Regularly test runbooks to ensure they work correctly.
Review runbooks: Periodically review runbooks and update based on lessons learned.
Change Management
Section titled “Change Management”Configuration Changes
Section titled “Configuration Changes”Review changes: Review configuration changes before applying them.
Test changes: Test changes in non-production environments before applying to production.
Document changes: Document configuration changes and their rationale in audit logs.
Rollback plan: Always have a rollback plan for configuration changes.
Release Management
Section titled “Release Management”Follow approval workflows: Require approval for releases according to configured workflows.
Test releases: Test releases in staging or beta channels before approving for production.
Monitor releases: Monitor release distribution and client adoption.
Document releases: Document release notes, known issues, and breaking changes.
Related Documentation
Section titled “Related Documentation”- Role-Based Access Control — Configure access control
- Approval Workflows — Configure release approvals
- Audit Readiness — Prepare for audits
- Compliance Reporting — Generate compliance reports