Pushing autonomous robots into the city streets is a daunting task. Some of these programs run on the robot itself but most actually run in the backend. Factors such as remote control, route finding, comparing robots with customers, sailing health management and customer and business relationships. All of this is required to run 24×7, without interruptions and grow vigorously to fit the task.
SRE at Starship is responsible for providing cloud infrastructure and platform services for running past operations. We are based on Gentlemen to our Microservices and running at the top AWS. MongoDb it’s a huge repository for a lot of previous work, but we also love it PostgreSQL, especially when strong typing and transaction confirmation are required. For async messages Kafka then a messaging system that we love and use for everything other than sending video clips from robots. To appear dependable Prometheus and Grafana, Loki, on the left and Jaeger. CICD is run by Jenkins.
A good portion of SRE time is used to repair and repair Kubernetes equipment. Kubernetes is our main delivery platform and there is always something to improve, whether it is just updating updates, adding confusing Pod points or enhancing Spot usage. Sometimes it’s like laying bricks – just putting on a Helm chart to provide functionality. But often “bricks” have to be selected and monitored (with a good Loki for price control, and a Service Mesh item after that) and sometimes its functionality is non-existent and has to be written from scratch. When this happens we often turn to Python and Golang as well as Rust and C when needed.
Another great tool that SRE manages is data and archives. Starship started with a single monolithic MongoDb – a method that has worked well so far. However, as the business grows we need to look back at this design and start thinking about supporting robots and a thousand. Apache Kafka is part of a larger issue, but we should also consider sharding, local groups and microservice database design. On top of that we are developing tools and regular updates to monitor existing databases. Examples: increase MongoDb monitoring with a side-by-side project to check the amount of databases, support PITR to support databases, adjust regular failures and try recovery, collect Kafka re-sharding metrics, start data storage.
Finally, one of the most important goals of Site Reliability Engineering is to reduce the time required to reduce the production of Starship. Although SRE is sometimes called upon to address the completion of construction, the most effective work is done to prevent shutdown and to ensure a speedy recovery. This could be a very big topic, ranging from having strong K8s weapons to the technical and business processes. There is a great deal of interest!
A day in the life of SRE
Arriving at work, sometimes between 9 and 10 (sometimes working remotely). Grab a cup of coffee, view Slack messages and emails. Take a look at the information that was shot at night, and see if there was anything interesting there.
Find out if the connection to MongoDb has been done overnight. Digging into Prometheus metrics and Grafana, find that this happens while backups are in operation. Why is this suddenly so difficult, we have run backup storage for many years? We seem to be pushing hard backup to keep it online with a storage price and this is ruining the entire CPU available. It seems that the load on the database has grown slightly to make this visible. This is happening in a stand-alone setting, not affecting production, but still a problem, if the original fails. Add the Jira item to fix this.
Along the way, modify the MongoDb prober (Golang) code to add histogram containers to better understand latency distribution. Run the Jenkins pipes to start a new exploration.
At 10 a.m. for a Standup session, share your updates with the team and learn what others have been doing – setting up a VPN server monitoring, using Python and Prometheus software, setting up ServiceMonitors on external services, solving MongoDb connectivity issues, testing for canary and Flagger.
After the meeting, continue to prepare for the day. One of the things I planned to do today was to set up another Kafka team at the test site. We are running Kafka on Kubernetes so it should be straightforward to download existing YAML files and convert them into a new batch. Or, at the second thought, should we use Helm instead, or maybe there is a better Kafka driver now? No, don’t go there – more magic, I want to improve my behavior. YAML green yes. After an hour and a half the new team is moving. Preparation was easy; only init vessels that register Kafka vendors on the DNS require a flexible configuration. Creating software notifications requires a small bash pattern to set up accounts on Zookeeper. One of the things that was about to be settled, was to set up Kafka Connect to handle the database change process – it seems the test entries are not running in ReplicaSet mode and Debezium will not be able to find the oplog from there. Repeat this and move on.
Now is the time to prepare for the Wheel of Tragedy event. At Starship we do this to better understand systems and share solutions. It works by breaking down a part of the system (often on trial) and causing another unfortunate person to try to solve the problem and reduce the problem. In this case I set up a test property with no filling in microservice for calculating the process. Present this as Kubernetes’ work called “haymaker” and hide it so that it does not appear immediately in the Linkerd mesh (yes, bad 😈). Then launch the “Wheel” game and realize any opportunities we have in social media, metrics, notifications and more.
In the last few hours of the day, stop all distractions and try to sign up. I have also set up a Mongoproxy BSON component as an asynchronous stream (Rust + Tokio) and I want to know how this works with real data. There seems to be a flaw somewhere in the parser guts and I need to add some deep pruning to realize this. Find the best library to follow in Tokio and get excited …
Disclaimer: The scenario described here is based on fact. Not everything happened in one day. Some meetings and interactions with co-workers have been changed. We are hiring.