Tudor Stefanescu

Sociotechnical adventures

A day at Voxxed Days Luxembourg 2024

Posted on June 23, 2024 by tudorescu

The big boys have their Devoxxes and KubeCons, Luxembourg has the Voxxed Days Luxembourg conference which, as we like to think about a lot of events here in Luxembourg, it is for sure smaller but maybe it is also a bit more refined… maybe.

Thursday, on 21st of June 2024, was the first day of Voxxed Days Luxembourg 2024 – the main local software conference organized by the Java User Group of Luxembourg – and what follows are my notes from this day.

1.

Showing the importance of the space industry for the Luxembourg software community(or just as a confirmation that space tech is cool) the conference was kick-started by Pierre Henriquet’s keynote Les robots de l’espace

… It’s true that – to bring the audience with their feet on the ground – software is mentioned twice as the thing that killed a robot in space: first on 13th November 1982 when the end of Viking 1 mission was brought by a software update and then in 1998 when a misalignment between the implementations provided by different NASA suppliers killed the Mars Climate Orbiter.

Still, the talk ends on a positive note for the viability of software in space: last week the Voyager 1 team managed to update the probe’s software to recover the data from all science instruments: an update deployed 24 billion kilometers away on hardware that has been traveling through space for 47 years.

2.

I stay in the main room where Henri Gomez presents “SRE, Mythes vs Réalités” a talk that (naturally) refers back heavily to Google’s “SRE book” – the book that kickstarted the entire label.

For me the talk confirms again that SRE for a lot of companies is either the old Ops but with some new language or a label to put on everything they do.

My personal view on SRE is that (similar to the security “shift left” that brought us DevSecOps) it is about mindset, tools and techniques and not about a job position… unfortunately (like for DevSecOps), organisations find it easier to adopt the label as a role than to actually understand the mindset shift, new tools, processes and techniques required.

First mention of the day of the Accelerate book, the DORA Metrics and the 4 Software Delivery Performance metrics.

3.

Next, I went to a smaller conference room to see the “How We Gained Observability Into Our CI/CD Pipeline” talk done by Dotan Horovits.

This was the second mention of the DORA metrics:

One of the main ideas of the talk was that the CI/CD pipelines are part of the production environment and should be monitored, traced, and analysed like production systems for issues and performance. It should not be a new idea: if what you produce is software, then your production pipeline is crucial, but the reality is that many companies still treat the CI/CD pipeline as part of something that is mainly Dev team responsibility, with the freedoms and risks that this entails..

Dotan goes into some detail on how they instrumented their pipelines :

collect data on ci/cd pipeline run into the environment variables
- create a summary step on the pipeline to collect all the info from the pipeline and store it in elastic search
visualise with kibana/opensearch dashboards:
- define your measuring needs – what you want to find/track?
- where the failure happened (branch/machine)
- what is the duration of each step
- are there steps that take more time than the baseline?
monitor the actual infrastructure (computers, CPUs, etc.) used to run your CI/CD pipelines using Telegraf, Prometheus. For more info check the guide

The talk includes even a short but effective introduction in Distributed Tracing and OpenTelemetry that are used for increasing visibility at every layer (OpenTelemetry was another frequent theme in the talks at the conference – and Dotan is well positioned to talk about it as he runs the OpenObservability Talks podcast)

4.

I checked then the Demystifion le functionnement interne de Kubernetes talk. It was actually a demo done by Denis Germain on how you can deploy a Kubernetes cluster one control plane brick at a time.

As somebody who was thrown into using and working with Kubernetes without having time to thoroughly understand its internals I approached this session as a student and took the opportunity to revisit basic K8s concepts.

A note on the general structure of an K8s resource:

- apiVersion: <>/v1
- metadata:
    - labels:
        - key: value
    - name: <>
- kind: Deployment/Service/Ingress/...
    - spec
        - replicas: x

A cool new (for me) tool to manipulate digital certificates: CFSSL, and the fact that for ingress controller traefik was preferred..

The Kubernetes Control Plane components deployed during this short talk:

Kube API Server: the entry point into the infrastructure
Kube Controller Manager: the binary that launches control loops on all the apis that exist in k8s to manage the difference between current state and target state.
Kube Scheduler: the planner/workflow manager: the one that actually creates the jobs that are required by the kube controller manager to align the target state with the current state
Kubelet: the agent on the node: it’s the one that actually executes the tasks planned by the kube scheduler. One of the first things it does is to register itself (and the node) on kube api -> containerd is the container engine that actually initiates the container
CNI container network is the element that manages the communication between container

5.

Next: Réagir à temps aux menaces dans vos clusters Kubernetes avec Falco et son écosystème with Rachid Zarouali and Thomas Labarussias.

What I found:

Falco is made by Sysdig – and that Sysdig is founded by Loris Degioanni. My first main professional achievement was moving a network stack (re-implement a subset of ARP, IP, UDP protocols) on winpcap: using winpcap to both “sniff” ethernet packets and also to transmit them with high throughput and precise timing (NDIS drivers… good times!). Winpcap was the base of Ethereal which then become WireShark, and which now continues to live in Falco… I never expected to find echoes of my career beginnings here.
Falco is (now?) eBPF-based, supporting older kernel versions (even pre-kubernetes) which makes sense with its lineage that can be traced to libpcap
Falco is a newly (2024) CNCF Graduated Project

Falco plugins are signed with cosign
falcosidekick – plugin adds additional functionality – created by Thomas Labarussias.

Integrating all the Falco ecosystem you get: detection -> notification -> falcosidekick (actions?)-> reaction (falco-talon)

Falco Talon is a response engine (you could implement an alternative response engine with Argo CD but Falco Talon has the potential of better integration): github.com/falco-talon/falco-talon

Side Note: there is a cool K8s UI tool that was used by Thomas Labarussias during the demo: https://k9scli.io

6.

Next to: Becoming a cloud-native physician: Using metric and traces to diagnose our cloud-native applications (Grace Jansen – IBM)

The general spirit of the talk was that in order to be able to correctly diagnose the behavior of your distributed application(s) you always need: context/correlation + measurements.

The measurements are provided using:

instrumentation – instrument systems and apps to collect relevant data:
- metrics – talk emphasised the use of the microprofile.io : an open source community specification for Enteprise Java – “Open cloud-native Java APIs”
- traces,
- logs
- (+ end-user monitoring + profiling)

Storing the measurements is important: send the data to a separate external system that can store and analyse it – Jaeger, Zipkin

Visualize/analyze: provide visualisations and insights into systems as a whole (Grafana?)

Back to Distributed Tracing and OpenTelemetry:

A trace contains multiple spans, a spans contains: span_id, a name, time related data, log messages and metadata to give information about what occurs during a transaction and the context an immutable object contained in the span data to identify the unique request that each span is part of: trace_id, parent_id.

The talk ended with a demo of how all these elements come together in OpenLiberty (OpenLiberty is the new, improved, stripped-down WebSphere).

Side Note: an interesting online demo/IDE tool – Theia IDE.

7.

Next: Au secours ! Mon manager me demande des KPIs ! by Geoffrey Graveaud

The third talk mentioning Dora metrics, 2nd that mentions the Accelerate book, the first (and last) that mentions Team Topologies.

The talk adopted a “star wars inspired” story format (which resonates I guess with all the Project Unicorn, Phoenix, etc. readers interested in agile/lean processes). I usually find this style annoying but this time it was well done and Geoffrey’s acting skills are really impressive.

The problem is classical: the organisation (the top) asks some KPIs to measure the software development performance that are artificial when projected onto the real life of the team (even if in this case the top is enlightened enough to be aware of the DORA metrics)

The solution is new: instead of going on the usual “the top/business doesn’t understand so explain them or ignore them” – what Geoffrey offers is: look into your real process, measure the elements that are available and part of your reality and correlate them with the requested KPIs (or, in any case this is my interpretation) – even if they will not fully answer to the request.

Side Note: the sandwich management method

Conclusion

From the talks I attended is emerging a general pattern of renewed focus on measurement: for processes (DORA) and for software (OpenTelemetry).

I’d like to think this focus is driven by the de fact that Agile’s principle of Inspect and Adapt becomes second nature in the industry (but is hard to defend this theory in practice knowing how few true Agile organisations are around…)

Of course, for me, Voxxed Days is more than talks and technology – it’s about the people, it’s about meeting familiar faces – people you worked with or just crossed at a meet-up in the city and share the same passion for technology, the software community and the society we are building through our work.

In summary: very good first day – I learned useful things and I’m happy to see that the DORA metrics, the Accelerate book and even really new(2019!) ideas like Team Topologies are finally making inroads into the small corner of the global software industry that Luxembourg represents!

A Quick Look at EBIOS Risk Manager

Posted on June 18, 2024 by tudorescu

What is EBIOS RM?

EBIOS Risk Manager (EBIOS RM) is the method published by the French National Cybersecurity Agency(ANSSI – “Agence nationale de la sécurité des systèmes d’information”) for assessing and managing digital risks(EBIOS: “Expression des Besoins et Identification des Objectifs de Sécurité” can be translated as “Description of Requirements and Identification of Security Objectives”) developed and promoted with the support of Club EBIOS (a French non-profit organization that focuses on risk management, drives the evolution of the method and proposes on its website a number of helpful resources for implementing it – some of them in English).

EBIOS RM defines a set of tools that can be adapted, selected and used depending to the objective of the project, and is compatible with the reference standards in effect, in terms of risk management (ISO 31000:2018) as well as in terms of cybersecurity (ISO/IEC 27000).

Why is it important?

Why use a formal method for your (cyber)security risk analysis and not just slap the usual cybersecurity technical solutions (LB + WAF + …) on your service?

On a (semi)philosophical note – because the first step to improvement is to start from a known best practice and then define and evolve your own specific process.

Beyond the (semi)philosophical reasons are then the very concrete regulations and certifications you may need to implement right now, and the knowledge that in the future the CRA regulation will require cybersecurity risk analysis (and proof of) for all digital products and services offered on EU market.

OK, so it is important: lets go to the next step:

How is it used?

First a few concepts

In general the target of any risk management /cybersecurity framework is to guide the organization’s decisions and actions in order to best defend/prepare itself.

While risk/failure analysis is something we all do natively, any formal practice needs to start by defining the base concepts: risk, severity, likelihood, etc.

Risk and its sources:

ISC2 – CISSP provides these definitions:

Risk is the possibility or likelihood that a threat will exploit a vulnerability to cause harm to an asset and the severity of damage that could result
a threat is a potential occurrence that may cause an undesirable or unwanted outcome for an organization or for an asset.
asset is anything used in a business process or task

One of the first formal methods to deal with risk was FMEA: Failure Modes, Effects and criticality Analysis that started to be used/defined in the 1940s (1950s?) in US (see wikiipedia). This is one of the first places where the use of broad severity(not relevant/ very minor/ minor/ critical/ catastrophic) and likelihood(extremely unlikely/ remote/ occasional/ reasonably possible/ frequent) categories have been defined.

ANSSI defines 4 levels of severity in EBIOS RM:

G4 – CRITICAL – Incapacity for the company to ensure all or a portion of its activity, with possible serious impacts on the safety of persons and assets. The company will most likely not overcome the situation (its survival is threatened).

G3 – SERIOUS – High degradation in the performance of the activity, with possible significant impacts on the safety of persons and assets. The company will overcome the situation with serious difficulties (operation in a highly degraded mode).

G2 – SIGNIFICANT – Degradation in the performance of the activity with no im- pact on the safety of persons and assets. The company will overcome the situation despite a few difficulties (operation in degraded mode).

G1 – MINOR – No impact on operations or the performance of the activity or on the safety of
persons and assets. The company will overcome the situation without too many difficulties (margins will be consumed).

ANSSI defines 4 levels of likelihood:

V4 – Nearly certain – The risk origin will certainly reach its target objective by one of the considered methods of attack. The likelihood of the scenario is very high.

V3 – Very likely – The risk origin will probably reach its target objective by one of the considered methods of attack. The likelihood of the scenario is high.

V2 – Likely – The risk origin could reach its target objective by one of the consi- dered methods of attack. The likelihood of the scenario is significant.

V1 – Rather unlikely. The risk origin has little chance of reaching its objective by one of the considered methods of attack. The likelihood of the scenario is low.

ANSSI defines some additional concepts:

Risk Origins (RO – this is similar to Threat Agent/Actor in ISC2 terminology) – something that potentially could exploit one ore more vulnerabilities.
Feared Events (FE – this is equivalent to Threats in ISC2 terminology)
Target Objectives(TO): the end results sought over by a Threat Agent/Actor

A side note : quantitative analysis

ISC2 – CISSP recommends using quantitative analysis for risk qualification:

Getting there requires to qualify your asset value or at least how much a risk realisation would cost you (Single Loss Expectancy) and then compute an annual loss so that you can compare rare events but costly with smaller but more frequent.

I think the two methods are compatible as nothing stops you to define afterwards some thresholds that map the value numbers to a severity class (eventually depending not only on the ALE but also on your budget/risk appetite/risk aversion)

A process of discovery

The risk management methods are at their core similar and all contain a number of steps that help establish: what is that you need to protect, what could happen to it, and what could be done to make sure the effects of what ever happen are managed (or at least accepted).

So the steps are in general (with some variance on the order and emphasis) :

identify assets (data, processes, physical)
identify vulnerabilities associated to your assets
identify the threats that exist in your operative environment (taking into account your security baseline)
identify the risks and prioritise action related to them based on their likelihood and severity
..cleanse and repeat.

To help with this process EBIOS RM defines 5 workshops: each one with expected inputs, outputs and participants:

Workshop 1:

Workshop 2:

Workshop 3:

Strategic scenario:

a potential attack with the system as a blackbox: how the attack will happen “from the exterior of the system”

Workshop 4:

Operational/Operative Scenarios – identify and describe potential attack scenarios corresponding to the strategic ones, eventually using tools like: STRIDE, OWASP , MITRE ATT&CK, OCTAVE, Trike, etc.

Workshop 5:

Risk Treatment Strategy Options (ISO27005/27001):

Avoid (results in a residual risk = 0) – change the context that gives rise to the risk
Modify (results in a residual risk > 0): add/remove or change security measures in order to decrease/modify the risk (likelihood and/or severity)
Share or Transfer (results in a residual risk that can be zero or greater : involve an external partner/entity (e.g. insurance)
Accept (the residual risk stays the same as the original risk)

In Summary:

In Conclusion:

EBIOS RM is a useful tool in the cybersecurity management, aligned with the main cybersecurity tools and frameworks.

There is also enough supporting open access materials (see ANSSI and Club EBIOS ) that help conduct and produce the required artefacts at each step of the process: templates, guides, etc. – which make it a primary candidate for adoption in organisations without an already established cybersecurity risk management practice.

The DevOps adventure – my review of “The Unicorn Project”

Posted on June 9, 2024 by tudorescu

#devops #agile #lean

Subtitle: A Novel about Developers, Digital Disruption, and Thriving in the Age of Data
Authors: Gene Kim
Bought: 30 July 2020

Another book from Gene Kim, The Unicorn Project continues to promote DevOps and agile methods as the solutions to the world’s IT (and corporate) ills, similar to the other books by the author. Besides The Phoenix Project, Gene Kim is known (at least) for The DevOps Handbook and Accelerate (written together with Nicole Forsgren and Jez Humble).

The book re-uses the novel format introduced by The Phoenix Project to trace the archetypal trajectory of a DevOps transformation—the enterprise being the fictional “Parts Unlimited” company (another reference to The Goal?)—a company that displays all the evils of a big “legacy company” in which words like ITIL and TOGAF have replaced the beating entrepreneurial heart and poisoned the spirit of the good IT people (/insert sad violin/).

The book is split into three parts:

Part 1 (Chapter 1 – Chapter 7) presents the state of crisis in which “Parts Unlimited” finds itself: losing market share, an IT culture dominated by bureaucracy in which the actual IT specialists (Dev, Ops, QAs) have become mindless zombies accepting the absurdity of corporate decay: infinite request chains, roadblocks, pervasive blame games, etc. In these dark times, there is a light of hope: a few engineers that have refused to sell out their dreams and that are organizing “the resistance” to defend IT ideals.
Part 2 (Chapter 8 – Chapter 13) shows the resistance becoming visible and being embraced by (some) management; the new hope spreads.
Part 3 (Chapter 14 – Chapter 19) depicts the final battle and the happy aftermath.

The DNA of the change (and of the perfect organization) consists of “The Five Ideals”:

Locality and Simplicity
Focus, Flow, and Joy
Improvement of Daily Work
Psychological Safety
Customer Focus

What shocked me is the total dedication that the Devs, Ops, and QAs have to this failed company: Maxine and her co-conspirators spend almost 100% of their (diminished) waking hours working to transform Parts Unlimited, to take responsibility for production deployments, to set up (in rogue mode) servers and deployment processes alike. Maxine’s family is presented as a passive background actor: something that is happening around her while she toils at the corporate laptop to improve the corporation’s IT.

The story brushes off, without much consideration, the security and compliance implications of direct production deployment by the DevOps team. It minimizes the human and logistic cost of operating and supporting high availability services: in The Unicorn Project, the DevOps team is happy to take responsibility for maintaining and supporting the services themselves—no worry for the on-call requirements, for wearing a pager (without any compensation).

In the end, the “rogue” aspect of the corporate transformation and especially its dependency on the employees’ readiness for seemingly endless sacrifice of self and of family time is the most puzzling and self-defeating aspect of the (assumed) objective of promoting the DevOps transformation path.

On one side, it makes the whole story less relevant to anybody looking for ways to start on the DevOps transformation path: any change is viable if you have access to the best people and if these people are willing to provide you endless time and effort… but this is never the case. In my experience, in a company like “Parts Unlimited,” you’ll soon discover that the best people have already left; those that stay around are mainly there because they have concluded that the way things are is… acceptable for some reason: job security, predictability, etc.

On the other side (and in my view, this is the worst part of The Unicorn Project): this is not needed. The “DevOps transformation” is possible by setting clear intermediary steps—steps that have an immediate advantage for the people involved and the company (e.g., fewer incidents in production, thus less personal impact).

By clear communication of the thinking behind the decisions, transparent tracking of the results to grow confidence, and by showing the team a sustainable path towards balancing private life and professional excellence, you don’t need a war—you need diplomacy, leadership, and knowing what you are doing.

Exploring the origins of Lean – my review of “The Goal”

Posted on June 8, 2024 by tudorescu

lean #lean-through-manufacture #business-novel

Why I read it

I found references to it in Lean Enterprise (Jez Humble) and also it was mentioned by Jez in a talk I saw on youtube (maybe this one).

What is about

The road of discovery traveled by the manager of an embattled factory in US to finally realise the [[lean]] enterprise principles!

Similar to [[The Phoenix Project]] (which I guess is inspired from it) a hard working family man is forced to discover the essence of lean in order to save his factory – in the process re-connecting with his team and family… (no kidding!)

What I took from it

Start with identifying the bottlenecks in your delivery process.

The bottlenecks are those activities/resources that necessary for an end product but that have a throughput lower than the one expected throughput of the end product. Meaning: if you need to deliver 5 things per day, and you have some part of your process that can only do 3 sub-things per day – that’s a bottleneck. Everything else that has the capacity to deliver stuff higher than your expected output is a non-bottleneck (duh!)…

The goal would be to have the level of activity throughput equal with expected delivery throughput – not higher, not lower. Which requires: increasing the throughput through the bottlenecks and, eventually, decreasing the throughput through the non bottlenecks to limit the “work in progress/inventory”.

After each system change/capacity re-balancing the process needs to be re-evaluated in order to identify – potentially new bottlenecks and restart. A lot of emphasis is put on the fact that accelerating the steps/processes that are not bottlenecks, or even allowing them to go at their “natural speed” and thus faster than the bottlenecks is wasteful as it doesn’t contribute to the increase of the market throughput. So in a way, the insight is that you need to block the work being done that is not aligned with the delivery… and for me an immediate significant realisation was that Kanban most of all a system of blocking creation of wasteful work .

One way to do this re-balancing would be to flag the work that involves the bottlenecks in a way that gives it higher priority compared with the rest.

Another way is to lower the granularity of the work being done on the ‘non bottleneck’ processes so that they respond faster to the bottleneck requests.

A point made in the book is that this will increase waste as smaller granularity implies more context switching (“set-ups”) – but as long as this waste still doesn’t transform the process in a “bottleneck process” it’s ok.

What I thought

Well, it does make clear (through all sorts of examples, stories, etc.) how important it is to align all your actions/outputs with the business goal… but I kind of hate this type of story with the guy who sacrifices his family and personal time to fight agains a (broken) system imposed by “the corporation”.

Trivia

“garbage in, garbage out” : appears in this book which is from 1984 and about factory parts — after some research (wikipedia) it seems this expression (just like “having a bug”) is another gift that software engineering gave tot he world!

How Google <really> runs production systems

Posted on June 6, 2024 by tudorescu

#articles #cloud

If there is a book that engineers working in operations have (started to)read in the past 8 years, chances are that the book’s title is Site Reliability Engineering with the subtitle: HOW GOOGLE RUNS PRODUCTION SYSTEMS… except that it doesn’t, as we will see below.

« During the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer’s GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period. »

The customer in this case was UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members.

UniSuper had the nasty surprise on 2nd of May, 2024 to see its multi-geography cloud infrastructure… disappear. It took them almost 2 weeks (until 15th of May) to be fully back online – and that mostly thanks to offline backups (see this article] from arstechnica)

Read the official account from google here.

Sociotechnical

Posted on July 10, 2023 by tudorescu

Sociotechnical refers to the interrelatedness of social and technical aspects of an organization or the society as a whole. Sociotechnical theory is about joint optimization, with a shared emphasis on achievement of both excellence in technical performance and quality in people’s work lives.

see: https://en.wikipedia.org/wiki/Sociotechnical_system