Experience based solution
As mentioned in Motivation section, Data First DevOps (DFD) is not a hypothesis and is set of approaches backed by few principles, in addressing the challenges in mid to large scale projects, here is another confession rather reality - when we sat down to solve the problem, we didn’t come with a solution and didn't named it DFD on very first day. We got solution.v1. We evolved it to solution.v2 and the v3 and so on. Eventually we understood that it’s the data which is working and also came up with something which we call - Smart Data, Dumb pipelines i.e. The pipeline will behave as per the data. It took quite long time to come to the term DFD. Once again, it's just set of approaches backed by few principles and you can be very much innovative about implementing these principles or for coming up newer and better approaches.
We first thought of naming it as Data Driven DevOps, which sounds exactly as the implementation of this technique. On googling this word, we got to know that there is already something called Data Driven DevOps but is principally far different than DFD. Hence, we came to the term Data FIRST DevOps, and this name is very close to the principles used in this approach.
Everything has context, even the DevOps processes. We need to understand the details of the context, which is ignored in general. We must understand the why, what, what if and future what of everything. We need to create a contextual relationship and (if applicable) life cycle of everything. Once we start connecting the dots, we get the context and the story of the process. Once you have the contextual understanding of the automation story, it will give you a lot of power.
Example: Think of a deployment. Starting from who did it, when and all, there can be a lot more to it. Is the deployment successful, can it be taken to release, or we would have to revert it as it failed, how many deployment failed, what version should we revert to, or should we keep it broken to be fixed, etc. So there is a lot of story/context around the deployment. We need to have visibility of all these.
Humanly you can't see every aspect/story/context of a process. It comes either by experience or requirement. One must think of capturing whatever s/he can perceive at one point. Ignoring this would never let you implement DFD.
The idea of Data First DevOps aka DFD, is to consider every component and process as a data point, and then deriving the whole DevOps around these data points. In simple words - Let's capture data for (almost) everything we do in DevOps scope.
Example: we can have a data point for build (a process) or a data point for Git Repo (a component). All such components should a be data, stored somewhere which can be referred for various purposes.
Like we develop an application, we think of data, like user data or order data etc. In exact same mindset we need to think of data for DevOps.
Polyglot Data Store
Speaking of processes and components, even if everything can be considered and treated as a data, yet there are various cases where the type of the data is not same.
Example: a build process can have data like build duration, build result, built by, built repo, built branch, etc. can be a CSV value in a file or may be a single row is a related RDBMS table. Against this, a K8S deployment descriptor does not make sense to be kept as a CSV or RDBMS table row. It rather makes sense to be stored as version-able file e.g. file in GitHub. We need to need to decide consciously on how the data is used and what can be the use cases of that data usage and choose a data store accordingly. Again, there is no hard-core rule, yet a decision based on common-sense and farsightedness. Making a mistake in selecting a right data store is not a crime, as far as you can switch to a better choice with tolerable loss. The idea is that you need to be open for storing your data in more than one location, even if you feel - Who does it like this? Just think carefully without being biased about the traditional approach.
Like we develop an application using polyglot databases, similar is possible here and you should not shy away if the situation hits you.
Data coherence is very much related to the above point of polyglot data store. Logically similar category of data must go in similar type of data store. This should be the default way, unless there is some more meaningful reason as per your context, always remember your context is more important than any default (or best practice or principle), which say otherwise.
Example: We can say that all such configuration files should be kept in the GitHub, while the build and deployment related data should be stored in a CSV file, but not needed to keep that CSV in GitHub, or maybe you can just put it in RDBMS which will enable you in future for running query on this data.
Like we take care of data coherence in application development, we need to take care of data coherence in DFD. Messing up with coherence can cause you trouble, mostly in case managing consistency in implementation approach.
Think of the situation where uniqueness matters and think of how that uniqueness is generated. If that system which generates the uniqueness, collapses, we should have ways to make things work properly. And as the whole thing works on data, you need to ensure that your data is stable and data source is not ephemeral.
Example: While implementing the DFD, there might be cases where you might be populating some data based on the event or user input e.g. using the Jenkins BUILD NUMBER as a data. Be careful about such sources of data, because they are not much reliable. Let's understand this by example. Say, that we are storing the docker image with a tag number as BUILD NUMBER of Jenkins job. Just in case, because of any given reason if the Jenkins goes down and reconstructed, the BUILD NUMBER will be reset and restart from 1. This might cause you tag conflict situation. So, we can say that BUILD NUMBER is not very stable data source to be used as tag number for docker. And we need to find out a better and more stable data point for this purpose. I am keeping it open for you think on what it can be.
While developing an application, as we take care of uniqueness of data and its generation and storage situation, we need to be careful about the data in same was in DFD. After all the whole DFD is based on DATA.
This is not the point to be discussed here with the software experts about the data safety. Needless to say, when everything is data and a lot of DevOps activities are based on the data we have, it becomes critical to take care of the data we have or we generate. Now this is not something which should be a haunting thing. But more you use the data to do your work, more important the data safety becomes for you. Hence you need to be vigilant about how you plan to safeguard your data. Again, the implementation of data safety is open to you, depending upon your context.
Example: You can keep taking regular backup of your data. Or you can data replication implemented or maybe you can come up with some other ways as per your context to ensure that your data remains safe.
As in application development, we care for the RPO, in the same way we need to take care of the RPO here. Be diligent, but do not over engineer.
Smart Data Dump pipelines
Last but not the least! You must design the pipelines in such a way that the flow of the behavior of the pipeline will depend on the input data it is processing. This makes your pipeline flexible to grow. Try to design the pipeline in such a way that just adding more data can add more behavior. Though there might be cases where you need to change a bit of your pipeline code, as per the new requirement along with adding more data. That is acceptable.
Example: Imagine the case described in Quick Peek.
In application development as different input data triggers different execution path, think in that manner in DFD pipeline implementation.