The Cost of Data
A typical read of a data engineers responsibilities will contain some description about designing, and constructing infrastructure that will collect, organize, and maintain an organizations data assets.
While all the above are considered core data engineering skills, there is one overlooked responsibility from my years as a data engineer that should be at the top of the list; managing the usage, and costs related to this data infrastructure.
As this current data era progresses, we are at the point where the abundance, and complexity of the data continues to scale, in parallel with the data needs of the business. This requires us to have a deeper understanding of the costs, to invent new, and more economic ways of collecting, processing, and persisting this data. Taking ownership by thinking long-term when building data solutions, and being transparent in communicating these costs to the organization are the hidden soft-skills of data engineering.
Understanding, and managing costs starts with the design document. Clarity on the resources, services, and infrastructure required to construct the recommended data solution, all have costs associated to them. Detailing these costs by some unit of measure, typically compute (serverless, and provisioned), and storage, related to the long-term scale of the desired state, provides a picture of what this solution will cost long-term. I’ve seen where data infrastructure, and resource costs can escalate fast, when scaling is applied. Think about the alternative solutions, and document the related costs. Even when told “cost is not a factor”, it is, and be included in the consideration along with the functional design, and scalability.
Questions to ask in the design document
What common infrastructure, or shared resources, can the design reuse, or be constructed on top of. Building a common, standard orchestration solution to manage data pipelines, along with standard data formats does save in cost
If the data architecture exists in a current state, what are the resource, and infrastructure costs? How will the desired state result in cost savings? Is this an opportunity to deprecate existing infrastructure?
How does required ingestion, and publishing cadence, and expected data volume influence the data architecture? If investing in batch processing, think about how smoothing out the processing cadence over a timeframe can reduce costs.
What about long-term data retention? Ask this question of the stakeholders. Consider both DevOps, and infrastructure costs when planning out the data longevity.
How do peak volume periods, or high volume events impact the costs? These may be outside of the normal planned scaling.
Managing for the long-term
Can’t emphasize this enough, make it a best practice to include both usage, and cost tracking as part of the standard operating procedure (SOP) guide.
Develop a mechanism to capture, and store infrastructure costs at a time-based grain (monthly), that can be easily queried, reported on, and used for forecasting.
Monitor, and alert on spikes in usage, and related costs. Cloud-based resource providers have standard usage, and cost related monitors, and the ability to configure a threshold to alert on.
Cloud-based resources? Utilize the key-value tags, to maintain project metadata on the resources used. Maintain distinct access accounts for beta, and production.
As a data engineering team schedule a continuous infrastructure planning, and review meeting. Use this time to establish infrastructure spending goals, identify patterns, and trends in the actual cost data vs. the spending goals. This should be a deep dive into the infrastructure, and usage patterns.
Collect, and evaluate data set usage, along with stakeholder feedback. Data sets are built with some business purpose, and on many occasions out live that purpose. Declining access patterns by the stakeholders over a period of time, is usually a good indicator that the data set is no longer of use. Think of this as a good data stewardship.