How to build a data team when you’re not sure where to start

Topics: Data

At Ad Hoc, we believe that data plays a key role in helping us build more satisfying and user-friendly digital experiences. And there’s certainly no lack of data in government! Whether it’s website traffic, customer surveys, grant application or healthcare enrollment databases, or any other form of information – there is a staggering amount of data in the public sector.

Using all that data to improve government services, however, is no easy task. Adding data practitioners to your team is a great way to start – but it’s also important to be strategic about which roles should be added and where they should sit, relative to the rest of the team.

Understanding key data roles

Before diving into different models for data team structures, we should first go over the building blocks of roles that create them. There is endless variation in titles in the data field, including newer roles that specialize in particular tools or areas of the data pipeline. For our purposes, we’ll focus on four of the most foundational data roles, which are also most likely to be the ones we’d staff for when building a new data team.

Half circle with data team in the middle, and four key data roles surrounding it:  data architects, data engineers, data analysts, and data scientists.

Data architects identify an organization’s short- and long-term data needs by developing a strategy that ensures the data ecosystem will be built to meet those needs. They set the standards for how data and metadata are stored and processed. They are also responsible for issues of data governance and access. The metaphor of an architect is a good one, since these roles develop the blueprints that data engineers will build from.

Data engineers build the infrastructure to bring data into an ecosystem and transform it so that data analysts and data scientists can use it. Their work involves building pipelines, wrangling data, and making sure all processes are scalable and efficient. The metaphor of “data as plumbing” is often used when describing data engineers. But in the contract-heavy world of government tech, we’re rarely the first ones breaking ground, so civil engineers may be the better comparison; data engineers must strategize how important resources can make their way through partially developed landscapes.

Data analysts transform and analyze data once it has been brought into the ecosystem, identifying patterns and surfacing insights. They are also responsible for communicating their findings using data visualizations, storytelling, dashboards, and reports. Their work typically focuses on descriptive or diagnostic analytics, though in some cases it may include predictive analytics as well. Analysts often act as translators, taking the business logic of the product, mapping it onto the data, and bringing it back in a form that supports discussion and decision making.

Data scientists also transform data and analyze it, while using statistical modeling and machine learning techniques to tackle complex problems. Their work usually involves predictive or prescriptive analytics for large datasets, though they may also use advanced diagnostic analytics for trickier data. They should have a broad understanding of different statistical models, as they’re responsible for evaluating which model is most appropriate for their investigation. Like detectives, data scientists go deep on a case to interrogate relationships between variables, revealing patterns that might have been invisible to the naked eye.

In practice of course, there can be some overlap or tasks shared between these roles. And not every team will necessarily have every role represented. But in general, understanding the sort of work each of these roles do can help us match them to our current needs.

Picking the right team structure

There is no one-size-fits-all approach to building a data team. In fact, each model below could be the perfect fit for a particular project! To maximize the impact of your data team, you’ll need to consider:

  1. The maturity of the data ecosystem that already exists
  2. The size and scope of the project you’re able to make structural decisions about
  3. Stakeholder buy-in for the value of data in decision-making

Centralized model

Here, all data practitioners work on one central team where they build pipelines and dashboards and provide analysis for several other teams. A centralized team might focus on building infrastructure and self-service tools that can easily scale to support many other teams at once. Alternatively, they might function more like an internal consulting team, swooping in to add their expertise and conduct bespoke analysis when needed for the product teams they serve. This model will usually include a combination of a data architect, a data engineer, and at least one data analyst or data scientist – though they may include several of any of these roles.

Pros
The great thing about this model is how easy it is for data staff to collaborate with each other across data roles. This is especially true for data analysts and data scientists who need to communicate business logic to the data engineers creating data models that support their analysis. A central team also has a bird’s-eye view, giving them insights into opportunities for new tools or research that may benefit multiple products and stakeholders. In addition, a centralized team has a greater likelihood of building a unified data ecosystem that will be an asset to the larger organization or contract as it continues to evolve.

Cons
By putting data in its own corner, staff have less visibility into the product teams they serve. This limits opportunities to discover new data sources or ideas for new analysis. Depending on the scope of the project and the number of data staff, it can also be easy for a centralized team to quickly become overburdened. A clear scope of work and prioritization framework from product owners is essential here, as the complexity of product-specific analysis may be limited by the need to serve many teams at once.

When it’s the right fit
Platform projects, where technology building blocks are developed as a product for other application teams, are a great fit for this model. Likewise, projects where lots of data exists that could benefit multiple teams are also good candidates, especially if there hasn’t been much previous work done to model the data. When product teams already have their baseline data needs met by existing tools and staff but may need help for deeper analysis, a centralized data team with a consulting focus may be the best approach.

Embedded model

Here, data practitioners are members of the product team whose data they are working with. This is most common for data analyst or data scientist roles, who are then able to focus their analysis on a more limited scope of products. An embedded team may also include a data architect and data engineer, especially if there has not been much work done previously to configure the data ecosystem. In a leaner staffing model, a data analyst or data scientist may need to rely on the work of previous data engineers or software engineers on the team who have some experience with data. These roles also may need to be responsible for a broader scope of the data stack themselves, making more senior staff a better match.

Pros
Because data practitioners have close, detailed knowledge of the product, they also have a greater chance of identifying additional data sources and opportunities for analysis. In addition, by focusing on the work of one product team, they can do deeper investigations of specific product questions, using more complex techniques.

Cons
The positive impact of this data work will likely be limited to the team these practitioners are placed on. There is also a greater risk of work being siloed, with data analysts and data scientists reinventing the wheel of current, past, or future data staff on other embedded teams simply because they don’t know about it. Not only can this work be duplicative, but it can also lead to incompatible systems being developed and stakeholder confusion when similar metrics are computed differently.

When it’s the right fit
Despite these risks, teams working on data-intensive projects that need the full-time attention of data staff are great candidates for an embedded model! There are also some situations where the scope of the team that we’re able to make staffing decisions for is limited, so any data roles are embedded by default. This model can also be perfect when stakeholders at an organization aren’t quite sold on the value of data or how to get started. In these cases, an embedded data team can be a great proof of concept for how this work can grow.

Hybrid model

In this Goldilocks-style approach, a centralized data team of data architects, engineers, and analysts focuses on creating standardized data products, developing a library of resources and basic analytics tools to serve many teams within an organization or project. Meanwhile, a small number of data analysts and data scientists are placed on select product teams. These embedded roles depend on the foundation the centralized team built – and they also help inform what that team should develop next, providing insights from the work they’re doing on the front lines of their product teams.

Pros
This hub-and-spoke strategy gets many of the benefits of the previous two models, while reducing both of their risks. The embedded data staff ensure that product teams with more complicated data needs get them met. Meanwhile, the centralized team develops a unified ecosystem to make sure work is compatible (not contradictory). It also reduces the need for distributed data engineers on product teams, making the overall headcount lower than trying to build both centralized and embedded teams without synching them up.

Cons
If this approach gives us the best of both worlds, isn’t it always the obvious choice? Not necessarily. A hybrid team can be a tough model to start from when needs and resources aren’t quite known. And realistically, the scope of some projects may be too small for this to be a smart use of resources. Relatedly, lack of stakeholder buy-in or low data fluency across product teams might make an embedded or centralized team a more efficient place to start.

When it’s the right fit
This model is a great target for teams starting from an embedded or centralized model looking to grow. Alternatively, large projects or several projects within an organization looking to maximize impact with limited data staff may choose to take this approach from the start, if they have the resources.

Adding value – even on a small scale

While there’s no silver-bullet model that will be correct for all data teams, this framework hopefully provides more tools for deciding which model is right for your project. And as mentioned above, that answer might change over time! It’s hard to maximize the impact of a data team that doesn’t exist yet – so don’t wait to be able to staff a full hybrid team tomorrow when a few well-placed data roles can start adding value today.

More on These Topics