Data Discovery Tools: Should You Build or Buy?

Every data team has data discovery on their mind, even more so as they begin to scale. Whether they already have a data discovery solution, or are considering taking the first step in implementing one— the growing pains of a scaling data organization means looking for ways to empower both you and your team to search, find, and derive conclusions from data with minimal hand holding.
This is where data discovery comes in, and the time that finding the optimal data discovery service or solution is important. After all, it’s a big transition to implement one, and an even bigger transition to move from one tool to another should your first solution not be sufficient. Based on our experience with the dozens of data organizations we work with, we’ve put together a guide on finding the best data discovery solution for your organization. We’ll be covering:
Data discovery is the process and technique that involves applying various tactics such as data mining and interactive visualization, to a company's data with the goal of finding and understanding patterns in the data.
In the broader sense, it’s a process, tooling, and organization from data creators and gatekeepers (typically the data analysts or engineers in an organization) to make data accessible to those who need it. Sometimes, they’re making it accessible to people from within the data organization (i.e. their fellow analysts and engineers), whereas other times they’re ensuring that data is reliably accessible to people outside of the data organization (i.e. stakeholders in sales, marketing, engineering, etc.)
Data discovery tools help both data stewards and business users (non-technical users) access and analyze complex data sets within their organization. The tools provide visualizations and other pre-built analyses that allow business users to answer specific questions about the data. The key components of a data discovery tool include:
There are many reasons why organizations need to index on data discovery sooner rather than later. For most, there’s a tipping point, usually within the data organization, that causes the search for a tool.
Many teams choose to purchase data tools because they want to start using the tools as quickly as possible, without any concern about the tradeoffs between speed and flexibility. Similarly, many teams choose to build their solutions off an open-source tool because they believe their use case is so unique that nothing that exists can fulfil their unique requirements. No matter which approaches your team chooses to go with, there will always be benefits and drawbacks to choosing one tool over the other (speed, customization, support, reliability).
When evaluating which data discovery solution you’d like to use, you should ask yourself the following questions:
Once it’s time to make a decision, we believe that it’s important to choose a solution based on the end goals your team has with the product as well as the dependency that other stakeholders will have on the product.
Suppose you’re looking to implement data discovery and start using the tool to align teams on what certain terms mean, how to access data and what data to trust. In that case, it could make more sense to buy from a vendor who already has built and manages a product that can achieve these functions.
However, suppose you are more interested in deep data governance and that your data infrastructure requires unique features that are not covered by traditional vendors. In this case, you should consider building a tool for your specific use case.
As the data stack becomes more fragmented with tools like Reverse ETL, data quality, data observability, data catalogues and headless BI tools, teams will have to pick which of these tools they want to maintain internally vs. buy from a vendor. We believe that in the future, teams who make the right decisions about which products to manage vs. purchase will be able to leverage their data teams' core competency and provide much more value to the business.
In the case of data discovery, data teams should ask themselves if they are well equipped to build a user-friendly data discovery and governance tool, which requires a mix of user experience, product management and data engineering abilities.
At first glance, building off an open-source tool appears to be a good option because it allows you to create a tool that is the perfect fit for your specific business model. But, teams who choose to build using open-source products can introduce unique challenges and that might end up requiring even more data engineering effort. The end result may have been just as expensive as buying the solution. Teams should consider that it’s highly unlikely that they will get the resources to build this vision of the perfect tool internally. One of the reasons for this is that investing in a tool that is not part of your core differentiation might not be a great use of company resources.
This is especially true at the beginning, and with a tool that is used by a variety of stakeholders. When data teams decide to manage open source tools that are used by a variety of stakeholders, they risk having to meet the demands of those stakeholders for future iterations of the product. The management of an internal open-source tool can very easily start to consume a data team that is not prepared to manage a product that is built for the entire organization. That being said, there are a lot of open-source tools that can work well for data teams. Tools that are used internally by the data team and perform a very isolated role are a great use case for open source tooling.
Here are some of the pros and cons of building data discovery tools from an open-source library or scratch:
Pros:
Cons
Over time, the volume, complexity and scope of the tools might change as the needs of your business and technical requirements change. When you’re planning your product, you need to think about how the tool may change as things become more complex and need to be prepared to build support for that future iteration of the product.
The primary reason to purchase instead of building software is to save time, money and resources. Additionally, teams should consider what building the tool internally adds to the organization's core competency. This way, teams can configure the data discovery tool to their exact data stack and specific needs.
By buying from a vendor, you are guaranteed to see continuous changes and developments to the tool, regardless of your companies resources. If it takes time for your team to develop new features and for the open-source community to innovate on the product, it might make more sense to consider purchasing a solution from a vendor instead of building your tool.
Just like the above, there are still tradeoffs to purchasing your tool. Below are some of those tradeoffs.
Pros:
Cons:
Once you have a good understanding of the cost associated with building, you should try to understand what it takes to build your tool or manage one internally.
Whichever path your team chooses, we believe making a decision is way better than taking too long to evaluate alternatives. This is because the amount of data accumulated and the requests a data team will recieve only grows with time. The earlier that teams adopt data governance and data discovery tools, the faster that they will be able to trust their data. Of course, there’s a lot at stake when making decisions about data and the impact it can have on your business.
If you are currently facing some challenges about building or buying a data discovery tool and would like to get a second opinion (we promise it won’t be biased), our team can help you define your goals and technical requirements so you avoid any serious roadblocks along the way.
Sign up for a free consultation here
Join top data leaders at Data Leaders Forum on April 9, 2024, for a one-day online event redefining data governance. Learn how AI, automation, and modern strategies are transforming governance into a competitive advantage. Register today!