10 Best Practices for Data Teams at Rapidly Growing Companies
data:image/s3,"s3://crabby-images/d34f5/d34f5f2d7b3d4bad3a85fe93efad33b02ab3c41c" alt=""
Managing data at a company is never an easy feat. Doing so at a rapidly growing company is even more difficult- oftentimes it means adding and building as the amount of data you're working with increases exponentially. As this data and data documentation increases, the amount of knowledge that new team members must understand increases as well.
Since we work with data teams of all sizes, here are some of the best practices we'd recommend for data teams that are serving startups and fast growing companies of all sizes.
Collecting data from different sources can be challenging, especially when you’re in a fast-paced environment. It’s important to gather all your data into a single source of truth so that you don’t have to access disparate systems to get a complete view of your business.
When it comes to building out your data stack, there are two types of tools you should consider: ETL and warehouse tools. ETL (extract, transform, load) tools pull data from API sources and allow for some transformation before loading the data into a warehouse or database. Examples include Fivetran and Stitch. Warehouse tools—such as Snowflake, BigQuery or Redshift—are where the transformed data is stored and queried using SQL.
When you're working with a SQL database, it helps to have some standard naming conventions for tables and columns. The following is a list of queries that can help you build your data model:
As your company grows, the number of data sources available to you will also grow. It will be increasingly important for all users of your data to understand what variables mean across these different data sources. This way, a user like a finance analyst who may not be familiar with your web application can know what an event_id or session_id corresponds to in your database.
To help prevent confusion, telling people that they must always reference columns with their table names (e.g., `user.gender`) is a good practice because it removes the need for disambiguation and can prevent errors when querying multiple tables at once.
However, if you use common column names across tables—for example, `date` or `email`—you should explicitly disambiguate them by prepending them with their table name and/or using more descriptive terms such as `creation_date`. This will prevent ambiguity in SQL queries and ensure that everyone who is writing SQL knows where the column is coming from without having to look it up elsewhere.
For example, you could have a page dedicated to each of the projects below—or possibly even more granular pages within those projects:
Even the best data teams are always looking for ways to improve their tech stack and processes, so be sure to stay on top of the latest advancements in technology. If your team is working with a new or unfamiliar technology, ask teammates if they have any experience with it, or know anyone who does. If not, attend meetups—if you’re based in New York City like we are, there are tons of amazing meetups to attend! You can also keep an eye out for news about new technologies that may prove useful to you—sometimes these announcements come from trade shows at events like Strata+Hadoop World or AWS re:Invent. It may sound obvious, but reading industry blogs is a great way to make sure you’re staying up-to-date on the latest trends in your field!
As you add engineers to your data team, you'll need a way to ensure that everyone is using the same transformation framework for their work:
As your company grows, so will the number of people who have access to your data. If you’ve done a good job securing and organizing your data warehouse, then you’ll have a lot more people using it. That’s why creating a naming taxonomy is key for keeping everything organized and understandable. What should you name things? Here are some ideas:
You’re going to want a directory structure of your model folders that are named in a way that makes sense. This is important because it makes it easy for others to find the information they need when working on an analysis or building new models. For example, if you are working on the “Customer Retention” project and have a “Customer Lifetime Value” model, your folder could be called something like “customer_retention/lifetime_value.”
People who are not familiar with the data team will be able to easily navigate after some basic training (e.g., where all Git repositories live, how many projects do we currently have), and you can also create search functionality for more advanced users.
When you’re collaborating with other teams and individuals, remember that not everyone learns in the same way as you do. Be patient when you explain your project to others: it may take a few rounds of questions and answers before they fully understand what you’re trying to accomplish.
If something isn’t clear to you when someone is explaining their data project, don’t be afraid to ask for more details. But don’t be surprised if the person on the other side of the conversation has trouble understanding why you don’t get how their project works! You each have a unique way of processing information.