Data munging is dirty work.

Data Wrangling & Data Munging Blocks Innovation

Efficient data workflows are crucial to being a data-driven organization

Thomas Spicer
Openbridge
Published in
6 min readJun 19, 2019

--

Data wrangling, or data munging, can impact your business’s bottom line. According to MIT, Tableau, Cap Gemini, and McKinsey, companies that have embraced a data-driven culture outperform their peers.

Over 80% of enterprises plan to deploy new data products this year, and 65% of small and medium-sized businesses follow that trend.

Data scientists and analysts will continue to be hot professions, given these business investments and priorities. These same analysts or data scientists are often tasked with wrangling to glean valuable insights and nuggets working with data.

Data wrangling and roadblocks to creating a data-driven organization

What do companies hope to accomplish by installing a data-driven culture within their organizations? A variety of business needs fuels a desire for being data-driven:

  • Confidence: Help people to make informed (better) business decisions.
  • Agility: Help business users to make decisions faster.
  • Consistency: Ensure planning and forecasting is more reliable and predictable.
  • Growth: Drive growth via customer acquisition and retention.
  • Product: Innovation of new services and products.

The data analysis journey can be uneasy. Companies pursue data-driven cultures and realize that getting the most value from those efforts means solving an unexpected roadblock: data wrangling.

The wrangling or data munging work of making data productive reflects one of the “dirty jobs” of analytics and insights. Given the amount of data these data analysts or scientists have to process manually, a better job title may be “data wrangler.”

“What is Data Wrangling” or “What is Data Munging”?

Wrangling and munging are synonymous with each other. What exactly is wrangling or munging? Here is an introduction to a definition:

“data wrangling process or munging is the work of manually aggregating, organizing, converting, routing, mapping and storing data, to ensure it is available and accessible for easy consumption. This includes pulling data sources, mapping data or data transformations, and data cleansing. This is done to support exploratory data analysis, data visualization, data science or machine learning”

The wrangling process also includes achieving data quality for complex data sets, processed data, or raw data. This can involve tasks for cleaning, converting one form into another format, or modeling complex data. Again, this can fall under the ELT or ETL munging process, especially if your workflow is extracting, loading, and transforming data.

These are all part of the wrangling or munging efforts needed to deliver effective analytics processes, insights, and outcomes. It is data-wrangling tools and strategies that help facilitate easy consumption. Often, moving from manual processes to automated ones requires ETL tools.

So what does “easy consumption” mean? It reflects the removal of barriers to data. The user gets to choose the data tool they use with agility and velocity, agility and velocity; they can quickly pursue data processing, visualizations, transformations, routing, reporting, or statistical models.

For example, one person may use Excel to consume their data, while others prefer analysis tools like Tableau, Qlik, or MicroStrategy tools. Both consume the same data, using the tools that best suit their preferences.

Realizing easy consumption means time-consuming, expensive, and error-prone data preparation efforts. Those efforts often become the responsibility of those who wear an analyst hat. They may be in marketing, IT, or someone with the sexiest job of the 21st century, the “data scientist.”

Regardless of title or department, people who need to consume data sets suffer most when they can not. They face significant bottlenecks in making data intuitive and efficient because they spend much time munging it. The data wrangling step’s importance reflects the desire to solve this difficult consumption problem teams face.

Complicated data preparation steps for analytics

People give the impression that data is magically delivered and available. In theory, this data is available in a warehouse or data lake. APIs, exports, logs, and a host of other data sources. What is glossed over is the integration effort required to make data available.

What does making data available mean? A tenet of data wrangling methodology is to ensure data availability. This is defined as:

“the logistics of moving data from a place of low value to a place of high value.”

For example, you order a new laptop from Amazon. When requested, it is sitting in an Amazon warehouse. This is a low-value location (at least to you). Once that product arrives at your doorstep, it is in a high-value place (your home).

Email data residing in MailChimp is a low-value location. It serves a purpose, but it’s isolated. A place of high value means extracting and transporting all your MailChimp data to your local/cloud system (i.e., database, storage array, cloud drive). This is true for wrangling projects.

If you have ever done systems integration work, you understand that data access efforts are non-trivial. Analysts or data scientists are often left to solve this critical step independently through wrangling skills or data-munging workflow.

Making data available is just one of many wrangling examples that become blockers for your business analysts.

Teams are overwhelmed with data wrangling projects

It is not uncommon to find analysts and data scientists spending large amounts of time wrangling. In fact, between 50% and 80% of a team’s time can be spent munging (NY Times). They extend significant energy to clean the data and integrate, format, rout, compile, and cross-reference.

The data work is often manual, time-consuming, and error-prone. The result is an expensive and brittle patchwork of code, files, and folders. If 80% of a week is spent wrangling, this leaves no time for analysis for someone who might be making $200,000 to $300,000 a year.

They can quickly become disenchanted and frustrated. Why? Most of their time is spent doing the opposite of what they were hired to do. They deliver what they can with the time remaining.

This leads to pressure from decision-makers who become disillusioned with the outcomes they are receiving. As a result, team morale decreases, and the risk of losing valuable analytic talent increases. Your team will see leaving their jobs as a way to escape the situation they have no control over.

Solving barriers to data access

Teams are deeply invested in solving an organization’s wrangling or munging data challenges. A fundamental element of being data-driven means data must be available before being made accessible.

For example, when Amazon delivers the laptop from a low-value location (warehouse) to a high-value location (home), they solve availability. However, Amazon has given you a box. While your cat may enjoy the box, you care about what is in the box.

That means there is still work to be done. You need to open the box, unpack the laptop, power it up, and configure and install your software. While your computer is available, it is not accessible until you finish these steps. Data is no different.

Data accessibility is defined as:

“the endeavor to make data easy to consume. The outcome is data that is approachable, comprehensible and usable.”

The NY Times article mentioned that data analysis and science are “ step-by-step experimentation processes.” This means there will always be some “hands-on” work with data. However, hands-on work should not mean being technically responsible for solving data availability and accessibility. If your team focuses on this low-level wrangling work, they will not deliver the outcomes you hired them to undertake.

Success requires any data wrangling work in advance of an analyst or scientist’s efforts. Data must be treated as a first-class citizen. Don’t fall into the trap of having analysts lead manual or labor-intensive technical steps.

Solving for activities such as manual data cleaning or delving into the raw form of data as part of a series of wrangling steps should not be their responsibility to solve. They should be contributors and mentors to those focused on removing those data barriers and technical challenges. This will ensure better alignment on how work is accomplished and by whom.

Don’t lose sight of a data accessibility outcome. Getting data availability and accessibility right is critical to a data-driven culture. Working back from this truth will afford your team greater agility and velocity.

Get Started Your Data Discovery Journey Today: 30-day Free Trial

Understanding your team’s current data munging or wrangling techniques is critical to help identify areas for improvement or automation. It also ensures analysts and scientists are well-positioned to deliver the insights they love to find and want to know.

Get Started with code-free, automated data wrangling software from Openbridge. Large data sets, small data sets, or anything in between, our platform supports a broad collection of open source data connectors that prepares data for storing in your private warehouse or lake.

From data validation to data enrichment and data structuring, we offer a 30-day free trial to see how we can solve your most complex data munging process challenges.

--

--