12 Things To Consider When Rolling Your Own Data Pipelines

Thomas Spicer
Openbridge
Published in
6 min readApr 5, 2017

--

While there are services (Openbridge!) that can help with your data integration efforts, sometimes individuals and organizations decide to build rather than buy. If you chose to forgo the “buy” route with commercial off-the-shelf or commercially available off-the-shelf (COTS) software, then you are one of those unique souls that get great joy pursuing custom-made or bespoke solutions. You’re not alone!

Here are 12 things to consider as you think about your data pipeline efforts:

1. Authentication

Integration with various source system OAuth mechanisms, Keys, and Username/Passwords. This includes handling expirations, de-authorizations, and hierarchical permission schemes. For example, a user authorizes you to collect data from Google Analytics in December. However, the user severs that authorization within their Google account three months later. What should happen in your system (see 2)?

2. User Management

Manage the linkage of user authorization to a specific resource. For example, user A provides authorization to system X. Related to (1), you need to care for delivery notifications for failed credentials, account renewals, and auditing the status of user-supplied authorizations. For example, you might have user A that provides, or can provide, authorizations to 50 Facebook pages. What happens if something changes related to user authorizations in 1 or all 50 of those 50 pages?

3. Versioning

Keeping up to date on API versioning for the source systems can be a challenge. This may include forking vendor SDKs to resolve bugs/errors. This also includes resolving undocumented features, functionality, or workarounds needed to generate desired outputs. Not all SDKs are kept current or work as expected. Misspellings in APIs are especially fun. Looking at your Sizmek! Are you keeping track of upstream code and operationally managing for change?

4. Requests

What can that API provide you? How can you bend it to your will to generate the desired output? How does someone invoke that API to get the desired output? How does one even know what output is available? Do not underestimate the upfront time to research, plan, and design your integration. Make sure you understand all the requests to API(s), including any inclusivity/exclusivity rules, the state of a request, methods in how calls are made to APIs, and any schema/request definitions that are required.

5. Throttling/Limits

Each data source will have Terms of Service (ToS) that enforce limits or throttling to their APIs. Also, many APIs have significant limits on the scope of data that can be requested at any time. These limits are also subject to change more frequently than you would prefer. What happens if you hit limits? Does everything fail?

6. Errors

Failure is not an edge case. You have to assume failure for all your API calls. Your systems will need to care for API unavailability, bad responses, undocumented changes, performance woes, malformed outputs, and incomplete responses. Lastly, upstream SDKs can and do contain bugs, which can complicate efforts, especially depending on their responsiveness to change/bug requests. How are you testing, especially while in production?

7. Availability

You should have a sense of when data becomes available in a source system. In many systems, data is not “real-time,” so you need to know when the “truth” is settled in a source system. Why does this matter? If you are calling too frequently for data that is unavailable/incomplete, you could be wasting precious API calls, increasing error rates, and reducing the quality of the data in your system. For example, you call system X every 6 hours, and it is responding with a request accepted and a proper response. However, that output contains inconsistent results. There are several records with 0 or NULL. Those values get stored in your warehouse. The reason for those values is that the truth has not settled in the source system. They honor your API call but are not returning expected values. If you are pushing those values to users and executives, this can lead to confusion and create issues around confidence.

8. Scheduling

When should something run? Should a request be time-shifted? For example, making a request on March 16th for data from March 14th? This is related to (7) where some systems require you to time-shift the period of data you are requesting. The Facebook Graph API works best with time-shifting requests to ensure consistent output. This means understanding how jobs are set up and will be triggering calls to an API at specific intervals. Don’t forget to keep track of tasks, requests, and outcomes so you can remediate any issues or failures.

9. Schemas

If you are calling (or receiving) data, it will have some structure. Most APIs will want you to define your request, and those requests have structure. The responses may have less structure, but generally, there is something that those invoking an API and those responding to the request agree upon. The requests will define the “payloads” schemas. These requests will usually be aligned closely with the API response output definitions, as well. For example, this request to Google Analytics defines not only what you are asking for but the output as well:

ids=profile_id,
start_date=start_date,
end_date=end_date,
metrics=’ga:pageviews’,
dimensions=’ga:visitorType,ga:sourceMedium,ga:networkLocation,ga:country,ga:region,ga:city’

If you are using an RDBMS, schemas are important. Plan accordingly.

10. Encoding

Depending on your source system, encoding can be the wild west. Even systems that commit to providing a standard encoding, sometimes do not do so in all cases. You will likely need to process data from source systems to properly encode data to UTF-8 specifications to ensure consistency. Bad things happen if your encoding is messed up.

11. Processing

The data delivered by the source system may need further processing (i.e., see Encoding). In some cases, this might be minor and in other cases, more expansive. You will need to plan for activities appending metadata, aligning output formats, de-duplication, validating outputs meet API specs/schemas, and so forth. For example, your systems go a bit whacky, and requests to Google Analytics happen 100 times. You can load that data 100 times, or you can make sure that prior/during load that you are not “polluting” the table your analysts rely on. For example, many data sources do not provide unique IDs for each record. This would be something you would want to append to your data.

12. Asynchronous/Synchronous Operations

To accomplish the desired output does not mean there is a unified source API to accomplish that objective. In some cases, you might need to chain multiple source APIs that need to be called, in sequence or parallel, to generate the desired output. For example, one API call may generate raw transactional data. However, that data has no labels for certain keys. You need to call a second API to look up each ID to determine that human-understandable label. Youtube does this for their feeds. You may get Video or Playlist IDs from one API, but you need to call a second API to determine the labels/names for those IDs. Also, all those API calls are counted against your quotas!

Summary

Taking care of these areas will help ensure you are not getting calls from angry data scientists yelling at you that the data is missing, wrong, incomplete, inconsistent…! They will come to rely on your work like electricity and water utilities. It just needs to work.

When Your Team Can’t Rely On Your Solution

Good luck with building a bespoke solution! It can be a lot of fun (until it’s not 😳). If you would rather skip the fun and go straight to realizing value from your data, give us a call!

--

--