Hmmm.. Where should I put my Data Lake?

With both my Infrastructure Technical Architecture and my CTO hats on, one of the questions I have begun to ask myself lately (and that my clients have been asking me too) is “Where do I put my Data Lake?”

Much like in your favourite Sim game, positioning your data lake in the right place from the outset takes some careful consideration, and getting it wrong could have some nasty long-term consequences.

Why? Well, you have at least 10 competing Financial Services requirements looming on the horizon:

Cost and possibly unpredictable Public Cloud “Get Charges”
Avoiding Vendor Lock-In and ensuring diversity
Avoiding duplication and multi-instancing – Prod, DR, Snapshots, Backups, Archive
Performant multi-cloud application accessibility (including Hybrid and Private)
Competing Data Retention Policy (GDPR => delete it) vs Regulatory (MIFID II => keep it) drivers
Security breech under GDPR = Big £££.
The Secret Sauce in the “Public Cloud”?
The need for Metadata tagging and searching ALL data types simultaneously
Growing capacity demands – e.g. MIFID II storing voice calls and ALL related trade history data
New functional demands e.g. MIFID II searching voice calls data
Varying roles requiring different carefully controlled access e.g. Compliance vs. Back Office vs. Algorithmic Models

The list goes on.

Well, it turns out that (as is always the way) the smartest guys out there – many of whom reside in Silicon Valley – have seen this problem coming and have developed and recently released some very interesting tech products that can serve as interesting building blocks when designing your data landscape in a Public Cloud S3 compatible manner.

Examples of the functional value these technologies can bring to the party are:

Very large scale (500Tb and up), very low cost S3 compatible “On-Premise” storage that can serve as your Tier 2 “offline” and “near-line” Data Archive which:
1. Provide in-house code S3 compatible accessibility and meta data tagging
2. Easily handle the HUGE NFS/CIFS File Services structures that challenge certain organization processes (e.g. Insurance)
3. Guarantee synchronous “data-durability” – AUTOMATIC peer to peer replication globally (over an up to 100mS latency link)
4. Offer an unbeatable 3 year TCO vs. the unpredictable PAYG Public Cloud price model and traditional On-Prem SAN
Super fast super-searchable Tier 1 storage that can
1. “Cache” your Most Recently Used “on-line” production data
2. Search ANY data type at 35000 IOPS
3. Enable direct and performant “mounting” of huge databases – possibly eliminating the need for dedicated DR datastore
4. Push your least recently used data off to Public Storage OR the S3 compatible Tier 2 storage above

Basically .. these tools provide some “Hybrid Storage” options to compliment your overall Hybrid Cloud design.

We’re showcasing some of these tech products at our Shard event on 30^th November. Read more here.

Uncategorized

Hmmm.. Where should I put my Data Lake?

Ray Bricknell