Thoughts on Data Engineering

Lessons learned building production data platforms at scale

๐Ÿ“Š
Data Quality 6 min read

Why Data Quality Can't Be an Afterthought Anymore

After validating over a trillion records at HSBC, I've learned one thing: bad data quality doesn't just slow you down - it breaks everything downstream. Here's what nobody tells you about building DQ frameworks that actually work in production.

The moment your pipeline hits production is when you realize tests in dev mean nothing. Real data is messy, schemas drift, and that "one-off exception" happens every single day. I've seen pipelines that worked perfectly for months suddenly fail because a vendor changed a date format without warning.

๐Ÿค–
AI & Data 8 min read

Your AI is Only as Good as Your Data Quality

Everyone's rushing to build AI agents and LLM applications, but here's the uncomfortable truth: if your data quality is garbage, your AI will be worse. I learned this the hard way experimenting with LLM-powered data quality classification.

We tried using GPT to automatically classify data quality rules. It worked brilliantly - until it didn't. The model kept hallucinating patterns that didn't exist in our data because the training examples had inconsistencies we didn't catch. Garbage in, garbage out isn't just a saying; it's physics for data systems.

โšก
Performance 7 min read

How I Cut Spark Runtime by 60% (And You Can Too)

Processing 250 million rows shouldn't take 4 hours. Here's exactly how I optimized our heaviest Spark jobs at HSBC - no magic, just proper partitioning, broadcast joins, and understanding what's actually happening under the hood.

The breakthrough came when I stopped treating Spark like a black box. Execution plans don't lie - they'll show you exactly where you're shuffling 50GB of data because someone wrote a GROUP BY on an unpartitioned column. One repartition() in the right place saved us 2.5 hours per run.

๐Ÿ—๏ธ
Architecture 9 min read

Delta Lake Isn't Just for Data Scientists

I was skeptical about Delta Lake at first. "Just another format," I thought. Then I had to debug why our consistency checks were failing across Bronze and Gold layers. Turns out, ACID transactions in data lakes aren't just nice to have - they're essential.

The game-changer was time travel. When stakeholders asked "what changed between yesterday and today that broke our report?" I could actually show them - row by row, with version history. Try doing that with plain Parquet files. Delta turned our data lake from a dumping ground into an actual reliable source of truth.

๐ŸŽฏ
Lessons Learned 5 min read

Things I Wish I Knew Before Building a DQ Framework

Building HSBC's Data Quality platform taught me more about production data engineering than any course ever could. Here are the hard lessons: metadata-driven beats config files, failing fast beats silent errors, and observability isn't optional.

The biggest mistake? Treating data quality like unit tests. It's not. DQ is about business rules, not code correctness. When an analyst says "this revenue number looks wrong," you need dashboards showing exactly which validation failed, when, and what the expected value was. Logs don't cut it.

โ˜๏ธ
Cloud Engineering 7 min read

GCP vs AWS for Data Pipelines: What Actually Matters

After building pipelines on both platforms, here's what the blog posts don't tell you: the choice matters less than you think, but when it matters, it really matters. BigQuery's nested data handling saved our lives. S3's simplicity did the same for others.

The real difference? BigQuery's separation of storage and compute means you're not paying for idle clusters. But AWS Glue's serverless Spark means you're not managing Dataproc clusters either. Both solve the same problem differently. Pick based on your team's skills, not the hype cycle.

๐Ÿ”„
Real-time Data 6 min read

Kafka for Data Engineers: Beyond the Marketing Hype

We added Kafka to stream real-time data quality metrics. It worked, but not for the reasons the documentation suggests. Here's what Kafka is actually good for in data engineering - and when you absolutely don't need it.

Real talk: most "real-time" requirements aren't actually real-time. Business users say they need live data, but what they really mean is "faster than the 6-hour batch we have now." Kafka is incredible when you genuinely need sub-second latency. For everything else, micro-batching with Spark Structured Streaming is simpler.

๐Ÿงช
Best Practices 8 min read

Great Expectations: A Love-Hate Story

I championed Great Expectations at HSBC, built our entire DQ framework on it, and validated 1.36 trillion records with it. Would I do it again? Yes. Was it painful? Absolutely. Here's the unfiltered truth about production GE.

The framework is powerful but opinionated. Custom expectations are where the real value is, but the documentation assumes you already know what you're doing. We ended up writing our own SQL-based expectations because the built-in ones couldn't handle our complex cross-table validations. Once you embrace that, GE becomes incredibly powerful.