System Design Key Concepts in Data Architecture and Data Infra

Data Cat
6 min readJan 30, 2024

Hi everyone, in this post, I will summarize things I recommend knowing and understanding for data architecture and data infra system design.

“Disclaimer: The views and opinions expressed in this blog post are solely my own and do not reflect those of any entity with which I have been, am now, or will be affiliated. This content was written during a period in which the author was not affiliated with nor belong to any organization that could influence their perspectives. As such, these are author’s personal insights, shared without any external bias or influence.”

1.Data Lake vs Data warehouse and what kind of dataset to store in each of them.

In many cases, you will likely use both. First you ingest data into data lake for unstructured or semi-structured data. Parquet is recommended for efficient I/O purpose. Then from the data lake, your data processing layer (ex. Spark) runs a data processing and store into either another data lake or insert into data warehouse. Finally, for aggregation, you run a function or query in the data warehouse. Especially columnar data warehouse is optimized for aggregation. If your query involves complex join, then it is recommended to do the aggregation in Data Lake rather than Data Warehouse, because join…

--

--