analytics

I’ve Operated Petabyte-scale ClickHouse® Clusters for 5 Years

Operated ClickHouse® clusters for 5 years, focusing on architecture, storage, upgrades, config, costs, and ingestion challenges. ClickHouse uses replicas and shards for data management, though cloud storage is preferred for cost efficiency. Ingestion is critical, often causing data loss or corruption. Upgrades are complex but manageable with careful planning. Efficient cluster operation requires understanding source code and using CI/CD for testing. Balancing cost, performance, and configuration is essential, and effective ingestion is key to stability.

https://www.tinybird.co/blog-posts/what-i-learned-operating-clickhouse

Mastering the Poisson Distribution: Intuition and Foundations

Extreme TLDR:
Explore the Poisson distribution for modeling count data in cases like online marketplaces, sports, and queuing. Understand its formula, assumptions, and when to use it, while considering limitations like overdispersion. Extensions include time-varying and mixed Poisson distributions to address real-life deviations. Aim for simplicity or Bayesian models for inference.

https://towardsdatascience.com/mastering-the-poisson-distribution-intuition-and-foundations/

Why DuckDB Is My First Choice for Data Processing

DuckDB is my preferred data processing tool for its simplicity, speed, and features. It's an open-source SQL engine that runs in-process, optimized for analytics, allowing fast operations like joins and aggregations. DuckDB easily installs via Python with no dependencies, speeds up CI testing, and simplifies SQL writing. Its friendly SQL dialect, support for various file types, and full ACID compliance enhance its usability in data pipelines. Additionally, it has a robust documentation and community support for building high-performance UDFs, making it a strong choice over other engines like Spark or Postgres.

https://www.robinlinacre.com/recommend_duckdb/

Clickbench Says Postgres Is a Great Analytics Database

Clickbench ranks Postgres highly for analytics after optimization via pg_mooncake. Unlike traditional views of Postgres as an OLTP database, its extensibility allows it to perform comparably to specialized analytics systems. Key advancements include using a columnstore format and vectorized execution with embedded DuckDB for efficient data processing. This new capability retains Postgres's flexibility while streamlining the data stack.

https://www.mooncake.dev/blog/clickbench-v0.1

Scroll to Top