Which file format is recommended for data files stored in S3 to enable efficient querying via a SQL endpoint?

Prepare for the Fabric Analytics Engineer Associate Test with comprehensive materials. Explore flashcards, multiple choice questions, and detailed explanations. Get ready for your success!

Multiple Choice

Which file format is recommended for data files stored in S3 to enable efficient querying via a SQL endpoint?

Explanation:
The main idea is that for SQL-style analytics, the format should let the engine read only what’s needed. Parquet is a columnar storage format, which means data is stored column by column rather than row by row. This enables column pruning (only reading the columns your query uses) and predicate pushdown (filters are applied as data is read), which drastically reduces data scanned from S3, speeds up queries, and improves compression. In a data lake on S3, this efficiency matters for cost and performance. Parquet also stores metadata and statistics that help the SQL engine decide early which data can be skipped, and it handles nested data well, which is common in analytics workloads. CSV reads entire files and is row-oriented, offering no schema and poor column-level pruning. JSON is flexible but text-based and not columnar, leading to heavier parsing and larger scans. Avro is binary and efficient for row-oriented access, but it doesn’t provide the same columnar benefits for selective column queries as Parquet.

The main idea is that for SQL-style analytics, the format should let the engine read only what’s needed. Parquet is a columnar storage format, which means data is stored column by column rather than row by row. This enables column pruning (only reading the columns your query uses) and predicate pushdown (filters are applied as data is read), which drastically reduces data scanned from S3, speeds up queries, and improves compression.

In a data lake on S3, this efficiency matters for cost and performance. Parquet also stores metadata and statistics that help the SQL engine decide early which data can be skipped, and it handles nested data well, which is common in analytics workloads.

CSV reads entire files and is row-oriented, offering no schema and poor column-level pruning. JSON is flexible but text-based and not columnar, leading to heavier parsing and larger scans. Avro is binary and efficient for row-oriented access, but it doesn’t provide the same columnar benefits for selective column queries as Parquet.

Subscribe

Get the latest from Passetra

You can unsubscribe at any time. Read our privacy policy