-->

Type of Views in Apache Spark SQL

Types of Views in Apache Spark SQL

Apache Spark supports three types of views—temporary, global temporary, and persistent—each fitting different lifecycle and sharing needs. This guide focuses on persistent views: how to create them, manage them with SQL and PySpark, and understand when they’re preferable to tables or temp views.

If you use a Hive-compatible metastore (or Unity Catalog/Glue), persistent views can be shared across Spark applications and survive restarts. They’re ideal for governed analytics, BI consumption, and metadata-driven pipelines. This post includes ready-to-copy examples, management commands, and a full end-to-end demo you can run locally.

Basic Syntax

Basic DDL for creating persistent views:

-- Create a persistent view
CREATE OR REPLACE VIEW persistent_orders AS
SELECT * FROM orders WHERE status = 'COMPLETE';

-- Create view with specific database
CREATE OR REPLACE VIEW sales_db.completed_orders AS
SELECT order_id, customer_id, total_amount, order_date 
FROM orders 
WHERE status = 'COMPLETE';

Using Spark SQL in PySpark

Example: creating and querying a persistent view from Python.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PersistentViews") \
    .enableHiveSupport() \
    .getOrCreate()

spark.sql("""
CREATE OR REPLACE VIEW persistent_orders AS
SELECT order_id, customer_id, total_amount, order_date 
FROM orders 
WHERE status = 'COMPLETE'
""")

result = spark.sql("SELECT * FROM persistent_orders")
result.show()

Programmatic Approach with DataFrame API

Create a view programmatically using DataFrames, then persist it via SQL:

orders_df = spark.table("orders")
completed_orders = orders_df.filter(orders_df.status == 'COMPLETE')

# Create temp view
completed_orders.createOrReplaceTempView("temp_completed_orders")

# Then persist
spark.sql("""
CREATE OR REPLACE VIEW persistent_orders AS
SELECT * FROM temp_completed_orders
""")

View Management Commands

Handy SQL commands for working with views:

SHOW VIEWS;
SHOW VIEWS IN sales_db;
SHOW CREATE VIEW persistent_orders;
DROP VIEW persistent_orders;
DROP VIEW IF EXISTS persistent_orders;

Advanced View Options

Views can include comments or aggregate logic:

-- With comment
CREATE OR REPLACE VIEW persistent_orders 
COMMENT 'View of completed orders'
AS
SELECT order_id, customer_id, total_amount, order_date 
FROM orders 
WHERE status = 'COMPLETE';

-- With aggregates
CREATE OR REPLACE VIEW order_summary AS
SELECT 
    customer_id,
    COUNT(*) as total_orders,
    SUM(total_amount) as total_spent,
    AVG(total_amount) as avg_order_value
FROM orders 
WHERE status = 'COMPLETE'
GROUP BY customer_id;

Persistence Characteristics

Persistent ViewsTemporary Views
Survive Spark session restartsOnly last for Spark session duration
Stored in Hive MetastoreNot stored in metastore
Accessible across Spark applicationsScoped to current session
Require enableHiveSupport()Created via createOrReplaceTempView()

Complete Example

An end-to-end demo with sample data and persistent view creation:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

spark = SparkSession.builder \
    .appName("SQLViewsExample") \
    .config("spark.sql.warehouse.dir", "/tmp/spark-warehouse") \
    .enableHiveSupport() \
    .getOrCreate()

orders_data = [
    (1, 101, 150.50, 'COMPLETE', '2024-01-15'),
    (2, 102, 89.99, 'PENDING', '2024-01-15'),
    (3, 101, 230.00, 'COMPLETE', '2024-01-16'),
    (4, 103, 45.25, 'COMPLETE', '2024-01-16'),
    (5, 102, 199.99, 'CANCELLED', '2024-01-17')
]

schema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("total_amount", DoubleType(), True),
    StructField("status", StringType(), True),
    StructField("order_date", StringType(), True)
])

orders_df = spark.createDataFrame(orders_data, schema)
orders_df.write.mode("overwrite").saveAsTable("orders")

spark.sql("""
CREATE OR REPLACE VIEW persistent_orders AS
SELECT order_id, customer_id, total_amount, status,
       CAST(order_date AS DATE) as order_date
FROM orders 
WHERE status = 'COMPLETE'
""")

spark.sql("SELECT * FROM persistent_orders").show()
spark.sql("SHOW CREATE VIEW persistent_orders").show(truncate=False)

Key Points

  • Use CREATE OR REPLACE VIEW for idempotency.
  • Enable Hive support for persistence across sessions.
  • Views are read-only: they store query definition, not data.
  • Underlying data changes are reflected automatically.
  • Use descriptive names and comments for maintainability.

Conclusion

Persistent views are a lightweight way to centralize business logic in Spark without copying data. They work best when you have a metastore configured and multiple teams or jobs need a consistent, versioned definition of a dataset. Use comments and clear naming conventions, and keep the view definitions small and composable so they’re easy to audit and evolve.

For heavy transformations or performance-sensitive queries, consider materialized tables or incremental pipelines and use views as the semantic layer. Pair views with access control (Ranger/Unity Catalog), quality checks, and CI validation of SQL to make them reliable interfaces between producers and consumers across your Spark platform.

FAQ

Are views faster than tables?

No. A view is just a stored query; performance depends on underlying tables, partitioning, caching, and the query plan.

Do I need Hive support?

For persistent views, yes—enable Hive support or use a compatible catalog like AWS Glue or Unity Catalog.

Do views update automatically?

Yes. Because they store the definition, not data, they reflect the latest source data at query time.

How do I migrate from temp to persistent?

Create a persistent view with the same SELECT logic, then update downstream consumers to reference it.