often starts with tools like pandas. They are intuitive, powerful, and perfect for small to medium-sized datasets. But as soon as your data grows beyond what fits comfortably in memory, performance issues begin to surface. This is where PySpark comes in.
Note that in this article I’ll often use the terms Spark and PySpark interchangeably. For our purposes, it doesn’t matter, but you should remember that they are different. Spark is the overarching distributed computing framework (written in Scala), and PySpark is a dedicated Python API to Spark.
What is PySpark?
PySpark is the Python API for Apache Spark, a distributed computing framework for efficiently processing large volumes of data. Instead of running all computations on a single machine, Spark spreads the work across multiple machines ( a cluster), allowing you to process data at scale while writing code that still feels familiar to Python users.
One of the key advantages of PySpark is that it abstracts away much of the complexity of distributed systems. You do not need to manually manage threads, memory, or network communication. Spark handles these concerns for you, while you focus on describing what you want to do with the data rather than how it should be executed.
If you are a complete newcomer to Spark, there are three key, core ideas you should learn before using it. These are:
1. Clusters
When people hear that Spark runs on a “cluster,” it can sound intimidating. In practice, you do not need deep knowledge of distributed systems to get started. A cluster is simply a group of servers networked together that can collaborate. In a Spark application running on a cluster, one machine acts as the driver, coordinating work, while the others act as executors, performing computations on data chunks. When the Executor nodes have finished their work, they signal back to the Driver node, and the Driver can then perform whatever is needed with the final result set.
┌───────────────────┐
│ Driver │
│(your PySpark app) │
└─────────┬─────────┘
│
| The Driver farms out work
| to one or more executors
┌────────────────────┼───────────────────────────┐
│ │ │
┌───────▼────────┐ ┌───────▼────────┐ ┌───────▼────────┐
│ Executor 1 │ │ Executor 2 │ │ Executor N │
│ processes part│ │ processes part│ …… │ processes part│
│ of the data │ | of the data │ │ of the data │
└────────────────┘ └────────────────┘ └────────────────┘
Just remember, you do not need to run Spark on a physical compute cluster. When you run PySpark locally, Spark simulates a cluster on your laptop or PC using multiple cores. One of the strengths of PySpark is that the same code can later be deployed to a real cluster, whether in the cloud or on-premises, with only very minor changes.
This separation of coordination and execution enables Spark to scale. As datasets grow, more executors can be added to process data in parallel, reducing runtime without requiring changes to your code.
2. The Spark dataframe
At the heart of PySpark is the DataFrame API, which is the main way you work with data in Spark. A DataFrame is simply a table of data, made up of rows and columns — very similar to a table in a database or a DataFrame in pandas. If you have used SQL or pandas before, the basic ideas will feel familiar.
With Spark DataFrames, you can perform common data tasks such as filtering rows, selecting columns, grouping data, joining tables, and calculating summaries like counts or averages. These operations are easy to read and write, allowing you to focus on what you want to do with the data rather than the technical details of how it runs.
What makes Spark special is what happens behind the scenes. Spark automatically determines the most efficient way to run your DataFrame operations and then executes them in parallel across multiple computers in a cluster. You don’t need to manage this yourself — Spark handles things like splitting the data, coordinating the work, and recovering from failures if something goes wrong.
Because of this, Spark DataFrames can handle very large datasets, even those too large to fit in memory on a single machine. At the same time, they provide a simple and familiar interface, making PySpark a powerful yet approachable tool for working with big data.
3. Lazy vs eager evaluation
Another strength of PySpark worth knowing is its approach to lazy versus eager execution.
Most Python data libraries, like Pandas, use eager execution. This means that when you run an operation, it executes immediately, followed by the next operation, and so on.
PySpark deals with this differently by using a technique called lazy execution. When you write data transformations, such as selecting columns or filtering rows, Spark does not execute them immediately. Instead, it builds an optimised execution plan and runs the computation only when an action (such as showing results or writing data to disk) is triggered. This allows Spark to optimise the workflow before execution, making your code more efficient without extra effort on your part.
Eager execution (e.g. pandas)
data ──filter──► result (computed immediately)
In pandas, each operation runs as soon as it is called. This is
intuitive but can be inefficient for large datasets.
PySpark uses lazy execution.
Lazy execution (PySpark)
data ──filter──►
│
└─groupby──► (plan builds here)
│
└─agg──► (still no execution)
│
action ──► executes here
To drive this point home, consider the following scenario. Let’s say we have a 10-million-record dataframe that we want to …
a) Add a new empty column to it called X
b) Filter the data in some way that causes us to remove 50% of the records
c) Perform an aggregation on the records left so that the new column X contains the MAX value of another value in that row
d) Print out the row with the highest value of X
On a system that performs eager execution, like Pandas, every step is performed exactly as we’ve outlined above. For 10 million records, it would look like this:
- Add Column: The system creates a new version of the 10-million-row dataset in memory, adding column X.
- Filter: The system filters all 10 million rows, resulting in 5 million deletions, and writes a new 5-million-row dataset to memory.
- Aggregation: It calculates the MAX value for every row and updates the column.
- Print: It finds the top row and shows it to you.
The Problem is we have done a massive amount of “heavy lifting” (adding a column to 10 million rows) only to immediately throw away half of that work in the next step.
Spark, on the other hand, because of its lazy execution model, doesn’t do any work when you define steps (a), (b), or (c). Instead, it builds a Logical Plan (also called a DAG — Directed Acyclic Graph) to do the work.
When you finally trigger step (d) – the Action – Spark’s optimiser looks at the whole plan and realises it can work much smarter:
- Predicate Pushdown: Spark sees the filter (remove 50% of records). Instead of adding column X to 10 million rows, it moves the filtering to the very beginning.
- Optimisation: It only adds column X and aggregates the remaining 5 million rows.
- Result: It avoids processing 5 million records, saving 50% of memory and CPU time.
Setting up the dev environment
Ok, that’s enough theory. Let’s look at how you can get PySpark installed on your system and run some example code snippets. Now, for a beginner introductory text, actually creating a real-world multi-node cluster is beyond the scope of this article. But as I mentioned before, Spark can create a synthetic cluster on your PC or laptop if it’s multi-core, which it will be if your system is less than about 10 years old.
The first thing we’ll do is set up a separate development environment for this work, ensuring our projects are siloed and do not interfere with each other. I’m using WSL2 Ubuntu for Windows and Conda for this part, but feel free to use whichever environment and method you’re accustomed to.
Install PySpark, etc.
# 1. Create a new environment with Python 3.11 (very stable for Spark)
conda create -n spark_env python=3.11 -y
# 2. Activate it
conda activate spark_env
# 3. Install PySpark and PyArrow (needed for Parquet files)
pip install pyspark pyarrow jupyter
To check that PySpark has been installed correctly, type the pyspark command into a terminal window.
$ pyspark
Python 3.11.14 | packaged by conda-forge | (main, Oct 22 2025, 22:46:25) [GCC 14.3.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: package sun.security.action not in java.base
Using Spark’s default log4j profile: org/apache/spark/log4j2-defaults.properties
26/01/15 16:15:21 WARN Utils: Your hostname, tpr-desktop, resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
26/01/15 16:15:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark’s default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/15 16:15:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
WARNING: A terminally deprecated method in sun.misc.Unsafe has been called
WARNING: sun.misc.Unsafe::arrayBaseOffset has been called by org.apache.spark.unsafe.Platform (file:/home/tom/miniconda3/envs/pandas_to_pyspark/lib/python3.11/site-packages/pyspark/jars/spark-unsafe_2.13-4.1.1.jar)
WARNING: Please consider reporting this to the maintainers of class org.apache.spark.unsafe.Platform
WARNING: sun.misc.Unsafe::arrayBaseOffset will be removed in a future release
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/__ / .__/\_,_/_/ /_/\_\ version 4.1.1
/_/
Using Python version 3.11.14 (main, Oct 22 2025 22:46:25)
Spark context Web UI available at http://10.255.255.254:4040
Spark context available as ‘sc’ (master = local[*], app id = local-1768493723158).
SparkSession available as ‘spark’.
>>>
If you don’t see the Spark welcome banner, then something has gone wrong, and you should double-check your installation.
Example 1 — Creating a local cluster
This is actually pretty easy. Just type the following into your notebook.
from pyspark.sql import SparkSession
# Initialize the Spark Session
spark = SparkSession.builder \
.master(“local[*]”) \
.appName(“MyLocalCluster”) \
.config(“spark.driver.memory”, “2g”) \
.getOrCreate()
# Verify the cluster is running
print(f”Spark is running version: {spark.version}”)
print(f”Master URL: {spark.sparkContext.master}”)
#
# The output
#
Spark is running version: 4.1.1
Master URL: local[*]
The SparkSession concept is important. In the early days of Spark, users had to juggle multiple “entry points” (like SparkContext for core functions, SQLContext for dataframes, and HiveContext for databases). It was confusing for beginners.
The SparkSession was introduced in Spark 2.0 as the “one-stop shop” for everything. It is the single point of entry for interacting with Spark functionality.
Example 2 — Creating a dataframe
Creating Dataframes and manipulating the data they contain in PySpark will be what you do most of the time. And it’s pretty straightforward to do. Here, we define that our dataframe will contain three records and three named columns.
# 1. Define your data as a list of tuples
data = [
(“Alice”, 34, “New York”),
(“Bob”, 45, “London”),
(“Catherine”, 29, “Paris”)
]
# 2. Define your column names
columns = [“Name”, “Age”, “City”]
# 3. Create the DataFrame
df = spark.createDataFrame(data, columns)
# 4. Show the result
df.show()
#
# The output
#
+———+—+——–+
| Name|Age| City|
+———+—+——–+
| Alice| 34|New York|
| Bob| 45| London|
|Catherine| 29| Paris|
+———+—+——–+
More likely, any dataframes you use will be initially created by reading in data from a file or database. Create a CSV file named sales_data.csv on your system with the following contents.
transaction_id,customer_name,net_amount,tax_amount, is_member
101,Alice,250.50,25.05,true
102,Bob,120.00,6.00, false
103,Charlie,450.75,25.07,true
104,David,89.99,5.73,false
Creating a dataframe from a file like this is simple,
# Load the CSV file
df = spark.read.format(“csv”) \
.option(“header”, “true”) \
.option(“inferSchema”, “true”) \
.load(“sales_data.csv”)
# Show the data
print(“Dataframe Contents:”)
df.show()
# Show the data types (Schema)
print(“Data Schema:”)
df.printSchema()
#
# The output
#
Dataframe Contents:
+————–+————-+———-+———-+———-+
|transaction_id|customer_name|net_amount|tax_amount| is_member|
+————–+————-+———-+———-+———-+
| 101| Alice| 250.5| 25.05| true|
| 102| Bob| 120.0| 6.0| false|
| 103| Charlie| 450.75| 25.07| true|
| 104| David| 89.99| 5.73| false|
+————–+————-+———-+———-+———-+
Data Schema:
root
|– transaction_id: integer (nullable = true)
|– customer_name: string (nullable = true)
|– net_amount: double (nullable = true)
|– tax_amount: double (nullable = true)
|– is_member: string (nullable = true)
Example 3 — Processing data
Of course, once you have your input data in a dataframe, the next thing you’ll want to do is process or manipulate it in some way. That’s easy too. Referring to the sales_data we just loaded, let’s say we want to calculate the gross amount (net + tax) and the tax rate as a percentage of the gross amount for each record and add those to our initial dataframe.
from pyspark.sql import functions as F
# 1. Add ‘gross_amount’ by adding net and tax
# 2. Add ‘tax_percentage’ by dividing tax by the new gross amount
df_extended = df.withColumn(“gross_amount”, F.col(“net_amount”) + F.col(“tax_amount”)) \
.withColumn(“tax_percentage”,
(F.col(“tax_amount”) / (F.col(“net_amount”) + F.col(“tax_amount”))) * 100)
# 3. Optional: Round the percentage to 2 decimal places for readability
df_extended = df_extended.withColumn(“tax_percentage”, F.round(F.col(“tax_percentage”), 2))
# Show the new columns along with the old ones
df_extended.show()
#
# The output
#
+————–+————-+———-+———-+———-+————+————–+
|transaction_id|customer_name|net_amount|tax_amount| is_member|gross_amount|tax_percentage|
+————–+————-+———-+———-+———-+————+————–+
| 101| Alice| 250.5| 25.05| true| 275.55| 9.09|
| 102| Bob| 120.0| 6.0| false| 126.0| 4.76|
| 103| Charlie| 450.75| 25.07| true| 475.82| 5.27|
| 104| David| 89.99| 5.73| false| 95.72| 5.99|
+————–+————-+———-+———-+———-+————+————–+
Summary
That concludes our brief sojourn into the world of distributed computing with PySpark. I explained what PySpark is and why you should consider using it if the data you’re processing exceeds your memory limits. In short, PySpark’s ability to scale to large multi-node clusters, its lazy execution model and the dataframe data structure make it an ideal data processing powerhouse.
PySpark is widely used in data engineering, analytics, and machine learning pipelines. It integrates well with cloud platforms, supports a variety of data sources (such as CSV, Parquet, and databases), and scales from a laptop to large production clusters.
If you are comfortable with Python and want to work with large datasets without abandoning familiar syntax, PySpark is an excellent next step. It bridges the gap between simple data analysis and large-scale data processing, making it a valuable tool for anyone entering the world of big data.
Hopefully, you can use my simple coding examples and explanations to take the next step toward using PySpark in the real world, on a real cluster, and to perform proper big-data processing.

