learning data science in 2020, Pandas was one of the most popular tools. Although new tools focus on improving Pandas’ weaknesses in handling very large datasets, I still use Pandas for many data cleaning, processing, and analysis tasks. Yes, Pandas gives me a hard time when working with billions of rows, but it is definitely more than enough for working with anything below that.
I see Pandas being used in not only for EDA or in notebooks but also in production systems.
In this article, I’ll go over some data cleaning and processing operations to demonstrate how capable Pandas is.
Let’s start with the dataset, which contains stock keeping units (SKUs) and a search API responses for these SKUs.
import pandas as pd
search_results = pd.read_csv(“search_results.csv”)
search_results.head()
Search result is a list of dictionaries and looks like this:
search_results.loc[0, “search_result”]
“[{‘my_id’: ‘HBCV00007F5Y2B’, ‘distance’: 1.0, ‘entity’: {}},
{‘my_id’: ‘HBCV00007UPQBM’, ‘distance’: 1.0, ‘entity’: {}},
{‘my_id’: ‘HBCV00008I29IH’, ‘distance’: 1.0, ‘entity’: {}},
{‘my_id’: ‘HBCV00006U3ZYB’, ‘distance’: 0.8961254358291626, ‘entity’: {}},
{‘my_id’: ‘HBCV0000AFA4H6’, ‘distance’: 0.8702399730682373, ‘entity’: {}},
{‘my_id’: ‘HBCV00009CDGD4’, ‘distance’: 0.86175537109375, ‘entity’: {}},
{‘my_id’: ‘HBCV000046336T’, ‘distance’: 0.8594968318939209, ‘entity’: {}},
{‘my_id’: ‘HBCV00009QDZRT’, ‘distance’: 0.8572311997413635, ‘entity’: {}},
{‘my_id’: ‘HBCV00008E11P3’, ‘distance’: 0.8553324937820435, ‘entity’: {}},
{‘my_id’: ‘HBV00000C4IY6’, ‘distance’: 0.8539167642593384, ‘entity’: {}}]
… and 5 entities remaining”
As we see in the output, it’s not a proper list of dictionary format because of the last part (“… and 5 entities remaining”). Also, it’s saved as a single string.
In order to make better use of it, we need to convert it to a proper list of dictionaries. The following line of code removes the last part by splitting the string at “…” and takes the first split.
search_results.loc[0, “search_result”].split(“…”)[0].strip()
However, the output is still a single string. We can use the built-in ast module of Python to convert it to a list:
import ast
res = ast.literal_eval(search_results.loc[0, “search_result”].split(“…”)[0].strip())
res
[{‘my_id’: ‘HBCV00007F5Y2B’, ‘distance’: 1.0, ‘entity’: {}},
{‘my_id’: ‘HBCV00007UPQBM’, ‘distance’: 1.0, ‘entity’: {}},
{‘my_id’: ‘HBCV00008I29IH’, ‘distance’: 1.0, ‘entity’: {}},
{‘my_id’: ‘HBCV00006U3ZYB’, ‘distance’: 0.8961254358291626, ‘entity’: {}},
{‘my_id’: ‘HBCV0000AFA4H6’, ‘distance’: 0.8702399730682373, ‘entity’: {}},
{‘my_id’: ‘HBCV00009CDGD4’, ‘distance’: 0.86175537109375, ‘entity’: {}},
{‘my_id’: ‘HBCV000046336T’, ‘distance’: 0.8594968318939209, ‘entity’: {}},
{‘my_id’: ‘HBCV00009QDZRT’, ‘distance’: 0.8572311997413635, ‘entity’: {}},
{‘my_id’: ‘HBCV00008E11P3’, ‘distance’: 0.8553324937820435, ‘entity’: {}},
{‘my_id’: ‘HBV00000C4IY6’, ‘distance’: 0.8539167642593384, ‘entity’: {}}]
We now have the search results as a proper list of dictionaries. This was only for a single row. We need to apply the same operation to all SKUs (i.e. entire SKU column).
One option is to go over all the rows in a for loop and perform the same operation. However, this is not the best option. We should prefer vectorized operations when we can. A vectorized operation basically means executing the code on all rows at once.
On a single row, I used splitting to get rid of the last part of the string but it did not work in a vectorized operation. A more robust option seems to be using a regex.
search_results.loc[:, ‘search_result’] = search_results[‘search_result’].str.replace(r”\.\.\..*”, “”, regex=True).str.strip()
This code selects “…” and everything that comes after it and replaces them with nothing. In other words, it removes “… and 5 entities remaining” part.
We now have all the rows in the search results column as a proper list of dictionaries.
search_results.loc[10, “search_result”]
“[{‘my_id’: ‘HBCV00007F5Y2B’, ‘distance’: 1.0, ‘entity’: {}},
{‘my_id’: ‘HBCV00007UPQBM’, ‘distance’: 1.0, ‘entity’: {}},
{‘my_id’: ‘HBCV00008I29IH’, ‘distance’: 1.0, ‘entity’: {}},
{‘my_id’: ‘HBCV00006U3ZYB’, ‘distance’: 0.8961254358291626, ‘entity’: {}},
{‘my_id’: ‘HBCV0000AFA4H6’, ‘distance’: 0.8702399730682373, ‘entity’: {}},
{‘my_id’: ‘HBCV00009CDGD4’, ‘distance’: 0.86175537109375, ‘entity’: {}},
{‘my_id’: ‘HBCV000046336T’, ‘distance’: 0.8594968318939209, ‘entity’: {}},
{‘my_id’: ‘HBCV00009QDZRT’, ‘distance’: 0.8572311997413635, ‘entity’: {}},
{‘my_id’: ‘HBCV00008E11P3’, ‘distance’: 0.8553324937820435, ‘entity’: {}},
{‘my_id’: ‘HBV00000C4IY6’, ‘distance’: 0.8539167642593384, ‘entity’: {}}]”
They’re still saved as a string but I can easily convert them to a list using the ast module, which I will do in the next step.
What I’m interested in is the SKUs returned in the search results. I’ll create a new column by extracting the SKUs in the dictionaries. I can access them using the “my_id” key of the dictionary.
There are 3 parts of this operation:
- Convert the search result string to list using the literal_eval function
- Extract SKU from the my_id key of the dictionary
- Do this in a list comprehension to get SKUs from all the dictionaries in the list
We can do all these operations by applying a lambda function to all rows as follows:
search_results.loc[:, “result_skus”] = \
search_results[“search_result”].apply(lambda x: [item[‘my_id’] for item in ast.literal_eval(x)])
search_results.head()
Each row in the result_skus column contains a list of 10 SKUs. Let’s say I need to have these 10 SKUs in different rows. For each row in the sku column, there will be 10 rows created from the list in the result_skus column. There is a very simple way of doing this in Pandas, which is the explode function.
data = search_results[[“sku”, “result_skus”]].explode(“result_skus”, ignore_index=True)
data.head()
We created a new dataframe with sku and result_skus column. The drawing below demonstrates what the explode function does:
Consider the opposite. We have a dataframe as shown above but want to have all results for an sku in a single row.
We can use the groupby function to group the rows by sku and then apply the list function on the result_skus column:
new_data = data.groupby(“sku”, as_index=False)[“result_skus”].apply(list)
new_data.head()
This will get us back to the previous step:
Using the explode function, we created a dataframe with a separate row for each sku in the result_skus column. What if we need to have them separated to different columns instead of rows?
One option is to apply the pd.Series function to the result_skus column and concatenate the resulting columns to the original dataframe.
new_cols = new_data[“result_skus”].apply(pd.Series)
new_data = pd.concat([new_data, new_cols], axis=1)
new_data.head()
Columns from 0 to 9 contains the 10 SKUs in the result_skus column. This code using the apply function is not a vectorized operation.
We have another option, which is vectorized and much faster.
new_cols = pd.DataFrame(new_data[“result_skus”].tolist())
new_data = pd.concat([new_data, new_cols], axis=1)
This code will give us the same dataframe as above but much faster.
I demonstrated a typical data cleaning and processing task a data scientist or analyst may encounter in their job. I’ve been in the field for over 5 years and Pandas has always been enough to do what I need except for when working very large datasets (e.g. billions of rows).
The tools that are better fit for such large datasets have similar syntax to Pandas. For example, PySpark is kind of a mixture of Pandas and SQL. Polars is very similar to Pandas in terms of syntax. Thus, learning and practicind Pandas is still a highly valuable skill for anyone working in the data science and AI domain.
Thank you for reading.
