Python in Excel: How to work with missing values - Stringfest Analytics (2024)

Data analysts should be proactive about missing values in their data, as these can significantly impact analysis outcomes. First of all, many algorithms such as linear and logistic regression as well as decision trees require complete datasets and cannot inherently process missing data.

But beyond statistical modeling and machine learning, missing values might also indicate issues in the data collection process, potentially introducing biases or flaws that skew results and lead to incorrect conclusions.

Excel lacks advanced handling of missing values, with no built-in null value system like SQL. Power Query in Excel offers improved management by recognizing missing values as null and providing tools to profile and identify their prevalence. However, it provides limited support for visually inspecting and correcting these missing values.

Integrating Python, especially with the Pandas library, into Excel enhances missing data management by offering advanced imputation methods and visualization tools, thereby improving Excel’s capability to handle incomplete datasets.

To see this in action, follow the included exercise file that uses a penguin dataset with missing values, demonstrating how Pandas can improve the analysis of missing data within Excel.

Once the penguins_df DataFrame is set up, a straightforward next step is to tally the missing values in each column. If you’ve dabbled computer science, you might know that True and False values are actually evaluated as 1s and 0s. With this so-called coercion, we can sum up all instances where na or missing is set to True (or 1). This provides a count of missing observations for each column:

Python in Excel: How to work with missing values - Stringfest Analytics (1)

Knowing the raw counts of missing values is useful, but understanding the relative percentage of missing values in each column can provide more context. This approach highlights how prevalent missing values are across different columns. By dividing the number of missing values in each column by the total number of rows in the DataFrame using len(), you can calculate these percentages. This analysis reveals that most columns have fewer than 5% missing values, with the ‘sex’ column being the only one that exceeds 1%.

Python in Excel: How to work with missing values - Stringfest Analytics (2)

Let’s explore another method to evaluate missing values by setting up a visualization in Python.

I’ll create a bar chart that displays all columns in the DataFrame where the number of missing values is greater than zero. This visual representation provides a quick and clear comparison of the significance of missing values across different columns, offering an immediate understanding of their relative impact overall.

For a clearer view, I’ve placed the code in a separate Gist below along with the resulting chart. You can access the final results in Excel with the exercise file.

Python in Excel: How to work with missing values - Stringfest Analytics (3)

Another interesting method to visualize missing values is through a heatmap. This approach is particularly useful if you’re looking for correlations or patterns among missing values in your data. For instance, if one variable tends to be missing alongside another, it could indicate a deeper issue in the data collection process. I’ll use Seaborn for this purpose.

With this plot, we can get a clear visual representation of where missing values are located within the overall grid of our data:

Python in Excel: How to work with missing values - Stringfest Analytics (4)

Now that we’ve explored some methods to summarize and visualize the data, let’s consider our next steps. Fortunately, none of the columns have significant amounts of missing values. I typically use 3-5% of the total as a guideline. Beyond this threshold, missing values across columns can lead to a substantial reduction in your dataset and could significantly skew your results if you need to drop or impute them. The best solution is always prevention. If possible, revisit your data collection source to correct and prevent future issues. However, I recognize that we often operate in the real world of data, where changes can be challenging and time constraints are common. So, let’s explore some quick fixes.

First, we’ll look at imputing the data. This involves using a summary statistic to fill in missing values. For a quantitative variable like bill_length_mm, I’ll use the median to fill the blanks. For a categorical variable like sex, I’ll use the mode, which is the most frequently occurring value, if it exists.

It’s important to note that imputing missing values is a delicate matter. There are many differing opinions on how best to handle it, and it can become quite complex. To that end, I’ll create new imputed columns so we can compare them with the original ones and determine if the adjustments are acceptable.

Another option, if you prefer not to impute missing values—and this is particularly viable when you have few missing values but still need to exclude them for statistical or presentation purposes—is simply to drop them.

It’s important to check how many rows you actually lose by doing this. Keep in mind that if you drop a row based on one missing value in a column, you’re also discarding all other data in that row, which could be valuable. So, exercise caution with this approach.

You can accomplish this using the dropna() method:

What questions do you have about missing values analysis or Python in Excel more broadly? I hope you’re discovering just how easy, flexible, and enjoyable it is to analyze, visualize, manipulate, and perform various operations on your data with Python, even if missing values aren’t a concern in your work.

If you’re just getting started with Python in Excel, it’s a good idea to understand how the Python language functions outside of the Excel environment. For that, you can check out my book, Advancing into Analytics:

Advancing into Analytics: From Excel to Python and R (O’Reilly)

Related

Python in Excel: How to work with missing values - Stringfest Analytics (2024)

FAQs

How do you deal with missing values in Excel data analysis? ›

There are multiple ways to handle missing data:
  1. Delete the data record (if the percentage of missing data is less).
  2. Replace it with mean, or median value if it's a quantitative feature, replace it with mode if it's a categorical feature. ...
  3. Replace with mean of nearest neighbours records.
Nov 2, 2023

How do you fill a string with missing values in Python? ›

Now, check out how you can fill in these missing values using the various available methods in pandas.
  1. Use the fillna() Method. The fillna() function iterates through your dataset and fills all empty rows with a specified value. ...
  2. The replace() Method. ...
  3. Fill Missing Data With interpolate()
Jun 13, 2023

What is the best way to handle missing values when analyzing the data? ›

How to Handle Missing Data in Dataset?
  1. Delete Rows with Missing Values: This approach is straightforward but be cautious as it may lead to loss of valuable data.
  2. Impute with Averages or Midpoints: Fill missing values with mean, median, or mode.

What are the four methods of treating missing data? ›

Complete case (CC), mean substitution (MS), last observation carried forward (LOCF), and multiple imputation (MI) are the four most frequently used methods in practice. In a real-world data analysis, the missing data can be MCAR, MAR, or MNAR depending on the reasons that lead to data missing.

What is the function used to find whether there are missing values in Python? ›

isnull() to detect missing values across DataFrame or Series objects, seamlessly integrating with data analysis workflows.

How to impute missing values in Python for categorical variables? ›

This can be achieved using the fillna() method in Python, specifying the method_name parameter as 'mode'. By setting inplace=True, the changes will be applied directly to the dataset. Mode imputation is a straightforward and intuitive approach to handling missing values in categorical variables.

How do you count missing values in a dataset in Python? ›

We can use the isna or isnull function to detect missing values. They returned a DataFrame filled with boolean values (True or False) indicating the missing values. In order to count the missing values in each column separately, we need to use the sum function together with isna or isnull.

How do you fill NaN values with a string? ›

Replace NaN with Blank String using fillna()

The fillna() is used to replace multiple columns of NaN values with an empty string. we can also use fillna() directly without specifying columns. Example 1: Multiple Columns Replace Empty String without specifying columns name.

How do you return a missing value in Python? ›

Checking for missing values using isnull()

In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe of Boolean values which are True for NaN values. Code #1: Python.

How to handle NA values in pandas? ›

Filling missing data

NA values can be replaced with corresponding value from a Series or DataFrame where the index and column aligns between the original object and the filled object. DataFrame.where() can also be used to fill NA values.Same result as above.

What to do if there is no data analysis in Excel? ›

Click the File tab, click Options, and then click the Add-Ins category. In the Manage box, select Excel Add-ins and then click Go. If you're using Excel for Mac, in the file menu go to Tools > Excel Add-ins. In the Add-Ins box, check the Analysis ToolPak check box, and then click OK.

What should we do if there are missing values in the dataset? ›

When dealing with missing data, data scientists can use two primary methods to solve the error: imputation or data removal. The imputation method substitutes reasonable guesses for missing data. It's most useful when the percentage of missing data is low.

How do you fill down missing values in Excel? ›

Method 2
  1. Select the range with empty cells.
  2. Press Ctrl + H to display the Find & Replace dialog box.
  3. Move to the Replace tab in the dialog.
  4. Leave the Find what field blank and enter the necessary value in the Replace with text box.
  5. Click Replace All.
Mar 22, 2023

How do you analyze data not visible in Excel? ›

In the Add-Ins dialog box, check the Analysis ToolPak check box. Click OK. Once you have enabled the Analysis ToolPak, you should see the Analyze Data button on the Data tab. If you don't see the Analyze Data button, it may be because the Data tab is not visible.

Top Articles
Latest Posts
Article information

Author: Dong Thiel

Last Updated:

Views: 5903

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Dong Thiel

Birthday: 2001-07-14

Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

Phone: +3512198379449

Job: Design Planner

Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.