Zomato EDA: Insights into Indian Food Industry Trends

Zomato EDA: Insights into Indian Food Industry Trends

Zomato is one of the most popular online food delivery platforms in India. But what can we learn from the data that Zomato collects and shares? In this blog post, I will explore some interesting insights from the Zomato dataset using exploratory data analysis (EDA). EDA is a process of summarizing, visualizing, and understanding the data before applying any machine learning or statistical techniques. It can help us discover patterns, trends, outliers, and anomalies in the data. Let's get started!

Dataset -- https://www.kaggle.com/datasets/himanshupoddar/zomato-bangalore-restaurants

Prerequisites

Before going forward, make sure you have covered the prerequisites, if not then you can check out my articles on those libraries:

Importing Libraries

Let's import all the libraries that we need.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.set_style('dark')

Load the Data

First, let's load the data.

df = pd.read_csv("zomato.csv")
df

zomato dataset

Check the Shape of the Data

Now, let's check how many rows and columns are there in this dataset.

df.shape

#--------------------- OUTPUT------------------------#
(51717, 17)
#--------------------- OUTPUT------------------------#

List all the Columns

Now, we'll find out all the columns of this dataset.

df.columns

# ----------------------- OUTPUT -------------------- #
Index(['url', 'address', 'name', 'online_order', 'book_table', 'rate', 'votes', 'phone', 'location', 'rest_type', 'dish_liked', 'cuisines', 'approx_cost(for two people)', 'reviews_list', 'menu_item', 'listed_in(type)', 'listed_in(city)'], dtype='object')
# ----------------------- OUTPUT -------------------- #

Now, let's start data preprocessing.

Data Preprocessing

First, let's see all the columns, their data type and how many non-null values are there in each column.

df.info()

zomato dataset info

You can see that there are a total of 51717 entries or rows and most of the columns have no null values and there's only one column with int datatype. The total size of the dataset is 6.7+ MB.

Now, we'll remove all the columns that we don't need for this analysis. If you want to keep them, then feel free to use them in your analysis but for this article, I won't be using them, so I'll just simply delete them for now.

Deleting Columns

#removing url column, address column, last column

df.drop(columns=["url", "address","phone","menu_item", "reviews_list", "listed_in(city)"], inplace=True)

drop function is used whenever we want to delete a row or column. Here, the drop function takes two parameters. columns is used to specify which columns we want to delete and inplace is used to modify the original dataset df in place.

Let's take a look at the data after dropping the columns.

df

dropped columns from the dataset

Now, let's see the shape of the data.

df.shape

# ---------------------- OUTPUT ---------------- #
(51717, 11)
# ---------------------- OUTPUT ---------------- #

You can see that the changes are in-place i.e the columns from the original data frame df got deleted.

Now, let's preprocess some columns.

Rate Column

In the above data frame figure, we want to remove the /5 in every rating and we also need to check if there are null values or not, there also can be some weird or garbage values. So, let's start cleaning the rate column.

First, we'll check all the unique values in the rate column, this will help us to know if there are null values or not and also if there are garbage values or not.

df["rate"].unique()

all the unique values in the rate column

You can see that, there is a garbage value that is "NEW" and "-" and null values are also there as a nan is there.

So, let's start by removing the garbage values replacing them with null values and removing the "/5" from all the valid ratings.

# remove NEW, -, /5 from the rating

def rate_handler(value):

    if value == "NEW" or value == "-":

        return np.nan

    else:

        value = str(value).split('/')

        value = value[0]

    return float(value)

df["rate"] = df["rate"].apply(rate_handler)

df.head()

dataset after removing the garbage values of rate column

Now, let's check all the unique values again to confirm that the garbage values and the "/5" are completely removed.

df.unique()

all the unique values of rate column after removing garbage values

This confirms that there are no garbage values and no /5.

Now let's fill the null values with the average rating of all the restaurants. You can use different strategies but I'll go with this one.

#handling missing rate values

def missing_value_handler(column):

    df[column].fillna(df[column].mean(), inplace=True)

missing_value_handler("rate")

Now, let's check if there are null values present or not in the rate column.

df["rate"].isnull().sum()

# --------------------- OUTPUT --------------------- #
0
# --------------------- OUTPUT --------------------- #

So, this confirms that there are no null values and we have cleaned the rate column.

approx_cost(for two people) Column

Similarly, let's check all the unique values to know if there are null and garbage values present in the column or not.

df["approx_cost(for two people)"].unique()

all the unique cost values

You can see that cost values having 4 digits consist of a "," and we don't want that, so let's remove the ",".

# remove the comma

def cost_handler(value):

    value = str(value)

    if ',' in value:

        value = value.replace(',', '')

        return float(value)

    else:

        return float(value)



df["approx_cost(for two people)"] = df["approx_cost(for two people)"].apply(cost_handler)

df.head()
  • Here, the function cost_handler() takes an argument value which is the cost.

  • Then we are converting the cost to string so that we can do string manipulation.

  • Then, we are saying if the cost string consists "," then replace it with an empty string.

  • Finally, we are typecasting the string to float and returning the value.

The apply() function can be used to apply a function to every value in a pandas series or data frame. Here, we are applying the cost_handler() function to the df data frame.

Now, let's take a look at all the unique values again just to confirm that we have removed all the commas.

df["approx_cost(for two people)"].unique()

You can see that all the commas in the 4-digit cost have been removed.

If you want to can fill in the null values in the cost column but I'm not going to do that for this analysis.

EDA

Now, let's start to ask some questions and try to get the answers with visualization.

These are the 5 questions that we'll try to answer and visualize:

  1. Number of restaurants having online order facility

  2. Top 10 best-rated restaurants

  3. Top 10 cuisines served by restaurants

  4. Number of Restaurants in every location

  5. Top 10 liked dishes

Number of restaurants having online order facility

sns.countplot(df["online_order"])

number of restaurants having online order facility

Most restaurants provide online order service, and nearly 30,000 restaurants provide online order service. To find the exact number you can use pandas value_counts to get the count for each of the option yes and no.

Top 10 best-rated restaurants

# top 10 rated restaurants

df.groupby("name")["rate"].mean().sort_values(ascending=False).head(10)
  • Here, we are grouping the data concerning name of restaurants.

  • Then finds the mean of all the ratings for that particular restaurant.

  • Finally, we are sorting the output of the above in decreasing order to get the highest ratings and then retrieving the top 10 data using head(10).

top 10 best-rated restaurants

Top 10 Cuisines Served

# top 10 cousines served by restaurants

cousines = {}

def cousines_handler(value):

    value = str(value).split(',')

    for cousine in value:

        cousine = cousine.strip()

        if cousine in cousines:

            cousines[cousine] += 1

        else:

            cousines[cousine] = 1

for value in df["cousines"]:

    cousines_handler(value)



print(cousines)

# convert this to dataframe
cousines = pd.DataFrame(list(cousines.items()), columns=["cousines", "count"])

cousines


# getting the top 10 cousines
top_10_cousines = cousines.sort_values("count", ascending=False).head(10)


# plotting the top 10 cousines
top_10_cousines_bar_plot = sns.barplot(data=top_10_cousines, x="cousines",y="count")

top_10_cousines_bar_plot.set_xticklabels(top_10_cousines_bar_plot.get_xticklabels(), rotation=45, horizontalalignment='right')
  • Here, first, we are creating a dictionary that will hold the count for every cuisine.

  • Then, the function cousines_handler() is used to fill the dictionary

  • Now, we converted the dictionary into a data frame just for best practices otherwise you could directly find the top 10 cuisines from the dictionary too.

  • Now, we sorted the data frame in a decreasing manner using the count column.

  • Finally, we plotted a bar graph with the above data and the last line is used to make the x-ticks rotate to 45 degrees so that the labels are easily visible.

Top 10 Cuisines Served

You can observe that the most served cuisine is North Indian followed by Chinese.

Top 10 Locations with Highest Restaurants

# top 10 locations with highest restaurants

number_of_outlets = df.groupby("location")["name"].count()

# getting the top 10 location with most number of restaurants
top_10_location_with_highest_restaurants = number_of_outlets.sort_values(ascending=False).head(10)


# plotting the top 10 locations
top_10_location_with_highest_restaurants_bar_plot = sns.barplot(x=top_10_location_with_highest_restaurants.index, y=top_10_location_with_highest_restaurants.values)

# setting the xticks
top_10_location_with_highest_restaurants_bar_plot.set_xticklabels(top_10_location_with_highest_restaurants_bar_plot.get_xticklabels(), rotation=45, horizontalalignment='right')
  • Here we are grouping the data using location.

  • Then count the number of restaurants per location using the name.

  • Finally, we just sort the restaurants in a decreasing manner concerning count and retrieving the top 10 using head(10).

  • Now, just plot the above result using seaborn, this is very similar to the top 10 cuisine question.

Top 10 Locations with Highest Restaurants

From this, we can observe that BTM has the most number of restaurants.

Top 10 Liked Dishes

# top 10 liked dishes

liked_dishes = {}

def liked_dishes_handler(value):

    value = str(value).split(',')

    for liked_dish in value:

        liked_dish = liked_dish.strip()

        if liked_dish in liked_dishes:

            liked_dishes[liked_dish] += 1

        else:

            liked_dishes[liked_dish] = 1

for value in df["dish_liked"]:

    liked_dishes_handler(value)



print(liked_dishes)

# convert this to dataframe

liked_dishes = pd.DataFrame(list(liked_dishes.items()), columns=["dishes", "count"])

liked_dishes


# getting the top 10 liked dishes
top_10_liked_dishes = liked_dishes.sort_values("count", ascending=False)[1:11]


# plotting the top 10 dishes
top_10_liked_dishes_bar_plot = sns.barplot(data=top_10_liked_dishes, x="dishes",y="count")

# setting the x-ticks
top_10_liked_dishes_bar_plot.set_xticklabels(top_10_liked_dishes_bar_plot.get_xticklabels(), rotation=45, horizontalalignment='right')

We are handling this almost the same as we did with the top 10 cuisine questions:

  • We created a dictionary to store the count of liked_dish per dish.

  • Then the function is used to fill the dictionary with the dish as key and count as value.

  • We converted the dictionary to a data frame.

  • Finally, we sorted the data frame in decreasing order concerning count and plotted it using a barplot.

top 10 liked dishes

From this, we can observe that Pasta is the most liked dish followed by Burgers.

Conclusion

Based on the exploratory data analysis (EDA) conducted on the Zomato dataset, several insights have been uncovered about the food industry and consumer preferences in different cities across India.

The analysis revealed that online ordering and delivery have become increasingly popular, especially in urban areas. This trend has been further accelerated by the COVID-19 pandemic, which has led to a shift toward contactless and online ordering options.

One of the key findings of the analysis was the popularity of North Indian cuisine across most cities, followed by Chinese and South Indian cuisine. This highlights the importance of understanding regional food preferences to cater to the local market.

Another significant insight has been found that pasta is the most popular dish among consumers, followed by burgers. These insights can be used by food businesses to better understand consumer preferences and cater to their needs to improve customer satisfaction and drive business growth.

Overall, the EDA on the Zomato dataset provides valuable insights into the food industry and consumer behavior in India. These insights can be leveraged by businesses and policymakers to make data-driven decisions and improve their understanding of the market.

Did you find this article valuable?

Support Sahil Mahapatra by becoming a sponsor. Any amount is appreciated!