18/07/2022  •   6 min read  

How pandas Script is used for Data Analysis?

comprehensive-data-analysis-using-pandas

Aspiring data analysts and scientists are likely aware that data wrangling is one of the most important and time-consuming phases in any data science or machine learning work.

Pandas, a robust and well-liked Python library that is built on top of Numpy and supports a wide variety of data objects and data operations for cleaning, manipulating, and analyzing data, Pandas, one of the most well-known data science tools, has unquestionably changed the game.

Two of the most significant pandas data structures will be examined in this blog:

Here, in this blog, you will examine two of the most significant pandas structure that is series and DataFrame.

On a unique dataset on movies, we will also conduct hands-on data analysis. By examining actual data, we will directly analyze some of the most beneficial procedures and features that panda offers.

Series

A Series can be compared to a separate column of a two - dimensional array or matrix or a 1-D array. It may be compared to one of the columns in an excel data sheet. A series is a collection of data values associated with a certain label. Each row also has certain index values associated with it. When the series is formed, these index values are automatically defined. These indexes can also be defined explicitly.

Let's get started by writing code in a Jupyter notebook to construct and explore the Series.

Follow along by opening your Jupyter notebook.

You can obtain the Jupyter notebook containing the source code for this blog here:

the source code

Procedure to Create Series

A dictionary of key-value pairs, arrays of values, or a list of values can all be used to generate a series object.

The method used to construct Series is pd.Series(). It accepts as parameters a list, an array, or a dictionary.

1. Creating Series from a Lists

Create a Series by using a list of values

Create a Series by using a list of values

Although the indices in this case are produced by default, we may also provide unique indexes when creating Series.

A list of "Marks" and related "Subjects" may be found below. The topics list is configured as a row index.

Subjects

Operation of Series Indexing and Slicing

The most crucial tasks we do during data analysis are data retrieval and modification. Square brackets [] can be used to slice across data contained in a Series to get it.

Subjects
#Slicing by using string indexes
S2[‘Tamil’]
Slicing by using string indexes

2. Developing Series from Dictionary

A dictionary is a basic data structure in Python that holds information as a collection of Key-Value pairs. A Series and a dictionary are comparable in that they both map specified indices to collections of values.

You save information on fruits and their costs in a dictionary. Learn how to make Series using this vocabulary by reading on.

Developing Series from Dictionary

Converting ‘dict_fruits’ to a series

Converting ‘dict_fruits’ to a series

This series' data may be accessed as follows:

This series' data may be accessed as follows

DataFrame

The most popular "DataFrame" data structure in pandas is the following significant data structure.

A DataFrame can be compared to a multi-dimensional table or an excel file's data table. It is simply a collection of Series organised into a multi-dimensional table structure. It aids in the storage of tabular data, where each row denotes an observation and each column a variable.

The method used to build a dataframe is pd.DataFrame().

There are several techniques to generate a DataFrame. Let's examine each of them.

1. Creating DataFrame from Series Object

A series (or many series) can be sent to the DataFrame construction function to generate a DataFrame. The optional input parameter 'columns' can be used to name the columns.

Let's build a DataFrame with the series we established in the previous step as the basis:

Let's build a DataFrame with the series we established in the previous step as the basis

2. Creating DataFrame from Dictionary Object

Let's imagine we want to combine two series of weights and heights of a group of people into a table.

group of people into a table

We will first establish a dictionary using the "height" and "weight" Series, then use the pd.DataFrame() method to generate a DataFrame.

 'height' and 'weight' Series

3. Creating DataFrame by Importing Data from the File

When you want to import data from several file types, such as CSV, Excel, JSON, etc., Pandas is quite helpful and comes in handy.

Below are the few methods to read the data into DataFrame and other file objects:

  • read_table()
  • read_csv()
  • read_html()
  • read_json()
  • read_pickle()

For this blog, we will only consider the data available in the CSV file.

Analyzing movie data from IMDB

Now that we have a fundamental knowledge of the various Pandas data structures, let's examine the entertaining and fascinating "IMDB-movies-dataset" and get our hands dirty by conducting real-world data analysis. You may obtain the open-source dataset from this URL.

What could be more enjoyable than doing actual data analysis? So put your Data Scientist/Analyst hats on and let's GET. SET.GO

The following fundamental procedures will be carried out on the movie data when we read it from the.csv file.

  • Data reading
  • Viewing the information
  • Recognizing the fundamentals of the data
  • Data Slicing and Indexing: Data Selection
  • Choosing data depending on the filtration
  • Groupby operations
  • Sorting Operations
  • Handling missing values
  • Null values and dropping columns
  • Apply () procedures

1. Reading Data

Loading data present from CSV file.

Loading data present from CSV file

2. Viewing Data

Using the head() and tail() methods, let's quickly preview the data.

Head ( )

  • Returns the dataset's top 5 rows by default.
  • Additionally, it can accept the number of rows as a parameter tail ( )

Tail ()

  • Returns the dataset's five lowest rows by default.
  • Additionally, it has an optional parameter for the number of rows.
Loading data present from CSV file
Sample Data

3. Understanding the Basic Information regarding the Data

Many functions are available in Pandas to grasp the dataframe's shape, number of columns, indexes, and other details.

One of my favorite methods, info(), provides all the essential details about the various columns in a DataFrame.

provides all the essential details
  • Shape will be used for gaining the shape of the DataFrame
  • Columns will give you the list of columns available in the DataFrame
DataFrame

This function will let you know that there are 1001 rows and 11 columns in the mentioned dataset.

describe() method will provide with the basic statistical summaries of every numerical attribute in the DataFrame.

data.describe()
DataFrame

4. Choosing Data- Indexing and Slicing

Utilize columns to extract data

Similarly, to Series, data extraction from a dataFrame. In this case, data is extracted from the columns using the column label.

Let's rapidly remove the data for "Genre" from the dataFrame.

The 'Genre' column will have all the information from this operation returned as Series. Double square brackets must be used for indexing if we wish to obtain this data as a DataFrame, as seen below:

Simply add the column names to the list if we want to extract several columns from the data.

Genre

Utilizing rows, extract data

To extract data from certain row indexes, you may use the methods loc and iloc.

Loc- Locates the rows by name using

  • Using an explicit index, loc conducts slicing.
  • To access data from particular rows, string indexes are required.

Iloc- Rows are located by integer index using.

  • Based on Python's default numerical index, iloc performs slicing

When we first read the data, we made a DataFrame with the string index "Title."

We will slice and index the DataFrame using the supplied "Title" using the loc function.

In this case, integer indexes are utilized to slice the data using iloc.

Genre

5. Selecting Data depending on Conditional Filtering

Pandas also allow DataFrame retrieval based on conditional filters.

What if we only wanted to choose movies that were released between 2010 and 2016 and had an average audience rating of less than 6.0 but were the highest grossers?

It only takes one line of code to obtain it, making it incredibly straightforward.

Genre
Sample Data

Despite receiving lower ratings, "The Twilight Saga: Breaking Dawn - Part 2" and "The Twilight Saga: Eclipse" dominated the box office.

6. Groupby Operation

Using the groupby() function, data may be grouped and actions can be carried out on top of grouped data. When we wish to apply aggregations and functions on grouped data, this is useful.

Genre
Sample Data

7. Sorting Operation

Another Pandas procedure that is frequently used in data analysis tasks is sorting.

A column or list of many columns can be sorted using the sort values() function.

If we wish to order the "Directors" in the aforementioned example from highest to lowest rating, we may do so by using the average rating column.

Genre
Sample Data

8. Dealing with Missing Values

Pandas has IsNull() which will detect null values in a dataframe. Let us check the procedure to use the method.

Genre

Here, we can see that the columns "Revenue (millions)" and "Metascore" have null values.

We can decide whether to discard null values or impute them based on what we've observed in the data.

9. Dropping Null Value and Columns

Another action that is crucial for data analysis is dropping columns and rows. Rows or columns can be dropped depending on conditions using the drop() method.

The "Metascore" column is fully removed from the data using the aforementioned code. Axis=1 in this case indicates that the column is to be removed. Unless we pass the argument inplace=True to the drop() method, these changes won't be reflected in the real data.

Using the dropna() method, we can also remove rows and columns containing null values.

The limit argument is used to provide the minimum number of non-null values required for the column or row to be retained without dropping in the aforementioned example.

With mean Revenue, we can replace these null numbers (Millions).

Fill null values with the supplied values using the fillna()->method.

Now, if we look at the dataframe, the Revenue column won't include any null values.

Genre

10. Apply ()

Whenever we want to apply any algorithm to the information, the apply () method is useful. Every row of the dataframe is sent to a function, which produces a result. The function may be predefined by the user or built-in.

For instance, if we want to categorize the movies based on user ratings, we may construct the appropriate function and then use it on the dataframe as seen below.

I'll create a function that divides movies into categories according to user ratings.

Now that you have applied this function to the real dataframe, each row's "Rating category" will be determined.

Below is the result of the data after applying the rating_group() function

Genre
Sample Data

For any web scraping solutions, contact iWeb Scraping today!

Request for a quote!

Get A Quote