The Pandas DataFrames

A Pandas DataFrame is a labeled two-dimensional data structure similar to a worksheet or a relational database table. The columns in Pandas DataFrame can be of different types.

A DataFrame can be created in the following ways:

  • Using another DataFrame
  • Using a NumPy array or a composite of arrays that has a two-dimensional shape
  • From another Pandas data structure called Series
  • From a file (a CSV file for example)
  • From a dictionary of one-dimensional structures (one-dimensional NumPy arrays, lists, dicts, or Pandas Series, for example)

Eample:

As an example, I have downloaded the file WHO.csv from this link http://www.exploredata.net/Downloads/WHO-Data-Set.

from pandas.io.parsers import read_csv

df = read_csv("WHO.csv")
print("Dataframe", df)

The output will be something like shown below (showing just the beginning of the output):

Dataframe                       Country  CountryID  Continent  \
0                 Afghanistan          1          1   
1                     Albania          2          2   
2                     Algeria          3          3   
3                     Andorra          4          2   
4                      Angola          5          3   
5         Antigua and Barbuda          6          4   
6                   Argentina          7          5   
…

Query the number of rows of a DataFrame

You can query the number of rows of a DataFrame:

print("Shape", df.shape)
print("Length", len(df))

#out
Shape (202, 358)
Length 202

Print the column’s header and the data types

You can check the column’s header and the data types:

print("Column Headers", df.columns)
print("Data types", df.dtypes)

The output of the headers seem to be in a special data structure:

Column Headers Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
       'Adult literacy rate (%)',
       'Gross national income per capita (PPP international $)',
       'Net primary school enrolment ratio female (%)',
       'Net primary school enrolment ratio male (%)',
       'Population (in thousands) total', 'Population annual growth rate (%)',
       ...
       'Total_CO2_emissions', 'Total_income', 'Total_reserves',
       'Trade_balance_goods_and_services', 'Under_five_mortality_from_CME',
       'Under_five_mortality_from_IHME', 'Under_five_mortality_rate',
       'Urban_population', 'Urban_population_growth',
       'Urban_population_pct_of_total'],
      dtype='object', length=358)

The data types are printed as shown below:

Data types Country                                                                             object
CountryID                                                                            int64
Continent                                                                            int64
Adolescent fertility rate (%)                                                      float64
Adult literacy rate (%)                                                            float64
Gross national income per capita (PPP international $)                             float64
Net primary school enrolment ratio female (%)                                      float64
…..

Index of Pandas DataFrames

Pandas DataFrames have an index (which can be specified or created automatically) and it can be accessed:

print("Index", df.index)

The output will be:

Index RangeIndex(start=0, stop=202, step=1)

Basically, the index is a wrapper around an array starting at 0 (stopping at 202 in our case), with an increment of one for each row.

You can iterate over the underlying data of a DataFrame

print(“Values”, df.values)

The output will be something like shown below:

Values [['Afghanistan' 1 1 ..., 5740436.0 5.44 22.9]
 ['Albania' 2 2 ..., 1431793.9 2.21 45.4]
 ['Algeria' 3 3 ..., 20800000.0 2.61 63.3]
 ..., 
 ['Yemen' 200 1 ..., 5759120.5 4.37 27.3]
 ['Zambia' 201 3 ..., 4017411.0 1.95 35.0]
 ['Zimbabwe' 202 3 ..., 4709965.0 1.9 35.9]]

Leave a Reply