Querying Data in Pandas

A Pandas DataFrame is structured very similar to a relational database. We can view operations that read data from a DataFrame as a query. We are going to retrieve the annual sunspot data from Quandl.

We can use the Quandl API or we can download the data manually as a CSV file from Quandl website. To download the data you have to sign up. I will install the API by running the following command in the terminal:

pip3 install Quandl

Note:
Using the API is limited to 50 API calls per day.

Example:

Get the data

After installing the Quandl we can get the data as follow:

import numpy as np
import quandl

sunspots = quandl.get("SIDC/SUNSPOTS_A")

The head() and tail() methods

The head() and tail() methods have the same purpose as the UNIX commands with the same name. Select the first n and last n records of a DataFrame (n is an integer parameter).

print("Head 2", sunspots.head(2) )
print("Tail 2", sunspots.tail(2))

The output will be:

Head 2             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
Date                                                                           
1700-12-31                               8.3                             NaN   
1701-12-31                              18.3                             NaN   

            Number of Observations  Definitive/Provisional Indicator  
Date                                                                  
1700-12-31                     NaN                               1.0  
1701-12-31                     NaN                               1.0  
Tail 2             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
Date                                                                           
2015-12-31                              69.8                             6.4   
2016-12-31                              39.8                             3.9   

            Number of Observations  Definitive/Provisional Indicator  
Date                                                                  
2015-12-31                  8903.0                               1.0  
2016-12-31                  9940.0                               1.0

We get the first two and last two rows of the sunspot data.

Query for the last value

This is how you query for the last value using the last date:

last_date = sunspots.index[-1]
print("Last value", sunspots.loc[last_date])

The output will be:

Last value Yearly Mean Total Sunspot Number      39.8
Yearly Mean Standard Deviation         3.9
Number of Observations              9940.0
Definitive/Provisional Indicator       1.0
Name: 2016-12-31 00:00:00, dtype: float64

Query the date

Query the date with date strings in the YYYYMMDD format as shown below:

print("Values slice by date:\n", sunspots["20020101": "20131231"])

We will get records records from 2002/01/01 until 2013/12/31. The output will be:

Values slice by date:
             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
Date                                                                           
2002-12-31                             163.6                             9.8   
2003-12-31                              99.3                             7.1   
...

            Number of Observations  Definitive/Provisional Indicator  
Date                                                                  
...
2011-12-31                  6077.0                               1.0  
2012-12-31                  5753.0                               1.0  
2013-12-31                  5347.0                               1.0 

Query with a list of indices

A list of indices can be used to query as well.

print("Slice from a list of indices:\n", sunspots.iloc[[2, 4, -4, -2]])

The output will be:

Slice from a list of indices:
             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
Date                                                                           
1702-12-31                              26.7                             NaN   
1704-12-31                              60.0                             NaN   
2013-12-31                              94.0                             6.9   
2015-12-31                              69.8                             6.4   

            Number of Observations  Definitive/Provisional Indicator  
Date                                                                  
1702-12-31                     NaN                               1.0  
1704-12-31                     NaN                               1.0  
2013-12-31                  5347.0                               1.0  
2015-12-31                  8903.0                               1.0

Select scalar values

There are two options to select scalar values. Two integers are required, the first for the row and the second for the column:

print("Scalar with Iloc:", sunspots.iloc[0, 0])
print("Scalar with iat", sunspots.iat[1, 0])

The output will be:

Scalar with Iloc: 8.3
Scalar with iat 18.3

Querying with Booleans

Querying with Booleans is similar to Where clause of SQL. The following code queries for values larger than the arithmetic mean.

print("Boolean selection", sunspots[sunspots > sunspots.mean()])
print("Boolean selection with column label:\n", sunspots[sunspots['Number of Observations'] > sunspots['Number of Observations'].mean()])

The first query yields all the rows (some rows not conforming to the condition that has a value of NaN). The second query returns only the rows where the value is larger than the mean.

The output will be:

Boolean selection             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
Date                                                                           
1700-12-31                               NaN                             NaN   
1701-12-31                               NaN                             NaN   
1702-12-31                               NaN                             NaN
…
2014-12-31                  5273.0                               NaN  
2015-12-31                  8903.0                               NaN  
2016-12-31                  9940.0                               NaN  

[317 rows x 4 columns]
Boolean selection with column label:
             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
Date                                                                           
1981-12-31                             198.9                            13.1   
1982-12-31                             162.4                            12.1   
1983-12-31                              91.0                             7.6   
1984-12-31                              60.5                             5.9
…
2014-12-31                  5273.0                               1.0  
2015-12-31                  8903.0                               1.0  
2016-12-31                  9940.0                               1.0  

Leave a Reply