Python Skills Assessment#

In this session you will have the opportunity to work on your Python skills assessment. As was the case last semester you will be asked to complete a series of tasks using the skills that we have covered in the computing workshops. The assessment this week involves:

  • Using Pandas DataFrames to filter and sort data.

  • Performing non-linear regressions for data extracted from a Pandas DataFrame.

  • Performing a weighted linear regression and calculating \(\chi_n^2\).

  • Interpreting the value of \(\chi_n^2\).

  • Presenting data in a high quality plot with subplots.

Note

Please complete these activities in a Jupyter Notebook and then submit this as a PDF to the submission portal on BlackBoard. The submission is due by 9am on Monday the 22nd of April.

The assessment rubric for this assignment can be found here. You will be asked to reattempt and resubmit this skills task if your work “requires improvement” for multiple elements.


Assessment Brief#

You have been provided with three csv files containing data about the 2020 Covid-19 pandemic. These files are:

  • “modified_country_vaccinations.csv” - this file contains information about the vaccine campaigns of different countries around the world. For every country, metrics are provided for a range of dates.

  • “worldometer_coronavirus_daily_data.csv” - this file contains covid-19 related statistics provided for each country on a per day basis.

  • “worldometer_coronavirus_summary_data.csv” - this file contains a summary of covid-19 related statistics for each country.

You are provided with a Jupyter Notebook in CoCalc to perform the following tasks.


Task 1#

Using Pandas import the data from the csv files into three DataFrames and familiarise yourself with how the DataFrames are organised.


Task 2#

From the worldometer_coronavirus_summary data, using a for loop, determine:

  • the total number of confirmed cases in each continent,

  • the total number of deaths attributed to covid-19 in each continent,

  • the percentage of confirmed cases that lead to deaths for each continent.

The for loop should contain clear print statements outlining your findings.


Task 3#

Filter the worldometer_coronavirus_summary_data to create a new DataFrame containing only information pertaining to countries in Europe. Sort this DataFrame by the total number of deaths and display the top 5 countries by deaths.


Task 4#

Warning

Tasks 4 and 5 will require you to look at data organised by calendar dates.

Although it is easy to plot dates on the axes of a Python plot, fitting data is much more difficult using them. As such, for tasks 4 and 5 your x-axis data should be the “day number” for that particular task e.g. 1, 2, 3, 4.

You can easily make a linear array for the range of days required for a particular task using np.linspace(). To get the corresponding range from your DataFrame, you can simply slice a particular column using the square brackets method e.g. DataFrame[“column_header”][start:stop].

Using the worldometer_coronavirus_summary_data, worldometer_coronavirus_daily_data and modified_country_vaccinations_data sets, you will create a figure containing subplots. Ensure that the plots are presented clearly and with the appropriate features of a high quality plot. To create the figure and subplots, use the following snippet of code:

import matplotlib.pyplot as plt

# Firstly, lets set the size of our figure. 
fig=plt.figure(figsize=(12,16))

# Next we can add our subplots and arrange them vertically.
ax1=fig.add_subplot(311)
ax2=fig.add_subplot(312)
ax3=fig.add_subplot(313)

4.1#

On the first subplot make a bar chart showing the total number of deaths for the 10 European countries with the highest number of total deaths.

4.2#

On the second plot show the number of new Covid-19 cases for the first 28 days of the pandemic for the UK, France and Germany. Due to incomplete data on day-zero, we will consider the first day of the pandemic to be the second earliest date in the _coronavirus_daily_data i.e. when slicing your DataFrame start at index 1 not 0.

4.3#

The number of new cases can be modelled using the following equation:

\[ y = A(1+R)^t+O\]

where \(A\) is the initial number of cases, \(R\) is the rate of increase, \(t\) is the number of days and \(O\) is an offset. Using this equation and curve_fit, fit the above data, include the fits on your plot and report the rate of increase (and associated error) of the number of cases for the United Kingdom, France and Germany.

4.4#

on the third subplot show the total number of vaccinations given in the UK between the 10th and 50th day of the vaccination campaign. For the purpose of this exercise, an additional column has been added to modified_country_vaccinations_data called “VaxError”.

  1. Using VaxError perform a weighted linear regression for this subset of the data and include this on your plot.

  2. Report the vaccination rate (and its associated) error over these 40 days.


Task 5#

Calculate the reduced Chi-squared (\(\chi_n^2\)) value for the weighted linear regression.

If you were to obtain this value of \(\chi_n^2\) in one of your experiments from the practical lab sessions, what would this suggest about your data/model?