Random Variables

A random variable is an unknown, whose possible values are numerical outcomes of a random phenomenon.

Example 1: Tossing a coin - we get Heads or Tails. If we associate values to each outcome, i.e. Heads = 0, Tails = 1, we can say,

X={0,1}X = \{0, 1\}

XRandomVariable,{0,1}PossiblevaluesX \color{red}{\leftarrow Random\,Variable},\,\,\{0,1\} \color{red}{\leftarrow Possible\,values}

Example 2: Rolling a die - we get vaue between 1 and 6.

Y={1,2,3,4,5,6}Y = \{1, 2, 3, 4, 5, 6\}

  • Random variable is denoted by uppercase letter (X, Y, ...)
  • Specific values are denoted by lowercase letters (x=0, y=5 etc.)

There are two types of random variables - Discrete and Continuous variables.

Discrete Variables

Discrete variables take only specific values

Examples

  • Outcomes of rolling a die
  • No. of players in a team
  • No. of smartphones in Toronto.

Continuous Variables

Continuous variables can take infinite number of values within a range

Examples

  • Weight of children from age 6-13
  • Time taken to hit first goal in soccer
  • Economy rates of a bowler in cricket

Probability distribution

Let's construct probability distribution for a discrete random variable. Given a dummy dataset of 40 ratings for a movie on a scale of 1 - 5 - ratings

import numpy as np
import pandas as pd

ratings = [5, 4, 4, 5, 1, 4, 3, 3, 5, 5,
           1, 1, 3, 5, 5, 4, 3, 5, 1, 5,
           5, 2, 5, 4, 5, 5, 3, 1, 1, 1,
           5, 4, 4, 4, 3, 5, 2, 1, 2, 4]
ratings_df = pd.DataFrame(ratings, columns = ['ratings'])
print(ratings_df.ratings.value_counts())
5    14
4     9
1     8
3     6
2     3
Name: ratings, dtype: int64

Let's calculate the probability distribution for each of rating value.

RatingResponsesProbability distribution
51414/40 = 35%
499/40 = 22.5%
366/40 = 15%
233/40 = 7.5%
188/40 = 20%

We can verify the same using pandas

ratings_distribution = ratings_df.ratings.value_counts(normalize = True) * 100
print(ratings_distribution)
5    35.0
4    22.5
1    20.0
3    15.0
2     7.5
Name: ratings, dtype: float64

Let's plot the probability distribution graph.

import matplotlib

ratings_distribution.sort_index(ascending = False).plot.bar()

Probability Distribution - Ratings

Expected Value

Expected value is the weighted average of the possible values

For the above example, if X is the random variable,

XP(X)Weighted value
50.355 × 0.35 = 1.75
40.2254 × 0.225 = 0.9
30.153 × 0.15 = 0.45
20.0752 × 0.075 = 0.15
10.21 × 0.2 = 0.2

$Expected\,Value\,E(X) = 1.75 + 0.9 + 0.45 + 0.15 + 0.2 = 3.45 $

We can verify by calculating the average,

ratings_avg = sum(ratings) / len(ratings)
print(f'Expected value = {ratings_avg}')
Expected value = 3.45

Let's calculate the Standard Deviation for X

XP(X)(X - E(X))² * P(X)
50.35(5 - 3.45)² * 0.35 = 0.840875
40.225(4 - 3.45)² * 0.225 = 0.0680625
30.15(3 - 3.45)² * 0.15 = 0.030375
20.075(2 - 3.45)² * 0.075 = 0.1576875
10.2(1 - 3.45)² * 0.2 = 1.2005
Total12.2975 (Variance)

$Standard\,Deviation\,=\sqrt{2.2975} = 1.516 $

Mean of sum & difference of two random variables

If we have two random variables X and Y, μx\mu_x and μy\mu_y are the respective means,

 E(X+Y)=μx+y=μx+μyE(X + Y) = \mu_{x + y} = \mu_x + \mu_y

 E(XY)=μxy=μxμyE(X - Y) = \mu_{x - y} = \mu_x - \mu_y