Previous PageTable Of ContentsNext Page

4    Probit analysis of the participation decision


4.1 Motivation

4.2 Statistical implementation

4.3 Specification and estimation

4.4 Estimating distance to market

4.5 Regression results

4.6 Distance to market estimates

4.7 Conclusions


The natural vehicle for analysis in preliminary investigations of household-panel data is the probit model. Having described the data, the motivation for the survey and the survey collection procedures, this chapter is concerned with the motivation and application of MCMC to probit estimation.

4.1 Motivation

Motivation for application of the probit model follows. Let i = 1, 2, ..., N, denote the households in question. Each household compares the level of utility derived from market participation, yi*, against its reservation utility attainable without market participation, vi*. Here, we use an asterisk (*) to denote the fact that both levels of utility are latent random variables.

Assuming that differences between utilities are determined by characteristics, we assume that these characteristics are specific to each household, xi (xi1, xi2, ..., xiq). Without loss of generality, we set vi* = 0 and denote the difference between the incurred and reserve utility levels yi*, and their relationship to the characteristics by the function. The condition characterising the discrete choice about whether to participate in the market can then be written as:

with participation when yi > 0 and non-participation otherwise. We define the indicator variable δi = 1 when yi > 0 and the household participates in the market, and δi = 0 under non-participation. This is the standard ‘index-utility’ representation of the probit model and is the natural first-step in assessing household-panel data in a market-entry situation.

4.2 Statistical implementation

Statistical implementation of this simple framework follows closely the ideas outlined in Albert and Chib (1993). A linear version of the participation equation (2) has the form

where zi > 0 if δi = 1 and δi = 0 otherwise; and where β is a vector of unknown coefficients controlling the relationship between household-specific characteristics and market participation, and ui is a random error. The econometrician observes data δi = 1 if the latent random variable zi > 0 and δi = 0 otherwise; and observes the vector of household-specific covariates, xi. The objective is to draw inferences about β and any other structural parameters by combining the observed and latent data. To do this, we assume that the participation variable, zi, has a normal distribution with mean the product of the conditioning data and the unknown coefficient matrix xiβ and variance equal to one. The restriction on the variance is imposed for identification purposes.

4.3 Specification and estimation

The estimation procedure can be introduced by looking at the complete-data model, which we denote

where z (z1,z2,...,zN) is the latent data; x (x1,x2,...,xN), x1 (x11,x12,...,x1q), x2 (x21,x22,...,x2q), ..., xN (xN1,,xN2,...,xNq), x(x1,x2,...,xN) are observations on the covariates; β1, β2, .......,βq is the parameter depicting the effects of changes in the covariates on the latent data; and the error vector u ≡(u1, u2,,.....uN) is assumed to have the normal distribution N(ON, IN), where ON denotes the N-dimensional null vector and IN is the N × N identity matrix.

With this notation at hand we use a conventional non-informative prior for the unknown parameters, namely π(β) µ1. (Recall that the covariance is restricted to take unit value for identification purposes.) Even with the unit restriction on the error variance, the model in its current setting is still intractable, due to the evaluation of integrals implied by the probit set-up. The key step in overcoming this impediment—as ably demonstrated in Albert and Chib (1993)—is to augment the observed data likelihood with the latent data and derive estimates of these latent data as part of the estimation exercise. Accordingly, with the prior now specified to include these latent data, π(β, z) µ1, the complete conditional distributions characterising the joint posterior distribution for the parameters β   and the latent data z have simple forms. In particular, in terms of the current notation, these conditional distributions are

where Ez ≡ xβ, VzIN, Eβ≡ (x' x)–1x'z and Vβ≡(x'x)–1. The crucial observation is that these two distributions are easy to sample from. Consequently, simulations from the joint posterior can be undertaken through the following, simple algorithm:

Step 1: Select starting values z(s).

Step 2: Draw β(s + 1) from the multivariate-normal (Eβ(s + 1), Vβ(s + 1) distribution, 

where Eβ(s + 1) and Vβ(s + 1) denote conditioning on z(s) from Step 1.

Step 3: Draw z(s + 1) from the multivariate-normal (Ez(s + 1), Vz(s + 1)) distribution          (6)

where Ez(s + 1) and Vz(s + 1) denote conditioning on β(s + 1) from Step 2.

Step 4: Repeat steps 1–3 many times, S1, until convergence is attained.

Step 5: Repeat steps 1–3 many times, S2, and collect samples {β(s) s = 1, 2, ..., S2}

and {z(s) s = 1, 2, ..., S2}.

The draws in the last step can be used to compute summary statistics (means, medians, standard deviations) or plot histograms of any summary measure of interest. In the results reported below, the algorithm is run for a ‘burn-in phase’ of S1 = 2000 observations followed by a ‘collection phase’ of S2 = 2000 observations.

4.4 Estimating distance to market

While the impact of the covariates x on the latent participation variable z are important themselves, more interest resides in computing a measure of additional resources required for each of the non-participating households to enter the market. We call these measures ‘distances to market’. Specifically, these distance measures are estimates of the additional levels of the regressors that make each non-participating household in the sample become active in the market. This question is answered directly as a by-product of data augmentation, showcasing the power of Markov chain Monte Carlo methods in policy formation.

Recall that, in each round of the algorithm in (6) we compute an estimate of the latent participation variable. For households that do not participate in the market this quantity has a negative value. This (negative-valued) quantity has important implications for policy. A household with a larger negative-valued latent variable is further from the market than one that has a smaller-valued latent quantity. But these estimates can, in turn, be transformed into a meaningful distance measure across each of the covariates in the model using some simple algebra.

Suppose, in the context of equations (4) that we wish to measure distance in terms of independent variable ‘k’, then all we need do is solve (setting the left-hand-side to zero) the probit equation in terms of the value of ‘covariate k’ and then subtract from it the household’s observed level of the resource in question. The quantity that results is fundamental for policy because it provides an estimate of resource deficiency in the household and, hence, provides an estimate of the additional amount of the resource that is required to engender positive marketable surplus. It follows that these quantities are the ones that precipitate entry into the market, dilute the density of non-participation and, therefore, overcome a main impediment to economic development. They are the values:

where (the censor set) c denotes the set of households that do not participate in the market. Formally, c {ii= 0}.

Note that quantities (7) are available across each non-participating household in the censor set. Therefore, further enhancing their appeal as policy measures, they can be used to provide precise measures of the levels of each covariate for each household. The question remaining (that is particularly relevant in the context of Bayesian inference) is the existence of a posterior distribution of each of these distance estimates and the existence of moments and other measures of central tendency that can be used to characterise these distributions—especially their locations and their scales.

Because the distance measures contain quantities that are either observed or are easily simulated as a by-product of the Gibbs sample, the natural inclination is to use the formulae, together with the outputs {β(s) s = 1, 2, ..., S2} and {z(s) s = 1, 2, ..., S2} in (6) to compute quantities {ki(s) s = 1, 2, ..., S2} from which means, standard errors and histograms can be constructed. However, these measures can only be meaningful when the posterior distribution is proper (that is, the distribution integrates to a finite measure) and there exist posterior moments to which the sample estimates correspond. Each of the quantities in (7) contains a quotient that is (conditionally distributed as) a ratio of normal random variables and, thus, the proposal to use the output of the sample to compute {ki(s) s = 1, 2, ..., S2} requires that the distribution of these ratios of normal random variates be proper and that their moments exist.

Findings by early contributors in the field (Merrill 1928; Geary 1930; Fieller 1932; Hinkley 1969) generate two relevant conclusions. First, the distributions of the quotients are proper but, second, moments may not exist. In loose terms, the requirements for the existence of moments depend on a ‘relative-variance condition’, namely that the means of the quantities in the denominators of the quotients on the right sides of (7) are ‘large’ in relation to their corresponding standard deviations. When this condition is met, moments exist and it is appropriate to characterise the locations and scales of the distributions through sample means and variances. When the moments do not exist (but the distributions are proper), it is inappropriate to use mean and variance estimates, but appropriate to use other descriptive measures such as histograms, posterior modes or, perhaps, medians of the sample estimates. Importantly, when the relative variance condition is met, the exact distribution of the ratio of normals is shown to be approximately normal. Consequently, some idea of the appropriateness of the various measures can be deduced by comparing the locations of modal estimates with estimates of the means and medians computed from the sample. When the relative-variance condition is met and the normal distribution provides a good approximation to the true distribution, the locations of the separate estimates should be similar. Unfortunately, due to the complex form of the posterior (Hinkley 1969, p. 636, equations 1 and 2) and the number of non-participating households (179 in total), computing posterior modes by a Monte Carlo variant of the EM algorithm (Dempster et al. 1977) as proposed by Chib (1996) is infeasible. Thus, we settle on four measures of central tendency, namely the mean obtained from the output of the Gibbs sample, the median of the Gibbs sample, a posterior-means estimate obtained by replacing the equation coefficients in (7) by their posterior means and the conditional means computed from the mixtures:

where the expectations on the right-hand sides are taken with respect to the conditional distributions , π(xki|zβ),k = 1, 2, .., q; i c. Some algebra reveals that, as long as the numerator can be safely assumed to be non-zero, these latter distributions are, themselves, conditionally normal, implying that the desired expectations do, in fact, exist. This point is important due to the fact that the measures in (7) provide more accurate estimates than the means obtained directly from the output of the Gibbs sample—a feature of the Gibbs sample predicated on the Rao-Blackwell theorem and illustrated, lucidly, by Gelfand and Smith (1990).

4.5 Regression results

Table 2 reports the results of the probit regression on the (68 households × 3 visits × 7 days milk sales =) 1428 observations. Column one reports definitions and column 2 reports posterior means of the Gibbs sample with implied asymptotic t statistics in parentheses. All but one of the covariates—years of farm experience—are significant at the conventional 5% significance level, and most of the covariate parameter estimates have marginal significance levels beyond 1%. In addition, reports of the signs of the predicted values of the estimated model suggest that only a small proportion of the observations lie outside their negative (positive) ranges for the non-participating (participating) households.

Table 2. Probit-equation regression estimates.

Regressor

Estimate (implied t-statistics)

Number of crossbred cows

0.7184

 

(11.1314)

Number of local cows

0.2609

 

(5.3243)

Time to the milk group, minutes

–0.0131

 

(–5.6077)

Farm experience of household head, years

0.0022

 

(0.4294)

Formal schooling of household head, years

0.0701

 

(3.7501)

Extension agent visits during the past year

0.2148

 

(10.2652)

Constant

–2.2100

 

(–10.6799)

 

Summary statistics

 

Participants

Positive predicted values

63

Negative predicted values

105

 

Non-participants

Positive predicted values

14

Negative predicted values

1246

These results suggest that the parsimonious formulation adopted here, with entry postulated to depend on animal assets (local and crossbred animals), knowledge assets (education and visits by extension agents) and location (distance to walk to the milk group), is a good approximation to the actual decision-process affecting entry decisions. Hence, the simple probit model seems suited to gauge an indication of the types of policies that could lead to participation among the non-participating households.

The results in general, but more especially those with respect to crossbreed cows, extension services and local breed cows raise interesting questions about the design of appropriate policies to effect participation, their relative potencies and the relative costs of implementing them, which we consider, below.

4.6 Distance to market estimates

In considering participation policy, we confine attention to the number of crossbred milking cows in the household, the number of local breed milking cows and the number of visits by extension agents that the household experienced during the 12 months preceding the survey. The focus is restricted primarily due to space limitations, but these four quantities are, perhaps, the most interesting ones due to the fact that they may be readily changed in the short term. In reporting the results we rearrange the (1248) observations corresponding to the non-participating households so that the first observation in the set corresponds to the household that is ‘nearest’ to the market and the last observation is the one that is ‘farthest’ from the market; where ‘near’ and ‘far’ are defined with reference to the units of measurement of the covariate in question (Table 3).

Table 3. Distance to market estimates.

Regressor

Estimate (implied t-statistic)

Number of crossbred cows

2.4758

 

(1.0852)

Number of local cows

6.8261

 

(1.0852)

Extension agent visits during the past year

8.2819

 

(3.6544)

With the distance estimates reported in ascending order the three graphs have the following conventions: Households with positive requirements are distant from the market, households with zero requirements are located at the market perimeter and households with negative requirements are within the market boundary. Preliminary plots of the four measures of distance (the Gibbs-sample means, the Gibbs-sample medians, the posterior-means estimates, and the conditional means estimates obtained by the Rao-Blackwell theorem) reveal that each of the estimates are virtually indistinguishable from each other. This observation suggests that the ‘relative variance condition’ is met so that the posterior distributions are ‘almost normal’. Hence, either the mean or the median estimates should suffice as accurate estimates of the distance quantities. Figure 3 reports estimates of crossbred cow requirements. With the Gibbs-sample medians as reference points, there are only three households that are (resource-sufficient) within the market boundary; each of the remaining households has a deficiency of crossbreed cows. This observation is important because it identifies crossbred cow use as an (almost) homogeneously deficient factor across non-participants. Across the entire set of censored observations the median requirement for entry is an addition of 2.48 crossbred cows; the maximum additional requirement (the household farthest from the market) is 5.07; and the minimum requirement is –1.32, which is the household with the greatest ‘excess’.

Figure 3. Crossbred cow distance to market estimates.

Turning to local cow requirements (Figure 4), we focus attentions again on the Gibbs-sample medians. Average household ownership of indigenous milking cows at the Ilu-Kura and Mirti sites amount to 1.49 and 1.31 animals, respectively. The maximum median requirement is 13.95 animals and the minimum requirement is –3.56 animals—three of the households have an excess of local breed cows. The median requirement across the non-participating households is 6.82 animals.

Figure 4. Local breed cow distance to market estimates.

Results for the number of visits by extension agents are reported in Figure 5. The Gibbs-sample medians are much closer than in Figure 4, but we will use the median estimates as the reference points. Average number of visits at the Ilu-Kura site amount to 1.82/household per year and at the Mirti site amount to 0.36/household per year. From Figure 5, we can deduce that the household closest to the market has an excess of 4.41 visits, and the household farthest from the market requires an additional 16.99 visits before it would enter the market. Hence, the distribution of requirements across the households is more varied than the animal inputs requirements. The estimated median additional requirement in the censor set is 8.28 visits/household per year, which reflects a substantial increase over current levels. Whether this strategy represents a practical alternative remains to be seen. Further work is needed to establish the best form of extension services to provide and determine whether their provision within groups of farmers, rather than individually, is useful. Only then can the precise costs involved in administering extension services be ascertained and its potential as a viable, market-precipitating policy be established.

Figure 5. Visits by agents distance to market estimates.

4.7 Conclusions

Collectively the results demonstrate three conclusions. First, standard probit analysis of the participation data provides a useful and informative vehicle for deriving policy estimates. Second, useful quantities for policy analysis are derived simply and robustly as a by-product of the data augmentation step in a Gibbs sampling. Third, the results suggest that on average 2.48 crossbred cows, 6.28 local breed cows and 8.68 visits by extension agents/household per year are the primary measures upon which extension agents and policy planners should focus attentions.

Previous PageTop Of PageNext Page