Optimality conditions of the Graphical Lasso

Let $\mathbf K = \mathbf \Sigma^{-1}$ and $\mathbf S$ be respectively the precision and sample covariance matrices of a multivariate gaussian distribution. The Graphical Lasso amounts to minimize the following penalized negative log-likelihood $$\operatorname{min}_{\mathbf K} \mathcal J(\mathbf K) = -\operatorname{logdet}(\mathbf K) + \operatorname{trace}(\mathbf S \mathbf K) + \lambda ||\mathbf K||_1.$$ where $\operatorname{logdet}$ is the logarithm of the determinant, $\lambda$ is an hyper-parameter for sparsity and $||\mathbf K||_1$ is the sum of the absolute value of the matrix entries.

The function $\mathcal J(\mathbf K)$ is convex but not differentiable at $K_{ij} = 0.$ So at optimality, we derive the subgradient instead of the classic gradient. Note that the derivative of $\operatorname{logdet}(\mathbf K) = \mathbf K^{-1}$ (_Boyd and Vanderberghe, 2004_). The **Karush-Kuhn-Tucker (KKT) optimality conditions** can be written as: $$-\mathbf K^{-1} + \mathbf S + \lambda \operatorname{Sign}(\mathbf K) = 0.$$ where \begin{equation} \operatorname{Sign}( K_{ij}) = \begin{cases} -1, & \text{if $K_{ij} < 0$}\\
\in [-1,1], & \text{if $K_{ij} = 0$} \\
+1, & \text{if $K_{ij} > 0.$} \end{cases} \end{equation}

The covariance matrix $\mathbf \Sigma$ being positive definite, the precision matrix $\mathbf K$ is also positive definite. So the diagonal entries of $\mathbf K$ are all positive (and the matrice is dominant diagonal). Let $\mathbf K^{-1} = \hat{\mathbf{\Sigma}}.$ As a result, $$K_{ii}>0 \implies -\hat{\Sigma}_{ii} + S_{ii} + \lambda = 0$$ and we get $$\hat{\Sigma}_{ii} = S_{ii} + \lambda.$$

This exact relationship between the diagonals of estimated and empirical covariance is used as initialization in solvers of the previous optimization problem, such as glasso (Friedman et al. 2007).

Edmond Sanou
PhD Student in Applied Mathematics