Clustering with Coefficients

using Random, Distributions, Clustering, LinearAlgebra

In class we discussed how to classify individuals by type based on moments or averages. There are ready-made packages for clustering based on sample means, but we have to be careful to choose moments that satisfy the “injectivity” property that we need for consistent classification of types.

Consider the following model:

\[ y_{nt} = \beta_{0,k} + \beta_{1,k}x_{nt} + \epsilon_{nt} \]

where \(x_{nt}\) is an AR(1) process:

\[ x_{nt} = \rho x_{nt-1} + \xi_{nt}\qquad \xi_{nt}\sim\mathcal{N}(0,\sigma^2_\xi) \]

An Issue with Clustering on Means

Notice that if we clustered based only averages of \(y\), we could get a lot of misclassification because of persistent differences in \(x\).

function gen_data(T,N,pars)
    (;ρ,β,σξ,σϵ) = pars
    K = size(β,2)
    X = zeros(T,N)
    Y = zeros(T,N)
    πk = fill(1/K,K)
    type = rand(Categorical(πk),N)
    σ_stat = sqrt(σξ^2/(1-ρ^2))
    for n in axes(X,2)
        k = type[n]
        x = rand(Normal(0,σ_stat))
        for t in axes(X,1)
            X[t,n] = x
            Y[t,n] = β[1,k] + β[2,k]*x + rand(Normal(0,σϵ))
            x = ρ*x + rand(Normal(0,σξ))
        end
    end
    return (;type,X,Y)
end
gen_data (generic function with 1 method)
p = (;β = [0 0.5 ; 0.3 0.1],ρ = 0.95,σξ = 0.5,σϵ = 0.2)
d = gen_data(10,1000,p)
(type = [1, 2, 2, 2, 1, 1, 1, 2, 2, 1  …  1, 2, 1, 1, 1, 2, 1, 1, 2, 2], X = [-1.4550229799005174 -0.3235719894861611 … -0.3252102963926345 0.9337550853018559; -1.4453101711005691 0.3197381857283288 … 0.6710521361585193 1.5354948469380725; … ; -0.29949820449345 1.6564662817604354 … -0.6372124345389572 0.9171847039433293; -0.7311611139759487 0.9109754242885167 … -0.8731164815615942 -0.10884432590462445], Y = [-0.551382423663523 0.0881439847705161 … 0.21107863609973787 0.6870144018270388; -0.7040626859383348 0.4464915516381839 … 0.727528604926102 0.4989540522378084; … ; -0.16862695133653327 0.6267404954989101 … 0.2333328051748107 0.4005897820801897; -0.3266392158716257 0.6077923037209964 … 0.7125507711086414 0.6464979375694259])

Let’s try clustering with k-means:

res = kmeans(d.Y,2)
KmeansResult{Matrix{Float64}, Float64, Int64}([-0.2678080833248467 0.4678242878166031; -0.2960095107964568 0.4742407792154177; … ; -0.3169829567953903 0.4577755549304992; -0.3096986820077558 0.47204389558301624], [1, 2, 2, 2, 2, 1, 1, 2, 2, 1  …  2, 2, 2, 1, 1, 2, 1, 2, 2, 2], [0.708035542811815, 0.4531583728266524, 0.7469764391671827, 0.5736154324762568, 0.7114268707590039, 0.164137140625503, 1.2610019632836815, 0.2606893217264279, 0.8490512038847164, 0.5408747521124615  …  1.2533746836273982, 0.4566586107970818, 1.1991676336482504, 1.8417661197646482, 1.0070316955214365, 0.17340415650391972, 4.31971298477642, 1.9929271300368452, 0.4382113308574356, 0.4667067001956937], [290, 710], [290, 710], 1060.5947343696905, 6, true)

We get a correct classification rate of:

function correct_rate(assignments,type) #<- to evaluate
    K = maximum(type)
    return [mean(assignments[type.==k].==mode(assignments[type.==k])) for k in 1:K]
end
correct_rate(res.assignments,d.type) 
2-element Vector{Float64}:
 0.6029106029106029
 1.0

So notice we are misclassifying many of the first type. A simple histogram shows us why:

using Plots
histogram(mean(d.Y[:,d.type.==1],dims=1)[:])
histogram!(mean(d.Y[:,d.type.==2],dims=1)[:])

Does this go away when we increase the panel dimension?

d = gen_data(20,1000,p)
res = kmeans(d.Y,2)
correct_rate(res.assignments,d.type) 
2-element Vector{Float64}:
 0.5826923076923077
 0.99375
histogram(mean(d.Y[:,d.type.==1],dims=1)[:])
histogram!(mean(d.Y[:,d.type.==2],dims=1)[:])

A bit, yes, but things are still not looking great.

NOTE I have to fix this! types are not identified up to permutation. need a different statistic for missclassification.

An Alternative

An alternative would be to use the structure of the model and cluster based on the least squares criterion itself. We could derive a classification \(\hat{\mathcal{K}}\) and coefficients \(\hat{\beta}\) to solve:

\[ \hat{\mathcal{K}},\hat{\beta} = \arg\min\sum_{n}\sum_{t}(Y_{nt} - \beta_{0,k} - \beta_{1,k}x_{nt})^2 \]

which we could solve by iterating in the same way as K-means. The \(m\)th iteration would be:

  1. Estimate \(\hat{\beta}^{m+1}\) by linear regression using \(\hat{\mathcal{K}}^{m}\).
  2. Re-assign types as \(k^{m+1}(n) = \arg\min\sum_t(Y_{nt} - \hat{\beta}^{m+1}_{0,k} - \hat{\beta}^{m+1}_{1,k}x_{nt})^2\)

Here’s code to do that:

function assign!(type,data,β)
    K = size(β,2)
    lsq = zeros(K)
    for n in axes(data.X,2)
        x = view(data.X,:,n)
        y = view(data.Y,:,n)
        for k in axes(β,2)
            @views lsq[k] = sum((y .- β[1,k] .- β[2,k]*x).^2)
        end
        type[n] = argmin(lsq)
    end
end
function estimate!(type,data,β)
    K = size(β,2)
    T = size(data.X,1)
    for k in axes(β,2)
        Ik = type.==k
        N = sum(Ik)
        @views Y = data.Y[:,Ik][:]
        @views X = [ones(N*T) data.X[:,Ik][:]]
        β[:,k] = inv(X'*X)*X'*Y
    end
end

function classify_least_squares!(type,β,data;maxiter = 100,e_tol = 1e-10)
    err = Inf
    iter = 1
    while err>e_tol && iter<maxiter
        β_old = copy(β)
        estimate!(type,data,β)
        assign!(type,data,β)
        err = norm.- β_old,Inf)
        iter += 1
    end
    if iter==maxiter
        println("warning: maximum iterations reached, no convergence yet")
    end
end
classify_least_squares! (generic function with 1 method)
β = zeros(2,2)
type = copy(res.assignments)
# estimate!(type,d,β)
# assign!(type,d,β)
classify_least_squares!(type,β,d)
correct_rate(type,d.type)
2-element Vector{Float64}:
 1.0
 0.9875

Notice how much better we do now!