using Random, Distributions, Clustering, LinearAlgebra
Clustering with Coefficients
In class we discussed how to classify individuals by type based on moments or averages. There are ready-made packages for clustering based on sample means, but we have to be careful to choose moments that satisfy the “injectivity” property that we need for consistent classification of types.
Consider the following model:
\[ y_{nt} = \beta_{0,k} + \beta_{1,k}x_{nt} + \epsilon_{nt} \]
where \(x_{nt}\) is an AR(1) process:
\[ x_{nt} = \rho x_{nt-1} + \xi_{nt}\qquad \xi_{nt}\sim\mathcal{N}(0,\sigma^2_\xi) \]
An Issue with Clustering on Means
Notice that if we clustered based only averages of \(y\), we could get a lot of misclassification because of persistent differences in \(x\).
function gen_data(T,N,pars)
= pars
(;ρ,β,σξ,σϵ) = size(β,2)
K = zeros(T,N)
X = zeros(T,N)
Y = fill(1/K,K)
πk type = rand(Categorical(πk),N)
= sqrt(σξ^2/(1-ρ^2))
σ_stat for n in axes(X,2)
= type[n]
k = rand(Normal(0,σ_stat))
x for t in axes(X,1)
= x
X[t,n] = β[1,k] + β[2,k]*x + rand(Normal(0,σϵ))
Y[t,n] = ρ*x + rand(Normal(0,σξ))
x end
end
return (;type,X,Y)
end
gen_data (generic function with 1 method)
= (;β = [0 0.5 ; 0.3 0.1],ρ = 0.95,σξ = 0.5,σϵ = 0.2)
p = gen_data(10,1000,p) d
(type = [1, 2, 2, 2, 1, 1, 1, 2, 2, 1 … 1, 2, 1, 1, 1, 2, 1, 1, 2, 2], X = [-1.4550229799005174 -0.3235719894861611 … -0.3252102963926345 0.9337550853018559; -1.4453101711005691 0.3197381857283288 … 0.6710521361585193 1.5354948469380725; … ; -0.29949820449345 1.6564662817604354 … -0.6372124345389572 0.9171847039433293; -0.7311611139759487 0.9109754242885167 … -0.8731164815615942 -0.10884432590462445], Y = [-0.551382423663523 0.0881439847705161 … 0.21107863609973787 0.6870144018270388; -0.7040626859383348 0.4464915516381839 … 0.727528604926102 0.4989540522378084; … ; -0.16862695133653327 0.6267404954989101 … 0.2333328051748107 0.4005897820801897; -0.3266392158716257 0.6077923037209964 … 0.7125507711086414 0.6464979375694259])
Let’s try clustering with k-means:
= kmeans(d.Y,2) res
KmeansResult{Matrix{Float64}, Float64, Int64}([-0.2678080833248467 0.4678242878166031; -0.2960095107964568 0.4742407792154177; … ; -0.3169829567953903 0.4577755549304992; -0.3096986820077558 0.47204389558301624], [1, 2, 2, 2, 2, 1, 1, 2, 2, 1 … 2, 2, 2, 1, 1, 2, 1, 2, 2, 2], [0.708035542811815, 0.4531583728266524, 0.7469764391671827, 0.5736154324762568, 0.7114268707590039, 0.164137140625503, 1.2610019632836815, 0.2606893217264279, 0.8490512038847164, 0.5408747521124615 … 1.2533746836273982, 0.4566586107970818, 1.1991676336482504, 1.8417661197646482, 1.0070316955214365, 0.17340415650391972, 4.31971298477642, 1.9929271300368452, 0.4382113308574356, 0.4667067001956937], [290, 710], [290, 710], 1060.5947343696905, 6, true)
We get a correct classification rate of:
function correct_rate(assignments,type) #<- to evaluate
= maximum(type)
K return [mean(assignments[type.==k].==mode(assignments[type.==k])) for k in 1:K]
end
correct_rate(res.assignments,d.type)
2-element Vector{Float64}:
0.6029106029106029
1.0
So notice we are misclassifying many of the first type. A simple histogram shows us why:
using Plots
histogram(mean(d.Y[:,d.type.==1],dims=1)[:])
histogram!(mean(d.Y[:,d.type.==2],dims=1)[:])
Does this go away when we increase the panel dimension?
= gen_data(20,1000,p)
d = kmeans(d.Y,2)
res correct_rate(res.assignments,d.type)
2-element Vector{Float64}:
0.5826923076923077
0.99375
histogram(mean(d.Y[:,d.type.==1],dims=1)[:])
histogram!(mean(d.Y[:,d.type.==2],dims=1)[:])
A bit, yes, but things are still not looking great.
NOTE I have to fix this! types are not identified up to permutation. need a different statistic for missclassification.
An Alternative
An alternative would be to use the structure of the model and cluster based on the least squares criterion itself. We could derive a classification \(\hat{\mathcal{K}}\) and coefficients \(\hat{\beta}\) to solve:
\[ \hat{\mathcal{K}},\hat{\beta} = \arg\min\sum_{n}\sum_{t}(Y_{nt} - \beta_{0,k} - \beta_{1,k}x_{nt})^2 \]
which we could solve by iterating in the same way as K-means. The \(m\)th iteration would be:
- Estimate \(\hat{\beta}^{m+1}\) by linear regression using \(\hat{\mathcal{K}}^{m}\).
- Re-assign types as \(k^{m+1}(n) = \arg\min\sum_t(Y_{nt} - \hat{\beta}^{m+1}_{0,k} - \hat{\beta}^{m+1}_{1,k}x_{nt})^2\)
Here’s code to do that:
function assign!(type,data,β)
= size(β,2)
K = zeros(K)
lsq for n in axes(data.X,2)
= view(data.X,:,n)
x = view(data.Y,:,n)
y for k in axes(β,2)
@views lsq[k] = sum((y .- β[1,k] .- β[2,k]*x).^2)
end
type[n] = argmin(lsq)
end
end
function estimate!(type,data,β)
= size(β,2)
K = size(data.X,1)
T for k in axes(β,2)
= type.==k
Ik = sum(Ik)
N @views Y = data.Y[:,Ik][:]
@views X = [ones(N*T) data.X[:,Ik][:]]
:,k] = inv(X'*X)*X'*Y
β[end
end
function classify_least_squares!(type,β,data;maxiter = 100,e_tol = 1e-10)
= Inf
err = 1
iter while err>e_tol && iter<maxiter
= copy(β)
β_old estimate!(type,data,β)
assign!(type,data,β)
= norm(β .- β_old,Inf)
err += 1
iter end
if iter==maxiter
println("warning: maximum iterations reached, no convergence yet")
end
end
classify_least_squares! (generic function with 1 method)
= zeros(2,2)
β type = copy(res.assignments)
# estimate!(type,d,β)
# assign!(type,d,β)
classify_least_squares!(type,β,d)
correct_rate(type,d.type)
2-element Vector{Float64}:
1.0
0.9875
Notice how much better we do now!