It is easy to see that stochastic gradient descent is somewhat an “online learning” algorithm. It can be computed in a minibatch manner that can be parallised. If we take this idea to the extreme, we actually have an algorithm called the “Hogwild!” algorithm. The gist of this algorithm is:
Loop
For some observation, with current solution "x":
grad_v <- calculate gradient for only the "v"th column
update "x_v" <- "x_v" - learning_rate * grad_v
end
end
For a sufficiently large dataset with many columns, the chance of collision is small, and the worse case scenario that the algorithm updates twice is rather minimal.
In a language like Julia, this approach is easily implemented using @sync @parallel
. In my own tests on my own computer, we could only see improvements on extremely large vectors; though the parallel implementation was also computationally less intensive on the memory side.
In this example, we simply wish to find the minimum point of the function where is some extremely large vector (a bit of hand waving on the notation, but you get the idea).
if length(workers()) < 3
addprocs(2)
end
workers()
@everywhere using Distributions
input = SharedArray(Float64, (100000))
input[:] = 10;
function hogwild_sgd(SA, dfunc, alpha, steps)
size = length(SA)
@sync @parallel for i=1:steps
pos_rand = rand(1:size, 1)[1]
SA[pos_rand] += -alpha * dfunc(SA)[pos_rand]
end
SA
end
@time hogwild_sgd(input, single_update, 0.05, 1000000)
# 204.839810 seconds (353.94 k allocations: 15.714 MB)
# comparison with single code.
function hogwild_sgd_single(SA, dfunc, alpha, steps)
size = length(SA)
for i=1:steps
pos_rand = rand(1:size, 1)[1]
SA[pos_rand] += -alpha * dfunc(SA)[pos_rand]
end
SA
end
input_single = ones(input)*10; # this is now an array, not SharedArray
@time hogwild_sgd_single(input_single, single_update, 0.05, 1000000)
# 278.976475 seconds (3.00 M allocations: 745.222 GB, 19.29% gc time)