OpenShiftでアプリケーションプロセスを殺してみる

この投稿はOpen PaaS Advent Calendar 2015の13日目の記事です。

OpenShift / Kubernetesではプロセスやコンテナのモニタリングが行われているので、煮たり焼いたり殴ったり殺したりしてもすぐ復旧するらしいのでやってみましょう。

材料はhello-sinatraさんです。

$ oc new-app https://github.com/nekop/hello-sinatra

とりあえずpsすると、アプリケーションであるrubyのプロセスが見えます。DockerのENTRYPOINTやで起動されるものはpid 1で起動します。これはコンテナプライマリプロセスと呼ばれます。

$ oc exec hello-sinatra-1-wznjc -- ps -eaf
UID         PID   PPID  C STIME TTY          TIME CMD
1000030+      1      0  0 05:51 ?        00:00:00 ruby /opt/app-root/src/bundle/ruby/bin/rackup -E production -P /tmp/rack.pid --host 0.0.0.0 --port 8080
1000030+     75      0  0 06:17 ?        00:00:00 ps -eaf

まずはシンプルにkillしてみましょう。SIGTERMです。

$ oc exec hello-sinatra-1-wznjc -- kill 1
$ oc exec hello-sinatra-1-wznjc -- ps -eaf
error: error executing remote command: Error executing command in container: container not found ("hello-sinatra")
$ oc exec hello-sinatra-1-wznjc -- ps -eaf
UID         PID   PPID  C STIME TTY          TIME CMD
1000030+      1      0 38 06:18 ?        00:00:00 ruby /opt/app-root/src/bundle/ruby/bin/rackup -E production -P /tmp/rack.pid --host 0.0.0.0 --port 8080
1000030+     25      0  0 06:18 ?        00:00:00 ps -eaf
$ curl http://hello-sinatra-test.apps.192.168.100.100.xip.io/
hello

一瞬Pod内のコンテナが居なくなっていますが、すぐ再起動されているようです。Startup timeが変化しました。Podは変化していません。SIGKILLでもまぁ同じ感じになります。

サスペンドしてみましょう。Ctrl-zとかで送られるSIGTSTPです。

$ oc exec hello-sinatra-1-wznjc -- kill -TSTP 1
$ oc exec hello-sinatra-1-wznjc -- ps -eaf
UID         PID   PPID  C STIME TTY          TIME CMD
1000030+      1      0 38 06:18 ?        00:00:00 ruby /opt/app-root/src/bundle/ruby/bin/rackup -E production -P /tmp/rack.pid --host 0.0.0.0 --port 8080
1000030+     25      0  0 06:18 ?        00:00:00 ps -eaf
$ curl http://hello-sinatra-test.apps.192.168.100.100.xip.io/
hello

まったく効きませんね。

oc delete podでpodを潰すとPodのIDが新しくなりますが、killと特に違いはありません。

$ oc delete pod $(oc get pod | grep Running | head -n1 | awk '{print $1}')
pod "hello-sinatra-1-d0b4c" deleted
$ oc exec $(oc get pod | grep Running | head -n1 | awk '{print $1}') -- ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1000030+      1 19.5  1.2 117492 23764 ?        Ssl  06:38   0:00 ruby /opt/app-root/src/bundle/ruby/bin/rackup -E production -P /tmp/rack.pid --host 0.0.0.0 --port 8080
1000030+     21  0.0  0.0  19764  1168 ?        Rs   06:38   0:00 ps aux

アプリケーションがハングしたような状態はどうなるでしょうか。

実は、ハングのチェックはデフォルトで有効になっていないため、特にそのままでは自動的に検知して再起動したりはしません。Deployment Configのspec.containersの下のコンテナ定義にlivenessProbeを加えることでモニタリングを有効化できます。

    spec:
      containers:
(省略)
        livenessProbe:
          httpGet:
            path: /
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 15
          timeoutSeconds: 12

アプリケーションを変更して100秒スリープするというよく寝る子にして再度ビルドとデプロイを行います。

$ oc delete pod $(oc get pod | grep Running | head -n1 | awk '{print $1}')
$ vi app.rb
$ git commit -am 'sleep 100' && git push
$ oc start-build hello-sinatra

少し放置プレイしておいてoc get podを見ると9分で6回再起動されているので、上の設定と合わせて考えるとlivenessProbeの実行は60秒くらいのインターバルがありそうです。このあたりは後でもう少しつっこんで調べようと思います。

$ oc get pod
NAME                    READY     STATUS      RESTARTS   AGE
hello-sinatra-5-yugu3   1/1       Running     6          9m

OpenShiftのシステムログを見てみると、以下のようなログが出ていました。

[vagrant@ose3-master ~]$ sudo journalctl -n10000 | grep Live | grep -v refuse
Dec 15 01:58:48 ose3-master.example.com origin-node[21405]: I1215 01:58:48.913911   21405 prober.go:105] Liveness probe for "hello-sinatra-5-yugu3_test:hello-sinatra" failed (failure): Get http://10.1.0.26:8080/: read tcp 10.1.0.26:8080: use of closed network connection
Dec 15 02:00:18 ose3-master.example.com origin-node[21405]: I1215 02:00:18.913822   21405 prober.go:105] Liveness probe for "hello-sinatra-5-yugu3_test:hello-sinatra" failed (failure): Get http://10.1.0.26:8080/: read tcp 10.1.0.26:8080: use of closed network connection
Dec 15 02:01:48 ose3-master.example.com origin-node[21405]: I1215 02:01:48.913582   21405 prober.go:105] Liveness probe for "hello-sinatra-5-yugu3_test:hello-sinatra" failed (failure): Get http://10.1.0.26:8080/: read tcp 10.1.0.26:8080: use of closed network connection
Dec 15 02:03:18 ose3-master.example.com origin-node[21405]: I1215 02:03:18.914962   21405 prober.go:105] Liveness probe for "hello-sinatra-5-yugu3_test:hello-sinatra" failed (failure): Get http://10.1.0.26:8080/: read tcp 10.1.0.26:8080: use of closed network connection
Dec 15 02:04:48 ose3-master.example.com origin-node[21405]: I1215 02:04:48.913876   21405 prober.go:105] Liveness probe for "hello-sinatra-5-yugu3_test:hello-sinatra" failed (failure): Get http://10.1.0.26:8080/: read tcp 10.1.0.26:8080: use of closed network connection

今回の実験はここまで。

nekop's blog

http://twitter.com/nekop

OpenShiftでアプリケーションプロセスを殺してみる