Syncing Cloud Storage to LARGE Github Repo using github actions

June 22, 2022

At our company, we used Google Cloud Storage to server static websites without any version control. Recently I was asked to set up version control for the web content in the cloud storage. So I decided to set up automatic syncing of Google Cloud storage and a github repo.

I've had good experience with github actions, so my first thought was to write a workflow to just automatically copy files to Cloud Storage. I could simply run rsync to sync my repo and my cloud storage bucket.

However, there's an issue with this solution: each workflow run creates a new clean environment and clones the entire repo, which is less than efficient if you have a multi-gigabyte repo. A github repo can hold at most 100 GB of files, so I needed a more efficient way to do this.

After searching around for a while, I found a solution: using a self-hosted runner.

How is a self-hosed runner different?

When using github-hosted runner, a clean environment is created on each run, meaning we cannot keep state between different workflow runs.

In contrast, a self-hosted runner does not have to create a clean environment every time; you can assume every single workflow run is executed on the same machine, meaning we can keep the files we have pulled previously, so on every workflow run, we only need to pull the newly committed file changes!

Prepare a Virtual Machine

To host a runner, you'll need a virtual machine. I created a Compute Engine with Ubuntu 20.04 LTS on GCP. For some reason, I could not install the runner service on Ubuntu 18.04 LTS although it is supposed to work so keep that in mind. With your runner machine ready, you can add the runner to your github repo/organization following the official docs.

If run.sh works, remember to install the runner as a service so it runs in the background.

Prepare the Repo

Since we want to keep the repo between workflow runs, we need to manually clone it first onto the VM. (Put the repo wherever you want, just remember to update the workflow accordingly)

cd ~
git clone <github-repo-url>

Write the workflow

The workflow is as follows. Remember to update the environment variables and working-directory. The workflow also uses these action secrets:

  • GCPPROJECTID_PROD - your gcp project id
  • GCPSAKEY_PROD - your service account key. Make sure it has permissions to update cloud storage buckets
  • PERSONALGITHUBTOKEN - your github access token

So remember to set those up.

Notice that on line 23, I specify to use a self-hosted runner.

name: Sync repo with cloud storage

# Controls when the action will run.
on:
  # Triggers the workflow on push or pull request events but only for the master branch
  push:
    branches: [main]

  # Allows you to run this workflow manually from the Actions tab
  workflow_dispatch:

env:
  GITHUB_USERNAME: imjamesku # change to your github username
  BUCKET_NAME: event-web-pages # change to your bucket name
  REPO_NAME: event-web-pages # change to your repo name

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "test_and_deploy"
  upload_to_gcs:
    # The type of runner that the job will run on
    runs-on: self-hosted
    defaults:
      run:
        working-directory: /home/ku.james/event-web-pages # path to the repo you just cloned

    steps:
      - name: test
        if: ${{ success() }}
        run: |
          pwd
          ls

      - name: Google Cloud setup
        if: ${{ success() }}
        # google cloud deploy tool
        uses: google-github-actions/setup-gcloud@v0
        with:
          project_id: ${{ secrets.GCP_PROJECT_ID_PROD }}
          service_account_key: ${{ secrets.GCP_SA_KEY_PROD }}
          export_default_credentials: true


      - name: git pull
        run: |
          git checkout main
          git pull https://${{ env.GITHUB_USERNAME }}:${{ secrets.PERSONAL_GITHUB_TOKEN }}@github.com/TW-Kadokawa/${{ env.REPO_NAME }}.git
      - name: Upload static
        if: ${{ success() }}
        run: |
          gsutil  -m -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=24 rsync -rd ./${{ env.BUCKET_NAME }} gs://${{ env.BUCKET_NAME }}

This workflow just 1. pulls the repo and 2. runs rsync to sync the files on Cloud Storage. If you want to avoid conflicts, it is also possible to use commands like git fetch and git reset origin/main. You have all the git commands at your disposal so write the workflow as you need.

After you commit the workflow to your repo, the workflow should start running and the files you committed should be uploaded to Cloud Storage automatically!

Subscribe to my email list

© 2024 ALL RIGHTS RESERVED