Setting up NF Core with MicroK8s

By
in Pipelines on

Introduction

This tutorial introduces a workflow system designed for scientific data processing. It showcases a leading solution that helps scientists streamline their computational needs through code-defined pipelines for data extraction and visualization. Following our previous blog posts, we’ll set up everything using a MicroK8s Kubernetes cluster to ensure workload scalability.

We recommend reading our previous blog posts to familiarize yourself with the terminology and techniques used for provisioning and architecting the system. These posts also cover setting up a Docker-based DevOps environment to keep your local system clean.

Nextflow Basics

Nextflow is a domain-specific language (DSL) and workflow manager designed to streamline data analysis pipelines in high-performance computing (HPC), cloud, and containerized environments. Built on Scala and Groovy, it enables scalable, reproducible, and portable execution of computational workflows.

Key Features of Nextflow:

  • Pipeline Modularity – Workflows are defined using a simple script format (.nf), making them easy to develop and extend.
  • Parallel Execution – Nextflow automatically schedules and parallelizes tasks, efficiently utilizing available compute resources.
  • Container Support – Native support for Docker, Singularity, and Conda, ensuring environment consistency.
  • Resumability & Fault Tolerance – If a pipeline fails, Nextflow can resume from the last completed step, reducing re-computation.
  • Cloud & HPC Integration – Seamlessly runs on local machines, SLURM, SGE, Kubernetes, AWS Batch, and Google Cloud Life Sciences.

Basic Nextflow Syntax:

A simple Nextflow script (example.nf) looks like this:

process HELLO {
    input:
    val name

    output:
    stdout

    script:
    """
    echo "Hello, $name!"
    """
}

workflow {
    names = ['Alice', 'Bob', 'Charlie']
    HELLO(names)
}
  • You can run the script by running
nextflow run example.nf

Why Nextflow for Scientific Workflows?

Nextflow is widely used in bioinformatics, genomics, and large-scale data analysis because it allows researchers to define workflows in code, scale workloads efficiently, and maintain reproducibility. Combined with NF Core, Nextflow ensures robust, versioned, and best-practice pipelines—making it an excellent choice for scientific computing on Kubernetes.

What is NF Core and Why Do We Care?

NF Core is a community-driven framework for building and sharing best-practice computational pipelines using Nextflow, a workflow manager optimized for scalable and reproducible data analysis. NF Core provides a standardized set of pipelines that are rigorously tested and maintained, ensuring robust execution across different computational environments, including local workstations, high-performance computing (HPC) clusters, and cloud-based infrastructure.

Why Use NF Core?

  • Reproducibility – NF Core pipelines enforce strict version control, ensuring results remain consistent across different computational environments.
  • Scalability – Built with Nextflow, NF Core pipelines support seamless execution across multi-threaded, cluster-based, and cloud-based environments.
  • Portability – The pipelines are containerized using Docker, Singularity, or Conda, reducing dependency conflicts.
  • Community-Driven Best Practices – Each pipeline adheres to strict guidelines, is thoroughly tested, and receives updates from the global research community.

NF Core vs. Galaxy with CVMFS Drivers

Both NF Core and Galaxy are designed to facilitate scientific workflows, but they serve different use cases:

Feature NF Core (Nextflow) Galaxy (CVMFS)
User Experience Command-line, highly scriptable Web-based GUI for ease of access
Pipeline Development Requires coding (Nextflow DSL) Prebuilt tools with workflow assembly
Scalability Highly scalable for cloud and HPC Scales well but requires CVMFS for distributed workflows
Portability Works with containers (Docker/Singularity) CVMFS provides shared software repositories
Community Support Strong, pipeline-focused community Broad bioinformatics user base

DevOps Environment

As in previous tutorials, we’ll provision a DevOps Docker image that includes Terraform and AWS CLI for EC2 provisioning. These tools will suffice for this tutorial. You can follow our previous blog post’s guide on setting up the environment. Here’s what the provisioning script should look like.

#!/bin/bash
# provision.sh

AWSCLI_VERSION=2.13.25
TERRAFORM_VERSION=1.6.1

function install_base_packages(){
    sudo apt-get update
    sudo apt-get install -y unzip wget curl openssh-client rsync
}

function install_terraform() {
    cd /tmp
    wget https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terraform_${TERRAFORM_VERSION}_linux_amd64.zip
    unzip terraform_${TERRAFORM_VERSION}_linux_amd64.zip
    sudo mv terraform /usr/local/bin/
    rm terraform_${TERRAFORM_VERSION}_linux_amd64.zip
}

function install_aws_cli() {
    cd /tmp
    curl -fsSL "https://awscli.amazonaws.com/awscli-exe-linux-x86_64-$AWSCLI_VERSION.zip" -o "awscli.zip"
    unzip awscli.zip > /dev/null
    sudo ./aws/install
    cd /tmp
    rm -rf aws
}

function main()
{
    install_base_packages
        install_terraform
        install_aws_cli
}

main

Nextflow Configuration

We need to create a Persistent Volume (PV) and a Persistent Volume Claim for the Nextflow images to use. You may need to adjust the storage request size depending on your image’s claim size. Here’s what we’ll use in this tutorial:

Persistent Volume

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nextflow-pv
spec:
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: nextflow-storage
  hostPath:
    path: "/home/ubuntu/work"

Persistent Volume Claim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nextflow-pvc
  namespace: nextflow
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Gi
  storageClassName: nextflow-storage  

You will need to define a nextflow.config file that will configure how Nextflow runs pipelines in your instance. Here is the configuration we’ll use:

process {
    executor = 'k8s'
}

k8s {
    storageClaimName = 'nextflow-pvc'
    storageMountPath = '/home/ubuntu/work'
    namespace = 'nextflow'
}

The configuration specifies Kubernetes as the executor rather than the default local executor. It also defines the PVC to use, container mount paths, and target namespace.

Inside your setup.sh script for EC2 provisioning, you’ll find the following method to set up NextFlow

function install_nextflow(){
    sudo DEBIAN_FRONTEND=noninteractive apt install openjdk-11-jdk -y
    curl -s https://get.nextflow.io | bash
    chmod +x nextflow
    sudo mv nextflow /usr/local/bin/

    # K8s Config for root
    sudo mkdir -p /root/.kube
    sudo ln -s /var/snap/microk8s/current/credentials/client.config /root/.kube/config

    # Setup up PV and PVCs
    microk8s kubectl create namespace nextflow
    microk8s kubectl apply -f nextflow-pv.yaml
    microk8s kubectl apply -f nextflow-pvc.yaml -n nextflow

    # Install FUSE plugin - Not used right now - better practice?
    # kubectl create -f https://github.com/nextflow-io/k8s-fuse-plugin/raw/master/manifests/k8s-fuse-plugin.yml
}

Following NextFlow’s installation instructions, we set up the required namespaces, Persistent Volume, and Persistent Volume Claim for pipeline execution. While NextFlow recommends using FUSE to deploy and run workflows, we’ve included but commented out the FUSE plugin installation for MicroK8s, as we’re not currently using it.

We are using a very similar setup.sh script as before, the main method does the following

function main() {
    install_microk8s
    install_nfs
    setup_microK8s
    update_ip_in_microk8s_config
    update_ip_in_kubeconfig
    install_nextflow
    install_docker
}

All of these methods can be found in the previous blog posts.

Deploying EC2 Instance

We’ll again use Terraform to deploy our EC2 instance, with additional files being transferred to the instance after provisioning. The security rules remain similar to our previous blog post. Below is the complete Terraform script. We’ve selected an m5.4xlarge instance type for this tutorial since many nf-core workflows require substantial CPU resources.

provider "aws" {
  region = "us-east-1"
}

data "http" "myip" {
  url = "https://ipv4.icanhazip.com"
}

data "aws_ami" "ubuntu" {
  most_recent = true
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
  }
  owners = ["099720109477"]
}

resource "aws_security_group" "microk8s_sg" {
  name        = "microk8s-sg"
  description = "Security group for MicroK8s EC2 instance"
  vpc_id      = var.vpc_id

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [var.your_ip, "${chomp(data.http.myip.response_body)}/32"]
    description = "Allow SSH access from your public IP"
  }

  ingress {
    from_port   = 16443
    to_port     = 16443
    protocol    = "tcp"
    cidr_blocks = [var.your_ip, "${chomp(data.http.myip.response_body)}/32"]
    description = "Allow K8s access from your public IP"
  }
}

resource "aws_instance" "microk8s_instance" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = var.instance_type
  subnet_id              = var.subnet_id
  vpc_security_group_ids = [aws_security_group.microk8s_sg.id]
  key_name               = var.key_name

  root_block_device {
    volume_size = 500
    volume_type = "gp2"
  }

  connection {
    type        = "ssh"
    user        = "ubuntu"
    private_key = file(var.private_key_path)
    host        = self.public_ip
  }

  provisioner "file" {
    source      = "${path.module}/setup.sh"
    destination = "/home/ubuntu/setup.sh"
  }

  provisioner "file" {
    source      = "${path.module}/nextflow-pv.yaml"
    destination = "/home/ubuntu/nextflow-pv.yaml"
  }

  provisioner "file" {
    source      = "${path.module}/nextflow-pvc.yaml"
    destination = "/home/ubuntu/nextflow-pvc.yaml"
  }

  provisioner "file" {
    source      = "${path.module}/nextflow.config"
    destination = "/home/ubuntu/nextflow.config"
  }

  provisioner "remote-exec" {
    inline = [
      "sudo bash /home/ubuntu/setup.sh"
    ]
  }

  tags = {
    Name = "microk8s-dev-instance"
  }
}

We can now deploy the instance by running the following

#!/bin/bash

terraform init
terraform validate
terraform plan -out=qc-microk8s-dev-plan
terraform apply qc-microk8s-dev-plan

Running our First NextFlow Workflow

Once everything has been deployed, we can connect to our EC2 instance using our ./connect.sh script. Once connected, we test that everything worked correctly by running the following commands

  • NextFlow has installed correctly
sudo nextflow info

########################## OUTPUT ##########################
  Version: 24.10.3 build 5933
  Created: 16-12-2024 15:34 UTC 
  System: Linux 5.15.0-1072-aws
  Runtime: Groovy 4.0.23 on OpenJDK 64-Bit Server VM 11.0.25+9-post-Ubuntu-1ubuntu120.04
  Encoding: UTF-8 (UTF-8)
  • MicroK8s is running
microk8s status

microk8s is running
high-availability: no
  datastore master nodes: 127.0.0.1:19001
  datastore standby nodes: none

If the above two are good, now we can attempt to run our first workflow. If you run the command in the directory of the nextflow.config file, you should see some fun stuff happening!

sudo nextflow run hello

########################## OUTPUT ##########################
 N E X T F L O W   ~  version 24.10.3

Launching `https://github.com/nextflow-io/hello` [astonishing_caravaggio] DSL2 - revision: afff16a9b4 [master]

executor >  k8s (4) # Important that this does NOT say `local`
[fd/f7ba24] sayHello (4) [100%] 4 of 4 ✔
Bonjour world!

Ciao world!

Hello world!

Hola world!

If you get the above output running the hello image AND you have the k8s executer running, then you are in great shape. If this is not working, ensure your config matches up with what is here and that you are running it in the same directory at the config file.

Running our First NF Core Workflow

As mentioned before, NF Core is community driven by scientists around the world. As a result, there are hundreds are pre-built pipelines that can be used. All of the pipelines can be found here

https://nf-co.re/pipelines/

Now we are not subject matter experts in the field, but we do know how to set up configurations, so just going over the basic pipelines, there is a demo pipeline we can attempt to use in order to ensure that our K8s setup is correct.

demo: Introduction

Once we have the samplesheet.csv file as our input to the pipeline. We need to update the nextflow.config file slightly

process {
    executor = 'k8s'
}

k8s {
    storageClaimName = 'nextflow-pvc'
    storageMountPath = '/home/ubuntu/work'
    namespace = 'nextflow'
    pod = [ // required as the pipeline performs a sym-link
        [volumeClaim: "nextflow-pvc", mountPath: "/home/ubuntu/work"],
        [hostPath: "/root/.nextflow/assets", mountPath: "/root/.nextflow/assets"]
    ]
}

You should also create an output directory

Now we can run the following

sudo nextflow run nf-core/demo --input samplesheet.csv --outdir output

########################## OUTPUT ##########################
 N E X T F L O W   ~  version 24.10.3

Launching `https://github.com/nf-core/demo` [nice_fourier] DSL2 - revision: 04060b4644 [master]

------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/demo 1.0.1
------------------------------------------------------
Input/output options
  input      : samplesheet.csv
  outdir     : output

Core Nextflow options
  revision   : master
  runName    : nice_fourier
  launchDir  : /home/ubuntu
  workDir    : /home/ubuntu/work
  projectDir : /root/.nextflow/assets/nf-core/demo
  userName   : root
  profile    : standard
  configFiles: 

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------* The pipeline
  https://doi.org/10.5281/zenodo.12192442

* The nf-core framework
    https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
    https://github.com/nf-core/demo/blob/master/CITATIONS.md

executor >  k8s (6)
executor >  k8s (6)
executor >  k8s (6)
executor >  k8s (6)
executor >  k8s (6)
executor >  k8s (7)
executor >  k8s (7)
executor >  k8s (7)
[d6/384cf0] NFCORE_DEMO:DEMO:FASTQC (SAMPLE1_PE)     [100%] 3 of 3 ✔
[db/180458] NFCORE_DEMO:DEMO:SEQTK_TRIM (SAMPLE1_PE) [100%] 3 of 3 ✔
[1a/799111] NFCORE_DEMO:DEMO:MULTIQC                 [100%] 1 of 1 ✔
-[nf-core/demo] Pipeline completed successfully-

If you monitor your cluster, you should see that pods start spinning up and completed their jobs. You can then go to the output directory and verify that data and result html files are there.

Conclusion

In this tutorial, we demonstrated how to set up and execute Nextflow workflows on a MicroK8s Kubernetes cluster, leveraging persistent storage and containerized execution to ensure scalability and reproducibility. We covered the basics of Nextflow, including its advantages for scientific workflows, and explored NF Core, a community-driven repository of curated pipelines.

By provisioning our infrastructure with Terraform, configuring Nextflow on Kubernetes, and successfully running both Nextflow’s Hello World workflow and an NF Core demo pipeline, we validated our deployment. These workflows illustrate how computational pipelines can be streamlined, modularized, and efficiently managed in a cloud-native environment.

This setup provides a solid foundation for running complex data analysis workflows in genomics, bioinformatics, and other scientific domains. Whether you’re processing large datasets, integrating various bioinformatics tools, or scaling workloads across cloud and on-premise clusters, Nextflow and Kubernetes provide a powerful combination for computational reproducibility.

As a next step, you can explore customizing Nextflow workflows, integrating cloud storage (e.g., AWS S3, Google Cloud Storage), or optimizing resource allocation for large-scale datasets. If you encounter any challenges, feel free to reach out.