Setting up NF Core with MicroK8s

Introduction
This tutorial introduces a workflow system designed for scientific data processing. It showcases a leading solution that helps scientists streamline their computational needs through code-defined pipelines for data extraction and visualization. Following our previous blog posts, we’ll set up everything using a MicroK8s Kubernetes cluster to ensure workload scalability.
We recommend reading our previous blog posts to familiarize yourself with the terminology and techniques used for provisioning and architecting the system. These posts also cover setting up a Docker-based DevOps environment to keep your local system clean.
Nextflow Basics
Nextflow is a domain-specific language (DSL) and workflow manager designed to streamline data analysis pipelines in high-performance computing (HPC), cloud, and containerized environments. Built on Scala and Groovy, it enables scalable, reproducible, and portable execution of computational workflows.
Key Features of Nextflow:
- Pipeline Modularity – Workflows are defined using a simple script format (
.nf
), making them easy to develop and extend. - Parallel Execution – Nextflow automatically schedules and parallelizes tasks, efficiently utilizing available compute resources.
- Container Support – Native support for Docker, Singularity, and Conda, ensuring environment consistency.
- Resumability & Fault Tolerance – If a pipeline fails, Nextflow can resume from the last completed step, reducing re-computation.
- Cloud & HPC Integration – Seamlessly runs on local machines, SLURM, SGE, Kubernetes, AWS Batch, and Google Cloud Life Sciences.
Basic Nextflow Syntax:
A simple Nextflow script (example.nf
) looks like this:
process HELLO {
input:
val name
output:
stdout
script:
"""
echo "Hello, $name!"
"""
}
workflow {
names = ['Alice', 'Bob', 'Charlie']
HELLO(names)
}
- You can run the script by running
nextflow run example.nf
- One key thing to note is that all processes in nextflow are parralizable by default, so it’s important to understand what’s happening under the hood before implementing production grade workflows
- More information here https://www.nextflow.io/docs/latest/your-first-script.html
Why Nextflow for Scientific Workflows?
Nextflow is widely used in bioinformatics, genomics, and large-scale data analysis because it allows researchers to define workflows in code, scale workloads efficiently, and maintain reproducibility. Combined with NF Core, Nextflow ensures robust, versioned, and best-practice pipelines—making it an excellent choice for scientific computing on Kubernetes.
What is NF Core and Why Do We Care?
NF Core is a community-driven framework for building and sharing best-practice computational pipelines using Nextflow, a workflow manager optimized for scalable and reproducible data analysis. NF Core provides a standardized set of pipelines that are rigorously tested and maintained, ensuring robust execution across different computational environments, including local workstations, high-performance computing (HPC) clusters, and cloud-based infrastructure.
Why Use NF Core?
- Reproducibility – NF Core pipelines enforce strict version control, ensuring results remain consistent across different computational environments.
- Scalability – Built with Nextflow, NF Core pipelines support seamless execution across multi-threaded, cluster-based, and cloud-based environments.
- Portability – The pipelines are containerized using Docker, Singularity, or Conda, reducing dependency conflicts.
- Community-Driven Best Practices – Each pipeline adheres to strict guidelines, is thoroughly tested, and receives updates from the global research community.
NF Core vs. Galaxy with CVMFS Drivers
Both NF Core and Galaxy are designed to facilitate scientific workflows, but they serve different use cases:
Feature | NF Core (Nextflow) | Galaxy (CVMFS) |
---|---|---|
User Experience | Command-line, highly scriptable | Web-based GUI for ease of access |
Pipeline Development | Requires coding (Nextflow DSL) | Prebuilt tools with workflow assembly |
Scalability | Highly scalable for cloud and HPC | Scales well but requires CVMFS for distributed workflows |
Portability | Works with containers (Docker/Singularity) | CVMFS provides shared software repositories |
Community Support | Strong, pipeline-focused community | Broad bioinformatics user base |
DevOps Environment
As in previous tutorials, we’ll provision a DevOps Docker image that includes Terraform
and AWS CLI
for EC2 provisioning. These tools will suffice for this tutorial. You can follow our previous blog post’s guide on setting up the environment. Here’s what the provisioning script should look like.
#!/bin/bash
# provision.sh
AWSCLI_VERSION=2.13.25
TERRAFORM_VERSION=1.6.1
function install_base_packages(){
sudo apt-get update
sudo apt-get install -y unzip wget curl openssh-client rsync
}
function install_terraform() {
cd /tmp
wget https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terraform_${TERRAFORM_VERSION}_linux_amd64.zip
unzip terraform_${TERRAFORM_VERSION}_linux_amd64.zip
sudo mv terraform /usr/local/bin/
rm terraform_${TERRAFORM_VERSION}_linux_amd64.zip
}
function install_aws_cli() {
cd /tmp
curl -fsSL "https://awscli.amazonaws.com/awscli-exe-linux-x86_64-$AWSCLI_VERSION.zip" -o "awscli.zip"
unzip awscli.zip > /dev/null
sudo ./aws/install
cd /tmp
rm -rf aws
}
function main()
{
install_base_packages
install_terraform
install_aws_cli
}
main
Nextflow Configuration
We need to create a Persistent Volume (PV) and a Persistent Volume Claim for the Nextflow images to use. You may need to adjust the storage request size depending on your image’s claim size. Here’s what we’ll use in this tutorial:
Persistent Volume
apiVersion: v1
kind: PersistentVolume
metadata:
name: nextflow-pv
spec:
capacity:
storage: 50Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: nextflow-storage
hostPath:
path: "/home/ubuntu/work"
Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nextflow-pvc
namespace: nextflow
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: nextflow-storage
You will need to define a nextflow.config
file that will configure how Nextflow runs pipelines in your instance. Here is the configuration we’ll use:
process {
executor = 'k8s'
}
k8s {
storageClaimName = 'nextflow-pvc'
storageMountPath = '/home/ubuntu/work'
namespace = 'nextflow'
}
The configuration specifies Kubernetes as the executor rather than the default local executor. It also defines the PVC to use, container mount paths, and target namespace.
Inside your setup.sh
script for EC2 provisioning, you’ll find the following method to set up NextFlow
function install_nextflow(){
sudo DEBIAN_FRONTEND=noninteractive apt install openjdk-11-jdk -y
curl -s https://get.nextflow.io | bash
chmod +x nextflow
sudo mv nextflow /usr/local/bin/
# K8s Config for root
sudo mkdir -p /root/.kube
sudo ln -s /var/snap/microk8s/current/credentials/client.config /root/.kube/config
# Setup up PV and PVCs
microk8s kubectl create namespace nextflow
microk8s kubectl apply -f nextflow-pv.yaml
microk8s kubectl apply -f nextflow-pvc.yaml -n nextflow
# Install FUSE plugin - Not used right now - better practice?
# kubectl create -f https://github.com/nextflow-io/k8s-fuse-plugin/raw/master/manifests/k8s-fuse-plugin.yml
}
Following NextFlow’s installation instructions, we set up the required namespaces, Persistent Volume, and Persistent Volume Claim for pipeline execution. While NextFlow recommends using FUSE to deploy and run workflows, we’ve included but commented out the FUSE plugin installation for MicroK8s, as we’re not currently using it.
We are using a very similar setup.sh
script as before, the main method does the following
function main() {
install_microk8s
install_nfs
setup_microK8s
update_ip_in_microk8s_config
update_ip_in_kubeconfig
install_nextflow
install_docker
}
All of these methods can be found in the previous blog posts.
Deploying EC2 Instance
We’ll again use Terraform to deploy our EC2 instance, with additional files being transferred to the instance after provisioning. The security rules remain similar to our previous blog post. Below is the complete Terraform script. We’ve selected an m5.4xlarge
instance type for this tutorial since many nf-core workflows require substantial CPU resources.
provider "aws" {
region = "us-east-1"
}
data "http" "myip" {
url = "https://ipv4.icanhazip.com"
}
data "aws_ami" "ubuntu" {
most_recent = true
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
}
owners = ["099720109477"]
}
resource "aws_security_group" "microk8s_sg" {
name = "microk8s-sg"
description = "Security group for MicroK8s EC2 instance"
vpc_id = var.vpc_id
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [var.your_ip, "${chomp(data.http.myip.response_body)}/32"]
description = "Allow SSH access from your public IP"
}
ingress {
from_port = 16443
to_port = 16443
protocol = "tcp"
cidr_blocks = [var.your_ip, "${chomp(data.http.myip.response_body)}/32"]
description = "Allow K8s access from your public IP"
}
}
resource "aws_instance" "microk8s_instance" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
subnet_id = var.subnet_id
vpc_security_group_ids = [aws_security_group.microk8s_sg.id]
key_name = var.key_name
root_block_device {
volume_size = 500
volume_type = "gp2"
}
connection {
type = "ssh"
user = "ubuntu"
private_key = file(var.private_key_path)
host = self.public_ip
}
provisioner "file" {
source = "${path.module}/setup.sh"
destination = "/home/ubuntu/setup.sh"
}
provisioner "file" {
source = "${path.module}/nextflow-pv.yaml"
destination = "/home/ubuntu/nextflow-pv.yaml"
}
provisioner "file" {
source = "${path.module}/nextflow-pvc.yaml"
destination = "/home/ubuntu/nextflow-pvc.yaml"
}
provisioner "file" {
source = "${path.module}/nextflow.config"
destination = "/home/ubuntu/nextflow.config"
}
provisioner "remote-exec" {
inline = [
"sudo bash /home/ubuntu/setup.sh"
]
}
tags = {
Name = "microk8s-dev-instance"
}
}
We can now deploy the instance by running the following
#!/bin/bash
terraform init
terraform validate
terraform plan -out=qc-microk8s-dev-plan
terraform apply qc-microk8s-dev-plan
Running our First NextFlow Workflow
Once everything has been deployed, we can connect to our EC2 instance using our ./connect.sh
script. Once connected, we test that everything worked correctly by running the following commands
- NextFlow has installed correctly
sudo nextflow info
########################## OUTPUT ##########################
Version: 24.10.3 build 5933
Created: 16-12-2024 15:34 UTC
System: Linux 5.15.0-1072-aws
Runtime: Groovy 4.0.23 on OpenJDK 64-Bit Server VM 11.0.25+9-post-Ubuntu-1ubuntu120.04
Encoding: UTF-8 (UTF-8)
- MicroK8s is running
microk8s status
microk8s is running
high-availability: no
datastore master nodes: 127.0.0.1:19001
datastore standby nodes: none
If the above two are good, now we can attempt to run our first workflow. If you run the command in the directory of the nextflow.config
file, you should see some fun stuff happening!
sudo nextflow run hello
########################## OUTPUT ##########################
N E X T F L O W ~ version 24.10.3
Launching `https://github.com/nextflow-io/hello` [astonishing_caravaggio] DSL2 - revision: afff16a9b4 [master]
executor > k8s (4) # Important that this does NOT say `local`
[fd/f7ba24] sayHello (4) [100%] 4 of 4 ✔
Bonjour world!
Ciao world!
Hello world!
Hola world!
If you get the above output running the hello image AND you have the k8s
executer running, then you are in great shape. If this is not working, ensure your config matches up with what is here and that you are running it in the same directory at the config file.
Running our First NF Core Workflow
As mentioned before, NF Core is community driven by scientists around the world. As a result, there are hundreds are pre-built pipelines that can be used. All of the pipelines can be found here
Now we are not subject matter experts in the field, but we do know how to set up configurations, so just going over the basic pipelines, there is a demo pipeline we can attempt to use in order to ensure that our K8s setup is correct.
Once we have the samplesheet.csv
file as our input to the pipeline. We need to update the nextflow.config
file slightly
process {
executor = 'k8s'
}
k8s {
storageClaimName = 'nextflow-pvc'
storageMountPath = '/home/ubuntu/work'
namespace = 'nextflow'
pod = [ // required as the pipeline performs a sym-link
[volumeClaim: "nextflow-pvc", mountPath: "/home/ubuntu/work"],
[hostPath: "/root/.nextflow/assets", mountPath: "/root/.nextflow/assets"]
]
}
You should also create an output
directory
Now we can run the following
sudo nextflow run nf-core/demo --input samplesheet.csv --outdir output
########################## OUTPUT ##########################
N E X T F L O W ~ version 24.10.3
Launching `https://github.com/nf-core/demo` [nice_fourier] DSL2 - revision: 04060b4644 [master]
------------------------------------------------------
,--./,-.
___ __ __ __ ___ /,-._.--~'
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/demo 1.0.1
------------------------------------------------------
Input/output options
input : samplesheet.csv
outdir : output
Core Nextflow options
revision : master
runName : nice_fourier
launchDir : /home/ubuntu
workDir : /home/ubuntu/work
projectDir : /root/.nextflow/assets/nf-core/demo
userName : root
profile : standard
configFiles:
!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------* The pipeline
https://doi.org/10.5281/zenodo.12192442
* The nf-core framework
https://doi.org/10.1038/s41587-020-0439-x
* Software dependencies
https://github.com/nf-core/demo/blob/master/CITATIONS.md
executor > k8s (6)
executor > k8s (6)
executor > k8s (6)
executor > k8s (6)
executor > k8s (6)
executor > k8s (7)
executor > k8s (7)
executor > k8s (7)
[d6/384cf0] NFCORE_DEMO:DEMO:FASTQC (SAMPLE1_PE) [100%] 3 of 3 ✔
[db/180458] NFCORE_DEMO:DEMO:SEQTK_TRIM (SAMPLE1_PE) [100%] 3 of 3 ✔
[1a/799111] NFCORE_DEMO:DEMO:MULTIQC [100%] 1 of 1 ✔
-[nf-core/demo] Pipeline completed successfully-
If you monitor your cluster, you should see that pods start spinning up and completed their jobs. You can then go to the output
directory and verify that data and result html files are there.
Conclusion
In this tutorial, we demonstrated how to set up and execute Nextflow workflows on a MicroK8s Kubernetes cluster, leveraging persistent storage and containerized execution to ensure scalability and reproducibility. We covered the basics of Nextflow, including its advantages for scientific workflows, and explored NF Core, a community-driven repository of curated pipelines.
By provisioning our infrastructure with Terraform, configuring Nextflow on Kubernetes, and successfully running both Nextflow’s Hello World workflow and an NF Core demo pipeline, we validated our deployment. These workflows illustrate how computational pipelines can be streamlined, modularized, and efficiently managed in a cloud-native environment.
This setup provides a solid foundation for running complex data analysis workflows in genomics, bioinformatics, and other scientific domains. Whether you’re processing large datasets, integrating various bioinformatics tools, or scaling workloads across cloud and on-premise clusters, Nextflow and Kubernetes provide a powerful combination for computational reproducibility.
As a next step, you can explore customizing Nextflow workflows, integrating cloud storage (e.g., AWS S3, Google Cloud Storage), or optimizing resource allocation for large-scale datasets. If you encounter any challenges, feel free to reach out.