CDSNA

Job Scheduler

All running on a managed infrastructure, laboratories, universities, and private entities can now load and run their HPC jobs on our HPC Job Scheduler Clusters.

Consolidated Data Storage would:

  • Host the compute infrastructure
  • Operate the scheduler
  • Provide secure access
  • Allocate resources per customer
  • Monitor and optimize workloads
  • Possibly meter usage and bill per GPU-hour

Clients would:

  • Log in (SSH or web portal)
  • Upload data or mount storage
  • Submit jobs
  • Monitor queue status
  • Retrieve results

BasePOD

We offer a validated solution tailored for AI and ML tasks. This architecture supports scaling from one DGX to nine DGXs within a single rack.

Featuring up to 72 NVIDIA GPUs and NVMe storage back-end, it delivers over 90 GB per second data throughput in a single 2u form factor.

Reach out to us to discover more or Test Drive it  firsthand

  • Experience the advanced capabilities of the NVIDIA DGX/HGX.

SuperPOD

With tens of DGXs and linear scalability, a successfully testing and integrating with the world’s flagship clustered-parallel filesystem, Weka FS, storage stack, the NVIDIA DGX SuperPOD provides a robust solution for scalable AI development and deep learning tasks. Key features include:

  • Streamlined model training directly from Spectrum Scale.
  • Automatic use of local resources as cache to reduce data re-reads over the network.
  • Dedicated workspace for long-term storage (LTS) of datasets.
  • A centralized hub for acquiring, manipulating, and sharing results via standard protocols like NFS, SMB, and S3.

Enterprise-grade research infrastructure at scale

SUNK unifies security, scalability, performance, and observability to deliver a high-performance Slurm experience purpose-built for AI research clusters on CoreWeave’s optimized infrastructure.

Security

SUNK User Provisioning automatically synchronizes POSIX and Slurm users with CoreWeave IAM or any supported Identity Provider, such as Okta or Google Workspace. User and group updates propagate instantly, eliminating manual configuration, reducing operational risk, and accelerating secure researcher access to compute.

Scalability

Workloads and data move fast and friction-free across clouds and regions without lock-in, giving you the flexibility to choose and the performance to stay.

Performance

The SUNK Scheduler runs training, inference, and reinforcement learning workloads on the same cluster to maximize efficiency. Topology-aware scheduling and optimized job requeue improve performance and resource utilization across every phase of research.

Observability

Quickly troubleshoot and optimize performance using Grafana dashboards purpose-built for SUNK. Access rich visibility into Slurm job metrics, hardware, networking, and storage layers, all tightly integrated with CoreWeave’s observability stack for end-to-end infrastructure insight.

Run on industry-leading Cloud infrastructure services

SUNK runs on infrastructure services that provide the ideal combination of ease of use, workload fungibility, performance, and scale.

Compute Services

Get the latest GPU compute you need for your AI workloads through a Kubernetes-native environment

Storage Services

Flexible, purpose-built, high-performance storage solutions tailored for AI

Networking Services

High-performance networking for optimal cluster scale-out and connectivity

Supercomputing Scale & Enterprise-grade security

With massive megaclusters, CoreWeave GPU clusters help support multi-trillion parameter model training.

    HPC Job Scheduler Request Form

    Tell us what you’re trying to run and we’ll follow up.


    By submitting, you consent to CDSNA contacting you at the email/phone provided to follow up on this request.