Skip to content

job run

Run a job in Studio.

Synopsis

usage: datachain job run [-h] [-v] [-q] [--team TEAM] [--env-file ENV_FILE] [--env ENV [ENV ...]] [--cluster CLUSTER] [--workers WORKERS]
                         [--files FILES [FILES ...]] [--python-version PYTHON_VERSION] [--repository REPOSITORY] [--req-file REQ_FILE] [--req REQ [REQ ...]]
                         [--priority PRIORITY] [--start-time START_TIME] [--cron CRON]
                         file

Description

This command runs a job in Studio using the specified query file. You can configure various aspects of the job including environment variables, Python version, dependencies, and more. When using --start-time or --cron, the job is scheduled as a task and will not show logs immediately. The job will be executed according to the schedule.

Arguments

  • file - Query file to run

Options

  • --team TEAM - Team to run job for (default: from config)
  • --env-file ENV_FILE - File with environment variables for the job
  • --env ENV - Environment variables in KEY=VALUE format
  • --cluster CLUSTER - Compute cluster to run the job on
  • --workers WORKERS - Number of workers for the job
  • --files FILES - Additional files to include in the job
  • --python-version PYTHON_VERSION - Python version for the job (e.g., 3.9, 3.10, 3.11)
  • --repository REPOSITORY - Repository URL to clone before running the job
  • --req-file REQ_FILE - Python requirements file
  • --req REQ - Python package requirements
  • --priority PRIORITY - Priority for the job in range 0-5. Lower value is higher priority (default: 5)
  • --start-time START_TIME - Start time in ISO format or natural language for the cron task.
  • --cron CRON - Cron expression for the cron task.
  • -h, --help - Show the help message and exit.
  • -v, --verbose - Be verbose.
  • -q, --quiet - Be quiet.

Examples

  1. Run a basic job:

    datachain job run query.py
    

  2. Run a job with specific team and Python version:

    datachain job run --team my-team --python-version 3.11 query.py
    

  3. Run a job with environment variables and requirements:

    datachain job run --env-file .env --req-file requirements.txt query.py
    

  4. Run a job with multiple workers and additional files:

    datachain job run --workers 4 --files utils.py config.json query.py
    

  5. Run a job with inline environment variables and package requirements:

    datachain job run --env API_KEY=123 --req pandas numpy query.py
    

  6. Run a job with a repository (will be cloned in the job working directory):

    datachain job run --repository https://github.com/iterative/datachain query.py
    
    # To specify a branch / revision:
    datachain job run --repository https://github.com/iterative/datachain@main query.py
    
    # Git URLs are also supported:
    datachain job run --repository git@github.com:iterative/datachain.git@main query.py
    

  7. Run a job with higher priority

    datachain job run --priority 2 query.py
    

  8. Run a job in a specific cluster

    # Get the cluster id using following command
    datachain job clusters
    # Use the id  of an active clusters from above
    datachain job run --cluster 1 query.py
    

  9. Schedule a job to run once at a specific time

    # Run job tomorrow at 3pm
    datachain job run --start-time "tomorrow 3pm" query.py
    
    # Run job in 2 hours
    datachain job run --start-time "in 2 hours" query.py
    
    # Run job on Monday at 9am
    datachain job run --start-time "monday 9am" query.py
    
    # Run job at a specific date and time
    datachain job run --start-time "2024-01-15 14:30:00" query.py
    

  10. Schedule a recurring job using cron expression

    # Run job daily at midnight
    datachain job run --cron "0 0 * * *" query.py
    
    # Run job every Monday at 9am
    datachain job run --cron "0 9 * * 1" query.py
    
    # Run job every hour
    datachain job run --cron "0 * * * *" query.py
    
    # Run job every month
    datachain job run --cron "@monthly" query.py
    

  11. Schedule a recurring job with a start time

    # Start the cron job after tomorrow 3pm
    datachain job run --start-time "tomorrow 3pm" --cron "0 0 * * *" query.py
    

Notes

  • Closing the logs command (e.g., with Ctrl+C) will only stop displaying the logs but will not cancel the job execution
  • To cancel a running job, use the datachain job cancel command
  • The job will continue running in Studio even after you stop viewing the logs
  • You can get the list of compute clusters using datachain job clusters command.
  • When using --start-time or --cron options, the job is scheduled as a task and will not show logs immediately. The job will be executed according to the schedule.
  • The --start-time option supports natural language parsing using the dateparser library, allowing flexible time expressions like "tomorrow 3pm", "in 2 hours", "monday 9am", etc.
  • Cron expressions follow the standard format: minute hour day-of-month month day-of-week (e.g., "0 0 * * *" for daily at midnight) or Vixie cron-style “@” keyword expressions.
  • Following options for Vixie cron-style expressions are supported:
    • @midnight
    • @hourly
    • @daily
    • @weekly
    • @monthly
    • @yearly
    • @annually