What is the Aurora engine?

xAI's autoregressive video architecture trained on 110,000 NVIDIA GB200 GPUs, optimized for temporal consistency and native audio sync.

What's the difference between T2V and I2V?

T2V (text-to-video) requires only a text prompt to generate a video from scratch. I2V (image-to-video) starts from a reference image and animates it forward while preserving the original composition.

Does Grok Imagine generate audio automatically?

Yes. Every video comes with dialogue, music, and sound effects generated natively. No separate audio editing or post-production is needed.

What resolution and duration does Grok Imagine support?

Videos are generated at 720p (1280×720) at 24 frames per second. Duration options are 6 seconds and 10 seconds. All three aspect ratios are supported: 16:9 (landscape), 9:16 (vertical), and 1:1 (square).

How fast is Grok Imagine?

Approximately 30 seconds per generation. This includes both the video rendering and native audio synthesis — a complete, ready-to-use video with synchronized sound in under a minute.

What are the Normal, Fun, and Spicy modes?

Grok Imagine offers three creative modes. Normal produces grounded, realistic output. Fun allows more stylized, playful results with exaggerated motion and color. Spicy pushes creative boundaries with surreal, highly stylized aesthetics.

xAI · Grok Imagine · Aurora Engine

Grok Imagine.
Video with Native Audio.

xAI's Aurora-powered video model. Text or image in — cinematic video with native audio out. Dialogue, music, and sound effects generated in ~30 seconds.

Start Generating Free View Pricing

720p

Resolution

24fps

Frame rate

10s

Max duration

30s

~Generation time

Aspect ratios

Aurora Engine720p 24fpsNative AudioText-to-VideoImage-to-VideoLip-Sync Dialogue110K GB200 GPUs~30s Generation16:9 Landscape9:16 Vertical1:1 SquareNormal ModeFun ModeSpicy ModeSound EffectsBackground MusicxAI6–10 Seconds

Specifications

Technical Specs

Built on the Aurora engine — xAI's autoregressive video architecture trained on 110,000 NVIDIA GB200 GPUs.

Resolution

720p

1280 × 720

Frame Rate

24 fps

Cinematic standard

Duration

6s or 10s

Two options

Audio

Native

Dialogue + Music + SFX

Generation

~30s

End-to-end

Engine

Aurora

110K GB200 GPUs

720p HD6s & 10sT2V + I2VNative audio3 Aspect ratios3 Creative modes

Output Formats

Every Platform. Every Ratio.

Generate landscape, vertical, or square video — Grok Imagine supports all three major aspect ratios natively.

Landscape

16:9 · 1280×720

16 : 9

YouTubeDesktopTV

Vertical

9:16 · 720×1280

9 : 16

TikTokReelsShorts

Square

1:1 · 720×720

1 : 1

InstagramTwitter/X

Duration

Choose Your Clip Length

Two duration options to fit your creative needs — from punchy hooks to longer narrative sequences.

6 seconds

With native audio

Perfect for social hooks, product reveals, reaction clips, and punchy visual statements that grab attention instantly.

MAX

10 seconds

With native audio

Room for narrative beats, character moments, multi-scene pacing, and story arcs with beginning, middle, and end.

Input Modes

Text-to-Video vs Image-to-Video

Two ways to create. Describe a scene from scratch — or animate a photo you already have.

Text-to-Video (T2V)

T2V

Write a text prompt describing your scene — characters, setting, camera angle, mood, lighting. Grok Imagine generates the full video with synchronized audio from words alone.

Natural language prompts
Full creative control over every detail
Audio generated from scene context
All aspect ratios supported

Image-to-Video (I2V)

I2V

Upload a reference image and describe how it should move. Grok Imagine animates the photo forward while preserving the original composition, colors, and subject identity.

Animate existing photos and artwork
Preserve source composition and style
Add natural motion + synchronized audio
Works with photos, illustrations, renders

Capabilities

What Grok Imagine Can Do

Built on xAI's Aurora engine with native audio generation, temporal coherence, and cinematic shot understanding.

Native Audio Generation

Dialogue with lip-sync, contextual background music, and ambient sound effects — all generated natively alongside the video. No separate audio pipeline.

Temporal Consistency

Aurora maintains frame-to-frame coherence across the full clip. Characters stay consistent, objects persist, and camera motion flows without artifacts or flickering.

Cinematic Shot Language

Describe camera movements in your prompt — tracking shots, close-ups, panning, aerial views — and Grok Imagine executes them with professional-grade framing.

Style Versatility

Three creative modes — Normal, Fun, and Spicy — let you dial the aesthetic from photorealistic to highly stylized. Works across live-action, animation, and abstract styles.

Image-to-Video Animation

Upload any image and Grok Imagine brings it to life. The model preserves subject identity, composition, and visual style while generating natural, fluid motion.

Native Audio

Sound Built In. No Post-Production.

Every Grok Imagine video ships with three layers of audio — dialogue, music, and sound effects — generated natively alongside the visuals.

Dialogue & Lip-Sync

Characters speak with natural voice and precise lip synchronization. The model generates speech that matches mouth movements frame-by-frame — no manual dubbing needed.

Contextual Background Music

Background music adapts to scene mood and tempo automatically. Action scenes get intensity and drive; quiet moments get ambient, atmospheric scoring.

Ambient Sound Effects

Footsteps on gravel, rain on windows, engine rumble, wind through trees — environmental audio is generated and precisely timed to match the visual content.

Under the Hood

The Aurora Engine

xAI's proprietary autoregressive video architecture — the largest known training infrastructure for a video model.

110,000 NVIDIA GB200 GPUs

Aurora is trained on the largest known GPU cluster dedicated to video generation — 110,000 NVIDIA GB200 GPUs. This massive compute enables the model to learn complex temporal dynamics, audio-visual synchronization, and physically plausible motion at scale.

Autoregressive architecture for frame-by-frame coherence
Joint audio-visual generation in a single forward pass
Trained on diverse video corpus for style versatility
Optimized inference — ~30 seconds per generation

Architecture

Autoregressive

Sequential frame generation for temporal consistency

Audio

Joint Generation

Video and audio produced in a single model pass

Training Scale

110K GPUs

NVIDIA GB200 Blackwell architecture

Inference Speed

~30 seconds

Complete video with audio, end-to-end

How It Works

Three Steps. One Video.

From prompt to finished video with audio in under a minute.

Write your prompt or upload an image

For T2V, describe the scene, characters, camera angle, and mood. For I2V, upload a reference image and describe how it should animate. Choose Normal, Fun, or Spicy mode.

Select aspect ratio and duration

16:9 for landscape (YouTube, desktop), 9:16 for vertical (TikTok, Reels, Shorts), or 1:1 for square (Instagram). Pick 6s or 10s duration.

Generate in ~30 seconds

Aurora renders your video with fully synchronized audio — dialogue, music, and sound effects all included. Download and use immediately, no post-processing required.

Use Cases

Who Is Grok Imagine For?

From content creators to marketing teams — Grok Imagine accelerates every video workflow.

Social Media Content

Create vertical 9:16 videos for TikTok, Reels, and Shorts — complete with music and sound effects, ready to post.

Advertising & Promos

Rapid-prototype video ads and promotional clips with cinematic quality. Test concepts before committing to production.

Product Visualization

Animate product images into dynamic showcase videos. Turn static product shots into engaging motion content.

Storytelling & Narrative

Draft short scenes, music video concepts, or story sequences with character consistency and natural dialogue.

Memes & Viral Content

Fun and Spicy modes produce playful, exaggerated, or surreal results — ideal for meme-worthy, shareable video content.

Creative Brainstorming

Visualize ideas before investing in production. Generate quick visual references for pitches, mood boards, and storyboards.

FAQ

Common Questions

Everything you need to know about generating video with Grok Imagine.

Try Grok Imagine Now.

Generate your first video with native audio today. Text or image in — cinematic output in ~30 seconds.

Start Creating Free See Plans

No credit card required for free tier · Cancel anytime

Grok Imagine.Video with Native Audio.

Technical Specs

Every Platform. Every Ratio.

Choose Your Clip Length

Text-to-Video vs Image-to-Video

What Grok Imagine Can Do

Native Audio Generation

Temporal Consistency

Cinematic Shot Language

Style Versatility

Image-to-Video Animation

Sound Built In. No Post-Production.

Dialogue & Lip-Sync

Contextual Background Music

Ambient Sound Effects

The Aurora Engine

110,000 NVIDIA GB200 GPUs

Three Steps. One Video.

Write your prompt or upload an image

Select aspect ratio and duration

Generate in ~30 seconds

Who Is Grok Imagine For?

Social Media Content

Advertising & Promos

Product Visualization

Storytelling & Narrative

Memes & Viral Content

Creative Brainstorming

Common Questions

What is the Aurora engine?

What's the difference between T2V and I2V?

Does Grok Imagine generate audio automatically?

What resolution and duration does Grok Imagine support?

How fast is Grok Imagine?

What are the Normal, Fun, and Spicy modes?

Try Grok Imagine Now.

Grok Imagine.Video with Native Audio.

Technical Specs

Every Platform. Every Ratio.

Choose Your Clip Length

Text-to-Video vs Image-to-Video

What Grok Imagine Can Do

Native Audio Generation

Temporal Consistency

Cinematic Shot Language

Style Versatility

Image-to-Video Animation

Sound Built In. No Post-Production.

Dialogue & Lip-Sync

Contextual Background Music

Ambient Sound Effects

The Aurora Engine

110,000 NVIDIA GB200 GPUs

Three Steps. One Video.

Write your prompt or upload an image

Select aspect ratio and duration

Generate in ~30 seconds

Who Is Grok Imagine For?

Social Media Content

Advertising & Promos

Product Visualization

Storytelling & Narrative

Memes & Viral Content

Creative Brainstorming

Common Questions

What is the Aurora engine?

What's the difference between T2V and I2V?

Does Grok Imagine generate audio automatically?

What resolution and duration does Grok Imagine support?

How fast is Grok Imagine?

What are the Normal, Fun, and Spicy modes?

Try Grok Imagine Now.

Grok Imagine.
Video with Native Audio.

Grok Imagine.
Video with Native Audio.