The spatio-temporal complexity of video data presents significant challenges in tasks such as compression, generation, and inpainting. To tackle these, we propose novel method ("MotionAura"). We propose four novel methodologies to address the challenges of spatiotemporal video processing. First, we introduce the 3D Mobile Inverted Vector-Quantization Variational Autoencoder (3D-MBQ-VAE), which combines Variational Autoencoders (VAEs) with masked modeling to enhance spatiotemporal video compression. By employing a novel training strategy with full frame masking, the model achieves superior temporal consistency and state-of-the-art (SOTA) reconstruction quality. Second, we present MotionAura, a text-to-video generation framework that utilizes vector-quantized diffusion models to discretize the latent space and capture complex motion dynamics, producing temporally coherent videos aligned with text prompts. Third, we propose a spectral transformer-based denoising network that processes video data in the frequency domain using the Fourier Transform. This method effectively captures global context and long-range dependencies, reducing computational complexity while maintaining high-quality video generation and denoising. Lastly, we introduce Sketch Guided Video Inpainting, which leverages Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning, enabling user-guided inpainting based on sketches. Across various benchmarks, our models achieve or surpass SOTA performance, offering robust frameworks for spatiotemporal modeling and user-driven video content manipulation.