International Journal on Science and Technology

E-ISSN: 2229-7677     Impact Factor: 9.88

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 16 Issue 3 July-September 2025 Submit your research before last 3 days of September to publish your research paper in the issue of July-September.

Batch Loading Data to Google BigQuery using Google Data Fusion

Author(s) Suhas Hanumanthaiah
Country United States
Abstract Efficient data ingestion into cloud-based data warehouses is critical for enabling timely analytics and informed decision-making in modern enterprises. This paper explores a practical and scalable solution for batch loading large volumes of structured and semi-structured data into Google BigQuery by leveraging Google Cloud Data Fusion (CDF). BigQuery, a serverless and highly scalable analytical database, excels at processing petabyte-scale datasets but requires efficient upstream data integration to unlock its full potential. Google Cloud Data Fusion, built on the Cask Data Application Platform (CDAP), offers a visual, code-free interface for designing, managing, and executing ETL pipelines. The research outlines how CDF integrates seamlessly with other GCP services such as Dataproc to orchestrate resource-optimized data pipelines. Through a well-defined architectural framework and deployment model, the paper demonstrates how CDF can be employed to create modular, reusable, and auto-scaling batch data workflows, delivering operational cost savings and performance benefits. Best practices such as namespace segregation, transformation pushdown, autoscaling clusters, and failure alerting are presented to enhance pipeline efficiency and governance. Additionally, this study identifies existing limitations in real-time data ingestion capabilities within CDF and proposes future work to evaluate its streaming performance using Pub/Sub and Spark Streaming. Overall, the approach provides a robust and cost-effective foundation for enterprise-grade data integration on Google Cloud, with strong potential for hybrid batch-streaming models in future research.
Keywords Cloud Data Fusion (CDF), Google BigQuery, Batch Data Processing, Google Cloud Platform (GCP).
Field Engineering
Published In Volume 15, Issue 1, January-March 2024
Published On 2024-03-06
DOI https://doi.org/10.71097/IJSAT.v15.i1.7552
Short DOI https://doi.org/g9v45g

Share this