Big Data Analytics: Concepts and Techniques

Authors: Kapil Oberai and Rohith Rajan

Data is a collection of facts, numbers, words, and pictures. Data is everywhere, from social media posts to smart watch steps. Information is what we get when we organize and understand that data. Why are we talking about “big” data now? The reason is that, there’s whole lot more of it and its being produced in enormous amounts through phones, sensors, apps, and daily digital activities. Every minute, people watch videos, send messages, search online, and upload photos (fig. 1). We are creating more data every day than we used to in years and not to forget other data sources like satellite data, sensors, GPS etc.

Figure 1: Data never sleeps (Source: https://www.domo.com/learn/infographic/data-never-sleeps-12)

Big data is defined as a vast collection of myriad datasets which is hard to process through the traditional data processing platforms or state of the art approaches for data processing (Chen et al., 2014).

The 5 Vs of Big Data: When we talk of Big Data the first thing that comes to mind is the “Size” i.e. the sheer volume of data. However size is relative and varies with people, situation and time frame. Hence, big data is primarily defined using five key characteristics, all starting with the letter “V”. These characteristics are known as the 5Vs: “Volume, Velocity, Variety, Veracity, and Value”.

Volume: Huge amount of data (Example: Twitter(X), Instagram posts and photos).
Velocity: Speed of data (Example: Google Maps traffic updates).
Variety: Different types- structured, semi‑structured, or unstructured (Example: text, GPS, video, satellite imagery).
Veracity: Trust level (Example: location tag).
Value: Useful insights (Example: Netflix Movie recommendations).

Big Data Analytics

Big data analytics analyses massive and diverse datasets (structured, semi-structured & unstructured) to identify underlying patterns and useful insights. It is been used in various sectors like retail, banking, education, healthcare etc. Popular examples include E-commerce sites suggesting products by studying shopping patterns, Netflix recommending movies by analysing viewing habits, healthcare sector uses to detect diseases early & predict outbreaks and Google Maps uses millions of live location signals to show real-time traffic. Big data analytics involves several techniques, including data mining, machine learning, predictive modelling, and statistics, applied to challenges presented by such big data.

Big data analytics (fig. 2) is classified in the following four categories (Shahnawaz & Kumar, 2025):

Descriptive Analytics: Shows what happened (e.g. mapping which areas were flooded using satellite images).
Diagnostic Analytics: Explains why it happened (e.g. analysing rainfall and other factors causing the flood).
Predictive Analytics: Predicts what may happen (e.g. forecasting which areas may flood next using weather and river data).
Prescriptive Analytics: Suggests what to do (e.g. recommending safe evacuation routes and shelters based on flood predictions).

The process of big data analytics generally follows a specific flow, much like a factory production line thereby converting raw data to powerful knowledge. The process of identifying insights from big data is shown in figure 3 having data management and analytics phases (Gandomi & Haider, 2015).

Figure 3: Big data processes- extracting insight

Big Data Analytics Tools: Hadoop and Spark

Just as you wouldn’t use a bicycle to transport cargo containers, similarly traditional software can’t manage the massive flow of big data, which requires robust and specialized tools. Two of such popular tools are Hadoop and Spark.

Hadoop: It lets us store enormous amounts of data and process it by breaking the task into many small pieces and working on them all at once across many computers. It works on the principle of divide and conquer. It includes HDFS (storage) and MapReduce (processing) as two major component. HDFS (Hadoop Distributed File System) is the underlying distributed file system that divides large files among many machines/nodes for high-throughput and fault-tolerant data access. MapReduce works in two steps namely Map and Reduce. The Map step splits a large task into smaller parts and sends them to different nodes. The Reduce step collects these outputs and processes them to produce the final result (Furht & Villanustre, 2016).

Spark: It is an open-source, distributed computing framework designed to process huge datasets much faster than traditional systems. Spark is an in-memory computing framework compared to Hadoop. It works by loading data into memory (RAM) instead of reading it from disk, which makes computations extremely quick. Spark breaks large tasks into small chunks and sends them to many computers that work in parallel. It can handle streaming, machine learning, and graph analysis in one unified system. It also supports multiple languages like Python,SQL and R.

Both Hadoop and Spark, support several spatial extensions, including SpatialHadoop, Hadoop-GIS, Apache Sedona and GeoTrellis. Other tools include NoSQL systems with spatial extensions. In addition, cloud-based platforms such as Google Earth Engine and Microsoft Planetary Computer are also used for large-scale spatial data analytics.

Big Data Analytics Use Case: Geo-social Media Data Analytics for Tourism

The study builds a geospatial framework for analysing tourism using geo-social media data from Twitter (X), TripAdvisor, and Flickr. It uses sentiment analysis to understand visitor experiences and displays the results on an interactive dashboard (fig. 4). Such application provides a fuller picture of tourist behaviour than the traditional survey-based methods.

Figure 4: Geo-social media data analytics for tourism

Conclusion

Big Data Analytics has become essential for making sense of the massive volumes of data generated every second from social media,e-commerce to sensors, satellites, and smartphones. Big Data Analytics enhances geoinformatics by enabling efficient handling of massive spatial datasets from satellites, drones, IoT sensors, and GIS databases. By unearthing complex spatial patterns and trends, it supports accurate, real-time, and data-driven geospatial decision-making for applications such as disaster management, environmental monitoring, and urban planning

However, big data analytics also faces several challenges, managing enormous storage and processing demands, maintaining data privacy and security, including ensuring data quality, and integrating diverse formats from multiple sources.

References

Chen, M., Mao, S., Zhang, Y., & Leung, V. C. M. (2014). Related Technologies BT – Big Data: Related Technologies, Challenges and Future Prospects (M. Chen, S. Mao, Y. Zhang, & V. C. M. Leung (eds.); pp. 11–18). Springer International Publishing. https://doi.org/10.1007/978-3-319-06245-7_2

Furht, B., & Villanustre, F. (2016). Introduction to Big Data BT – Big Data Technologies and Applications (B. Furht & F. Villanustre (eds.); pp. 3–11). Springer International Publishing. https://doi.org/10.1007/978-3-319-44550-2_1

Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35, 137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007

Shahnawaz, M., & Kumar, M. (2025). A Comprehensive Survey on Big Data Analytics: Characteristics, Tools and Techniques. ACM Comput. Surv., 57(8). https://doi.org/10.1145/3718364

No Comments

Leave a Comment