Big Data

Big data is a collection of massive data sets that are generated and collected through various sources. Big data can be present in structured, semi-structured and unstructured form. It helps in identifying trends and past behaviours in any field. It has applications in many fields including banking, education, Marketing and business.

However, there are certain challenges associated with big data such as lack of storage capacity and processing technologies. High computing power is required to process big data and find meaningful insights from it.

Discussion

  • Which software technologies are used to manage big data?
    Apache Hadoop ecosystem. Source: Bappalige 2014
    Apache Hadoop ecosystem. Source: Bappalige 2014

    In traditional computing systems, organizations had to buy infrastructure and maintain them to have proper computing system. It was very costly as well as time consuming. Cloud computing solved this problem. It offered customized computing systems at affordable costs.

    Big data requires massive on-demand computation power and distributed storage. It is often present in petabytes which is too large for traditional computing systems. Cloud computing provides scalable on-demand integrated computer resources, required storage and computing capacity to analyse big data. Users can use these resources and end the session when done. They will be charged only for the resources used.

    Apache Hadoop, an open source distributed processing framework is used to perform the processing of big data. It uses Map/Reduce algorithm to process large volume of data. It works on divide and conquer method in which a problem is broken down into many smaller parts. Gradually other processing tools such as Apache Spark and MongoDB Atlas were introduced after Hadoop. Often an ecosystem of different technologies are used for big data processing.

  • What are various definitions of big data?
    Big data architecture. Source: Microsoft Docs 2023
    Big data architecture. Source: Microsoft Docs 2023

    According to Edd Dumbill on O’Reilly, big data is data that can't be processed through conventional database systems. The data is too big and increases too fast to fit the structures of database architectures. An alternative way should be chosen to process it.

    According to Microsoft Enterprise Insight Blog, big data is the process of applying high computing power – the latest in machine learning and artificial intelligence – to massive and highly complex sets of information.

    According to Networkworld, any amount of data that’s too big to be handled by one computer is big data.

    According to Cory Janssen's post on Techopedia, big data is a process that is used when traditional data mining and handling techniques cannot find insights and meaning from data sets.

  • What are the characteristics of big data?
    6 Vs of big data. Source: Phillips 2021
    6 Vs of big data. Source: Phillips 2021

    Characteristics of big data can be defined by 42 V's. Some major of them are as following:

    • Volume: Size of big data is often larger than terabytes and petabytes. Every second more data is stored on the internet than the total data stored on the internet just 20 years ago.
    • Variety: It denotes the type and nature of the data. Technologies such as RDBMS were capable to handle structured data efficiently but they weren't sufficient to process and analyze semi-structured and unstructured data.
    • Velocity: The speed at which data is generated and processed for analysis. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling and processing.
    • Veracity: It signifies the degree of accuracy and reliability of data. Often big data is the accumulation of data collected from various sources, hence it is important to make sure the data collected is accurate and trustworthy.
    • Value: The useful information retrieved from big data determines its value. All the datasets collected in big data not necessarily provide useful information.
  • What are sources of big data?

    Big data is made up of text, image, video and audio files. Big data is generated through many sources. Some major sources of big data include Internet of Things (IoT) devices, self quantified data, multimedia data and social media data.

    IoT data is generated by GPS devices, mobile phones, intelligent clothing, alarms, intelligent/smart cars, mobile computing devices, PDAs, window blinds, window sensors.

    Self quantifying data is generated by the measuring individuals' behaviour. Data from wristbands used to monitor movements and exercise and sphygmomanometers utilized to measure blood pressure are examples of self-quantification data.

    Multimedia data is generated from various sources such as text, images, and audio and video. Each individual connected to the internet generates this data. This type of data grows exponentially each day.

    Social media data is generated by platforms such as Facebook, Twitter, LinkedIn, YouTube, Instagram and so on. This type of data grows at the highest speed. Excessive usage of social media leads to a lot of social media data generated each day.

  • How is big data stored and processed?
    Hadoop's core components. Source: Phillips 2021
    Hadoop's core components. Source: Phillips 2021

    Data lakes are used to store big data. Unlike typical data warehouses that are commonly built on relational databases and contain structured data only, data lakes can support various data types and are based on Hadoop clusters (an open source distributed framework that manages data processing and storage for big data applications in scalable clusters of computer servers), cloud object storage services, NoSQL databases or other big data platforms. Consistency, availability and partition tolerance are some important factors of big data storage systems.

    Often big data environments are made up of a combination of multiple systems. For instance, a central data lake might be integrated with other platforms, including relational databases or a data warehouse.

    Technologies such as Hadoop and Spark are used to process big data. Heavy computing power is required to process big data which is provided by clustered systems that distribute processing workloads across hundreds or thousands of servers. Hence, Cloud is preferred choice for big data systems.

  • What are use cases of big data?

    Companies use big data to understand different consumer patterns and improve their operations to provide better customer support. Companies that use big data make better and faster decisions than companies which don't.

    Big data has many use cases in various fields such as finance, healthcare, education, media, IOT and specially business. Below are some applications of big data:

    • Product development: Big companies such as Netflix and Procter & Gamble use big data to determine customer demand and needs. They find out the key attributes of past successful products/services and use this data to build the new products/services.
    • Predictive maintenance: Analysis of unstructured data such as log entries, sensor data, error messages and engine temperature can help in predicting the lifespan of products.
    • Customer experience: Big data allows organizations to collect data from social media, web visits, call logs, and other sources and improve the customer experience.
    • Fraud and compliance: Big data helps in finding leakage in security systems by finding similar patterns in fraud cases.
    • Machine learning: Big data allows us to teach machines how to perform instead of just programming them.
  • What are challenges faced with big data?

    One of the major challenges with big data is its volume. The size of big data is increasing at a very high speed. According to Oracle, data volumes are doubling in size about every two years. Current data processing algorithms are not capable of retrieving the required information on time in case of big data storage. They are designed for limited amount of data.

    Data is only useful if it conveys any meaning or gives any results. Organizing data in such a way that it gives insights takes a lot of work. Data scientists spend 50-80 percent of their time curating and preparing data before it can actually be used.

    Apache Hadoop was the only technology used to handle big data. Soon Apache Spark was introduced. The combination of Hadoop and Spark framework is a better approach for big data processing. Staying up-to date with big data technologies is also an overhead.

    Often NoSQL databases are used for big data. NoSQL databases have many advantages such as flexibility, open source, cost effective and scalability. They have certain limitations too such as lack of maturity and consistency related to performance.

  • What are some good big data practices?

    Big data can be an expensive burden on the organization if its employees don't know how to make insightful decisions from it. Hence, organizations deploying big data strategy should consider investing in their employees' skills and training.

    With the increase in collection and usage of data, data misuse also started increasing. European Union employed General Data Protection Regulation (GDPR). It limits the types of data organizations can collect and makes opt-in consent from individuals compulsory for collecting personal data. A similar law is applied in California, called California Consumer Privacy Act (CCPA).

    Organizations should focus on their needs rather than rapidly evolving big data technology. Businesses should find out the problems/opportunities that big data can solve. After that a collaborative effort between data scientist and business executives can lead to fruitful insights from data sets.

    Limited and authorised sources of data should be chosen to avoid complexity and useless loads of data.

    Having backup of big data is important as data can be lost or corrupted. Also, it protects data from cyber threats.

  • How does big data analytics work?

    Below are series of steps involved in deriving big data analytics;

    • Data Collection: Organizations can use numerous ways to collect data from from cloud storage to mobile applications to in-store IoT sensors and beyond. The data is stored in data warehouses and data lakes.
    • Data Processing: After the collection of data, it is organized properly for further processing. Batch processing is a data processing option which looks at large data blocks over time. It is useful when there is a longer delay time between collection and analysis of data. Stream processing is a data processing option which looks at small batches of data at a time. It shortens the delay time between collection and analysis of data enabling quicker decision-making.
    • Data Cleaning: Regardless data is big or small, cleansing of data is mandatory before analysis. Any duplicate or irrelevant data should be removed and data should be formatted correctly.
    • Data Analysis: Now advanced analytics processes are used to turn big data into insights. Some of these big data analysis methods include data mining, predictive analytics and deep learning.
  • What are the myths about big data?

    With evolution of big data, myths about big data has also evolved. Some of them are as following:

    Big data is costly and only for IT department

    The cloud SaaS platforms such as Amazon Web Services, Microsoft Azure and Google Cloud have made big data systems very affordable. No hardware/software purchase or installation is required. The access of right data can improve the performance of employee/organization regardless of department.

    Big data can predict the future

    Big data only tells the possibility of events based on the past data. It does not precisely predict the future. The predictions are based on past events and can be wrong too.

    Big data is all about size

    Volume is an important characteristic of big data but variety and velocity are equally important feature. Data should not only be large in size but also from credible sources.

    Big data will replace existing data warehouses

    Big data fulfils specific requirements but it is not the solution for every data-related issue. It cannot replace the traditional data warehouses or RDBMS.

    Big data is just hype

    Marketers in many industries are using it to increase revenue and reduce the operational costs.

Milestones

1990

Peter J. Denning describes the need for computing machines in his publication, Saving All the Bits to process massively increasing data and find statistical summary out of it.

2000

The Sloan Digital Sky Survey (SDSS) begins collecting astronomical data, it gathers more data in its first few weeks than all data collected in the history of astronomy previously. It collects data at a rate of about 200 GB per night.

2001

Doug Laney introduces the 3Vs concept of big data. The Vs are volume, variety and velocity.

2005
Apache Spark architecture. Source: Das 2021
Apache Spark architecture. Source: Das 2021

Computer scientists Doug Cutting and Mike Cafarella create an open source framework called Apache Hadoop. It is used to store and process large data sets. Apache Spark is an open-source data processing framework introduced in 2009. It can quickly perform processing tasks on very large data sets. It provides the computational speed, scalability and programmability required for Big Data. The primary difference between Spark and Hadoop is that Spark processes and retains data in memory for subsequent steps, whereas Hadoop processes data on disk.

2007

The term big data is introduced to the masses in the Wired's article “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”.

2008

Google processes 20 petabytes of data in a single day.

2012

The White House announces a national "Big Data Initiative" that consists of six federal departments and agencies committing more than $200 million to big data research projects. The U.S. state of Massachusetts announces the Massachusetts Big Data Initiative which provides funding from the state government and private companies to a variety of research institutions. Harvard Business Review titles Data Scientist as the “Sexiest Job of 21st Century”.

2014

The British government announces the founding of the Alan Turing Institute to focus on new ways to collect and analyze large data sets.

2016

IBM reports that 2.5 quintillion bytes of data is created every day (that's 2.5 followed by 18 zeroes).

References

  1. Anderson, Chris. 2000. "The end of theory: the data deluge makes the scientific method obsolete." The Wired, June 23. Accessed 2023-01-16.
  2. Apac Business Headlines. 2023. "Myths about big data- people should stop believing." Apac Business Headlines. Accessed 2023-02-21.
  3. BBC News. 2014. "Alan Turing Institute to be set up to research big data." BBC News, March 19. Accessed 2023-01-16.
  4. Bappalige, Sachin P. 2014. "An introduction to Apache Hadoop for big data." Opensource, August 26. Accessed 2023-03-06.
  5. Bhadani, A., and D Jothimani. 2016. "Big data: challenges, opportunities and realities." Arxiv. Accessed 2023-01-26.
  6. Bigelow, Stephen J. 2021. "An introduction to big data in the cloud." Techtarget, July 19. Accessed 2023-02-18.
  7. Bizbl marketing. 2023. "The Top Five Myths of Big Data Analytics." Bizbl marketing. Accessed 2023-03-06.
  8. Botelho, Bridget. 2022. "Big data and management: From the Editors." Techtarget. Accessed 2023-01-26.
  9. Brainhub. 2023. "Big Data: 10 myths debunked." Brainhub. Accessed 2023-02-20.
  10. Brown, Eric D. 2023. "4 Big data myths, busted." The Enterprisers Project. Accessed 2023-02-21.
  11. Butler, Brandon. 2012. "Defining 'big data' depends on who's doing the defining." Network world, May 10. Accessed 2023-02-14.
  12. Das, Aveek. 2021. "Introduction to Apache Spark." SqlShack, April 12. Accessed 2023-03-06.
  13. Farmer, Donald. 2021. "6 Essential big data best practices for businesses." Techtarget, May 07. Accessed 2023-02-17.
  14. George, Gerard, Martine R Haas, and Alex Pentland. 2014. "Big data and management: From the Editors." Singapore Management University. Accessed 2023-01-26.
  15. Google Cloud. 2023. "What is a data lake?" Google Cloud. Accessed 2023-02-13.
  16. IBM. 2021. "The respective architectures of Hadoop and Spark, how these big data frameworks compare in multiple contexts and scenarios that fit best with each solution." IBM, May 27. Accessed 2023-03-06.
  17. IBM. 2023. "What is Apache Spark?" IBM. Accessed 2023-02-21.
  18. James, Edd Wilder. 2012. "Big data." O'Reilly, January 11. Accessed 2023-02-13.
  19. Knapton, Ken. 2022. "Four best practices for big data governance." Forbes, June 16. Accessed 2023-02-17.
  20. Malak, Haissam abdul. 2023. "9 Tested Big Data Best Practices to Apply." The ECM consultant, January 27. Accessed 2023-03-06.
  21. Marr, Bernard. 2017. "5 Massive 'Big Data' myths most people believe - but shouldn't." Forbes, September 26. Accessed 2023-02-20.
  22. Microsoft Docs. 2023. "Big data architectures." Microsoft. Accessed 2023-01-29.
  23. Microsoft Enterprise Insight Blog. 2013. "The big bang: how the big data explosion is changing the world." Microsoft News Center, February 11. Accessed 2023-02-14.
  24. Muniswamaiah, Manoj, Tilak Agerwala, and Charles Tappert. 2019. "Big data in cloud computing review and opportunities." International Journal of Computer Science & Information Technology, August 04. Accessed 2023-02-18.
  25. Office of Science and Technology Policy. 2012. "Obama administration unveils “big data” inititaive: announces $200 million in new R&D investments." Obama white house, March 29. Accessed 2023-01-21.
  26. Oracle. 2023. "What is big data?" Oracle. Updated 2022-12-16. Accessed 2023-01-17.
  27. Packt. 2023. "Sources of big data." Packt. Accessed 2023-01-27.
  28. Pal, Kaushik. 2016. "10 Big Myths About Big Data." Techopedia, July 14. Accessed 2023-03-06.
  29. Phillips, Andres. 2021. "A history and timeline of big data." Techtarget, April 01. Accessed 2023-01-29.
  30. Press, Gil. 2013. "A very short history of big data." Forbes, May 09.. Updated 2013-12-21. Accessed 2023-01-29.
  31. Project pro. 2023. "Big data timeline- series of big data evolution." Project pro. Updated 2023-01-16. Accessed 2023-01-29.
  32. Sagiroglu, Seref, and Duygu Sinanc. 2023. "Big data: a review." Department of Computer Engineering, Gazi University. Updated 2022-12-16. Accessed 2023-01-17.
  33. Schroer, Alyssa. 2022. "Big data." Builtin. Updated 2022-12-16. Accessed 2023-01-17.
  34. Segal, Troy. 2022. "What is big data? Definition, How It Works, and Uses." Investopedia. Updated 2022-11-29. Accessed 2023-01-27.
  35. Shafer, Tom. 2017. "The 42 V’s of big data and Data Science." KD Nuggets, April. Accessed 2023-02-13.
  36. Shah, Shvetank, Andrew Horne, and Jaime Capellá. 2012. "Good data won’t guarantee good decisions." Harvard Business Review, April. Updated 2022-11-29. Accessed 2023-01-27.
  37. Sharma, Gaurav. 2023. "Big data & cloud computing: the roles & relationships." IEEE Computer Society. Accessed 2023-02-18.
  38. Tableau. 2023. "Big data analytics: what it is, how it works, benefits, and challenges." Tableau. Accessed 2023-01-27.
  39. Techopedia. 2019. "Big data." Techopedia. Updated 2019-02-25. Accessed 2023-02-13.
  40. The Economist. 2010. "Data, data everywhere." The Economist, February 27. Accessed 2023-01-16.
  41. Wigmore, Ivy. 2020. "3Vs (volume, variety and velocity)" Techtarget, December. Accessed 2023-02-18.
  42. Wikipedia. 2023. "Big data." Wikipedia. Updated 2023-01-26. Accessed 2023-01-26.
  43. Yaqoob, Ibrar, Ibrahim Abaker Targio Hashem, Abdullah Gani, Salimah Mokhtar, Ejaz Ahmed, Nor Badrul Anuar, and Athanasios V. Vasilakos. 2016. "Big data: From beginning to future." International Journal of Information Management, June 29. Updated 2016-09-16. Accessed 2023-01-26.
  44. Young, Shannon. 2012. "Mass. governor, MIT announce big data initiative." Boston, May 30. Accessed 2023-01-16.

Further Reading

  1. Sagiroglu, Seref, and Duygu Sinanc. 2023. "Big Data: A Review." Department of Computer Engineering, Gazi University. Updated 2022-12-16. Accessed 2023-01-17.
  2. Wikipedia. 2023. "Big data." Wikipedia. Updated 2023-01-26. Accessed 2023-01-26.
  3. Botelho, Bridget. 2022. "Big data and management: From the Editors." Techtarget. Accessed 2023-01-26.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
5
6
2327
2
4
486
2224
Words
0
Likes
2242
Hits

Cite As

Devopedia. 2023. "Big Data." Version 7, March 7. Accessed 2024-06-26. https://devopedia.org/big-data
Contributed by
2 authors


Last updated on
2023-03-07 04:22:42
  • Big Data in R
  • COCO Dataset
  • Computer Data Storage
  • Data Alignment
  • Data Analytics
  • Data as a Product