• Examples of structured, semi-structured and unstructured data. Source: Jones 2018, fig. 2.
    Examples of structured, semi-structured and unstructured data. Source: Jones 2018, fig. 2.
  • Challenges of extracting information from a scanned PDF document. Source: Lawtomated 2019.
    Challenges of extracting information from a scanned PDF document. Source: Lawtomated 2019.
  • Structured data is organized as tables, rows, columns and relations. Source: Pickell 2018.
    Structured data is organized as tables, rows, columns and relations. Source: Pickell 2018.
  • Once analyzed, an 'unstructured' image can reveal useful insights. Source: Johnson 2019.
    Once analyzed, an
  • Keyword trend analysis based on logs and reports. Source: Min 2017, fig. 4.
    Keyword trend analysis based on logs and reports. Source: Min 2017, fig. 4.

Structured vs Unstructured Data

Avatar of user arvindpdmn
arvindpdmn
990 DevCoins
1 author has contributed to this article
Last updated by arvindpdmn
on 2019-10-21 11:03:39
Created by arvindpdmn
on 2019-10-20 04:13:29
Improve this article. Show messages

Summary

Examples of structured, semi-structured and unstructured data. Source: Jones 2018, fig. 2.
Examples of structured, semi-structured and unstructured data. Source: Jones 2018, fig. 2.

Data is available in many forms, shapes and formats. Broadly, data can be either structured or unstructured. Data that's properly organized, with well-defined constraints and relationships among its different parts, can be considered as structured.

There's no precise definition of structured data. In the hazy boundary between structured and unstructured data, some have identified semi-structured data. Others have argued that all data has some structure: it's just that some are more difficult to store or analyze.

The general viewpoint is that all data can add value to data mining and analytics. Technology that evolved for structured data have been adapted, and new ones invented, to handle unstructured data as well. Unstructured data has become increasingly important due to its volume, velocity, variety and value.

Milestones

1960

In the 1960s, businesses start using computers. Storing and managing data becomes important to them. Because memory is expensive, there's a need to store data efficiently. For these reasons, databases are invented. IBM's IMS is an example. These early databases store only structured data.

1970

The 1970s sees the arrival of relational databases. In 1974, IBM invents SQL as a language to query such databases.

Sep
1991

Andy Rehn, VP of marketing for Data Base Architects, states that "as much as 90 percent of the information business uses is non-numerical, freeform data." This is one of the earliest published reports that quantifies the prevalence of unstructured data.

1995
Challenges of extracting information from a scanned PDF document. Source: Lawtomated 2019.

The early and mid-1990s is when text mining starts entering real-world applications. Document-management systems emerge, later rebranded as Enterprise Content Management (ECM) systems. The World Wide Web also starts generating lots of unstructured data. Business Intelligence (BI) that had grown up on structured data starts mining text for useful insights.

1998

A rule of thumb is that 80% of all data is unstructured or semi-structured at best. This 80% figure is mentioned in a Merrill-Lynch report but it's not due to primary research.

2007

From a survey involving data warehousing and business intelligence search, TDWI Research finds that structured/unstructured ratio is not 20/80 as often claimed. Structured data is at 47%, unstructured data at 31% and the rest being semi-structured. However, the report recognizes that unstructured data is on the rise.

Mar
2009

OASIS approves Unstructured Information Management Architecture (UIMA), version 1.0. Apache UIMA is an open source implementation of this standard. V1.0.0 of this software is released in January 2014. In August 2019, V3.1.0 is released.

Mar
2019

Analysis of unstructured data is still new in some domains. A case in point is the healthcare industry where doctors' handwritten notes, images, histories, formulae and genetics have not been fully analyzed.

Discussion

  • What do you mean by "structure" with respect to data?
    Structured data is organized as tables, rows, columns and relations. Source: Pickell 2018.
    Structured data is organized as tables, rows, columns and relations. Source: Pickell 2018.

    Data that's highly organized such as in a database can be considered as structured data. These often use a Relational Database Management System (RDBMS). Such data has schema that defines attributes and their types, constraints on values, and relations with other data tables and attributes. Data must conform to this schema. This makes data easy to query using Structured Query Language (SQL) and thereby facilitates analysis. Data in spreadsheets may be considered structured.

    Unstructured data is not organized in this manner, that is, in RDBMS. Consider an audio stream, which has a well-defined format. Otherwise, it couldn't be decoded and played. Consider newspaper text. A linguist would say there's structure. But from the perspective of business analysts, this data is unstructured simply because it's harder to analyze and obtain insights.

    Some say the term unstructured data is a misnomer. If it's truly lacking structure, it's useless to store it or analyze it. It would be better to categorize data as fixed or variable structure; repetitive or hierarchical; textual or non-textual.

  • Could you give examples of unstructured and semi-structured data?
    Once analyzed, an 'unstructured' image can reveal useful insights. Source: Johnson 2019.
    Once analyzed, an 'unstructured' image can reveal useful insights. Source: Johnson 2019.

    Rich media such as image, video or audio are unstructured. Social media generates lots of unstructured content. Websites host unstructured content that are commonly textual in nature. Information in documents such as MS Word files or PDF files are also seen as unstructured. Machine-generated content such as satellite images, IoT sensor data, or CCTV video feed are considered unstructured.

    Many of the same data sources mentioned above could have attributes that make them semi-structured. For instance, Tweets, Facebook posts, blog articles, and news stories published online often have number of likes, retweets/shares, and comments, including names of readers who did these. Email text may be unstructured but the header contains names of sender/receiver, date and subject that give some structure.

    IoT data may be seen as semi-structured when it's in JSON or XML formats. Data about data, called metadata, such as author name and publication date make data semi-structured. In fact, rich semantic markup on webpages gives them lot more structure that what HTML alone does. Most unstructured data can be considered as semi-structured because of metadata.

  • What are some use cases of analysis on unstructured data?
    Keyword trend analysis based on logs and reports. Source: Min 2017, fig. 4.
    Keyword trend analysis based on logs and reports. Source: Min 2017, fig. 4.

    In a manufacturing plant, operation logs and reports are unstructured. Text analysis to pick out frequent words or sentiments can help the plant manager make a maintenance plan or quickly understand the nature of a particular line.

    Koorong Books in Australia uses text analysis to identify duplicate postings or suggest similar books. This is an example of content-based profiling.

    In banking, customer may often give feedback or complaint on social media rather than via web forms. If banks wish to be customer centric, they need to act on this unstructured data. In fact, this could apply to any industry that needs to listen to its customers. Companies can use chatbots with NLP capability to automate customer support functions.

    Deep learning techniques are being used to analyze images and sounds. Images can be automatically labelled. Mammograms can be analyzed for cancer. The sound of a motor can inform in advance if it's going to fail. This is of importance in automobile and aviation sectors.

  • What are some myths about unstructured data?

    An early myth was that unstructured data can't be quantitatively analyzed. Perhaps true in the past, but with recent advances in computer vision, speech processing, and natural language processing, algorithms are able solve many complex problems in these domains.

    Some might believe that unstructured data replaces structured data. In reality, many companies have not fully exploited the potential of structured data. Good old predictive models and analytical capabilities should continue to be used. Unstructured data will give access to new insights not otherwise available in structured data. In conclusion, both structured and unstructured data are valuable.

    Another myth says that all big data is unstructured data. Telecom companies, smart energy meters, smartphones, and cars fitted with sensors are all generating big data that's structured.

    Some businesses just store the data with a vague idea of using them later. In fact, data and insights depreciate over time. Real-time dashboards are probably the best opportunity to act on data in a timely manner. When collecting or acting on data, tie it to a business vision.

  • How should I store unstructured data?

    Big Data technologies such Hadoop and NoSQL databases have come about to address the needs of storing and managing unstructured data. Data warehouses and data lakes are places where big data is stored. Unstructured data is often used alongside structured data. Many technologies cater for both.

    Hadoop enables distributed storage and computing on big data. It's a good engine for handling unstructured data, though it's a myth to think that unstructured data can't be stored or analyzed without Hadoop.

    NoSQL databases are highly scalable and support flexible schema. Their storage is distributed. They're non-relational. They're good at storing multimedia, social media or textual data. Among the different types are document stores, column stores, key-value stores and graph data stores. A mix of these is often used, each suited to a particular data. This approach is called polyglot persistence.

    As cost of flash memory drops, flash becomes a faster alternative to disk storage. However, file services such as data protection, backup, and search need to be in place.

    For a minimalistic storage system, try the open source MinIO. Alternatives include Ceph, Scality and Cleversafe.

  • What are some techniques to analyze unstructured data?

    The general perception is that unstructured data is hard to analyze. This is changing due to advances in machine learning models.

    The simplest way to get started is perhaps to call APIs that others have published. For example, cognitive APIs from IBM such as Watson Tradeoff Analytics can help in decision making. Geneea is an NLP API. For speech recognition, we can use AT&T Speech API. Google's Cloud Vision API is useful for many image-specific tasks.

References

  1. Apache UIMA. 2019. "Homepage." Accessed 2019-10-20.
  2. Boe, Benjamin De. 2014. "Use Cases for Unstructured Data." Whitepaper, InterSystems Corporation. Accessed 2019-10-20.
  3. Datar, Shweta. 2019. "10 Machine Learning APIs You Should Learn." DZone, January 31. Accessed 2019-10-20.
  4. Foote, Keith D. 2017. "A Brief History of Database Management." DATAVERSITY, March 23. Accessed 2019-10-20.
  5. Goodwins, Rupert. 2019. "AI weaves value from unstructured data." Raconteur, March 27. Accessed 2019-10-20.
  6. Grimes, Seth. 2008. "Unstructured Data and the 80 Percent Rule." Breakthrough Analysis, August 01. Accessed 2019-10-20.
  7. Grishchenko, Alexey. 2015. "The Myth of “Unstructured Data”." Distributed Systems Architecture, July 28. Accessed 2019-10-20.
  8. Halper, Fern. 2018. "3 Use Cases for Unstructured Data." TDWI, September 25. Accessed 2019-10-20.
  9. Johnson, Matthew G. 2019. "The Myth of Unstructured Data." DataSeries, via Medium, February 10. Accessed 2019-10-20.
  10. Jones, M. Tim. 2018. "Data, structure, and the data science pipeline." IBM Developer, February 01. Accessed 2019-10-20.
  11. Lawtomated. 2019. "Structured Data vs. Unstructured Data: what are they and why care?" Lawtomated, April 07. Accessed 2019-10-20.
  12. Min, Zhi. 2017. "Helping Users Find Value in Unstructured Plant Data." Blog, Yokogawa, June 20. sAccessed 2019-10-20.
  13. Pao, Steve. 2017. "3 challenges of high-performance storage in the age of unstructured data explosion." InfoWorld, November 21. Accessed 2019-10-20.
  14. Pickell, David. 2018. "Structured vs Unstructured Data – What's the Difference?" G2.com, November 16. Accessed 2019-10-20.
  15. Robb, Drew. 2017. "Semi-Structured Data." Datamation, July 03. Accessed 2019-10-20.
  16. Schneider, Christie. 2016. "The biggest data challenges that you might not even know you have." Blog, IBM, May 25. Accessed 2019-10-20.
  17. Selvam. 2016. "Note to banks: say “hello” to big data or “goodbye” to your customers." Blog, Crayon, June 09. Accessed 2019-10-20.
  18. Swoyer, Stephen. 2007. "Unstructured Data: Attacking a Myth." TDWI, September 05. Accessed 2019-10-20.
  19. Truxillo, Catherine. 2013. "Five myths about unstructured data and five good reasons you should be analyzing it." Blog, SAS, July 08. Accessed 2019-10-20.
  20. Woodie, Alex. 2017. "Solving Storage Just the Beginning for Minio CEO Periasamy." Datanami, February 01. Accessed 2019-10-20.
  21. Yates, Scott. 2018. "Five Business Intelligence Myths, Debunked." Experfy, February 28. Accessed 2019-10-20.
  22. van der Lans, Rick. 2015. "Big Data Myth 4: Big Data is Unstructured Data." TechTarget, October 12. Accessed 2019-10-20.
  23. van der Lans, Rick. 2016. "Unstructured data is a misnomer." TechTarget, March 25. Accessed 2019-10-20.

Milestones

1960

In the 1960s, businesses start using computers. Storing and managing data becomes important to them. Because memory is expensive, there's a need to store data efficiently. For these reasons, databases are invented. IBM's IMS is an example. These early databases store only structured data.

1970

The 1970s sees the arrival of relational databases. In 1974, IBM invents SQL as a language to query such databases.

Sep
1991

Andy Rehn, VP of marketing for Data Base Architects, states that "as much as 90 percent of the information business uses is non-numerical, freeform data." This is one of the earliest published reports that quantifies the prevalence of unstructured data.

1995
Challenges of extracting information from a scanned PDF document. Source: Lawtomated 2019.

The early and mid-1990s is when text mining starts entering real-world applications. Document-management systems emerge, later rebranded as Enterprise Content Management (ECM) systems. The World Wide Web also starts generating lots of unstructured data. Business Intelligence (BI) that had grown up on structured data starts mining text for useful insights.

1998

A rule of thumb is that 80% of all data is unstructured or semi-structured at best. This 80% figure is mentioned in a Merrill-Lynch report but it's not due to primary research.

2007

From a survey involving data warehousing and business intelligence search, TDWI Research finds that structured/unstructured ratio is not 20/80 as often claimed. Structured data is at 47%, unstructured data at 31% and the rest being semi-structured. However, the report recognizes that unstructured data is on the rise.

Mar
2009

OASIS approves Unstructured Information Management Architecture (UIMA), version 1.0. Apache UIMA is an open source implementation of this standard. V1.0.0 of this software is released in January 2014. In August 2019, V3.1.0 is released.

Mar
2019

Analysis of unstructured data is still new in some domains. A case in point is the healthcare industry where doctors' handwritten notes, images, histories, formulae and genetics have not been fully analyzed.

Tags

See Also

  • Big Data
  • NoSQL Databases
  • Types of Databases
  • Data Analytics
  • Text Mining
  • Apache UIMA

Further Reading

  1. Johnson, Matthew G. 2019. "The Myth of Unstructured Data." DataSeries, via Medium, February 10. Accessed 2019-10-20.
  2. Boe, Benjamin De. 2014. "Use Cases for Unstructured Data." Whitepaper, InterSystems Corporation. Accessed 2019-10-20.
  3. Grimes, Seth. 2008. "Unstructured Data and the 80 Percent Rule." Breakthrough Analysis, August 01. Accessed 2019-10-20.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
2
0
990
1488
Words
0
Chats
2
Edits
0
Likes
139
Hits

Cite As

Devopedia. 2019. "Structured vs Unstructured Data." Version 2, October 21. Accessed 2019-11-21. https://devopedia.org/structured-vs-unstructured-data