Structured vs Unstructured Data
- Summary
-
Discussion
- What do you mean by "structure" with respect to data?
- Could you give examples of unstructured and semi-structured data?
- What are some use cases of analysis on unstructured data?
- What are some myths about unstructured data?
- How should I store unstructured data?
- What are some techniques to analyze unstructured data?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
Data is available in many forms, shapes and formats. Broadly, data can be either structured or unstructured. Data that's properly organized, with well-defined constraints and relationships among its different parts, can be considered as structured.
There's no precise definition of structured data. In the hazy boundary between structured and unstructured data, some have identified semi-structured data. Others have argued that all data has some structure: it's just that some are more difficult to store or analyze.
The general viewpoint is that all data can add value to data mining and analytics. Technology that evolved for structured data have been adapted, and new ones invented, to handle unstructured data as well. Unstructured data has become increasingly important due to its volume, velocity, variety and value.
Discussion
-
What do you mean by "structure" with respect to data? Data that's highly organized such as in a database can be considered as structured data. These often use a Relational Database Management System (RDBMS). Such data has schema that defines attributes and their types, constraints on values, and relations with other data tables and attributes. Data must conform to this schema. This makes data easy to query using Structured Query Language (SQL) and thereby facilitates analysis. Data in spreadsheets may be considered structured.
Unstructured data is not organized in this manner, that is, in RDBMS. Consider an audio stream, which has a well-defined format. Otherwise, it couldn't be decoded and played. Consider newspaper text. A linguist would say there's structure. But from the perspective of business analysts, this data is unstructured simply because it's harder to analyze and obtain insights.
Some say the term unstructured data is a misnomer. If it's truly lacking structure, it's useless to store it or analyze it. It would be better to categorize data as fixed or variable structure; repetitive or hierarchical; textual or non-textual.
-
Could you give examples of unstructured and semi-structured data? Rich media such as image, video or audio are unstructured. Social media generates lots of unstructured content. Websites host unstructured content that are commonly textual in nature. Information in documents such as MS Word files or PDF files are also seen as unstructured. Machine-generated content such as satellite images, IoT sensor data, or CCTV video feed are considered unstructured.
Many of the same data sources mentioned above could have attributes that make them semi-structured. For instance, Tweets, Facebook posts, blog articles, and news stories published online often have number of likes, retweets/shares, and comments, including names of readers who did these. Email text may be unstructured but the header contains names of sender/receiver, date and subject that give some structure.
IoT data may be seen as semi-structured when it's in JSON or XML formats. Data about data, called metadata, such as author name and publication date make data semi-structured. In fact, rich semantic markup on webpages gives them lot more structure that what HTML alone does. Most unstructured data can be considered as semi-structured because of metadata.
-
What are some use cases of analysis on unstructured data? In a manufacturing plant, operation logs and reports are unstructured. Text analysis to pick out frequent words or sentiments can help the plant manager make a maintenance plan or quickly understand the nature of a particular line.
Koorong Books in Australia uses text analysis to identify duplicate postings or suggest similar books. This is an example of content-based profiling.
In banking, customer may often give feedback or complaint on social media rather than via web forms. If banks wish to be customer centric, they need to act on this unstructured data. In fact, this could apply to any industry that needs to listen to its customers. Companies can use chatbots with NLP capability to automate customer support functions.
Deep learning techniques are being used to analyze images and sounds. Images can be automatically labelled. Mammograms can be analyzed for cancer. The sound of a motor can inform in advance if it's going to fail. This is of importance in automobile and aviation sectors.
-
What are some myths about unstructured data? An early myth was that unstructured data can't be quantitatively analyzed. Perhaps true in the past, but with recent advances in computer vision, speech processing, and natural language processing, algorithms are able solve many complex problems in these domains.
Some might believe that unstructured data replaces structured data. In reality, many companies have not fully exploited the potential of structured data. Good old predictive models and analytical capabilities should continue to be used. Unstructured data will give access to new insights not otherwise available in structured data. In conclusion, both structured and unstructured data are valuable.
Another myth says that all big data is unstructured data. Telecom companies, smart energy meters, smartphones, and cars fitted with sensors are all generating big data that's structured.
Some businesses just store the data with a vague idea of using them later. In fact, data and insights depreciate over time. Real-time dashboards are probably the best opportunity to act on data in a timely manner. When collecting or acting on data, tie it to a business vision.
-
How should I store unstructured data? Big Data technologies such Hadoop and NoSQL databases have come about to address the needs of storing and managing unstructured data. Data warehouses and data lakes are places where big data is stored. Unstructured data is often used alongside structured data. Many technologies cater for both.
Hadoop enables distributed storage and computing on big data. It's a good engine for handling unstructured data, though it's a myth to think that unstructured data can't be stored or analyzed without Hadoop.
NoSQL databases are highly scalable and support flexible schema. Their storage is distributed. They're non-relational. They're good at storing multimedia, social media or textual data. Among the different types are document stores, column stores, key-value stores and graph data stores. A mix of these is often used, each suited to a particular data. This approach is called polyglot persistence.
As cost of flash memory drops, flash becomes a faster alternative to disk storage. However, file services such as data protection, backup, and search need to be in place.
For a minimalistic storage system, try the open source MinIO. Alternatives include Ceph, Scality and Cleversafe.
-
What are some techniques to analyze unstructured data? The general perception is that unstructured data is hard to analyze. This is changing due to advances in machine learning models.
The simplest way to get started is perhaps to call APIs that others have published. For example, cognitive APIs from IBM such as Watson Tradeoff Analytics can help in decision making. Geneea is an NLP API. For speech recognition, we can use AT&T Speech API. Google's Cloud Vision API is useful for many image-specific tasks.
Milestones
1991
The early and mid-1990s is when text mining starts entering real-world applications. Document-management systems emerge, later rebranded as Enterprise Content Management (ECM) systems. The World Wide Web also starts generating lots of unstructured data. Business Intelligence (BI) that had grown up on structured data starts mining text for useful insights.
From a survey involving data warehousing and business intelligence search, TDWI Research finds that structured/unstructured ratio is not 20/80 as often claimed. Structured data is at 47%, unstructured data at 31% and the rest being semi-structured. However, the report recognizes that unstructured data is on the rise.
2009
References
- Apache UIMA. 2019. "Homepage." Accessed 2019-10-20.
- Boe, Benjamin De. 2014. "Use Cases for Unstructured Data." Whitepaper, InterSystems Corporation. Accessed 2019-10-20.
- Datar, Shweta. 2019. "10 Machine Learning APIs You Should Learn." DZone, January 31. Accessed 2019-10-20.
- Foote, Keith D. 2017. "A Brief History of Database Management." DATAVERSITY, March 23. Accessed 2019-10-20.
- Goodwins, Rupert. 2019. "AI weaves value from unstructured data." Raconteur, March 27. Accessed 2019-10-20.
- Grimes, Seth. 2008. "Unstructured Data and the 80 Percent Rule." Breakthrough Analysis, August 01. Accessed 2019-10-20.
- Grishchenko, Alexey. 2015. "The Myth of “Unstructured Data”." Distributed Systems Architecture, July 28. Accessed 2019-10-20.
- Halper, Fern. 2018. "3 Use Cases for Unstructured Data." TDWI, September 25. Accessed 2019-10-20.
- Johnson, Matthew G. 2019. "The Myth of Unstructured Data." DataSeries, via Medium, February 10. Accessed 2019-10-20.
- Jones, M. Tim. 2018. "Data, structure, and the data science pipeline." IBM Developer, February 01. Accessed 2019-10-20.
- Lawtomated. 2019. "Structured Data vs. Unstructured Data: what are they and why care?" Lawtomated, April 07. Accessed 2019-10-20.
- Min, Zhi. 2017. "Helping Users Find Value in Unstructured Plant Data." Blog, Yokogawa, June 20. sAccessed 2019-10-20.
- Pao, Steve. 2017. "3 challenges of high-performance storage in the age of unstructured data explosion." InfoWorld, November 21. Accessed 2019-10-20.
- Pickell, David. 2018. "Structured vs Unstructured Data – What's the Difference?" G2.com, November 16. Accessed 2019-10-20.
- Robb, Drew. 2017. "Semi-Structured Data." Datamation, July 03. Accessed 2019-10-20.
- Schneider, Christie. 2016. "The biggest data challenges that you might not even know you have." Blog, IBM, May 25. Accessed 2019-10-20.
- Selvam. 2016. "Note to banks: say “hello” to big data or “goodbye” to your customers." Blog, Crayon, June 09. Accessed 2019-10-20.
- Swoyer, Stephen. 2007. "Unstructured Data: Attacking a Myth." TDWI, September 05. Accessed 2019-10-20.
- Truxillo, Catherine. 2013. "Five myths about unstructured data and five good reasons you should be analyzing it." Blog, SAS, July 08. Accessed 2019-10-20.
- Woodie, Alex. 2017. "Solving Storage Just the Beginning for Minio CEO Periasamy." Datanami, February 01. Accessed 2019-10-20.
- Yates, Scott. 2018. "Five Business Intelligence Myths, Debunked." Experfy, February 28. Accessed 2019-10-20.
- van der Lans, Rick. 2015. "Big Data Myth 4: Big Data is Unstructured Data." TechTarget, October 12. Accessed 2019-10-20.
- van der Lans, Rick. 2016. "Unstructured data is a misnomer." TechTarget, March 25. Accessed 2019-10-20.
Further Reading
- Johnson, Matthew G. 2019. "The Myth of Unstructured Data." DataSeries, via Medium, February 10. Accessed 2019-10-20.
- Boe, Benjamin De. 2014. "Use Cases for Unstructured Data." Whitepaper, InterSystems Corporation. Accessed 2019-10-20.
- Grimes, Seth. 2008. "Unstructured Data and the 80 Percent Rule." Breakthrough Analysis, August 01. Accessed 2019-10-20.
Article Stats
Cite As
See Also
- Big Data
- NoSQL Databases
- Types of Databases
- Data Analysis
- Text Mining
- Apache UIMA