Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems. Data can be of various forms, such as structured, semi-structured, or unstructured. Besides, the “metadata” is another type that typically represents data about the data. In the following, we brief discuss these types of real world data.
Structured: It has a well-defend structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.
Unstructured: On the other hand, there is no pre-defend format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.
Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.
Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, flee size, date generated by the document, keywords to define the document, etc.
In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD, UNSW-NB15 , ISCX’12, CICDDoS2019, Bot-IoT, etc., smartphone datasets such as phone call logs, SMS Log , mobile application usages logs, mobile phone notification logs etc., IoT data, agriculture and e-commerce data [120, 138], health data such as heart disease, diabetes mellitus, COVID-19, etc., and many more in various application domains.
The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities.
Recommended Articles
Sample Sheet Metal Quiz MCQ’s Questions and Answers
What is Machine Learning? A Primer for the Epidemiologist
Mechanical Engineering Notes