Processing of Data - DLMBDSA01 Unit 4

Ace your homework & exams now with Quizwiz!

Steps in data analysis

Features extraction Correlation analysis Feature selection Machine learning Extracting valuable insights

Machine learning

In this step, a learning mathematical algorithm is developed to extract knowledge from and uncover the properties of the data and predict future outcomes should new data be inserted. Descriptive analytics are used to understand underlying data patterns; Predictive analytics are used to estimate new or future data based on performance; Prescriptive analytics are used to optimize the dependent action

Batch processing

Input data and/or output information are grouped into batches to permit sequential processing. mainly offline

SQL

SQL is not a file format, it is a famous language for query databases. It is considered a powerful query language which designs linked spreadsheets (i.e., tables) for the elements of the input dataset

Data Analysis

The data analysis stage may be performed through multiple threads of simultaneously executed instructions using machine learning and artificial intelligence algorithms. This stage is the "heart" of data processing and may include converting the data to a more suitable format

Data Preparation

The data preparation stage involves preprocessing. Raw data are cleaned, organized, standardized, and checked for errors. The purpose of this stage is to deal with missing values and eliminate redundant, incomplete, duplicate, and incorrect records.

Data Storage

The final stage of data processing is the storing the data, instructions, developed numerical models, and information for future use

Correlation analysis

The focus of this step is to determine which pairs of data features have the highest degree of correlation

ASCII text

Abbreviation for American Standard Code for Information Interchange. ASCII code represents text for electronic communication in computers

Data Collection

After raw data are collected from a source(s), they are converted into a computer-friendly format (e.g., tables, text, images) to form a data lake

Data Input

After the data have been prepared and cleaned, they are entered into their destination location (e.g., a data warehouse) and translated into a format that consumers of the data—e.g., an organization's employees—can easily understand

Extracting valuable insights

After the model is evaluated for accuracy and performance, the most important and relevant information contained in the input data is retrieved and presented

XML

This file contains structured, non-tabular data written as text with annotations. the data can be shared on the World Wide Web using ASCII text For example, <img fig="Alice.jpg" tag="Alice" /> indicates that the names of the variables are "fig" and "tag" and their values are "Alice.jpg" and "Alice", respectively

JSON

This file includes a list of variable-field pairs and their corresponding names and values. It is for the transmission of data records with their complete information among several operating systems.

Mechanical Data Processing

data are processed using various devices like printers, calculators, and typewriters. This method is faster and more reliable than the manual data processing method but is still considered primitive

Stages of Data Processing

data collection data preparation data input data analysis data interpretation data storage

The algorithms and calculations used in data processing must be

highly accurate, well built, correctly performed

Data processing

is the extraction of useful information from collected raw data. It is similar to an industrial production process: inputs (raw data) are put through a particular process (data processing) to produce outputs (insights and information).

Methods of data processing

manual data processing mechanical data processing electronic data processing

Real-time processing

responds almost immediately to changes in inputs and the requests of outputs. the time savings are desirable because outputs are obtained in real-time. An example is banking transactions.

Types of data collection

statistical populations research experiments sample surveys byproduct operations.

XLS

stores data in tables consisting of rows (for records) and columns (for variables). It is also possible to create charts from the data for visualization.

Feature selection

During this step, informative and relevant features are selected by applying correlation analysis to separate redundant features, keeping the features which show high correlation with the target variable

CSV

Each line in a CSV file denotes a single data record, with values separated by commas to specify the value of each data feature

Data Interpretation

the outcomes of the machine learning predictions need to be translated into actions. The outcomes must be interpreted to obtain beneficial information that can guide a company's future decisions. need to be presented to business managers in a user-friendly form such as tables, audio, videos, and images

Distributed processing (multi-processing)

utilized by remote workstations connected to a large server. (ATM)

Online processing

utilizes internet connections and attached resources. cloud computing

- Data feature

A data feature, also called a variable, are aspects of the data like name, date, age, etc.

-Data lake

A data lake is a repository of data stored in both its natural and transformed formats.

-Data warehouse

A data warehouse is a store gathered from data sources and used to guide decision-making in an organization

Protobuf

A reduced version of XML, these files transfer small structured data sizes across programs. This format is used for inter-application communication at Google.

Time-sharing processing

A single computing unit is utilized by multiple users according to a predetermined, allocated time slot. The processing is usually performed by super and mainframe computers on bulk data such as census surveys, industry statistics, and enterprise resource planning.

Manual Data Processing

Considered a "primitive" method, manual data processing methodology was used at a time when the technology was in its early stages and often unaffordable It still may need to be used today for legacy data which are not digitized, typically employed only by small businesses and lowcapacity governmental offices

Features extraction

Data are represented by a number of fixed features which can be categorical, binary, or continuous.

Electronic Data Processing

data are processed automatically using computer applications, software, and programs developed according to a predefined set of rules. Electronic data processing is fast and accurate. Examples include the processing of customers' bank accounts and students' university grades.

Summary

Without data processing, it is almost impossible to make a good decision. It is difficult to think of any industry that does not implement data processing to obtain insights into areas which require improvements. Data processing is a multidimensional process that starts with the collection of an immeasurable amount of data from various sources. The data are arranged in practical, organized forms and forwarded to the next stage, data preparation. In this stage, all preprocessing operations are performed on the data to remove noise and outliers. The data are then entered into the computer in a usable form. The data analysis stage converts the raw data into meaningful insights and information by using a machine learning model. Machine learning is employed to perform a series of operations on the preprocessed data so that relations within the data elements can be presented in various forms such as distribution curves, reports, and images. Data processing can be manual, mechanical, or electronic. The latter is the fastest and most accurate method and includes many processing types: batch, online, real-time, distributed (multi-processing), and time-sharing. The common formats of a data processed file are XLS, CSV, XML, SQL, JSON, and Protobuf. In each type of file, the data are presented in a predetermined structure specified by the properties of the associated formats.

The common formats of a data processed file

XLS (Excel spreadsheet) CSV (comma-separated value) XML (extensible markup language) JSON (Java script object notation) Protobuf (protocol buffers) Apache Parquet SQL (structured query language)

Apache Parquet

a column-oriented database management system format available in the Hadoop ecosystem for big data processing, regardless of the data model or programming language storing data in columns rather than rows There are many other data formats which may be utilized, such as the hierarchical data format (HDF4 and HDF versions).

Data processing can be applied in many scenarios such as

automating office environments; administrating event tickets and reservations; managing work time and monitoring billable hours; organizing and planning the allocation of human or material resources; and conducting forecasting and optimization in an enterprise environment

Types of the electronic data processing

batch processing online processing real-time processing distributed processing (multi-processing) time-sharing processing

Summary 2

بدون معالجة البيانات ، يكاد يكون من المستحيل اتخاذ قرار جيد. من الصعب التفكير في أي صناعة لا تنفذ معالجة البيانات للحصول على رؤى في المجالات التي تتطلب تحسينات. معالجة البيانات هي عملية متعددة الأبعاد تبدأ بجمع كمية لا حصر لها من البيانات من مصادر مختلفة. يتم ترتيب البيانات في أشكال عملية ومنظمة وإرسالها إلى المرحلة التالية ، إعداد البيانات. في هذه المرحلة ، يتم تنفيذ جميع عمليات المعالجة المسبقة على البيانات لإزالة الضوضاء والقيم المتطرفة. ثم يتم إدخال البيانات في الكمبيوتر بشكل قابل للاستخدام. تقوم مرحلة تحليل البيانات بتحويل البيانات الأولية إلى رؤى ومعلومات ذات مغزى باستخدام نموذج التعلم الآلي. يتم استخدام التعلم الآلي لإجراء سلسلة من العمليات على البيانات المعالجة مسبقًا بحيث يمكن تقديم العلاقات داخل عناصر البيانات في أشكال مختلفة مثل منحنيات التوزيع والتقارير والصور. يمكن أن تكون معالجة البيانات يدوية أو ميكانيكية أو إلكترونية. هذه الأخيرة هي الطريقة الأسرع والأكثر دقة وتتضمن العديد من أنواع المعالجة: دفعة ، عبر الإنترنت ، في الوقت الفعلي ، موزعة (معالجة متعددة) ، ومشاركة الوقت. التنسيقات الشائعة لملف معالجة البيانات هي XLS و CSV و XML و SQL و JSON و Protobuf. في كل نوع من أنواع الملفات ، يتم تقديم البيانات في بنية محددة مسبقًا تحددها خصائص التنسيقات المرتبطة.

Processed data should be presented in a format which meets the following criteria

• Data files are in sophisticated formats that computers can analyze. • People can easily recognize the data fields and their range of values. • The formats are popular and/or standard so that the data can be mixed and matched with other data resources. • The data are clear and express the information they contain without unnecessary features (e.g., highly-correlated, redundant).

The benefits of data processing in medium and large organizations

• better analysis and presentation of the organization's data; • reduction of data to only the most meaningful information; • easier storage and distribution of data; • simplified report creation; • improved productivity and increased profits; and • more accurate decision-making.

Forms of processed data

• user-readable plain text files, exported as Notepad files; • charts to reflect trends and progress/decay; • maps for spatial data; • images for graphical data; and • software-specific formats for those data requiring further analysis and processing.


Related study sets

14 conscious and unconscious thought

View Set

Energy in the 21st Century Test #1

View Set

Mike Meyers' CompTIA Network+ - Chapter 9: Network Naming

View Set

Chapter 2 Exam - Life Provisions

View Set

Chapter 8: Net Present Value and Other Investment Criteria

View Set

Economics Ch.12 practice questions

View Set

GI Assessment NCLEX Style Questions

View Set