Data Science
Techniques for analyzing quantitative data
Author Jonathan Koomey has recommended a series of best practices for understanding quantitative data. These include: Check raw data for anomalies prior to performing your analysis; Re-perform important calculations, such as verifying columns of data that are formula driven; Confirm main totals are the sum of subtotals; Check relationships between numbers that should be related in a predictable way, such as ratios over time; Normalize numbers to make comparisons easier, such as analyzing amounts per person or relative to GDP or as an index value relative to a base year; Break problems into component parts by analyzing factors that led to the results, such as DuPont analysis of return on equity.[6] For the variables under examination, analysts typically obtain descriptive statistics for them, such as the mean (average), median, and standard deviation. They may also analyze the distribution of the key variables to see how the individual values cluster around the mean. An illustration of the MECE principle used for data analysis. The consultants at McKinsey and Company named a technique for breaking a quantitative problem down into its component parts called the MECE principle. Each layer can be broken down into its components; each of the sub-components must be mutually exclusive of each other and collectively add up to the layer above them. The relationship is referred to as "Mutually Exclusive and Collectively Exhaustive" or MECE. For example, profit by definition can be broken down into total revenue and total cost. In turn, total revenue can be analyzed by its components, such as revenue of divisions A, B, and C (which are mutually exclusive of each other) and should add to the total revenue (collectively exhaustive). Analysts may use robust statistical measurements to solve certain analytical problems. Hypothesis testing is used when a particular hypothesis about the true state of affairs is made by the analyst and data is gathered to determine whether that state of affairs is true or false. For example, the hypothesis might be that "Unemployment has no effect on inflation", which relates to an economics concept called the Phillips Curve. Hypothesis testing involves considering the likelihood of Type I and type II errors, which relate to whether the data supports accepting or rejecting the hypothesis. Regression analysis may be used when the analyst is trying to determine the extent to which independent variable X affects dependent variable Y (e.g., "To what extent do changes in the unemployment rate (X) affect the inflation rate (Y)?"). This is an attempt to model or fit an equation line or curve to the data, such that Y is a function of X. Necessary condition analysis (NCA) may be used when the analyst is trying to determine the extent to which independent variable X allows variable Y (e.g., "To what extent is a certain unemployment rate (X) necessary for a certain inflation rate (Y)?"). Whereas (multiple) regression analysis uses additive logic where each X-variable can produce the outcome and the X's can compensate for each other (they are sufficient but not necessary), necessary condition analysis (NCA) uses necessity logic, where one or more X-variables allow the outcome to exist, but may not produce it (they are necessary but not sufficient). Each single necessary condition must be present and compensation is not possible.
Forecasts
D
Scientific Method
The scientific method is a body of techniques for investigating phenomena, acquiring new knowledge, or correcting and integrating previous knowledge.[2] To be termed scientific, a method of inquiry is commonly based on empirical or measurable evidence subject to specific principles of reasoning.[3] The Oxford Dictionaries Online define the scientific method as "a method or procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses".[4] The scientific method is an ongoing process, which usually begins with observations about the natural world. Human beings are naturally inquisitive, so they often come up with questions about things they see or hear and often develop ideas (hypotheses) about why things are the way they are. The best hypotheses lead to predictions that can be tested in various ways, including making further observations about nature. In general, the strongest tests of hypotheses come from carefully controlled and replicated experiments that gather empirical data. Depending on how well the tests match the predictions, the original hypothesis may require refinement, alteration, expansion or even rejection. If a particular hypothesis becomes very well supported a general theory may be developed.[1] Although procedures vary from one field of inquiry to another, identifiable features are frequently shared in common between them. The overall process of the scientific method involves making conjectures (hypotheses), deriving predictions from them as logical consequences, and then carrying out experiments based on those predictions.[5][6] A hypothesis is a conjecture, based on knowledge obtained while formulating the question. The hypothesis might be very specific or it might be broad. Scientists then test hypotheses by conducting experiments. Under modern interpretations, a scientific hypothesis must be falsifiable, implying that it is possible to identify a possible outcome of an experiment that conflicts with predictions deduced from the hypothesis; otherwise, the hypothesis cannot be meaningfully tested.[7] The purpose of an experiment is to determine whether observations agree with or conflict with the predictions derived from a hypothesis.[8] Experiments can take place in a college lab, on a kitchen table, at CERN's Large Hadron Collider, at the bottom of an ocean, on Mars, and so on. There are difficulties in a formulaic statement of method, however. Though the scientific method is often presented as a fixed sequence of steps, it represents rather a set of general principles.[9] Not all steps take place in every scientific inquiry (or to the same degree), and are not always in the same order.[10] Some philosophers and scientists have argued that there is no scientific method. For example, Lee Smolin[11] and Paul Feyerabend (in his Against Method). Nola and Sankey remark that "For some, the whole idea of a theory of scientific method is yester-year's debate".[12]
Predictions
D
Statistical Model
A statistical model is a class of mathematical model, which embodies a set of assumptions concerning the generation of some sample data, and similar data from a larger population. A statistical model represents, often in considerably idealized form, the data-generating process. The assumptions embodied by a statistical model describe a set of probability distributions, some of which are assumed to adequately approximate the distribution from which a particular data set is sampled. The probability distributions inherent in statistical models are what distinguishes statistical models from other, non-statistical, mathematical models. A statistical model is usually specified by mathematical equations that relate one or more random variables and possibly other non-random variables. As such, "a model is a formal representation of a theory". All statistical hypothesis tests and all statistical estimators are derived from statistical models. More generally, statistical models are part of the foundation of statistical inference.
Information
Information (shortened as info) is that which informs. In other words, it is the answer to a question of some kind. It is also that from which data and knowledge can be derived, as data represents values attributed to parameters, and knowledge signifies understanding of real things or abstract concepts.[1] As it regards data, the information's existence is not necessarily coupled to an observer (it exists beyond an event horizon, for example), while in the case of knowledge, the information requires a cognitive observer. At its most fundamental, information is any propagation of cause and effect within a system. Information is conveyed either as the content of a message or through direct or indirect observation of some thing. That which is perceived can be construed as a message in its own right, and in that sense, information is always conveyed as the content of a message. Information can be encoded into various forms for transmission and interpretation (for example, information may be encoded into a sequence of signs, or transmitted via a sequence of signals). It can also be encrypted for safe storage and communication. Information resolves uncertainty. The uncertainty of an event is measured by its probability of occurrence and is inversely proportional to that. The more uncertain an event, the more information is required to resolve uncertainty of that event. The bit is a typical unit of information, but other units such as the nat may be used. Example: information in one "fair" coin flip: log2(2/1) = 1 bit, and in two fair coin flips is log2(4/1) = 2 bits. The concept that information is the message has different meanings in different contexts.[2] Thus the concept of information becomes closely related to notions of constraint, communication, control, data, form, education, knowledge, meaning, understanding, mental stimuli, pattern, perception, representation, and entropy.
Knowledge
Knowledge is a familiarity, awareness or understanding of someone or something, such as facts, information, descriptions, or skills, which is acquired through experience or education by perceiving, discovering, or learning. Knowledge can refer to a theoretical or practical understanding of a subject. It can be implicit (as with practical skill or expertise) or explicit (as with the theoretical understanding of a subject); it can be more or less formal or systematic.[1] In philosophy, the study of knowledge is called epistemology; the philosopher Plato famously defined knowledge as "justified true belief", though this definition is now agreed by most analytic philosophers to be problematic because of the Gettier problems. However, several definitions of knowledge and theories to explain it exist. Knowledge acquisition involves complex cognitive processes: perception, communication, and reasoning;[2] while knowledge is also said to be related to the capacity of acknowledgment in human beings.[3]
Data Processing
Data processing is, generally, "the collection and manipulation of items of data to produce meaningful information."[1] In this sense it can be considered a subset of information processing, "the change (processing) of information in any manner detectable by an observer." [note 1] The term Data processing (DP) has also been used previously to refer to a department within an organization responsible for the operation of data processing applications.[2] Data processing may involve various processes, including: Validation - Ensuring that supplied data is correct and relevant. Sorting - "arranging items in some sequence and/or in different sets." Summarization - reducing detail data to its main points. Aggregation - combining multiple pieces of data. Analysis - the "collection, organization, analysis, interpretation and presentation of data.". Reporting - list detail or summary data or computed information. Classification - separates data into various categories.
Data set
A data set (or dataset) is a collection of data. Most commonly a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set. Each value is known as a datum. The data set may comprise data for one or more members, corresponding to the number of rows. The term data set may also be used more loosely, to refer to the data in a collection of closely related tables, corresponding to a particular experiment or event. An example of this type is the data sets collected by space agencies performing experiments with instruments aboard space probes.
Quantification
In mathematics and empirical science, quantification (or quantitation) is the act of counting and measuring that maps human sense observations and experiences into members of some set of numbers. Quantification in this sense is fundamental to the scientific method.
Philosophy of Information
The philosophy of information (PI) is the area of research that studies conceptual issues arising at the intersection of computer science, information science, information technology, and philosophy. It includes: the critical investigation of the conceptual nature and basic principles of information, including its dynamics, utilisation and sciences the elaboration and application of information-theoretic and computational methodologies to philosophical problems.[1]
Data Scientist
Data scientists use their data and analytical ability to find and interpret rich data sources; manage large amounts of data despite hardware, software, and bandwidth constraints; merge data sources; ensure consistency of datasets; create visualizations to aid in understanding data; build mathematical models using the data; and present and communicate the data insights/findings. They are often expected to produce answers in days rather than months, work by exploratory analysis and rapid iteration, and to produce and present results with dashboards (displays of current values) rather than papers/reports, as statisticians normally do.[6]
Data Analysis
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business intelligence covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All are varieties of data analysis. Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination. The term data analysis is sometimes used as a synonym for data modeling.
Big Data
Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. The term often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.[2] Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk. Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on."[3] Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data sets in areas including Internet search, finance, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics,[4] connectomics, complex physics simulations, biology and environmental research.[5]
Data
Data (/ˈdeɪtə/ day-tə, /ˈdætə/ da-tə, or /ˈdɑːtə/ dah-tə)[1] is a set of values of qualitative or quantitative variables. An example of qualitative data would be an anthropologist's handwritten notes about her interviews with people of an Indigenous tribe. Pieces of data are individual pieces of information. While the concept of data is commonly associated with scientific research, data is collected by a huge range of organizations and institutions, ranging from businesses (e.g., sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment rates, literacy rates) and non-governmental organizations (e.g., censuses of the number of homeless people by non-profit organizations). Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools. Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing. Raw data ("unprocessed data") is a collection of numbers or characters before it has been "cleaned" and corrected by researchers. Raw data needs to be corrected to remove outliers or obvious instrument or data entry errors (e.g., a thermometer reading from an outdoor Arctic location recording a tropical temperature). Data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next stage. Field data is raw data that is collected in an uncontrolled "in situ" environment. Experimental data is data that is generated within the context of a scientific investigation by observation and recording.
Data (computing)
Data (/ˈdeɪtə/ day-tə, or /ˈdɑːtə/ dah-tə;[1] treated as singular, plural, or as a mass noun) is any sequence of one (1) or more symbols given meaning by specific act(s) of interpretation. Data (or datum - a single unit of data) is not information. Data requires interpretation to become information. To translate data to information, there must be several known factors considered. The factors involved are determined by the creator of the data and the desired information. The term metadata is used to reference the data about the data. Metadata may be implied, specified or given. Data relating to physical events or processes will also have a temporal component. In almost all cases this temporal component is implied. This is the case when a device such as a temperature logger received data from a temperature sensor. When the temperature is received it is assumed that the data has a temporal references of "now". So the device records the date, time and temperature together. When the data logger communicates temperatures, it must also report the date and time (metadata) for each temperature. Digital data is data that is represented using the binary number system of ones (1) and zeros (0). As opposed to analog representation. In modern (post 1960) computer systems, all data is digital. Data within a computer, in most cases, moves as parallel data. Data moving to or from a computer, in most cases, moves as serial data. See Parallel communication and Serial communication. Data sourced from an analog device, such as a temperature sensor, must pass through an "analog to digital converter" or "ADC" (see Analog-to-digital converter) to convert the analog data to digital data. Data representing quantities, characters, or symbols on which operations are performed by a computer, stored and recorded on magnetic, optical, or mechanical recording media, and transmitted in the form of digital electrical signals.[2] A program is a set of data that consists of a series of coded software instructions to control the operation of a computer or other machine.[3] Physical computer memory elements consist of an address and a byte/word of data storage. Digital data are often stored in relational databases, like tables or SQL databases, and can generally be represented as abstract key/value pairs. Data can be organized in many different types of data structures, including arrays, graphs, and objects. Data structures can store data of many different types, including numbers, strings and even other data structures. Data pass in and out of computers via peripheral devices.
Data Mining
Data mining is an interdisciplinary subfield of computer science.[1][2][3] It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.[1] The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.[1] Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.[1] Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD.[4] The term is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself.[5] It also is a buzzword[6] and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence, machine learning, and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java[7] (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons.[8] Often the more general terms (large scale) data analysis and analytics - or, when referring to actual methods, artificial intelligence and machine learning - are more appropriate. The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps. The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.
Data reporting
Data reporting is the process of collecting and submitting data to authorities entrusted with compiling statistics. Accurate data reporting gives rise to accurate analyses of the facts on the ground; inaccurate data reporting can lead to vastly uninformed decisions based on erroneous evidence. When data is not reported, the problem is known as underreporting; the opposite problem leads to false positives. Data reporting can be an incredibly difficult endeavor. Census bureaus may hire even hundreds of thousands of workers to achieve the task of counting all of the residents of a country.[1][2] Teachers use data from student assessments to determine grades; cellphone manufacturers rely on sales data from retailers to point the way to which models to increase production of. The effective management of nearly any company relies on accurate data.
Data Vizualization
Data visualization or data visualisation is viewed by many disciplines as a modern equivalent of visual communication. It involves the creation and study of the visual representation of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for the units of information".[1] A primary goal of data visualization is to communicate information clearly and efficiently via statistical graphics, plots and information graphics. Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative message.[2] Effective visualization helps users analyze and reason about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will look up a specific measurement, while charts of various types are used to show patterns or relationships in the data for one or more variables. Data visualization is both an art and a science. It is viewed as a branch of descriptive statistics by some, but also as a grounded theory development tool by others. The rate at which data is generated has increased. Data created by internet activity and an expanding number of sensors in the environment, such as satellites, are referred to as "Big Data". Processing, analyzing and communicating this data present a variety of ethical and analytical challenges for data visualization. The field of data science and practitioners called data scientists have emerged to help address this challenge.[3]
Experimental Data
Experimental data in science are data produced by a measurement, test method, experimental design or quasi-experimental design. In clinical research any data produced are the result of a clinical trial. Experimental data may be qualitative or quantitative, each being appropriate for different investigations. Generally speaking, qualitative data are considered more descriptive and can be subjective in comparison to having a continuous measurement scale that produces numbers. Whereas quantitative data are gathered in a manner that is normally experimentally repeatable, qualitative information is usually more closely related to phenomenal meaning and is, therefore, subject to interpretation by individual observers. Experimental data can be reproduced by a variety of different investigators and mathematical analysis may be performed on these data.
False Precision
False precision (also called overprecision, fake precision, misplaced precision and spurious accuracy) occurs when numerical data are presented in a manner that implies better precision than is actually the case; since precision is a limit to accuracy, this often leads to overconfidence in the accuracy as well.[1] Madsen Pirie defines the term "false precision" in a more general way: when exact numbers are used for notions that cannot be expressed in exact terms. For example, "I am 90% sure he is wrong". Often false precision is abused to produce an unwarranted confidence in the claim: "our mouthwash is twice as good as our competitor's". [2] In science and engineering, convention dictates that unless a margin of error is explicitly stated, the number of significant figures used in the presentation of data should be limited to what is warranted by the precision of those data. For example, if an instrument can be read to tenths of a unit of measurement, results of calculations using data obtained from that instrument can only be confidently stated to the tenths place, regardless of what the raw calculation returns or whether other data used in the calculation are more accurate. Even outside these disciplines, there is a tendency to assume that all the non-zero digits of a number are meaningful; thus, providing excessive figures may lead the viewer to expect better precision than actually exists. However, in contrast, it is good practice to retain more significant figures than this in the intermediate stages of a calculation, in order to avoid accumulated rounding errors. False precision commonly arises when high-precision and low-precision data are combined, and in conversion of units.
Field Research
Field research or fieldwork is the collection of information outside a laboratory, library or workplace setting. The approaches and methods used in field research vary across disciplines. For example, biologists who conduct field research may simply observe animals interacting with their environments, whereas social scientists conducting field research may interview or observe people in their natural environments to learn their languages, folklore, and social structures. Field research involves a range of well-defined, although variable, methods: informal interviews, direct observation, participation in the life of the group, collective discussions, analyses of personal documents produced within the group, self-analysis, results from activities undertaken off- or on-line, and life-histories. Although the method generally is characterized as qualitative research, it may (and often does) include quantitative dimensions.
Code
In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form or representation, sometimes shortened or secret, for communication through a channel or storage in a medium. An early example is the invention of language, which enabled a person, through speech, to communicate what he or she saw, heard, felt, or thought to others. But speech limits the range of communication to the distance a voice can carry, and limits the audience to those present when the speech is uttered. The invention of writing, which converted spoken language into visual symbols, extended the range of communication across space and time The process of encoding converts information from a source into symbols for communication or storage. Decoding is the reverse process, converting code symbols back into a form that the recipient of that understands time.
Accuracy vs. Measurement
In industrial instrumentation, accuracy is the measurement tolerance, or transmission of the instrument and defines the limits of the errors made when the instrument is used in normal operating conditions.[4] Ideally a measurement device is both accurate and precise, with measurements all close to and tightly clustered around the true value. The accuracy and precision of a measurement process is usually established by repeatedly measuring some traceable reference standard. Such standards are defined in the International System of Units (abbreviated SI from French: Système international d'unités) and maintained by national standards organizations such as the National Institute of Standards and Technology in the United States. This also applies when measurements are repeated and averaged. In that case, the term standard error is properly applied: the precision of the average is equal to the known standard deviation of the process divided by the square root of the number of measurements averaged. Further, the central limit theorem shows that the probability distribution of the averaged measurements will be closer to a normal distribution than that of individual measurements. With regard to accuracy we can distinguish: the difference between the mean of the measurements and the reference value, the bias. Establishing and correcting for bias is necessary for calibration. the combined effect of that and precision. A common convention in science and engineering is to express accuracy and/or precision implicitly by means of significant figures. Here, when not explicitly stated, the margin of error is understood to be one-half the value of the last significant place.
Problem Solving
Problem solving consists of using generic or ad hoc methods, in an orderly manner, for finding solutions to problems. Some of the problem-solving techniques developed and used in artificial intelligence, computer science, engineering, mathematics, or medicine are related to mental problem-solving techniques studied in psychology. The term problem-solving is used in many disciplines, sometimes with different perspectives, and often with different terminologies. For instance, it is a mental process in psychology and a computerized process in computer science. Problems can also be classified into two different types (ill-defined and well-defined) from which appropriate solutions are to be made. Ill-defined problems are those that do not have clear goals, solution paths, or expected solution. Well-defined problems have specific goals, clearly defined solution paths, and clear expected solutions. These problems also allow for more initial planning than ill-defined problems.[1] Being able to solve problems sometimes involves dealing with pragmatics (logic) and semantics (interpretation of the problem). The ability to understand what the goal of the problem is and what rules could be applied represent the key to solving the problem. Sometimes the problem requires some abstract thinking and coming up with a creative solution.
Raw Data
Raw data, also known as primary data, is data (e.g., numbers, instrument readings, figures, etc.) collected from a source. If a scientist sets up a computerized thermometer which records the temperature of a chemical mixture in a test tube every minute, the list of temperature readings for every minute, as printed out on a spreadsheet or viewed on a computer screen is "raw data". Raw data has not been subjected to processing, "cleaning" by researchers to remove outliers, obvious instrument reading errors or data entry errors, or any analysis (e.g., determining central tendency aspects such as the average or median result). As well, raw data has not been subject to any other manipulation by a software program or a human researcher, analyst or technician. It is also referred to as primary data. Raw data is a relative term (see data), because even once raw data has been "cleaned" and processed by one team of researchers, another team may consider this processed data to be "raw data" for another stage of research. Raw data can be inputted to a computer program or used in manual procedures such as analyzing statistics from a survey. The term "raw data" can refer to the binary data on electronic storage devices, such as hard disk drives (also referred to as "low-level data").
Empiricism
Empiricism is a theory that states that knowledge comes only or primarily from sensory experience.[1] One of several views of epistemology, the study of human knowledge, along with rationalism and skepticism, empiricism emphasizes the role of empirical evidence in the formation of ideas, over the notion of innate ideas or traditions;[2] empiricists may argue however that traditions (or customs) arise due to relations of previous sense experiences.[3] Empiricism in the philosophy of science emphasizes evidence, especially as discovered in experiments. It is a fundamental part of the scientific method that all hypotheses and theories must be tested against observations of the natural world rather than resting solely on a priori reasoning, intuition, or revelation. Empiricism, often used by natural scientists, says that "knowledge is based on experience" and that "knowledge is tentative and probabilistic, subject to continued revision and falsification."[4] One of the epistemological tenets is that sensory experience creates knowledge. The scientific method, including experiments and validated measurement tools, guides empirical research.
Exploratory Data Analysis
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA),[1] which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.
Measurement (Data)
Measurement is the assignment of a number to a characteristic of an object or event, which can be compared with other objects or events.[1][2] The scope and application of a measurement is dependent on the context and discipline. In the natural sciences and engineering, measurements do not apply to nominal properties of objects or events, which is consistent with the guidelines of the International vocabulary of metrology published by the International Bureau of Weights and Measures.[2] However, in other fields such as statistics as well as the social and behavioral sciences, measurements can have multiple levels, which would include nominal, ordinal, interval, and ratio scales.[1][3] Measurement is a cornerstone of trade, science, technology, and quantitative research in many disciplines. Historically, many measurement systems existed for the varied fields of human existence to facilitate comparisons in these fields. Often these were achieved by local agreements between trading partners or collaborators. Since the 18th century, developments progressed towards unifying, widely accepted standards that resulted in the modern International System of Units (SI). This system reduces all physical measurements to a mathematical combination of seven base units. The science of measurement is pursued in the field of metrology.
Accuracy vs. Precision
Precision is a description of random errors, a measure of statistical variability. Accuracy has two definitions: more commonly, it is a description of systematic errors, a measure of statistical bias; alternatively, ISO defines accuracy as describing both types of observational error above (preferring the term trueness for the common definition of accuracy). In the fields of science, engineering and statistics, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's true value.[1] The precision of a measurement system, related to reproducibility and repeatability, is the degree to which repeated measurements under unchanged conditions show the same results.[1][2] Although the two words precision and accuracy can be synonymous in colloquial use, they are deliberately contrasted in the context of the scientific method. A measurement system can be accurate but not precise, precise but not accurate, neither, or both. For example, if an experiment contains a systematic error, then increasing the sample size generally increases precision but does not improve accuracy. The result would be a consistent yet inaccurate string of results from the flawed experiment. Eliminating the systematic error improves accuracy but does not change precision. A measurement system is considered valid if it is both accurate and precise. Related terms include bias (non-random or directed effects caused by a factor or factors unrelated to the independent variable) and error (random variability). The terminology is also applied to indirect measurements—that is, values obtained by a computational procedure from observed data. In addition to accuracy and precision, measurements may also have a measurement resolution, which is the smallest change in the underlying physical quantity that produces a response in the measurement. In numerical analysis, accuracy is also the nearness of a calculation to the true value; while precision is the resolution of the representation, typically defined by the number of decimal or binary digits. Statistical literature prefers to use the terms bias and variability instead of accuracy and precision: bias is the amount of inaccuracy and variability is the amount of imprecision. In military terms, accuracy refers primarily to the accuracy of fire (or "justesse de tir"), the precision of fire expressed by the closeness of a grouping of shots at and around the centre of the target.[3]
MECE principal
The MECE principle, pronounced 'me see', is a grouping principle for separating a set of items into subsets that are mutually exclusive and collectively exhaustive.[1] The MECE principle is useful in the business mapping process where the optimum arrangement of information is exhaustive and does not double count at any level of the hierarchy. Examples of MECE arrangements include categorizing people by year of birth (assuming all years are known). A non-MECE example would be categorization by nationality, because nationalities are neither mutually exclusive (some people have dual nationality) nor collectively exhaustive (some people have none).
Analytics
Analytics is the discovery, interpretation, and communication of meaningful patterns in data. Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favors data visualization to communicate insight. Organizations may apply analytics to business data to describe, predict, and improve business performance. Specifically, areas within analytics include predictive analytics, prescriptive analytics, enterprise decision management, retail analytics, store assortment and stock-keeping unit optimization, marketing optimization and marketing mix modeling, web analytics, sales force sizing and optimization, price and promotion modeling, predictive science, credit risk analysis, and fraud analytics. Since analytics can require extensive computation (see big data), the algorithms and software used for analytics harness the most current methods in computer science, statistics, and mathematics.[1] Analytics is multidisciplinary. There is extensive use of mathematics and statistics, the use of descriptive techniques and predictive models to gain valuable knowledge from data—data analysis. The insights from data are used to recommend action or to guide decision making rooted in business context. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with the entire methodology. There is a pronounced tendency to use the term analytics in business settings e.g. text analytics vs. the more generic text mining to emphasize this broader perspective.[citation needed]. There is an increasing use of the term advanced analytics,[citation needed] typically used to describe the technical aspects of analytics, especially in the emerging fields such as the use of machine learning techniques like neural networks to do predictive modeling.
Data Science
Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured,[1][2] which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics,[3] similar to Knowledge Discovery in Databases (KDD). Data science employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, operations research,[4] information science, and computer science, including signal processing, probability models, machine learning, statistical learning, data mining, database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, artificial intelligence, and high performance computing. Methods that scale to big data are of particular interest in data science, although the discipline is not generally considered to be restricted to such big data, and big data technologies are often focused on organizing and preprocessing the data instead of analysis. The development of machine learning has enhanced the growth and importance of data science. Data science affects academic and applied research in many domains, including machine translation, speech recognition, robotics, search engines, digital economy, but also the biological sciences, medical informatics, health care, social sciences and the humanities. It heavily influences economics, business and finance. From the business perspective, data science is an integral part of competitive intelligence, a newly emerging field that encompasses a number of activities, such as data mining and data analysis.[5]
Information Extraction
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video could be seen as information extraction. Due to the difficulty of the problem, current approaches to IE focus on narrowly restricted domains. An example is the extraction from news wire reports of corporate mergers, such as denoted by the formal relation: A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context. Information Extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR)[1] has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article is presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to "understand" an attack article only enough to find data corresponding to the slots in this template.
Knowledge Representation and Reasoning
Knowledge representation and reasoning (KR) is the field of artificial intelligence (AI) dedicated to representing information about the world in a form that a computer system can utilize to solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language. Knowledge representation incorporates findings from psychology[citation needed] about how humans solve problems and represent knowledge in order to design formalisms that will make complex systems easier to design and build. Knowledge representation and reasoning also incorporates findings from logic to automate various kinds of reasoning, such as the application of rules or the relations of sets and subsets. Examples of knowledge representation formalisms include semantic nets, systems architecture, Frames, Rules, and ontologies. Examples of automated reasoning engines include inference engines, theorem provers, and classifiers.