Big Data
Four Characterisitcs of Fake websites
(1) Exhibit the ability to generalize across diverse and vast collections of concocted and spoof websites (2) Incorporate rich sets of fraud cues (3) Leverage important domain-specific knowledge regarding the unique properties of fake websites: stylistic similarities and content duplication (4) Provide long-term sustainability against dynamic adversaries by adapting to changes in the properties exhibited by fake websitestics of a fake website detection system.
Google Files System assumptions
(1) Inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis. (2) Store a modest number of large files. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them. (3) Workloads primarily consist of two kinds of reads: large streaming reads and small random reads. Large streaming reads are generally between hundreds of KBs, to 1 MB. A small random read typically reads a few KBs. (4) Workloads also have many large, sequential writes that append data to files. (5) The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. The file may be read later, or a consumer may be reading through the file simultaneously. (6) High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write. Ghemawat 2003
Challenges for Applying Topic Modeling
(1) Obtaining data from web (1) API (2) Web Crawlers (3) file downloads (2) Readying Data for Analysis (3) Fitting and Validating a Topic Model ( crudical # of topic ones plans to extract Debortoli et al. 2016
Biases and limitations assoicated with use of social media data for prediction
(1) Who participates and do they represent the general population. (approximately 10% of all social media website users, majority of content is produced by 1%) - Mitigated by the absolute volume of social media users. (2) Biased Representation of Product Evaluation - Online product ratings are influenced by previously posted rating. (3) Temporal shifts in Online Reviews - Variance of product ratings decrease with the number off ratings. (4) Intentional Manipulation - 16% of Yelp reviews are suspected to be manipulated; *product rating manipulation may be far more prevalent than expected. (5) Effects of Product Characteristics - Different brands have different representations in social media; Cars can be considered highly complex brands, and hence are expected to be mentioned less frequently in social media outlets, which consequently may impact the predictive power of the associated social media data. (Geva et al. 2010)
Meta Learning Principles
(1) the use of organizational and industry contextual information, (2) the use of quarterly and annual data, and the use of more robust classification methods using (3) stacked generalization and (4) adaptive learning. Abbasi et al. 2012
Focused Web Crawler
- Focused crawlers "seek, acquire, index, and maintain pages on a specific set of topics that represent a narrow segment of the web" (Chakrabarti et al., 1999). - Crawlers are defined as "software programs that traverse the World Wide Web information space by following hypertext links and retrieving web documents by standard HTTP protocol" (Cheong, 1996, p. 82). - Accessibility is an important consideration because DarkWeb forums often require membership to access member postings (Chen, 2006). - One study found that the InvisibleWeb contained 400 to 550 times the information present in the traditional surface Web (Bergman, 2000; Lin & Chen, 2002). Fu et al. 2010
Advantages of using large sample
- Provide opportunity to conduct more powerful data analysis anad inference - Detection or quantification of a small or complex effect. E.g. nonlinear relationships - Sufficient data to conduct analysis on subsamples of data - Incorporate variable controls - Validate predictive models (Lin et al. 2013 )
Support Vector Machines
- Tries to achieve classification - Features are extracted - Tries to find a separator (one dimensional lower than the space, e.g line in two dimensions - Best line is defined based on the width of the gap of the two data sets
Key use cases for social media analytics
1) identifying issues described in user-generated content; (2) identifying ideas and opportunities; and (3) identifying important discussion participants(Zabin et al. 2011).
Argawal and Dhar 2014 - 5 Questions
1. Are big data, analytics, and data science, as being described in the popular outlets, old wine in new bottles or is it something new? 2. What are the strengths that the information systems (IS) community brings to the discourse on business analytics? In other words, what is our competitive advantage? 3. What are important and interesting research questions and domains that may "fit" with on-going research in our community? How might we push the envelope by extending or modifying our existing research agendas? What about new areas of inquiry? 4. To what extent should robust prediction prowess be used as a criterion in evaluating data-driven models versus current criteria that favor "explanatory" models without subjecting them to rigorous tests of future predictability? 5. As editors and reviewers, how should we evaluate research in this domain? What constitutes a "significant" contribution? same as before Fit, Interestingness, Rigor, Story, Theory (Agarwal 2012).
6 Challenges of Big Data
1. Big Data changes the definition of knowledge 2. Claims to objectivity and accuracy are misleading 3. Bigger data are not always better data 4. Taken out of context, Big Data loses its meaning 5. Just because it is accessible does not make it ethical 6. Limited access to Big Data creates new digital divides Boyd and Crawford 2012
Tree Based Methods
1. Generate the Selection model 2. Identify unbalancing variables 3. Measure intervention effect at each terminal node 4. Display results visually 5. Optional: Pool homogeneous intervention effects Yahav et al. 2016
5 Management Challenges of Big Data
1. Leadership - set clear goals, and define success 2. Talent Management - Employ data scientist and professionals skilled with data 3. Technology - use technology able to handle volume, velocity, and variety of data 4. Decision Making - Put information and relevant decision rights in the same location 5. Company Culture- Move away from making decision on hunches to data driven decision McAfee et al. 2021
High Level steps for SML
1. Train the bottom-level classifiers and run them on the entire test bed. 2. Train the top-level classifiers in the generalized stack, using the training and testing data generated by the bottom-level classifiers. 3. Reset the training data to include only the original training instances. 4. Rank the test instances based on the top-level classifiers' predictions. 5. If the stopping rule has not been satisfied, add the d test instances with the highest rank to the training data (with class labels congruent with the top-level classifiers' predictions) and increment d. Otherwise go to step 7. 6. If d is less than the number of instances in the test bed, repeat steps 1-5, using the expanded training data for steps 1 and 2. 7. Output the predictions from the top-level classifiers in the generalized stacks. Abbasi et al. 2012
Abassi et al. 2019
18 Total combinations of signal detection methods; GASD outperformed the mention model by a wide margin in terms of recall and precision on all setting. - Yielded timing signals on most settings - Better precision rates - Results demonstrate that event types, channels, and models heavily impact event signal detection performance.
4 Categories of data analytics
4 Categories of data analytics -Prediction -Summarization -Estimation -Hypothesis testing Varian 2014
Word embedding/word2vec
: represent semantics using a numeric vector Words that co-occur with the same neighboring words have similar meanings (Harris 1954); the model thus identifies synonyms from common neighboring words. Word2vec learns the meaning of specific word via a neural network that reads through textual documents and thereby learns to predict all its neighboring words (Li et al. 2021 )
BigTable Architecture
A scalable distributed storage system that is designed to handle petabytes of data across thousands of commodity servers. Currently (2008), services Google Earth, Google Analytics, Google Finance, Personalized Search (60 applications) Utilizes BigTable API to create and delete tables and controls access controls Relies on Google Cluster management system to manage scheduling, resources and monitoring the overall system Also, uses Google's File System Chang et al. 2008
Major Limitations of NLP
Analysis only really available for high-resource languages (English, French, German, Spanish, and Chinese) Key finding is that simple methods work quite well Consists of Words, part-of-speech (POS), and basic templates Hirschberg and Manning 2015
Abbasi et al. 2010
Automated detection systems have emerged as a mechanism for combating fake websites, however most are fairly simplistic in terms of their fraud cues and detection methods employed We conducted a series of experiments, comparing the proposed system against several existing fake website detection systems on a test bed encompassing 900 websites. The results indicate that systems grounded in SLT can more accurately detect various categories of fake websites by utilizing richer sets of fraud cues in combination with problem-specific knowledge.
Necessary components of NLP
Automatic Speech Recognition (ASR), Dialogue Management (DM), and Text-to-Speech (TTS) Hirschberg and Manning 2015
Business Intelligence and Analytics nu mber
BI&A 1.0 - database management, with structured data, collected by companies through various legacy systems, and often stored in commercial relational database management systems (RDBMS) BI&A 2.0 - Web-based unstructured content. BI&A 3.0 - Mobile and Sensor based content Chen et al. 2012
Social Network Analysis - Authors
Borgatti et al. 2009 - Social Network Theory Martens et al. 2016 - fined grained behavior data for predicitve analysis Vosoughi et al. 2018 - spread of true and false news
Big Data Fundementals - Authors
Chang et al. 2008 - Googles Bigtable Architecture Ghemawat et al. 2003 - Google Files System Dean and Ghemawat 2008 - MapReduce Fu et al. 2010 - Focused Crawler on the dark web
Future NLP Research
Classify Ekman's classic six basic emotions (anger, disgust, fear, happiness, sadness, surprise) Identifying deception Medical conditions (i.e. Depression, Autism and Parkinson's) Speaker characteristics (age, gender, likeability, personality) Speaker Conditions (cognitive overload, drunkenness, sleepiness) Hirschberg and Manning 2015
Abbasi 2012 Findings
Collectively, the MetaFraud framework improved overall accuracy by 27percent on average as compared to the baseline classifiers, and by 7 to 20 percent over state-of-the-art methods, with each phase contributing significantly as evidenced by the hypotheses test results.
Natural Language Process
Computational Linguisitics: Looks at techniques to learn, understand, and produce content understandable by human (Hirschberg and Manning 2015)
Kokkodis 2020
Context - online communities When a user takes multiple actions within a week, we map those to the one associated with the higher level of self-confidence. The HMM-AFT framework allows managers to predict individual user engagement (Section 5.3.1) and future community welfare (Section 5.3.2). The HMM-AFT approach significantly (p < 0.001) outperforms all other approaches by an average AUC improvement between 9% and 26% ("Mean across classes" plot in Figure 7)
Yan & Tan 2014
Context - online health care communities Method - Partially observed markov model to examine the latent health outcomes for online health community members. Hypothesis 1. Informational support given and received in online healthcare communities has a positive effect on patients' health conditions. - Supported Hypothesis 2. Emotional support given and received in online healthcare communities has a positive effect on patients' health conditions. - Supported Hypothesis 3. The effect of social support in online healthcare communities is moderated by patients' health conditions. - Partially supported Hypothesis 4. Social embeddedness and social competence in online healthcare communities illustrate the depth and strength of social support and thus have a positive effect on patients' health conditions. - Partially supported
Taxonomy for Big Data Analytics
Decision Time Analytics Techniques
Big Data Definition
Define big data as a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology that provokes extensive utopian and dystopian rhetoric. Boyd and Crawford 2012
Nichols 2007 -
Discuss: Ordinary Regression and Panel Methods Matching and reweighting estimator Instrumental Variables Regression Discontinuity Designs
Lesson for Machine Learning for Decision making
First, it illustrates the kind of decisions that make for an ideal application of machine learning: ones that hinge on the prediction of some outcome Second general lesson is that assessing whether machine predictions improve on human decisions requires confronting a basic selection problem: data on outcomes (labels) can be missing in a nonrandom way. Third, decisions that appear bad may simply reflect different goals. Difficulties: You cannot observe counter factual. E.g. You don't know if you deny bail, if the person would of failed to appear Shmueli and Koppius 2011
The Google File System
GFS is a scalable distributed file system for large distributed data-intensive applications. Distributed file system (DFS) is a method of storing and accessing files based in a client/server architecture. In a distributed file system, one or more central servers store files that can be accessed, with proper authorization rights, by any number of remote clients in the network (Ghemawat 2003)
Hidden Markov Model
HMM built from trace data to examine how user engagement relates to dimensions of community welfare (activity, interactivity, and reciprocity). evaluate five different predictive models: -Support Vector regression model (-SV) (Shevade et al. 2000, Sapankevych and Sankar 2009), an autoregressive integrated moving average model with explanatory variables (ARIMAX) (Friedman and Meiselman 1963), a k-nearest neighbor regression (kNN), recurrent neural networks (long short-term memory, or LSTM) (Hochreiter and Schmidhuber 1997), and gradient boosting (XGBoost) (Chen and Guestrin 2016). Kokkodis 2020
Text Mining-Authors
Hirschberg and Manning 2015 - Identifies advances in Natural Language Process Debortoli et al. 2016 - Text mining tutorial Li et al. 2021 - measuring corporate culture Büschken and Allenby 2016 - cpompare topic modeling tools LDA, SC-LDA, Sticky SC-LDA Abbasi et al. 2018 - language action perspective -sensemaking capabilites of text analysis Abassi et al. 2019 - Early warning of adverse product events using topic modeling
Vosoughi et al. 2018
How do truth and falsity diffuse differently, and what factors of human judgment explain these differences? Tweets from 2006-2017, 126,000 rumor cascades by 3 million people more than 4.5 million times; rumor cascades from 6 independent fact checking organizations - Assesses novelty Results: Contrary to conventional wisdom, robots accelerated the spread of true and false news at the same rate, implying that false news spreads more than the truth because humans, not robots, are more likely to spread it. -Falsehood also reached far more people than the truth. Whereas the truth rarely diffused to more than 1000 people, the top 1% of false-news cascades routinely diffused to between 1000 and 100,000 people (Fig. 2B). -Falsehood reached more people at every depth of a cascade than the truth, meaning that many more people retweeted falsehood than they did the truth (Fig. 2C). -It took the truth about six times as long as falsehood to reach 1500 people (Fig. 2F) and 20 times as long as falsehood to reach a cascade depth of 10 (Fig. 2E). As the truth never diffused beyond a depth of 10, we saw that falsehood reached a depth of 19 nearly 10 times faster than the truth reached a depth of 10 (Fig. 2E). -Novelty attracts human attention (24), contributes to productive decision-making (25), and encourages information sharing (26) because novelty updates our understanding of the world.
Popular sentiment Analysis
Identify opinions and beliefs about politicians Predict Disease Spreading from symptoms mentioned in tweets Recognize fake news Identify Social Networks Product pricing trends and advertising from online reviews Medical forums discover common questions and misconceptions Identify hate speech or bullying behavior Hirschberg and Manning 2015
Problem with P values
In very large samples, p-values go quickly to zero, and solely relying on p-values can lead the researcher to claim support for results of no practical significance. Information systems research that is based on large samples might be over relying on p-values to interpret findings. With extremely large samples, we should go beyond rejecting a null hypothesis based on the sign of the coefficient (positive or negative) and the p-value. (Lin et al. 2013 )
What is driving NLP development?
Increase in computing power Accessibility and availability of data Development of new machine learning techniques Greater understanding of human language Hirschberg and Manning 2015
Seuqnece Modeling Authors
Kokkodis 2020 - Uses Hidden Markov Model to model less active users (lurkers) to greater engagement Zheng et al. 2014 - Latent Growth Modeling for Information Systems Yan and Tan 2014 - Partially observed markov model
Latent Dirichelet Allocation benefits
LDA has evolved from the seminal LSA idea, and academic research has extensively used both methods1; 2) numerous free and open source LDA software libraries exist for most statistical programming languages (including R, Python, Java); and 3) several empirical studies have validated LDA's capability of extracting semantically meaningful topics from texts and categorizing texts according to these topics (e.g., Boyd-Graber, Mimno, & Newman, 2014; Chang, Boyd-Graber, Gerrish, Wang, & Blei, 2009; Lau, Newman, & Baldwin, 2014; Mimno, Wallach, Talley, Leenders, & McCallum, 2011). Debortoli et al. 2016
Latent Growth Modeling
Latent Growth Modeling (LGM) as a complementary method for analyzing longitudinal data, modeling the process of change over time, testing time-centric hypotheses, and building longitudinal theories. time lag, duration, and rate of change are key parameters when integrating time into research hypotheses (Mitchell and James 2001) LGM examines: (1) change within a single variable, (2) change of a variable conditional on covariates, (3) cross effects of changes among multiple time-varying variables, and (4) shape (e.g., nonlinear) of growth patterns. LGM breaks down the variance into two major components: within-individual and between individual variation (Whiteman and Mroczek 2007). There are in general four types of hypotheses that are testable using LGM: (1) change within a single variable without covariates (unconditional model), (2) change of a variable conditional on time-invariant covariates (conditional model), (3) cross effects of changes among multiple time-varying variables (multivariate LGM), and (4) nonlinear growth patterns (nonlinear LGM). (Zheng et al. 2014)
BigTable Architecture Lessons Learned
Lessons Learned - Large distributed systems are vulnerable to failure - Delay new features until it is known how they are going to be used - Monitoring system performance is important - Value of simple designs. Coding/protocols Chang et al. 2008
Text coding categorizations
Manual Coding - bottom up Manual Coding - top down Dictionaries Supervised Machine Learning Unsupervised Machine Learning
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. USES: • large-scale machine learning problems, • clustering problems for the Google News and Google products, • extracting data to produce reports of popular queries (e.g. Google Zeitgeist and Google Trends), • extracting properties of Web pages for new experiments and products (e.g. extraction of geographical locations from a large corpus of Web pages for localized search), • processing of satellite imagery data, • language model processing for statistical machine translation, and • large-scale graph computations. More than ten thousand distinct programs have been implemented using MapReduce at Google, including algorithms for large-scale graph processing, text processing, data mining, machine learning, statistical machine translation, and many other areas Dean and Ghemawat 2008
Big Data Authors
McAfee et al. 2021 - Introduces Volume, Velocity, and Variety, and 5 main challenges Chen et al. 2012 - evolution, application and emerging research areas for big data Boyd and Crawford 2012 - Provides 6 challenges to big data about the claims it makes and the ethical issues that arise Goes 2014 - Creates Taxonomy for Big Data Infrastructure based on the 4 V's Argawal and Dhar 2014 - Addresses 5 Questions on big data, data science and analytics Rai 2016 - synergies between big data and theory Cukier 2010 - Discusses on business and government are tapping into the vast potential of big data
Li et al. 2021
Neural network language models Semi-supervised machine learning approach 209,480 earning calls from Thomson Reuter's StreetEvents database 2001-2018 Firms with a strong corporate culture are associated with greater operational efficiency, more corporate risk-takin and long-term orientation, and higher firm value. Culture-performance link is more pronounced in bad times. Natural language model based on artificial neural networks, can learn the context specific meanings of words and phrases. Using this model, we propose a new semi-supervised machine learning approach to generating a culture dictionary and quantifying corporate disclosures.
Practical Advise for going beyond P values
Presenting Effect Size- Marginal analysis Reporting Confidence Intervals Using Charts (Lin et al. 2013 )
Wang et al. 2018
Research Questions: 1. aim to introduce a novel copycat detection method that is able to identify different types of copycat apps in dimensions of both function and appearance by using diverse sources of publicly available data. 2. aim to empirically analyze how copycat apps affect the demand for original apps Results: (1) Copycat apps can be either friends or foes of original apps, depending on the quality and imitation type of the copycats. (2) Nondeceptive copycats, particularly those with high quality, are likely to be competitors of the original app and will cannibalize the sales of the original app. (3) Deceptive copycats, especially those with low quality, positively affect demand for the original app..
Predictive Analytics- Authors
Shmueli and Koppius 2011 - Predictive Analytics in IS research Kleinberg et al. 2017 - Can algorithms improve upon judicial decisions using decision trees Athey 2017 - Big data for policy problems Abbasi 2010 - detecting fake websites Abbasi et al 2012 - detecting financial fraud Geva et al. 2010 - search trend data successfully augment social media data for sales prediction
Spoken Dialgue System
Siri, Cortana, Bixby, etc. Hirschberg and Manning 2015
Taxonomy for Four V's of Big data
Source Type Source Site Volume Velocity Variety Veracity Data Management Computation control Archival Needs Goes 2014
Benefits of the Google File System
Supports large-scale data processing workloads on commodity hardware. Provides fault tolerance: constant monitoring, replicating crucial data, fast and automatic recovery. Deliver high aggregate throughput to many concurrent readers and writers performing a variety of tasks. Meets Googles need: storage platform for research and development, and production data processing Ghemawat 2003
Cuckier 2010
The data deluge - Businesses, governments and society are only starting to tap its vast potential Data, data everywhere- Information has gone from scarce to superabundant. That brings huge new benefits, says Kenneth Cukier—but also big headaches All Too much - mountrous amount of data A different game - inforamtion is transfrorming traditional business Show me - new ways of visualizing data Needle in a haystack - the uses of information about information.
Language Action Perspective
The language-action perspective (LAP) emphasizes pragmatics; not what people say, but rather, what people do with language (Winograd and Flores 1986). LAP highlights "what people do by communicating, how language is used to create a common basis for communication partners, and how their activities are coordinated through language" (de Moor and Aakhus 2006, pp. 93-94). LAP's principles are based on several important theories, including speech act theory (Searle 1969), discourse analysis, and argumentation. 1. Conversation structures: LAP advocates considering messages in the context of the conversations in which they occur. Conversations encompass interactions between users and their messages. There are different types of conversations: conversations for action, conversations for clarification, conversations for possibilities, conversations for orientation, etc. 2. Actions and context: LAP advocates the pragmatic view, which can complement the semantic perspective by emphasizing actions, intentions, and communication context through consideration of speech acts. Abassi et al. 2018
Benefits of Tree based approach
The power of our tree-based approach stems from its simplicity, communicability, automated nature, and generalizability, yet nuanced analysis. Tree-based approach assumes that the propensity to self-select an intervention is observable. However, compared to PS, our proposed method is - computationally simpler - requires fewer ad hoc choices by the researcher - detects and provides more nuanced insights and effects - simpler to understand and use scales better to big data Yahav et al. 2016
Abbasi 2010 Findings
The results indicate that systems grounded in SLT can more accurately detect various categories of fake websites by utilizing richer sets of fraud cues in combination with problem-specific knowledge.
genetic algorithm-based signal detection (GASD)
The two key aspects of the method are its (1) objective function, which rewards the creation of signals that garner fewer, potentially higher-quality, alerts faster; and (2) the weighting method, which allows better contextualization of references to product, attribute, and user experience terms for each individual product. GASD attempts to better harness the diversity of wise crowds for enhanced aggregation, in an unsupervised manner devoid of overfitting. The details are as follows...GASD learns time-series-specific weights for various product, incident, and experience terms. Abassi et al. 2019
Martens 2016
Theories - Social Network Theory - using fine grained transaction data Data Source & Variables - 21 million transactions by 1.2 million customers to 3.2 million merchants, as well as 280 other variables describing the customers The results show that there is no appreciable improvement from moving to big data when using traditional structured data. However, in contrast, when using fine-grained behavior data, there continues to be substantial value to increasing the data size across the entire range of the analyses. This suggests that larger firms may have substantially more valuable data assets than smaller firms, when using their transaction data for targeted marketing
Buschken and Allenby 2016
Uncover latent topics associated with user-generated topics and relate them to product ratings. Use of Latent topics to predict and explain consumer ratings Identify themes associated with positive and negative reviews, comparing results from the model-free analysis reported in Table 2 to topics in the SC-LDA-Rating model. Methods: Comparative Analysis of Topic Modeling Tools; LDA, SC-LDA, Sticky SC-LDA Data: Online reviews of hotels and restaurant; 696 reviews of Italian restaurants comprising a corpus of 43,685 words. The corpora of Manhattan hotel reviews and JFK hotel reviews comprise 73,314 and 25,970 words, respectively. Results: Model-based analysis of the data is that it helps to reveal the combination of classification variables for which unique themes and points of differentiation are present.
Explanatory Analytics Authors
Varian 2014 - four categories for data analytics Nichols 2007 - Cookbook for causal inference with observational data Lin et al . 2013 - Large Samples and the problem with P Yahav et a.l 2016 - tree based approach
Difference Big data and data analytics
Volume - The amount of data being collected and processed Velocity - Speed of data creation Variety - All different types of data; structured/unstructured Veracity - McAfee et al. 2021
Geva et al. 2010 Results
We find that adding search trend data to models based on the more commonly used social media data significantly improves predictive accuracy. We also find that predictive models based on inexpensive search trend data provide predictive accuracy that is comparable, at least, to that of social media data-based predictive models. Last, we show that the improvement in accuracy is considerably larger for "value" car brands, while for "premium" car brands the improvement obtained is more moderate.
Zhang et al. 2018
We investigate the economic impact of images and lower-level image factors that influence property demand in Airbnb. Using Difference-in-Difference analyses on a 16-month Airbnb panel dataset spanning 7,711 properties, we find that units with verified photos (taken by Airbnb's photographers) generate additional revenue of $2,521 per year on average. For an average Airbnb property (booked for 21.057% of the days per month), this corresponds to 17.51% increase in demand due to verified photos. Leveraging computer vision techniques to classify the image quality of more than 510,000 photos, we show that 58.83% of this effect comes from the high image quality of verified photos. Next, we identify 12 interpretable image attributes from photography and marketing literature relevant for real estate photography that capture image quality as well as consumer taste. We quantify (using computer vision algorithms) and characterize unit images to evaluate the economic impact of these human-interpretable attributes.
Image Mining - Authors
Zhang et al. 2018 - AirBNB verified property photos lead to increase demandWang et al. 2018 - Copy cats of mobile apps
Advantages of using Latent Topic Modeling Tools
ability to uncover collections of words that co-occur in the customer reviews. In the analysis of our reviews, we find that many words are indiscriminately used in all evaluations of hotels and restaurants (Buschken and Allenby 2016)
Two benefits of text mining for research
automated text mining allows IS researchers to 1) overcome the limitations of manual approaches to analyzing qualitative data and 2) yield insights that they could not otherwise find. Debortoli et al. 2016
Exclusion
competitive situation to which one node, by forming relation with another, excludes a third node; a nodes power becomes a function of the powers of all other nodes in the network. (Borgatti et al. 2009)
Analysis
drawing on large data sets to identify patterns in order to make economic, social, technical, and legal claims. Boyd and Crawford 2012
Latent Dirichelet Allocation
first proposed by Blei et al. (2003), a probability distribution over a fixed set of topics defines each document, and, in turn, a probability distribution over a confined vocabulary of words defines each topic. While LDA assumes all documents to be generated from the same fixed set of topics, each document exhibits these topics in different proportions that can range from 0 percent (if a document fails to talk about a topic entirely) to 100 percent (if a document talks about a topic exclusively). The LDA algorithm computationally estimates the hidden topic and word distributions given the observed per-document word occurrences. LDA can perform this estimation via sampling approaches (e.g., Gibbs sampling) or optimization approaches (e.g., Variational Bayes). Debortoli et al. 2016
Structures
invent graph-theoretic properties that characterize structures, positions, and dyadic properties (such as the cohesion or connectedness of the structure) and the overall "shape" (i.e., distribution) of ties (Borgatti et al. 2009)
Bagging
involves averaging across models estimated with several different bootstrap samples in order to improve the performance of an estimator. Varian 2014
Bootstrap
involves choosing (with replacement) a sample of size n from a dataset of size n to estimate the sampling distribution of some statistic. A variation is the "m out of n bootstrap" which draws a sample of size m from a dataset of size n > m. Varian 2014
Boosting
involves repeated estimation where misclassified observations are given increasing weight in each repetition. The final estimate is then a vote or an average across the repeated estimates. Varian 2014
LASSO
least absolute shrinkage and selection operator Varian 2014
Technology
maximizing computation power and algorithmic accuracy to gather, analyze, link, and compare large data sets. Boyd and Crawford 2012
Adaption
nodes become homogenous as a result of experiences and adapting to similar social environments (Borgatti et al. 2009)
Goal of predictions
o Simpler models work better o Divide the data into training, testing and validation o Tuning parameter o Classification and Regression Trees Varian 2014
Tree-based approach
provides a standalone, automated, data-driven methodology that allows for (1) the examination of nascent interventions whose selection is difficult and costly to theoretically specify a priori, (2) detection of heterogeneous intervention effects for different preintervention profiles, (3) identification of pre-intervention variables that correlate with the self-selected intervention, and (4) visua Yahav et al. 2016l presentation of intervention effects that is easy to discern and understand. Yahav et al. 2016
Classification and Regression Trees
set of data mining algorithms that generate IF-THEN rules to link predictors to an outcome.
Ties-
similarities, social relations, interactions, and flows (Borgatti et al. 2009)
Binding
social construction can bind nodes together in such a way to construct new entities (Borgatti et al. 2009)
Supervised machine learning (SML)
software programs take as input training data sets and estimate or "learn" parameters that can be used to make predictions on new data. When SML applications are used "off the shelf" without understanding the underlying assumptions or ensuring that conditions like stability are met, then the validity and usefulness of the conclusions can be compromised. It is sometimes important for stakeholders to understand the reason that a decision has been made, or decision-makers may need to commit a decision rule to memory Athey 2017
Random Forest
technique using multiple trees (they don't offer simple summaries of relationships in the data) Varian 2014
Conversation Detanglement
the ability to accurately affiliate messages in discussion threads with their respective conversations. From a LAP perspective, conversations are an important unit of analysis that is presently not represented in text/social media analytics systems: messages are too atomic and threads encompass multiple intertwined conversations (Elsner and Charniak 2010). Abassi et al. 2018
Coherence analysis:
the ability to infer reply-to relations among series of messages within a discussion thread Nash 2005). Social media technologies make it difficult to accurately infer interrelations between messages (Honeycutt and Herring 2009), impacting quality of participant interaction and social network information (Aumayr et al. 2011; Khan et al. 2002). Abassi et al. 2018
Message speech act classification
the ability to infer the speech act composition of messages within discussion threads - for instance, assertions, questions, suggestions, etc. (Kim, Li, and Kim 2010). Abassi et al. 2018
Mythology
the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy. Boyd and Crawford 2012
Rai 2016
thoughts on (1) changes in the practices to generate and source data for research, (2) certain cautions that arise from these changes, and (3) synergies that can be achieved between big data and the testing, elaboration, and generation of theory in IS through research designs and method
Strength of Weak Ties
weak ties are more likely sources of novel information (Borgatti et al. 2009)
