BUSI 344 - R&D Questions
If the correlation between two variables is strong (say, between rent per square foot and quality of retail space), is the relationship deemed to be linear?
Not necessarily, because two variables may have high correlation but exhibit a nonlinear relationship. This is a good reason for visually exploring the relationships between the variables. Using linear regression to predict outcomes for a dependent variable where a nonlinear relationship exists willproduce flawed results.
(a) Which is better as a unit of comparison, continuous variables or categorical variables? (b) Which is better for control and blocking, continuous variables or categorical variables?
(a) A continuous (measure) variable, used as a unit of comparison, can significantly improve the precision of an analysis--there is less flexibility with a categorical (discrete) variable. (b) For control and blocking, the best are binary (yes/no) variables. Next best are categorical (discrete) variables with more than two choices.
Steps for building MRA Model
1. Specify the Model - MV = LV + BV 2. Review Variables 3. Examining the Variables 4. Transformations 5. Examine Transformed Variables 6. List the Variables for Calibration 7. Model Calibration 8. Test an Evaluate the Model
How well is the appraisal industry in Canada positioned for adoption of Appraisal Assisted Valuation Models? What barriers do you see for the growth of these AVMs?
AVMs are still in their infancy in Canada. Most of the AVM use is for mortgage insurance or evaluation of pooled mortgage funds (secondary markets). At the present time, the main barriers to further growth of AVMs are: C acceptance by primary lenders; C widespread availability of low cost property data; and C lender and real estate professionals understanding AVM technology, and the risks and benefits associated with use. The AIC has recommended that the AVM industry, lenders, and appraisal professionals collaborate to improve public understanding of AVMs and develop national standards for performance.
When would absolute frequencies be more useful than relative frequencies?
Absolute frequencies are more useful for comparison than relative frequencies where the sample size is small and the relative comparison may not be as meaningful.
Identify the main differences between computer-assisted drafting (CAD) systems, automated mapping/facilities management systems (AM/FM), and geographic information systems (GIS).
Computer-Assisted Drafting Systems (CAD): suitable for traditional mapping functions. • Automated Mapping/Facilities Management Systems (AM/FM): provide sophisticated databases for storing and manipulating attribute information, but are limited in their ability to analyze relationships between different layers other than through visual inspection (overlapping of layers). • Geographic Information Systems (GIS): developed for spatial analysis needs such as planning, natural resources, and land records management. GIS can completely integrate spatial data and attribute data among different layers. The GIS approach is ideal for multipurpose users.
Provide four examples of how GIS might be used in real property appraisal.
Examples of how appraisers might use a GIS include: • Verify property structures • Verify site size and attributes • Identify changes in structures or land use • Identify pool of comparable sales and narrow down similarity. • Analyze proximity to value influences, either positive (e.g., parks, schools) or negative (e.g., industrial uses) · • Identify neighbourhood boundaries
Assume you need to develop a regression model to explain the impact of view on high-rise condo sales in Burnaby. What type of model would you develop, predictive or explanatory? Why?
Explanatory model. AN explanatory model provides the most accurate values for the coefficients. In this case VIEW. Predictive model provides the best possible estimate of selling price, but not nessesarily the most reliable estimates of the coefficients.
In a modelling exercise, you are attempting to determine if a "sea-glimpse" view is significant in terms of necessitating an adjustment in the model. You have completed a Kruskal-Wallis test and found the following results: 1. N = 200 2. Mean Rank = 168 for view = 1; Mean Rank = 85 for view = 0 3. Chi-Square = 6.2 4. Asymp. Sig. = 0.04 What is the expected mean rank? What can you conclude from these results?
For a 200 sale database, the expected mean rank is 100. The high mean rank for view versus non-view properties indicates a difference in the valuation of these property types. The chi-square confirms the sea glimpse data is significantly different. There is only a 4% chance that the data for properties with and without a sea glimpse is the same. Here, the calculated chi-square statistic is 6.2 and the probability of obtaining a test statistic of this amount if the data was equally valued is .04, below the .05 threshold. Therefore, we conclude an adjustment for this characteristic is necessary.
Appraisal has historically used less intensive data analysis methods. Why? How has this changed today?
Historically, appraisal has used less intensive data analysis methods because data was poor in quality,difficult to obtain, and difficult to analyze. Today, good quality data is much more easily available and computing power and software have made it much simpler to analyze effectively.
Real estate investments are considered a good hedge against inflation. What does this tell you about the correlation between house prices and inflation?
If real estate is in fact a good hedge against inflation, housing prices and inflation must be (very) positively correlated. As such, when inflation increases, so would house prices. For example, let's examine a house bought for $200,000 in 1990. Because housing prices should rise with inflation, the same house will be worth more today. However, because of inflation, if the original $200,000 is held in cash since 1990, it is worth less today as it has less purchasing power. Therefore, by holding onto the real estate, you have hedged against inflation, and retained value in your investment. Whether or not real estate is actually a good hedge against inflation is a very debatable topic.
A young couple is looking to purchase a property. They would like to be within five kilometres of their favourite park, but prefer to be at least three kilometres from the highway. They want a house with at least three bathrooms and a lot larger than two acres. How might they use GIS to find suitable properties? (In your answer, you may make any necessary assumptions regarding access to data and computer systems).
If this couple had access to a suitable GIS, they could quickly and easily identify suitable properties. First, they might establish a point buffer, showing properties within five kilometres of the park. Second, they might establish a line buffer for properties more than three kilometres from the highway. By overlaying these maps, they can focus on their optimal market area. Third, they could query the GIS to display properties that have three or more bathrooms and lots larger than two acres, perhap using colour coding. This would identify their pool of potential properties. Finally, they could view nwhich of these properties might be available for sale.
13. In plotting unstandardized predicted dependent values against unstandardized residuals, what type of outcome are we looking for?
In plotting unstandardized predicted dependent values against unstandardized residuals, we are looking for a random outcome with a low R value. This means that there is no pattern to the remaining 2 unexplained variance in the regression model.
Case Study 7 employs regression techniques that apply the coefficient of determination (R2) and the standard error of the estimate. Can you see any weaknesses in these case studies where excessive reliance was placed on one or two statistics, such as R2 or the SEE?
R represents the approximate percentage of variation in the dependent variable which is explained by 2 the basic regression equation and is one measure of the effectiveness of the regression. The standard error of the estimate, another measure of how well the regression fits, represents the remaining dispersion in the data after the regression equation is applied. However, in Case Studies 6 and 7, there was little information provided in the descriptive statistics used to examine the data variables. It is possible that multicollinearity existed in some of the regression equations or that the regression was, in fact, non-linear, with more data transformations required.
What would happen if you changed the parameters in a step-wise linear regression analysis, as follows: Entry Probability Removal Probability From .15 .20 To .05 .10 Test the outcomes with the Regina3 model. How do you determine the best thresholds for step-wise regression analysis?
Reducing the Entry Probability and Removal Probability thresholds has the effect of eliminating more variables which don't meet the new test. The analyst should consider the objectives of the regression analysis in setting the entry and removal probabilities. If the goal is to produce a regression model which is a balance between predictive and explanatory outcomes, the threshold will not likely be as high in comparison to a model which requires a very high level of confidence for explanatory purposes.
Assume you were working with a large dataset of office vacancy observations in a large market area. The data consists of observations of vacant space (sq ft) and gross leasable area (GLA) for each building sampled. Your interest is predicting vacancy for a specific range of office buildings. How would you accomplish this task with SPSS? What variables would be required?
Regression analysis could be performed to determine the relationship of vacancy (as the dependent variable) to gross lease-able area (as the independent variable). Other variables to consider would be the class of building, age, location, etc, and these could be incorporated to perform multiple regression analysis.
What is the difference between "reliability" and "credibility"?
Reliability is the more statistical term, relating to low variability (precision). Reliability is measurable. Credibility means worthy of belief and is more subjective, but equally important. Appraisal credibility can only be evaluated in terms of the intended use of the appraisal. Credibility includes appropriateness of the model used, as well as mathematical reliability
After running a regression, you find that the model yields an SEE of 5,000. Is this a good result? What are the problems with using SEE as a measure of "goodness of fit"?
SEE is an absolute measure, meaning size alone does not tell us very much. In order to use SEE toanalyze "goodness of fit", we must convert it to a Coefficient of Variation (COV), by dividing the SEE by the mean. The COV tells us how well our model is doing in relative terms (percentage). A target COV for a good model is less than 10%.
How is similarity identified and measured?
Similarity is defined by problem identification, including assignment conditions. It is measured by the four dimensions of similarity: transaction, time, space, and physical/financial utility.
What are some statistical measures that you can use to test for the "normality" of your data sample? What are the strengths and weaknesses of the various approaches?
Tests for normality include the Kolmogorov-Smirnov and Shapiro-Wilk tests. Normal distributions also have an equivalent mean, median and mode. For a normal distribution, approximately 68% of the data should lie within one standard deviation from the mean, 95% within two standard deviations, and 99% within three standard deviations. A histogram and normal probability plot can also be used to help determine if a data set is normally distributed.
16. What are the four appraisal principles of experimental design? Provide a one-sentence explanation of each.
The four appraisal principles of experimental design are control, randomization, replication, and blocking. (a) Control: limiting data to homogeneous group(s). (b) Randomization: relying on control (bracketing and balancing), so that elements are distributed by chance. (c) Replication: relying on similar "experiments," or similarity of missing information. (d) Blocking: exploiting "natural" groupings of variables.
What statistic is most useful in identifying the best unit of measure?
The main statistical measures useful in identifying the best unit of comparison is the correlation between sale prices and that particular variable (the higher the better) and the coefficient of variation of the variable itself (the lower the better) .
What is an attribute and why is it important to a GIS?
The term attribute describes any piece of information about an object that can be stored in addition to its geographic properties. GIS can tell you not just the shape of a feature, but also what it is and any other information that may exist about it. A conventional map depict information, but a GIS stores this information within the map data itself. The maps shown in GIS are intelligent - the features know their own identity.
Name two major types of changes in market conditions (time)?
The two major types of changes in market conditions (time) are trends and event impacts.
Real estate terminology is very specialized. A variable describing office building class (e.g., Class A, Class B, Class C) would be what type of data variable? What possible problems might you experience in relying on this building class variable?
This Building Class variable is an example of an "Ordinal" variable. Each class is related to the other and provides an indication of which class is "better" than another, but not any objective indication of how much "better" one class is in relation to another. The problem with ordinal data variables is that they are often based on subjective interpretation. For example, one person's understanding of an office building's class may be very different than someone else's interpretation. When conducting data exploration, you need to pay close attention to the definition and consistency of application of ordinal variables.
An analyst has completed a valuation model for real estate prices and is now recommending applying the model to assess all houses in the city for property tax purposes. However, the Assessor first wants to test the ability of the model to reliably predict outcomes for real estate prices. What do you recommend?
To test a model you compare predicted model price to actual actual price (PAR). If the PAR is above 1 then the predicted prices are great than the actual prices. If the PAR value is less than 1, than the predicted prices are lower than actual prices.
To include an ordinal variable for property characteristics (e.g., view) in a regression model, you may transform the variable into separate binary variables. What is a disadvantage of using binary variables versus another re-coding approach?
Transformation of ordinal variables into binary variables means that the values for each variable are limited to two discrete numbers, and no other numbers. Therefore, a binary variable is required for each characteristic to be studied, with the database possibly becoming large and complex. A way to overcome this problem is to transform ordinal variables into a single variable and use numbers to represent the different qualities for view (e.g., no view = 0, moderate view = 1, excellent view equals 1.6, where an excellent view is known to be 60% more valuable than a moderate view). Rather than three separate codes for different views (as in the Lesson 5 example), it would be possible to have one view variable with three possible values.
How could you use visual presentation aids to help a client understand a statistical analysis?
Visual aids such as graphs, provide the opportunity to simplify complex relationships between data variables so that the key messages about the data become clear
You have completed a mass appraisal model building exercise and are now in testing mode. You have found the following results. What factor should be applied to the predicted selling prices created by the model in order to bring the estimated sale prices in line with the target ratio of 1.000? (a) The median PAR for a neighbourhood is 1.087 with a 95% confidence interval of 1.035 to 1.111. (b) The median PAR for a neighbourhood is 1.017 with a 95% confidence interval of 0.975 to 1.111.
(a) Because the confidence interval for the median is above 1, we are reasonably certain an adjustment is necessary. We are at least 95% confident that the median ratio is not equal to 1. We will apply an adjustment of 1 / 1.087 = 0.917. Multiplying all the predicted values of all properties in this neighbourhood by this 92.7% factor will reduce their predicted values such that, on average, the predictions equal the sale prices. (b) Because the confidence interval for the median overlaps 1, no adjustment is necessary. We cannot be certain, at a 95% confidence level, that the median ratio is different than the target of 1.
(a) List the sequence of five data group subsets we have learned regarding data reduction. (b) Which of the above sets is most similar to the traditional three comparable sales?
(a) Five subsets in data reduction: database, data frame, market dataset, information set, and illustrative set. (b) The illustrative set serves the best simplicity aspects of the traditional "3 comps" report.
(a) How is information "better" than data? (b) How is an information set different from the market dataset?
(a) Information is data that has been organized to make it useful and understandable. (b) An information set is a subset of the data set, useful for a particular analytic method.
(a) What are the four dimensions of comparability (similarity) and why are they important to identifying the right market? (b) The transaction dimension has three elements. How do these differ substantially from the other three dimensions? (c) The utility dimension encompasses the two fundamental types of property benefits, amenities and income. What are the three forms of utility?
(a) The four dimensions of comparability are: transaction elements; time; space; and utility characteristics. They are important because their relative independence enables simple regression and centres comparison methods. (b) The elements of transaction terms do not measure any attributes of the property, only thecontract and motivation. (c) Similar to optimal use issues, the three forms of utility are physical characteristics, legal permissibility, and financial characteristics.
In what ways does a geographic information system offer a more powerful tool than conventional maps?
A GIS displays location elements like map, but also stores attribute information about objects on th map. One of the key advantages of GIS over conventional maps is the ability to layer information. When compiling a conventional map, one has to draw a balance between displaying as much information as possible to make the map useful without adding so much detail that it becomes cluttered and confusing. With GIS, this problem is removed - many different layers of information can be added, and shown in different combinations and in a different order, depending on the particular message to be conveyed. By switching different data layers on and off, the user can create many different views for the same location.
What are the key elements needed to create a GIS?
A Geographic Information System (GIS) is created through the integration of data, people, hardware, software and methods (applications). GIS is all about taking advantage of the visual power of maps by incorporating different types of data within user friendly systems to create dynamic and interactive maps.
Explain what buffers are and give an example for each type of buffer.
A buffer is a shape drawn on a map, which represents the total area within a certain distance of a given feature (point, line, or area). You can use a GIS to generate buffer zones and then identify all features that lie within a particular distance. Buffers can be point, line, or area. Buffer examples: C Point buffer: select all properties within 10 kilometres of a nuclear power facility. C Line buffer: select all properties within 100 metres on either side of an electrical line right-of-way. C Area buffer: for all properties within Oldtown neighbourhood, select those buildings older than 50 years.
Eliminating outliers is an initial data screening activity. What type of problems can you encounter with elimination of these unexplainable data occurrences?
A high level of care is required when "pruning" outliers since you may be altering the statistical relationships between data to confirm your hypothesis. In other words, instead of using the data to determine relationships, you may instead be changing the data to prove your own pre-conceived notions. It is important to document the rationale for elimination of outliers and be consistent in your approach, in order to ensure credibility of the final model outcome.
Provide an example of how you would use an overlay.
A key advantage of being able to layer data in a GIS is to carry out overlay operations. This means combining more than one layer of data to create a new set of data (analogous to combining layers of coloured cellophane, where you combine yellow and blue to make green). The example from the lesson involved a farmer who needed a certain level of rainfall and a type of soi to successfully grow a crop. By combining the rainfall map and soil type map, the farmer can locate the best location. Another example might be a developer targeting construction of seniors facilities. The developer might combine maps of age, socioeconomic status, and seniors' facilities to pinpoint an optimal location
Prior to building a model to predict property value based on a group of property sales, time adjusting those sales may be important. What time adjustment is likely required for sales in a market that is (a) fairly constant, (b) rising, or (c) declining? What tools are most effective to determine the need for a time adjustment?
A stable market requires no time adjustments. A rising market likely requires adjustment of older sale prices upwards, where a declining market would require older sales to be adjusted downwards. The most effective tools to test for the need for time adjustment are scatterplots, boxplots, and the Kruskal-Wallis test.
The first step in testing for multicollinearity is conducted during data-screening where the correlation of each of the independent variables is determined. What other steps can be taken to ensure that multicollinearity is not present in your model?
After creating your model, you should examine both the Tolerance and VIF statistics for each variable, where Tolerance = 1/VIF. If the tolerance of any of the variables is less than 0.3, or subsequently, the VIF is greater than 3.333, multicollinearity exists and the model should be revised.
You hired a consultant to complete a statistical analysis predicting the need for seniors housing in Langley, BC. In reviewing the results, should you focus on the reliability of the forecast in relation to other benchmark data? Or do you need to examine the consultant's interpretation of the underlying data relationships?
Both approaches will be necessary. Data exploration, in particular, will be necessary to determine the strengths and weaknesses of the statistical analysis. The results may appear reasonable, but end up not adequately supported by the underlying data and analysis.
What are the similarities and differences between traditional appraisal and appraisal using automated valuation models? What are the advantages and disadvantages of each?
Both conventional appraisal and AVMs have the same roots in valuation theory and approaches to value. AVMs simply automate the valuation process that is inherent in traditional "one of" appraisals. The main differences relate to the power of technology in reducing the cost of appraisal, improving the consistency, and greatly increasing the speed of the output. While AVMs offer the above advantages, they come with some distinct limitations: C They require a considerable amount of data to produce reliable outcomes; C AVMs will not work well for unusual properties or situations where properties are distressed through flooding or urban decay; and C AVMs require in-depth understanding before they can be relied on — many appraisers do not have the statistical training or experience to understand the appropriate application of AVMs.
Assume you have just completed a regression analysis to predict the improvement value of housing in a Canadian city. Your client is an insurance company, who will rely on the outcome for loan underwriting purposes. How would you document and explain the quality of the model to your client?
In explaining your model to your client, you should state the steps involved in: • defining the project goals and assumptions; • capturing the data; • analyzing the data relationship and specifying the model; • calibrating the model; and • testing the outcomes. You should identify the accuracy of the model in terms of its ability to predict improvement values. You should also identify the limitations of the model - i.e., where it should not be used, such as housing of a certain type or age. The challenge for the researcher will be "de-mystifying" the model building process for clients with only familiarity with statistics. In this case, insurance professionals are likely to have a better understanding of probability and statistical measures than many other clients so it may be possible to provide more in-depth information on the work undertaken.
Additive multiple regression includes a major assumption that the impact of the coefficient for a specific independent variable xi is independent of the impact of other variables, e.g., x2, x3, x4, etc. In other words, the impact of one independent variable on the dependent variable Y, is assumed to not be related to changes in another independent variable. When these assumptions turn out to be false, what problem do we have? How can this issue be overcome?
Multicollinearity. This issue can be identified during initial data exploration. If two independent variables show high correlation, there is a potential for problems in the model. It may be necessary toexclude one or more variables from the model and re-test the regression. The Tolerance and VIF statistics are tests for multicollinearity which can be applied to regression models. A low Tolerance (less than 0.3) and high VIF (greater than 3.333) outcome is a warning sign that multicollinearity exists.
Multiple regression analysis (MRA) is a powerful tool. Why don't we just use MRA to solve all valuation problems?
Multiple regression analysis (MRA) is a powerful tool, but we have emphasized other tools in this lesson for two reasons: 1) simple graphs and two-variable statistics are easier to understand and explain to clients; 2) comparison of means and medians as well as simple regression can be used in conjunction with MRA, or within traditional procedures.
Explain the difference between simple linear regression and multiple regression.
Multiple regression includes two or more independent variables; linear regression includes only one variable. Multiple regression is difficult to depict spatially since it involves three or more dimensions, while linear regression is in two dimensions and can be readily displayed in a graph. Multiple regression involves much more complexity of calculations than linear regression, which results in more robust outcomes but is more difficult for clients and other real estate professionals to understand.
What are the advantages and disadvantages of various types of AVMs (e.g., hedonic, price indices)? Provide some examples of the users for different AVM products.
Price Indicies. Very simple models which rely on repeat-sales to generate a price index for a specific geographic area, generally defined by zip or postal code. This method does not rely on the detail present in a hedonic model and hence the output is subject to more variability. The Zillow "zindex" is an example of a price index AVM (explored later in the lesson). The reavs AVM is another example. Hedonic. The most common AVMs, largely based on statistical models using some form of linear regression. The advantage is that these take advantage of a large amount of property attributes, such as location, property size and nature of improvements, to produce reasonably accurate outcomes when data is current. The disadvantage is the complexity, cost to build and maintain, data requirements and public understandability. AVMs developed by Landcor and MPAC (Municipal Property Assessment Corporation) are examples of hedonic systems. Expert Systems. Models the behaviour of experts (i.e., appraisers) using mathematical relationships. These systems are complex, expensive and are not very adaptable to changing conditions. This is the latest trend in AVMs. The user would respond to a series of questions which would allow the logic engine to search the knowledge base to lead to a conclusion. Zillow and reavs both have some elements of an expert system. Neural Network. An attempt to use computing resources to mimic or model the way the human brain works. Due to the complexity and highly theoretical nature of these networks, they have not been widely adopted as the basis for AVMs.
What is graphical analysis? List some applications of graphical analysis in property valuation.
Refer to The Appraisal of Real Estate, Second Canadian Edition, page 18.10. Graphical analysis is a variant of statistical analysis in which the appraiser arrives at a conclusion by visually interpreting a graphic display of data and applying statistical curve fit analysis. Primarily this is useful when there is not a statistically significant volume of data available for full statistical analysis. Applications include: C illustrating market reaction to variations in the elements of comparison; C plotting the most reliable equation for the best fit curve; C identifying the most appropriate equation of those commonly used to solve for an adjustment; C strengthening logical arguments; C enhancing the objectivity of an appraisal by allowing the comparable data to articulate "collective market patterns"; and C testing market value estimated by the three traditional approaches to value thereby avoiding valuation errors
In Case Study 1, a sales indexing approach was used to provide a preliminary indication of the value range for a property which is appraised. Can you see any scenarios in which this approach might not work or lead to unexplainable outcomes?
Sales indexing won't work well if there are few sales in the prior period of interest or current date of valuation, to establish historic and contemporary median sales prices. While this approach can be used for typical housing styles within a neighbourhood, sales indexing won't account for non-typical property attributes, such as larger or smaller than typical floor space, lot size, age, or non-typical design.
Which would be a better tool for analyzing the relationship between two continuous variables: a boxplot, scatterplot, or histogram?
Scatterplot, since the other graphing tools would produce outcomes which could likely not be interpreted, especially if a large volume of data was involved. Boxplots and histograms can be used forcontinuous variables if you recode them into smaller classes (e.g., divide living area into threesegments, large, medium, and small).
What information does the SPSS Casewise Diagnostics report provide? What action should you take if data occurrences are noted in the report?
The Casewise Diagnostics report in SPSS provides a list of data occurrences which do not meet the threshold for standard error. In other words, if the threshold for standard error was set at 3, then the report will display any "outliers" that have residuals or errors more than ±3 standard deviations from the mean. In a histogram, this would be the observations at the farthest ends of the tail of the normal curve. There are several possibilities which may account for the unexpected outcome for this data. There may be an error in initially recording a data value for one or more variables or the error may be simply unexplainable. If a review of the data for each record reveals no apparent problems, it may be appropriate to remove the outliers from the regression model and re-run the analysis. Factors to consider are the number of outliers to be removed in relation to the number of data occurrences since extracting a large number of outliers will impact the credibility of the model outcomes.
What is the VIF statistic useful for?
The VIF (Variance Inflation Factor) statistic is an indicator of multicollinearity. A VIF value greater than 3.333 provides strong indication that multicollinearity exists for an independent variable.
Case Study 2 addresses acceleration and deceleration of market demand for single family detached dwellings. In your own words, explain this concept and whether or not you agree with it as a helpful tool for market analysis.
The advantages of this method include that it is relatively easy to perform, the data (MLS) is readily available from the local Real Estate Board, it may offer the analyst an advantage over their competition, and may provide the basis for good client advice. However, the moving average approach may not be sufficient to fully address the inter-relationships within the data (called autocorrelation — how one period's value is closely related to the next). Controlling for autocorrelation may highlight a new or different pattern in the data.
Two reasons were discussed for why an analyst might need to go back to expand an original dataset or even widen the data frame. One reason was simply that more data was needed. What was the other major reason and why is it important?
The analyst may need to expand a data set or data frame if more data is needed or if the optimal use is not what was originally asserted when the appraisal problem was identified.
A real estate analyst is developing a regression model to predict the rent which can be achieved for different types and sizes of office tenancies in Kanata suburban office parks. Two of the variables in her proposed model are square feet of Rentable Area (reflects "grossed up" area which includes tenant's share of common area) and office Useable Area (actual area occupied by the tenant - usually smaller than rentable area). Can you see a potential problem that the analyst may encounter?
The analyst should complete descriptive statistics and boxplots for each of these variables (in relation to office rents) to see if the two independent variables may be related. The next step is to examine the correlation between the two variables with a scatterplot and bivariate analysis. It is very likely that this analysis will reveal a very high correlation between rentable area and useable area. The analyst will then have to decide which variable to exclude from the regression analysis. Since office rent is normally based on rentable rather than useable area, it will probably make more sense to exclude useable area from the regression. If both variables are left in the regression, the analyst will have a model with a severe case of multicollinearity.
10. In Step 3 of a regression analysis, an appraiser has developed the following correlation analysis for single family dwellings. What can we learn from this table? Note: Condition Rank and Quality Rank have been transformed from ordinal variables showing ranks (e.g., 1 to 10) into linearized variables ranging from 0 to 1. Pearson Correlations - Single Family Dwelling Analysis Fin Area sq ft Bedrms count Stories count Age Condition Rank Quality Rank Lot size sq ft Sale Price Fin Area sq ft 1 .895 .674 -.45 .185 .001 .629 .891 Bedrms count .895 1 .563 -.231 .022 .320 -.070 .921 Stories count .674 .563 1 .567 .234 .397 -.105 .769 Age -.45 -.231 .567 1 .764 .830 .392 .562 Cond Rank .185 .022 .234 .764 1 .932 -.021 .852 Quality Rank .001 .320 .397 .830 .303 1 .041 .732 Lot Size sq ft .629 -.070 -.105 .392 -.021 .041 1 .331 Sale Price .891 .921 .769 .562 .852 .732 .331 1
The correlation table reveals a number of concerns with independent variables which appear to be related. For example, Fin Area and Bedroom Count have a high correlation (over .8). Stories has a high correlation with Fin Area and Bedrooms, and moderate correlation with Age. Age and Cond (condition), as well as Age and Quality, also have a high correlation (close to .8), not surprisingly. Condition and Quality have a very high correlation. Consequently, it would be necessary to either exclude one of each of these highly related variables from the regression, re-test, and determine if the correlation issue is eliminated. Another option would be to develop an "interaction model" which is one in which one independent variable has the effect of magnifying the impact of another. This type of analysis is complex and beyond the level of BUSI 344.
Assume you had completed a statistical analysis in Excel and produced descriptive statistics for a small sample of office property rents for Class B buildings in Kelowna. Is there any point in considering the Count function when you interpret the findings and report to your client?
The count function should certainly be considered for this particular case as it will provide the number of Class B office properties in Kelowna. This is an important statistic to consider in analysis as there may be very few of these specific properties in a small city, and therefore, each individual property may have a large influence on the descriptive statistics you have calculated.
You are beginning development of a regression model to predict street level retail rents for several prestige retail districts in the City of Vancouver (e.g., South Granville, Robson Street). You have purchased data from BC Assessment and have begun migrating the data from an Excel format into SPSS. What is your next step?
The first step in migrating this data is to gain an understanding of the data. You need to determine: • the number of records; • the number of variables; • distribution of the variables (descriptive statistics, frequencies); and • any variables which may overlap in influence. You must confirm that the data received meets the specifications for the project and determine whether any additional data may be required to meet the project objectives. You should also be aware of how the data was collected and the currency of the information (e.g., GIGO rule, "Garbage In, Garbage Out"). You will need to delete the records which are not required for the study or filter the required records and save these to a new data-set. In preliminary screening of data, you will need to use your local market knowledge to analyze the data. To supplement this, you may need to contact the data supplier to clarify and resolve data issues.
What are the five stages of data reduction for our purposes?
The five stages of data reduction are: a) data base or data sources; b) data frame; c) market data set; d) information set; e) illustrative set.
What are the general approaches available for evaluation of AVMs? Based on the measures of central tendency and dispersion reviewed in earlier lessons, what would you conclude about AVM statistical measures of accuracy?
The lending industry has shown leadership in establishing standards and benchmarks for evaluation of AVM performance. Some key measures include: C the "hit rate"; C AVM percentage error; C AVM bias, median; C AVM value to sale price ratio; and C standard deviation of AVM outcomes for predicted values in relation to sales prices for a sample population of recently sold properties (e.g., properties sold that were subject to financing approvals). However, the problem remains that there is no generally accepted statistical methodology for evaluation of AVMs and furthermore, limited public understanding of the reliability of AVMs. The emergence of AVMs that evaluate other AVMs (cascading AVMs) is rapidly gaining more interest.
In Case Study 1, the author has used median rather than mean. Would the mean have been more appropriate for this analysis? Provide the rationale for your answer?
The median is more appropriate to use in this case because the data does not follow a normal distribution. Since the sample includes a number of sale prices of outliers that are considerably outside the typical range (not unusual in some neighbourhoods, say a neighbourhood that has some view lots), use of the mean would tend to skew the average sale price. The goal of the analysis is to test the estimated market value of the subject property against the sale price of a typical property in the subject market. Median sale price, with 50% of sales above and 50% below, would be more indicative of the norm than would average sale price, as the mean may be skewed.
The descriptive statistics summary provides a mean and median of the dataset. Can you think of any reason(s) why the median may be better indication of central tendency than the mean?
The median may be a better indication of central tendency than the mean when there are significantoutliers. The mean is more sensitive to these outliers, making the median a better choice for a measure of central tendency.
Your multiple regression model results show a large F value, but a low R2 value. What can you conclude about this result?
The regression is significant but the variables only explain a small amount of the variation in the dependent variable. The F statistic measures performance of the model overall when compared to the result that would be obtained by estimating the sale price by simply using the mean sale price. With a high F, the significance will be low or zero, meaning the result is significant. However, the low R2 value shows that the model is not explaining as much of the variation in sale price as would be optimal.
You conduct a regression analysis of detached single family housing prices in Langley, and then use the regression formula to calculate predicted values for your data set and the residuals (actual sales price - predicted value). What kind of results should you expect when you analyze the descriptive statistics for the residuals?
The residuals should have a mean of zero because the regression model's function is to find the line of best fit that limits each observation's residual from this line. The median however, may be positive or negative depending on the skewness in the distribution of predicted values.
Assume you are analyzing resort condo sales in the resort community of Whistler as a part of building a regression model to predict values for 1 and 2 bedroom units in a large strata complex. Most of these condos are included in rental pools for part of the year. Local real estate brokers have told you that the prime characteristics that drive sales are size of unit, number of bedrooms, floor height, view, amenities, ability to "lock-off" units (bedroom in a condo that can be rented separately), strata fees and taxes, and quality of finish. You would like to develop an adjustment for lock-off suites, but you only have 10 sales with this feature, some of which appear to be outliers. The dataset has 90 cases in total. What should you do?
The risk of including this variable in the regression model is that the small number of sales may not provide enough information to explain the impact of the variable on sale price. What may happen is that another variable, e.g., bedrooms, may explain some of the effect of "lock-off". You should run descriptive statistics and graphs for this variable to confirm the outlier issue, and determine if "lock-off" is already accounted for with bedrooms or another variable. Since the number of occurrences with this variable is quite low, around 9% of the total dataset, and only 6% if the outliers are removed, there is a risk that the coefficient values calculated for this variable will not be significant.
You are building a model to predict market rents for retail properties. If you have 20 variables in your regression model, what is the minimum number of rental transactions you need in your dataset?
The rule for statistical significance is to have at least 5 times as many "tests" of the dependent variable or in this case, sample rents, in relation to the number of independent variables. Therefore, your model should contain at least 100 cases or rental transactions.
14. You have a database of recreational lot sales and are forecasting sale price per front foot for a certain size of waterfront lot. How can you account for non-linear data relationships in your forecast?
The slope of the regression line would be very sensitive to the location of a few data occurrences because of the nature of the Least Squares calculations. In other words, the coefficient of the independent variable could be dramatically affected by one or two data points, resulting in predictions of the dependent variable which contain high potential error.
What would you conclude about the data screening processes used by Zillow and Zoocasa versus those used by other commercial AVMs discussed in the case studies?
The source data that powers Zillow is acquired from 3rd party data providers such as Dataquick with their own data integrity measures. However, the factor that separates Zillow from all other AVMs is that end users can modify the data which Zillow uses for its comparable search engine. There is no data screening process in place to address bias, errors, or other data related issues. In contrast, the Taurean model, Landcor, MPAC, and other commercial AVMs offer a very high level of data screening.
If the range for distribution 1 is greater than the range for distribution 2, what can you conclude about their respective standard deviation values? Can you conclude the same thing about their COV?
The standard deviation is a measure of the dispersion of a distribution. While the range of a distribution may have a significant effect on its standard deviation, the actual distribution will determine the size of the standard deviation. If one distribution has a greater range than another, it is likely that it will have a greater standard deviation. However, if the distribution with the larger range is more tightly distributed around the mean with an outlier resulting in a large range, the standard deviation may be smaller than that of a distribution that is more widely dispersed but with a slightly smaller range. The COV is a ratio of the standard deviation divided by the mean, and therefore, the range does not tell us very much about this value.
11. If one of the variables in your predictive regression model had the following statistics, what could you conclude? t-value = .095 sig. = .732 VIF = 4.107
The t-score is less than 2, indicating the coefficient associated with the variable may not be statistically significant at a 95% confidence level. This is backed up by the high significance number or the probability that the coefficient is equal to zero. However, this may not be a serious concern if the intent of the regression model is predictive, rather than explanatory. A larger issue is the moderately high VIF score of 4.107, where a score below 3.333 is desired. The variable may be exhibiting some multicollinearity, which could be problematic for predictive results. One course of action would be to remove the variable from the list of independent variables and re-run the regression.
What is a thematic map? Provide an example.
Thematic maps use colours and shading of areas to display information related to location. For example, a map of parliamentary constituencies shaded in different colours can show the number of seats held by different political parties. GIS can build this kind of map automatically from the data values (number of seats), and typically offers many alternative ways of presenting this information. An example of a thematic map is a colour-coded zoning map, showing different property uses associated with specified colours.
You want to use a multiple regression model to predict rents for a suburban industrial park. However, you can find only 5 or 6 rent comparables over a 2 month period. What action could you take?
There are two main alternatives: broaden the scope of research to other industrial parks with similar attributes or increase the time-frame for research. It is likely that both options will be required.
Alex has purchased a property sales data-set from a Nova Scotia assessment organization to support his real estate appraisal business. The data includes information on a large number of data variables for residential neighbourhoods. After his initial data exploration, Alex concludes some variables are not very helpful in building a regression model. How could Alex possibly arrive at this conclusion?
Variables with either few occurrences or weak correlation can likely be excluded from further consideration at this stage.
Why is it necessary to examine the relationship between PARs and various data variables during the model testing process? If we use regression to analyze PAR in relation to the data variables, do we want the R2 to be high or low?
We expect to find no relationship between the PAR outcomes and the data variables. If a relationship exists, then the PAR outcomes are influenced by a specific property characteristic and the model will be biased. We would expect a low R value if a regression of PAR outcomes versus the data variables 2 was completed. In other words, the model indicates the data variables predict very little of the variation in PAR outcomes
While neighbourhoods and districts are important to real estate economics, what economic division is the one we primarily rely on for defining our datasets? How is it defined?
We primarily rely on market segment. Market segment is defined as a homogeneous market, as characterized by a set of similarity variables
You have started an internship with an older well-respected appraiser. Everyone tells you that you are lucky to get a position with someone of his stature. You are ready to begin applying your modern tools. You get your first assignment, and enthusiastically download what you believe is a good market dataset. Being your first assignment, you have your new mentor check your search parameters. He laughs, and says: "All that statistical stuff is OK for school, but a good appraiser never needs to use more than three 'comps'! Just pick the best three and use those". What can you say?
You face an interesting dilemma, if you have what you believe is a useful tool, but something your mentor does not believe in or perhaps does not understand. You will have to decide whether to do exactly what you're told or whether to try and introduce something new to the process. This may be a business decision mixed up with an ethical question. The profession has a long tradition of new appraisers learning new things, then bringing them to practice through their work. Being a professional requires tact as well as being "right". If you were to take an all-or-none "MRA or bust!" attitude, you will likely lose your opportunity to bring new methods to the "old guard" and also to provide additional value-added techniques to clients. The simple regression and grouped comparison methods you have learned here can provide a bridge between traditional methods and modern technology. Graphs, scatter plots, trend lines, and group comparisons are intuitive and easily understood. You can consider bringing these methods into your work, doing good for clients, your own reputation, and the future of the profession.
You are building a regression model to predict the market value of single family residential building lots in a neighbourhood experiencing rapid growth. You need to determine which data characteristic to use for describing lot size. Some possibilities are a standard lot with an adjustment factor for size, lot area in square feet, lot width and depth, and lot area in acres. What should you consider in making this choice?
You need to consider whether the lots are relatively uniform or not. If the lots are all about the same size, it may possible to use "lot" as a variable and simply include an adjustment for the various attributes affecting value, such as view, corner influence, useable area, cul-de-sac, etc. Alternatively, if the lots are more varied in size, it might be appropriate to use square foot with price per square foot as the dependent variable. An acreage variable would only be suitable for residential "small holdings" or hobby farms, due to the disproportional impact of small changes in the variable on the dependent variable for residential building lots. Another factor to consider is the units used by market participants - building lots are generally priced on a lot basis rather than price per square foot. Once a decision has been made to select lot or square feet as a variable, then the relationships can be tested during data exploration to determine which will be the best option for a regression model.
How can the coefficient of variation be used to help determine appropriate units of comparison for real estate data?
from the scatterplot measure the slope of the regression line, and estimate the projection of the intersection of the line on the Y-axis where X = 0. This point will be the constant in the equation. The slope will be the regression coefficient. If the regression represents an inverse relationship, the coefficient will be negative.
multicollinearity
in multiple regression analysis, a situation in which at least some of the independent variables are highly correlated with each other. Such a situation can result in inaccurate estimates of the parameters in the regression model.