DSC 201

Ace your homework & exams now with Quizwiz!

What is cloud classification IaaS?

users can deploy and run software through the cloud. Overlooks processing, storage, networks, ect.

Syntax of commenting in SAS code

statement beginning with an asterisk (*) is called a comment statement

What is TensorFlow?

1st order tensor = vector 2nd order tensor = matrix(2D) open source library for numerical computation. Suited for machine learning and deep learning on a large scale provides GPU acceleration

What is a Linux cluster?

A group of computers linked by an interconnect. Group works together and creates appearance of one computer. Runs on operating system that uses linux kernel.

Virtual machines: Advantages and disadvantages

Advantage: Multiple virtual computing, storage, and network resources can be created with limited hardware Disadvantage: Multiple virtual computing, storage, and network resources compete for underlying hardware resources

Why do we care about precision of calculations?

Because the precision of calculations gives you the amount of cores and rams necessary to build the tech you want.

What is CUDA?

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing

What are the various ways to run Python?

Command line shell, jupyter notebook, python script in bash

What is a data frame?

Data frame is a tabular, column oriented data structure with row and column labels

What is a parallel file system?

Lustre: uses metadata servers, object storage servers, and client servers for the file system. GPFS: (general Parallel File System): provides access to block level storage on multiple nodes. Metadata and blocks distributed across multiple disk arrays

What is InfiniBand and how does it differ from Ethernet?

Ethernet has high latency because of the TCP/IP protocol. Infiniband has been designed for low latency (and high bandwidth).

What is a GPU?

GPU is a graphics processing unit: designed to manipulate and alter memory to accelerate the creation of images. Coprossessor of CPU.

What is Amdahl's Law and how do you calculate speedup? Why is that useful to do?

Given a fixed workload and number of processors, the law will give you the expected speedup Speedup = 1/((1-fractionOfWorkParallelized)+( fractionOfWorkParallelized/NumberOfCores)) 1. If p=0 then none of the work is parallelized and speedup is 0 2. If p=1 then the speedup is equal to the number of cores 3. If p=0.5 then the speedup is equal to 2 in the limit of infinite number of cores

What is Anaconda?

Is a distribution of python that includes extra packages and libraries. The goal is to provide easy access to most used analysis and machine learning libraries. You can create your own conda environments for specific packages.

What is NumPy and why is it useful?

It is a python library. Provides arrays similar to pythons built-in list, but is more efficient It is useful for manipulating arrays and creating matrices

LINPACK benchmark - What is it? What is the difference between theoretical value and benchmark value? Why is there a difference?

Linpack benchmark factors matrix as the product of a lower triangular matrix and an an upper triangular matrix. Theoretical value is the estimated value and benchmark value is the actual value Theoretical value is meant to be able to get a answer as close as possible to the benchmark value. There is a difference so people can calculate theoretical.

How are OpenMP and MPI similar and different?

OpenMP is designed for multiplatform share memory parallel programming. Uses compiler directives to control execution. 1. Uses the fork-join model of parallel execution a. Fork() creates a new process that is the chld process of the caller MPI is used for multiple tasks that need to be completed. 1. Each task takes its own local memory for computation. 2. Can be used to run parallel applications on shared memory and distributed memory systems 3. Tasks exchange data through communications by sending and receiving messages

What are the differences between GPUs connected via PCIe vs. NVLink

PCIe : communication between CPU and GPU happens through the PCIe switches. NVLink: This is available to pascal and volta class GPUs in systems that support it. More GPUs can be placed in the host computer.

What is the Pandas library?

Panda provides data frame object for python. Padas series object looks like numpy array but can have defined indexs

What is parallel computing and why is it important?

Parallel computing: a problem is broken into many parts that can be solved that the same time. Tasks from each part execute at the same time on different processors. 1. It is time saver for intense computing problems. 2. Able to tackle difficult or more complex problems 3. Allows for the development of more accurate and detailed models 4. More efficient use of hardware

How to calculate FLOPS

Perfomance = (#cores)*(clock speed in GHz)*(instructions per clock cycle) = GFlops/1000 = TFlops

What is Slurm?

Provides software for starting, stopping, and monitoring compute jobs and resources. Allocates and deallocated computing resources on nodes and partitions. Manages queues of jobs waiting for resources.

What is scikit-learn? What does it provide?

Python library for machine learning, data analysis, and data mining Provides regression, classification, clustering, dimensionality reduction, model selection, preprocessing

What are the components of Flynn's Taxonomy? Can you provide examples of each?

SISD: single instruction single data 1. Ex: single CPU system SIMD: single instruction multiple data 1. Ex: GPU MISD: multiple instruction single data 1. Ex: rare but could exist in obscure cryptographic applications MIMD: multiple instruction multiple data 1. Ex: linux cluster

What is linear scaling? How do we get close to linear scaling?

Scaling/Graphing the p values next to a linear graph to compare --how to get close to linear scaling-- 1. Improve the algorithm so the program has a greater fraction of its time spent doing parallelized work 2. Reduce parallel overhead 3. Larger data sets may help improve the fraction of time spent doing the parallelized work

What is a Slurm script and what does the sbatch command do?

Slurm script uses bin/bash and sets all the parameters including partition, time, memory allocation ect. Sbatch command send your slurm script through the system to run it.

What is a tensor core?

Tensor Cores are built for tensor processing and helps the acceleration of the GPU

What is a list, tuple, set, dictionary, etc.? What is the syntax for each?

Variable 1. X=1 List - changeable 1. L = [1,2,3] Tuple - not changeable 1. T=(1,2,3) Set - changeable, no duplicates, no order 1. S = {1,2,3} Dictionary 1. D= {'state': 'NY', capital: 'Albany'}

What is a socket

a physical location on a computer board that houses CPU

what is serial computing

a problem is broken into discrete series of instructions that are executed one at a time

What is a thread?

an independent stream of instructions that can be scheduled to run by the operating system

What is a Bash script and how is it same/different from command line environment?

bash script is a Unix/Linux commands in a text file - batch mode The commands and syntax are the same as those entered into the command line. Bash script has a specific header and needs to be run through command line

What is cloud classification PaaS?

can deploy user created applications made through programming languages, libraries, ect.

What is a CPU

chip that does the computing

What is a core

computing element of a CPU that processes data independently

How is Bash different/similar to a compiled language (e.g. C++) or a scripting language like Python?

has similar functions such as loops and conditionals Slower run speed compared to C++ or python. Not easy for tasks like floating point calculation and math functions

What is the relationship between a repository, revision, changeset, working copy?

i. Actual changes to the files in a repository occur in a user's working copy ii. While a user has checked out a particular revision to their working copy, another user may also have checked out the same revision, made modifications, and committed the changeset to the repository.

Why are relationships important in relational databases?

i. Because it Storing relationships about the data in a way that is more complicated than a simple list

What does sc.parallelize([1,2,3,4,5]) do?

i. Can specify the number of partitions when creating the RDD

b. What is a factor?

i. Categorical data

What is the Caret library?

i. Classification and Regression (Classification and Regression Training) - Similar to Scikit-Learn

What is cloud classification SaaS?

runs on cloud infrastructure and is available remotely through web browser. The user does not manage or control cloud infrastructure except for user specific application settings.

What do these common git commands do: commit, add, checkout, clone, fetch, pull, push, merge, rebase, branch, diff, tag?

i. Commit: creates a revision of a repository (and an associated change set) from a working copy. ii. Add: add file contents to index iii. Checkout: creates a working copy from a particular revision of a repository iv. Clone: clone a repository into a new directory v. Fetch: download objects and refs from another repository vi. Pull: fetch from and integrate with another repository vii. Push: update remort refs along with associated objects viii. Merge: join two or more developed histories together ix. Rebase: reapply commits on top of another base technique x. Branch: list, create or delete branchs xi. Diff: show changes between commits, commit, and working tree, ect xii. Tag: create, list, delete, or verify a tag object signed with GPG.

Review the RDD operations:

i. Filter: calculate the fraction of test points that were labeled correctly ii. reduceByKey// iii. count: Count returns the number of elements in the RDD iv. countByValue// v. collect: all the data in a set vi. flatMap: all the words in given set vii. mapValues : maps the key and all the values individually with that key

How are SAS formats used?

i. Format is used to format dates or money by calling the column and then specifying what format you want - Format DOB mmddyy10

how to use and the syntax for proc (means, print, frequency)

i. Freq: measures frequency of certain variables. Table allows you to specify what variable you want count freq for. - Proc freq data=demographic; Tables Gender; ii. Means: averages two or more variables. Var helped specify what variables you want to average - Proc means data=demographic; Var Age Height Weight iii. Print: Allows user to print there data proc print data=

how to use and the syntax for the data step process (infile, dsd, filename, datalines)

i. Infile: tells SAS where the data values are coming from. - ex: infile'/home/yourNetID/Desktop/MySAS/Data/mdata.txt'; ii. dsd, dml, delimiter: - First, it changes the default delimiter from a blank to a comma. - Next, if there are two delimiters in a row, it assumes there is a missing value between. - Finally, if character values are placed in quotes (single or double quotes), the quotes are stripped from the value iii. Filename: this statement to identify the file and also uses this reference in your INFILE statement with fileref - ex: filename sharks'/home/yourNetID/Desktop/MySAS/Data/ydata.csv'; iv. Datalines: you want to write a short test program in SAS. Instead of having to place your data in an external file, you can place your lines of data directly in your SAS program by using a DATALINES statement.

Difference between running R in "batch mode" and interactively

i. Interactively includes running it in jupyter notebook or Rstudios where you can run each line Batch mode you type out all the code and sent it through bluehive

why are relational databases important

i. Keeps large amounts of persistent data ii. Control all access to their data through transactions in order to contain complexity of concurrency iii. Transactions also help in error handling by allowing erroneous changes to be rolled back

Provide examples of NoSQL tools and their type

i. Key value database - berkeleyDB - apache cassandra - levelDB - memcachedDB - redis - riak - voldemort ii. Document database - mongoDB - couchDB - orientDB - ravenDB - terrastore - marklogic iii. Column family database - Amazon dynamoDB - Apache Cassandra - Bigtable - Cloudera - HBase - hypertable iv. Graph database - flockDB - hypergraphDB - infinite graph - Neo4j - orientDB

Be able to explain the 4 types of NoSQL databases and where they are commonly used

i. Key value database: The client can either get the value for the key, put a value for a key, or delete a key from the data store - Storing session information, user profiles, preferences, shopping cart data ii. Document database: The database stores and retrieves documents - Event logging, content management systems and blogging platforms, web analytics and real time analytics, ecommerce applications iii. Column family databases: store data in column families as rows that have many columns associated with a row key - Using column families, you can store blog entries with tags, categories, links, and trackbacks in different columns. - Counters - Expiring usage : You may provide demo access to users, or may want to show ad banners on a website for a specific time. You can do this by using expiring columns iv. Graph databases: Graph databases allow you to store entities (also called nodes) along with relationships (also called edges) between those entities - Social networks are a good example of where graph databases can be used effectively. These social graphs can be more than just the kind related to friends; for example, a social graph can represent employees, their knowledge, and where they worked with other employees on various projects - Routing, dispatch, and location-based: Think of an airlines or trucking company as good examples. Every location or address that has a delivery is a node, and all the nodes where the delivery has to be made by the transport mechanism can be modeled as a graph of nodes

What is a vector? What is a list? How are they different?

i. List is a heterogeneous vector. Can list many different elements 1. List(1,2,3, "hello", sqrt) ii. Vector is a list of numbers iii. They are different because a list can have characters/strings

What are map, reduce, and shuffle operations and when are they necessary/unnecessary?

i. Map 1. Operations include Map, flatmap, groupbykey, filter, zipwithindex 2. Maps out unique word in a data set and the what line number its on ii. Reduce 1. Operations include reduce, reducebykey, combinebykey, count, mean, sum, min, max, stdev 2. Takes the map and reduces it to show the word and what line numbers the word is in iii. Shuffle operations 1. It is a process of redistributing data across partitions 2. try to avoid

What is a matrix? What is a data frame? How are they different?

i. Matrix is a vector in column and row form ii. Dataframe is Table or heterogeneous matrix iii. They are different because a dataframe can have characters/strings from a doc

Syntax for filtering rows and columns in matrix and data frame

i. Matrix: m <- matrix(c(1,2,3,4,5,6), nrow=2, ncolumn=3) ii. DF: W <- read.csv("name of doc", header=T)

What is git for and why is it used?

i. Method of organizing and controlling versions of something ii. Primarily used for tracking changes to textual data. iii. Could be as simple as using separate folders to store different copies of a document - or different versions of code, though this is redundant and requires users to use careful book keeping.

What is a NoSQL database and why is it significant?

i. NoSQL databases operate without a schema, allowing you to freely add fields to database records without having to define any changes in structure first. This is particularly useful when dealing with non-uniform data and custom fields ii. It handles data access with sizes and performance that demands a cluster iii. It improves the productivity of application development by using a more convenient data interaction style iv. The common characteristics of NoSQL databases are: - They do not use the relational model - They run well on clusters o Usually are open-source - They're built for the 21st century web estates - They are for the most part, schemaless

what are the advantages to using a centralized workflow?

i. Only uses 1 master branch so it is simple to maintain.

Understand the concept of polyglot persistence

i. Polyglot Persistence, means that when storing data, it is best to use multiple data storage technologies, chosen based upon the way data is being used by individual applications or components of a single application, therefore, different kinds of data are best dealt with by different data stores

PCA - How do you implement it? How do you use it for clustering/classification?

i. Primarily for visualization of arrays/samples ii. Performs a rotation of the data that maximizes the variance in the new axes

Syntax for extracting and manipulating data in data frame

i. Read.csv ii. Head() iii. Tail() iv. Mean() v. Sum()

what is the difference between a merge and a rebase?

i. Rebasing works by simulating a series of changes from a different starting point ii. Rebasing produces a linear history, though a less accurate one iii. Rebasing creates two versions of particular changesets (original and simulated) iv. Merge produces a linear history as well but merges them all together in the end.

What is an RDD? What kind of structures can an RDD contain?

i. Represents an immutable, partitioned collection of elements that can be operated on in parallel. ii. It can contain any kind of record. w/ key value pairs.

Do NoSQL databases have any limitations? If so, what are they?

i. Running on clusters has an effect on their data model as well as their approach to consistency

Why do you train models on a subset of labeled data?

i. So you often will want to split your data into a training set for the algorithm to learn on, and a testing set to test the accuracy of the learned algorithm

What are some common Spark modules and what do they do?

i. Spark Core - MapReduce ii. Spark SQL - Database iii. Spark Streaming - Real-time analysis of streaming data iv. Spark MLlib - Distributed machine learning library v. Spark GraphX - Distributed graph processing framework

What is a "workspace image" and why does R ask if you want to save this?

i. That's what the file is called with the R code.

Describe the significance of the timeline of relational databases and NoSQL

i. The change is that now we see relational databases as just one option for data storage ii. The world wanted to have different options for data storage

what are the differences between valid and invalid SAS variables?

i. Valid names: letters (upper and lower), numbers, underscore ii. Not valid: starts with number, has blanks, other characters besides underscore.

How do you assign color to your data, and assign a colormap?

i. colormap winter;

How to assign a vector to a variable (e.g. using ":" and "seq" functions)

i. v <- c(1,2,3)

What is a task

job given to compile or something code needs to do

What is a node?

physical computing unit with sockets that have one or more processors, bans of memory, and network interfaces

What limitations do relational databases have?

procedures would load and manipulate data within their logic, but the procedure itself did not contain the data in any meaningful way


Related study sets

PHYSIO: chapter 3 practice questions

View Set

Q Mental Health - Substance Abuse Quiz EXAM 2 Combine, Q Substance-Related Disorders Combine, Q Townsend: Unit 5 Combine, **QChapter 14: Substance Use and Addictive Disorders - has ques from test 5 - Combine, **Psych Unit 6 Quizes for Test #2** P, Q...

View Set

GEOG 202 Mastering Geography- Caribbean

View Set

CHP 6: Police Officers and Law Enforcement Operations

View Set

Cybersecurity - SSC 200 Final Exam Study Guide

View Set