Data Science Exam Two

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

coding term (Python): info()

prints information about the dataframe - number of columns, column labels, column data types, etc.

coding term (Python): dropna()

removes rows with null data

What are repositories in github?

Basically an entire github project usually with multiple files (a bit like a folder)

What is Console in RStudio

Console (bottom left): where R is waiting for you to tell it what to do and where it will show the results of a command

What is metadata?

Data about data; provides description of data characteristics and relationships in data - author, date created, date modified, file size

What is environment in RStudio

Environment (top right): virtual space that is a collection of names/variables associated with some values

Which is more permanent, a Git branch or fork? Explain.

Fork is more permanent because a git branch will delete if the main source is deleted but the fork will always remain

What is CoLab?

Google's version of a Python Jupyter notebook

Why is markdown useful in programming?

It allows someone to write comments/prose in cells of a Jupyter or CoLab notebook

What is a Jupyter Notebook?

It is an open-source web application that allows you to create and share documents - code -text/markdown -raw

What is a commit in git?

It is when you save changes you have made, creating a new version of the copy you are working with

why file naming conventions could be very important for a researcher using Python to analyze their data

Python will find files based on names so naming has to be consistent, but also there are certain symbols that do not fit the Python syntax and will not have an output when put into the code

Provide some uses for R

R can be used for data manipulation, statistics and graphs

What is Script in RStudio

Script (top left): a text fine containing (almost) the same commands that you would enter on the command line of R - cleaner than the console

What is Tabs in RStudio

Tabs (bottom right: Provides help packages and plots along for the other panels

What is "main" in git?

The default repository for a project

.ipynb

The file extension for Jupyter Notebook

Syntax

The particular way that instructions need to be written in a particular programming language or CLI environment - some may use a language with periods, others commas etc

What is a pull in git?

When someone who has made a new version, asks the owner of the original version to take a look to see if it is worth adding those changes to the original version

What is a merge in git?

When two version are merged together to make a single version of a repository

What is a data frame?

a dataframe is tabular data (data frame is the lingo used in Python)

what do these tools (git, GitHub, and Jupyter notebooks) allow scientists to do that they couldn't do before?

allow scientists to collaborate on a project directly in the working data space

Explain (generally), what a function is

an object containing multiple interrelated statements that run together in a predefined order every time the function is called

Why should students planning to continue their education or career in the biological sciences want to learn the lingo associated with computer programming?

computer programing is becoming increasingly popular in science and is a useful tool to know and understand

How can you recognize a .csv file?

data separated by commas

What is "big data" and how is it changing the way science is done?

data that is so large, fast or complex that it's difficult or impossible to process using traditional methods

Explain why R is an "interpreted language"

you enter an expression and the R console executes the actual code you wrote

What is Pandas in Python?

A library in Python that makes it way easier to manipulate data sets.

What is Seaborn in Python?

A library in Python that makes it way easier to work with data visualization.

What is Python?

A multi-purpose programming language

What is github?

A private service that enables people to use git and especially to collaborate on software

What is CoCalc?

A service that makes it easier for Dr. Mulcahy to send you coding assignments; may also be useful for you as you try to do your EDDIE projects

What is markdown?

"a lightweight markup language for creating formatted text using a plain-text editor.

why a biologist might need to use git, GitHub, or Jupyter notebooks for a research project

- Allow for collaboration (beneficial virtually) - GitHub is a place that code from previous projects can be pulled and altered to fit the requirements of the project at hand - Git has a version control tool which will help in organizing and tracking the changes made during the course of the project

What is the difference between GUI and CLI?

- GUI means graphical user interface and often involves moving your mouse around a computer screen (which is in essence a graph - with an X and Y axis) - CLI stands for command line interface, and in CLI, the user must type instructions for the computer. (Synonyms exist for CLI)

"Open Source" - what else does it mean besides "free" software?

- Provides full access to algorithms and their implementation - Gives the community ability to fix bugs/extend software - Promotes reproducible research by providing open and accessible tools

open source softwares

- R - Python - Git (NOT Github) - Jupyter

what are binary codes

- These are Instructions that are delivered entirely via two-symbols - In binary code, 10 is the number 2 - "I can speak 10 languages. English and Binary."

What is git?

- Version control system, especially for software - enables you to manage your website source code.

Why might CLI be preferable to GUI in some biological research projects?

- Very specific instructions telling the computer exactly what you want it to do - Can be more efficient - Most programming is done in CLI

Explain what a virtual machine is, and provide some advantages of using a virtual machine

- a desktop that is someplace else that you can access - don't have to physically contact the machine, saves time, saves money

What are OBJECTS, and explain how to make objects

- almost all things in R (functions, datasets, results, etc.) - write a script that created the objects (statistical results)

Provide advantages of using "R"

- fast and free - active user community - excellent for simulation, programming, computer intensive analysis, etc.

characteristics of a project that might determine whether you want to use Python in a CLI mode versus Excel in a GUI mode

- if you need to pull in separate files - if you are working with code - if the data set is very large

When did R start? What year was it created?

- initially written & released as an open source software - during 90s

What is a package in Python

- large collection of modules that all help with a common problem (fixing dates etc.) - essentially small chunks of code that are shortcuts

Objects have MODE and CLASS! Explain each of these terms and how they might interact.

- mode is mutually exclusive classification of objects according to their basic structure (numbers, characters, etc.) - class attributes is a vector of character strings (determines how functions deal with Z)

What are libraries and packages in Python and why are they so handy?

- pandas and seaborn - pandas cleans and standardized the data - seaborn formats data into visualizations such as graphs

which of the steps in the data science pipeline are associated with particular lines of code?

- pull data: read_csv("https://bit.ly/2Cs1Mq1") - verify data: looking at head & tail - clean/manipulate: removing null data - analyzing/creating visualizations: graphs - sharing data: shared with Dr. Mulcahy

What is a library in Python

- set of packages and modules - have to be "called" into our program before we start using certain instructions

Give some specific ways that you could continue to educate yourself after this class to learn more about Git, GitHub, Python, and Jupyter notebooks

- sololearn.com - codefinity.com - jetbrains.com - learnpython.org

Provide disadvantages of using "R"

- steep learning curve - no commercial support - working with large datasets is limited

Explain how R sessions are interactive

- you write small bits of code and get an output - if the output is not what you wanted then you adjust your syntax - at the end you save script files which can be rerun later

cardinal rules of using spreadsheet programs for data

1. put all variables in columns 2. put each observations in its own row 3. don't combine multiple pieces of information in one cell 4. leave the raw data raw 5. export the cleaned data to a text-based format like CSV (ensures that anyone can use the data)

What is a fork in git?

A copy of a repository in which the copy is not synced with the mail copy and will remain intact even if the main copy is deleted

What is a branch in git?

A copy of a repository in which the copy is not synced with the main copy but which will be deleted if the main copy is deleted

What is a clone in git?

A copy of a repository in which the copy is synced with the main copy

coding term (Python): columns

shows all of the column titles in the data sheet

coding term (Python): describe

shows important values from the data, such as the mean, totals, min, and max

coding term (Python): head(n)

shows the first (n) lines of data

coding term (Python): tail(n)

shows the last (n) lines of data


Ensembles d'études connexes

CSC 415 Operating Systems All Units

View Set

N128 Week 2 - Adaptive Quizzing #2

View Set

AP Physics 1 Unit 4 Progress Check A

View Set

Building Efficiency as a Recruiter Certification Assessment

View Set

How To Read Literature Like A Professer

View Set

Chapter 7: Skin structure, growth, and Nutrition. Milady's Cosmetology.

View Set