Blogs on Lucas Allen, Charlotte, North Carolina Data Scientist
/blog/
Recent content in Blogs on Lucas Allen, Charlotte, North Carolina Data ScientistHugo -- gohugo.ioen-usCopyright (c) 2018, Lucas Allen; all rights reserved.Thu, 01 Mar 2018 13:00:47 +0000Look Ma, I'm on Hugo!
/moving-site-to-hugo/
Thu, 01 Mar 2018 13:00:47 +0000/moving-site-to-hugo/The last couple months I’ve been playing around with migrating this blog over from WordPress to Hugo, the static site generator built on Go. Due to “reasons,” many of which probably don’t apply to the average WordPress blogger, and some of which involved me being finicky about certain expectations, this turned out to be more of a hassle than I anticipated. While the new version of the site has been up and running for over a month now, I’ve been slow to post because I’ve continued to spend my time tweaking and improving, but I can feel the old itch to write.Keep on Movin’ On
/data_scientist_charlotte_nc/
Sun, 08 Oct 2017 23:04:35 +0000/data_scientist_charlotte_nc/
This blog has gone pretty quiet the last 6 months or so, which usually signals I’m up to something new. In fact, this time it’s a move across the country for another career move. My new opportunity takes me to a part of the country which is a big change for a guy who’s never lived outside of rural Central Illinois.
Heading South I’m now located in the Queen City of Charlotte, NC.Retro Game Retrieval Engine Design
/retro-game-retrieval-engine-design/
Wed, 19 Jul 2017 02:00:44 +0000/retro-game-retrieval-engine-design/I’ve got a new Shiny web app that I’ve embedded on another site where I’m doing some experimental things, and I wanted to talk generally about how I created it. The web app can be found at the following link that allows the user to do interactive searches for similar classic games for home consoles from what are generally known as the third generation (NES, Sega Master System) through the sixth generation (Wii, PS2, Xbox).Back2School with Vectors, Cosine Similarity, and Word2Vec
/teaching-high-school-students-vectors-cosine-similarity/
Fri, 12 May 2017 01:49:40 +0000/teaching-high-school-students-vectors-cosine-similarity/Tomorrow, I’ll be making a return visit to the high school where I spent a decade in the mathematics department as a teacher. I’ve got the chance to speak to ten classes over the course of six class periods and tell them a little bit about what I do as a data scientist.
Since many of the students will be familiar with concepts like vectors and trigonometry, I’ve decided to do an activity involving the Python gensim package and Word2Vec.A New Introduction to Spark 2.1 Dataframes with Python and MLlib
/introduction-spark-2-1-dataframes-python-mllib/
Wed, 03 May 2017 21:29:22 +0000/introduction-spark-2-1-dataframes-python-mllib/A couple of years ago, when I was in the midst of my rookie year as a data scientist, I wrote a blog post and tutorial about using the Python Spark API to build a simple model from housing data with Spark dataframes. Despite the simple nature of the model (a straight train-test split with multivariate linear regression), it was one of the more challenging tutorials I’ve ever written for this blog.Machine Learning Specialization Cut Short by Coursera
/machine-learning-specialization-coursera-review/
Tue, 31 Jan 2017 01:55:58 +0000/machine-learning-specialization-coursera-review/After an extremely long wait, today was the day that the fifth course in Coursera’s Machine Learning Specialization was set to begin. I’ve been with this specialization since it launched in the fall of 2015. Students were initially promised an ambitious slate of six courses, including a capstone that would wrap up by early summer of 2016. With noted husband and wife couple Carlos Guestrin and Emily Fox, previously of Carnegie Mellon and now of the University of Washington, this sounded like a great option.Minivan Price Comparison With R
/minivan-price-comparison-r/
Thu, 09 Jun 2016 01:38:04 +0000/minivan-price-comparison-r/With my family growing once again and my 13-year-old Mazda Protégé on the fritz, I recently decided it was time to go minivan shopping. A frugal shopper, some might say cheap, I quickly set my focus on the used, domestic market and found that there are only two competitors here, the Dodge Grand Caravan and the Chrysler Town and Country.
Two questions immediately came to mind:
As these two minivans are, for all practical purposes identical (manufactured at the same facility, same internals, just different branding), if one compared them with a similar set of features, does one name carry a price premium over the other?University of Washington Machine Learning Classification Review
/university-washington-machine-learning-classification-review/
Mon, 16 May 2016 21:06:10 +0000/university-washington-machine-learning-classification-review/I’ve spent the last couple of months working through course three in the University of Washington’s Machine Learning Specialization on Coursera. Course two was regression (review); the topic of the third course is classification. As has been the case with previous courses, this specialization continues to be taught by Carlos Guestrin and Emily Fox. For the classification course, Dr. Guestrin took the lead.
The time requirements did increase a bit with this third course, not excessively, but it felt like I was working an extra hour or so a week on it.Coursera Review–Machine Learning: Regression
/coursera-review-machine-learning-regression/
Mon, 18 Jan 2016 21:59:43 +0000/coursera-review-machine-learning-regression/I’ve recently completed the second course in the University of Washington Machine Learning Specialization on Coursera, “Machine Learning: Regression.” This comes on the heels of completing course 1, Machine Learning Foundations: A Case Study Approach. This course debuted right at the end of November and wrapped up 6 weeks later (my impression is that these courses are slipping a bit behind the timeline that was originally announced). I’d encourage you to read my review of the first course above, as I was left satisfied with the learning experience I received in the first class, but wondering if some of the concerns that students raised would be addressed.Constructing a Social Graph With Twitter and Plotly
/constructing-social-graph-twitter-plotly/
Wed, 06 Jan 2016 01:48:28 +0000/constructing-social-graph-twitter-plotly/In a couple of earlier posts, I showed an example of a social graph created from Twitter data and Plotly, a graph of relationships between educational technology enthusiasts on Twitter. Those posts were more for the educator audience that I write for, but increasingly, I’m getting feedback on my posts from other data scientists, so I’ve decided to include my code, both here on this blog and at my Github account.Teaching Graph Theory With Twitter
/teaching-graph-theory-twitter/
Wed, 06 Jan 2016 01:43:53 +0000/teaching-graph-theory-twitter/In a recent post, I displayed the social network graph that I created using the Twitter API and Plotly. There are a number of interesting applications here. Given my history with education, one that I think that shouldn’t be overlooked is as an interesting way to teach graph theory for an innovative teacher and school.
I taught graph theory myself for several years as part of a discrete mathematics course. While the textbook I used included many examples of “real world” problems that I found engaging, the students didn’t always agree.#EdTechChat Social Network Graph
/edtechchat-social-network-graph/
Wed, 06 Jan 2016 01:39:21 +0000/edtechchat-social-network-graph/Using the Twitter API and Plotly with Python, I created a visualization of a recent #EdTechChat on Twitter, held on December 14. If you aren’t familiar with graph theory, the dots in this visualization are referred to as nodes or vertices. They represent the Twitter users that participated in the chat. The line segments connecting them are called edges and represent a relationship between two Twitter users: one user follows the other.Coursera Review: Social and Economic Networks
/coursera-review-social-economic-networks/
Thu, 10 Dec 2015 22:13:49 +0000/coursera-review-social-economic-networks/Because I just couldn’t get enough of the new Machine Learning Specialization from the University of Washington, I decided to fill fill my schedule to the brim with another Coursera class, Social and Economic Networks: Models and Analysis, from the University of Stanford. I took a graph theory course at the University of Illinois while getting my master’s degree around the dawn of the new millennium, which among many other topics, covered things like Euler circuits, Hamiltonian paths, coloring, and the like.Coursera Review: Machine Learning Foundations—A Case Study Approach
/coursera-review-machine-learning-foundations-a-case-study-approach/
Mon, 09 Nov 2015 22:17:34 +0000/coursera-review-machine-learning-foundations-a-case-study-approach/After completing the Data Science Specialization from Johns Hopkins in 2014, my MOOC studies in 2015 have been fairly sporadic, partly as a result of starting a new job, and partly as a result of not seeing something that seemed like the right fit. That’s no longer the case, as I’ve recently jumped into a new specialization, the Machine Learning Specialization from the University of Washington.
As great an experience as I had with the JHU specialization, this new specialization checks a couple of continuing education boxes for me that I felt the JHU specialization left lacking.Databricks Review
/databricks-review/
Tue, 22 Sep 2015 02:31:20 +0000/databricks-review/
Not too long ago, I did my first post on Apache Spark, a Spark dataframes tutorial. I’ve continued to experiment with Spark since taking my first tentative steps with it just a few months ago. One of the challenges with Spark is that it has a reputation for being difficult to deploy at scale. Stepping in to try to solve that problem is Databricks. Databricks offers the ability for corporations to deploy an optimized Spark via the cloud with some very nice extra bells and whistles.My First Month With Ubuntu
/week-ubuntu/
Thu, 17 Sep 2015 22:00:58 +0000/week-ubuntu/
My journey into data science is taking me all sorts of interesting places that I didn’t originally expect. That’s what I love about it. While I can feel myself accelerating into the learning curve, there’s no shortage of new things to learn and won’t be for years to come.
One of the latest has been setting up of a “dual boot” environment of my new Dell PC to run both Windows 10 and Linux Ubuntu.Spark Dataframes and MLlib
/spark-dataframes-mllib-tutorial/
Mon, 24 Aug 2015 00:08:39 +0000/spark-dataframes-mllib-tutorial/NOTE: I have created an updated version of my Python Spark Dataframes tutorial that is based on Spark 2.1 uses an easier, updated Spark ML API. I would encourage readers to check that out over this older post.
A couple of months ago, I got my first experience with Apache Spark. While I am just starting to use it to implement meaningful problems, in my experience when working with a new tool or technology, just getting one’s feet wet can be crucial to getting a learning snowball rolling.Statistical Learning Stanford Online Review
/statistical-learning-stanford-online-review/
Fri, 10 Apr 2015 01:20:36 +0000/statistical-learning-stanford-online-review/I just received my certificate from Stanford’s Statistical Learning course, taught by the legendary Trevor Hastie and Rob Tribshirani. This was the first MOOC I’ve completed since making the jump from education to the corporate world, and I did find it challenging to keep up with the material despite the fact that this class required quite a bit less on a per week basis than most of the Johns Hopkins Data Science Specialization on Coursera.Favorite Podcasts for Data Scientists
/favorite-podcasts-data-scientists/
Sun, 22 Feb 2015 21:54:47 +0000/favorite-podcasts-data-scientists/One of my favorite learning methods is via podcasts. They allow me to multitask–exercising, driving, or doing chores–while listening to experts on a particular topic. Some of the podcasts I listen to are purely for entertainment (think Serial or StartUp) but many others are for educational purposes.
As I’ve been trying to build up my data science awareness in a variety of areas, I’ve been putting together list of podcasts specific to data science.My MOOC Study Strategies
/mooc-study-strategies/
Mon, 09 Feb 2015 22:00:43 +0000/mooc-study-strategies/If you’ve looked into MOOCs (Massive Online Open Courses) at all, you have probably wondered how successful students are at completing them compared to traditional courses. The short answer? Not very. I’ve seen various numbers floating around in a variety of studies, citing completion rates as low as 4% and as high as 8%, but never have I seen an aggregate number over 10%.
People take MOOCs for a variety of reasons.Johns Hopkins Data Science Specialization Review
/data-science-specialization-coursera-review/
Thu, 29 Jan 2015 22:00:14 +0000/data-science-specialization-coursera-review/It’s been a couple of weeks since Johns Hopkins issued final certificates for their Data Science Specialization on Coursera. I’m glad to say that I am now among the first crop of “alums” of the program. According to the last email we students received from our Johns Hopkins professors, about 2.3 million students have attempted at least one of the courses in the Data Science Specialization. Of those, 68,000 verified certificates were issued for completing a single course.Data Science Capstone Review
/data-science-capstone-review/
Tue, 20 Jan 2015 22:00:25 +0000/data-science-capstone-review/
Overview of the Data Science Capstone Project and Approach The Johns Hopkins Data Science Capstone project concluded around Christmas last month. It was an interesting experience, and very different than the other classes. The project, a partnership with smartphone app maker SwiftKey, required students to create a predictive text web app that worked much like a smartphone keyboard.
I spent much of the almost 2 months of the project getting up to speed on the basic terminology and approaches of Natural Language Processing, a field dedicated to the interaction between computers and human languages.Best R Tutorial Sites
/r-tutorial-sites/
Tue, 13 Jan 2015 22:00:16 +0000/r-tutorial-sites/There’s no doubt that the ability to analyze data and do predictive modeling by programming in R is a very valuable skill, whether you are looking to learn it for a college statistics class or one of a great many great jobs that utilize R. If you are trying to get started on your own, you may find it is a little tricky, however. While there are tons of sites in the Codecademy model to get started with certain languages like Javascript, PHP, CSS, or HTML, there are fewer options for getting started with R.Thoughts on Completing the 9 Johns Hopkins Data Science Courses
/thoughts-completing-9-johns-hopkins-data-science-courses/
Mon, 08 Sep 2014 13:00:00 +0000/thoughts-completing-9-johns-hopkins-data-science-courses/A process that began 4 months ago, the sequence of 9 Johns Hopkins Data Science Specialization courses on Coursera, wrapped up for me late last week with my last quiz in course 9, Developing Data Products. While I haven’t truly finished the specialization yet (the first ever capstone project doesn’t launch until late October), I still feel a sense of accomplishment.
According to our JHU professors, as of early August, over 800,000 students have attempted at least one course in the sequence.Developing Data Products Coursera Review
/developing-data-products-coursera-review-2/
Fri, 05 Sep 2014 13:00:33 +0000/developing-data-products-coursera-review-2/The ninth and final course prior to the capstone in Johns Hopkins Data Science Specialization on Coursera is Developing Data Products. This is the third and final course in the sequence taught by Brian Caffo. After taking the lead on two statistics courses, Statisical Inference and Regression Models, this class seemed to bring out a more humorous side in Caffo. On a couple of occasions, including the very first video, he had a bit of fun at his co-instructors expense with Go Animate videos.Practical Machine Learning Coursera Review
/practical-machine-learning-coursera-review/
Wed, 03 Sep 2014 13:00:47 +0000/practical-machine-learning-coursera-review/The eighth course in Johns Hopkins Data Science Specialization on Coursera is Practical Machine Learning This is the third and final course in the sequence taught by Jeff Leek.
Probably more than any other course in the JHU series of classes, this is the one that feels like it brought the whole sequence together. Students of Practical Machine Learning need the skills developed throughout the rest of the sequence to be successful in this course, from basic R Programming (course 2) through Regression Models (course 7).Regression Models Coursera Review
/regression-models-coursera-review/
Mon, 25 Aug 2014 19:28:30 +0000/regression-models-coursera-review/The seventh course in Johns Hopkins Data Science Specialization on Coursera is Regression Models. This is the second course in the sequence taught by Brian Caffo, after Statistical Inference. Much like that course, the emphasis here is on mathematics, and people who have been out of the mathematical loop for a while will probably find this class to be a struggle.
In fact, after breezing through most of Statistical Inference, I found significant portions of this class to be more challenging.Statistical Inference Coursera Review
/statistical-inference-coursera-review/
Fri, 22 Aug 2014 19:27:03 +0000/statistical-inference-coursera-review/The sixth course in Johns Hopkins Data Science Specialization on Coursera is Statistical Inference. This is the first course in the specialization taught by Brian Caffo. In my review of the R Programming course, I mentioned that there were two places in the sequence that seemed (based solely on my observations of forum comments) to be bogging students down. R Programming was obviously the first. Statistical Inference is the second.Reproducible Research Coursera Review
/reproducible-research-coursera-review/
Wed, 20 Aug 2014 19:15:25 +0000/reproducible-research-coursera-review/The fifth course in Johns Hopkins Data Science Specialization on Coursera is Reproducible Research. This is the third and final course in the sequence taught by Roger Peng.
Reproducible Research is the course among the first five in the specialization (except The Data Scientist’s Toolbox), where I spent the least time learning new R code. Instead, the emphasis of this course was more philosophical in nature. Here the emphasis was on writing your research findings up in a way that they could be shared with others in such a way that they were considered to be reproducible, though not necessarily replicable.Exploratory Data Analysis Coursera Review
/exploratory-data-analysis-coursera-review/
Mon, 18 Aug 2014 19:15:15 +0000/exploratory-data-analysis-coursera-review/The fourth course in Johns Hopkins Data Science Specialization on Coursera is Exploratory Data Analysis. This is the second class in the sequence taught by Roger Peng, after R programming.
This course could just about as well be titled “Visualizing Data,” since most everything in the class emphasized methods of presenting data visually in R. The bulk of the time in the class was spent on the 3 most popular methods of graphing in R: the base plotting system, lattice plot, and ggplot2.Getting and Cleaning Data Coursera Review
/getting-cleaning-data-coursera-review/
Fri, 15 Aug 2014 19:14:08 +0000/getting-cleaning-data-coursera-review/The third course in Johns Hopkins Data Science specialization on Coursera is Getting and Cleaning Data. The purpose of this class is to get students familiar with the process of creating a “tidy” data set from a variety of different sources. Like The Data Scientist’s Toolbox, this class is taught by Jeff Leek.
The breadth of material covered in this course was spectacular. Dr. Leek spent the majority of the first two weeks of the course explaining who to read a variety of data sources into R, some of which I was pretty familiar with, but others I was learning about for the first time.R Programming Coursera Review
/r-programming-coursera-review/
Wed, 13 Aug 2014 19:04:11 +0000/r-programming-coursera-review/The second course in Johns Hopkins Data Science Specialization on Coursera is R Programming. I took this class concurrently with The Data Scientist’s Toolbox, which was more of a “warm up” class. If you don’t have much of a programming background, you’d better get warm quickly, because this class gets hot in a hurry for the uninitiated. R Programming is substantially more challenging than The Data Scientist’s Toolbox.
R Programming is taught by Roger Peng, who, based on forum feedback, seems to be a student favorite in the data science sequence.Data Scientist’s Toolbox Review
/data-scientists-toolbox-review-coursera/
Mon, 11 Aug 2014 13:50:41 +0000/data-scientists-toolbox-review-coursera/The Data Scientist’s Toolbox is the first course in the nine course sequence (plus capstone) that Johns Hopkins is offering via Coursera towards a Data Science Specialization. This course was not only my first course in that sequence, it was my first class on Coursera. In fact, it was my first MOOC (massive open online course).
While I was under the impression that the general public is now pretty well informed about MOOC’s, it’s been pretty obvious from speaking with my college educated peers that that is not the case.The Data Science Specialization from Johns Hopkins on Coursera
/data-science-specialization-coursera/
Fri, 08 Aug 2014 21:08:20 +0000/data-science-specialization-coursera/For nearly six months, this blog has gone quiet. Considering I’ve had 3+ years of posting an average of two times a week, that’s quite a stretch without writing. There have been a variety of reasons for my silence, including increasing work and family commitments, but as I alluded to last fall a couple of times on this blog, I’ve also been investigating new career options. The biggest reason for my silence is the significant amount of time that I’ve devoted to career questions over the first half of this year.