Tag Archives: Hadoop

Definitive Guides to Data Science and Analytics Things

The Definitive Guide to anything should be a helpful, informative road map to that topic, including visualizations, lessons learned, best practices, application areas, success stories, suggested reading, and more.  I don’t know if all such “definitive guides” can meet all of those qualifications, but here are some that do a good job:

  1. The Field Guide to Data Science (big data analytics by Booz Allen Hamilton)
  2. The Data Science Capability Handbook (big data analytics by Booz Allen Hamilton)
  3. The Definitive Guide to Becoming a Data Scientist (big data analytics)
  4. The Definitive Guide to Data Science – The Data Science Handbook (analytics)
  5. The Definitive Guide to doing Data Science for Social Good (big data analytics, data4good)
  6. The Definitive Q&A Guide for Aspiring Data Scientists (big data analytics, data science)
  7. The Definitive Guide to Data Literacy for all (analytics, data science)
  8. The Data Analytics Handbook Series (big data, data science, data literacy by Leada)
  9. The Big Analytics Book (big data, data science)
  10. The Definitive Guide to Big Data (analytics, data science)
  11. The Definitive Guide to the Data Lake (big data analytics by MapR)
  12. The Definitive Guide to Business Intelligence (big data, business analytics)
  13. The Definitive Guide to Natural Language Processing (text analytics, data science)
  14. A Gentle Guide to Machine Learning (analytics, data science)
  15. Building Machine Learning Systems with Python (a non-definitive guide) (data analytics)
  16. The Definitive Guide to Data Journalism (journalism analytics, data storytelling)
  17. The Definitive “Getting Started with Apache Spark” ebook (big data analytics by MapR)
  18. The Definitive Guide to Getting Started with Apache Spark (big data analytics, data science)
  19. The Definitive Guide to Hadoop (big data analytics)
  20. The Definitive Guide to the Internet of Things for Business (IoT, big data analytics)
  21. The Definitive Guide to Retail Analytics (customer analytics, digital marketing)
  22. The Definitive Guide to Personalization Maturity in Digital Marketing Analytics (by SYNTASA)
  23. The Definitive Guide to Nonprofit Analytics (business intelligence, data mining, big data)
  24. The Definitive Guide to Marketing Metrics & Analytics
  25. The Definitive Guide to Campaign Tagging in Google Analytics (marketing, SEO)
  26. The Definitive Guide to Channels in Google Analytics (SEO)
  27. A Definitive Roadmap to the Future of Analytics (marketing, machine learning)
  28. The Definitive Guide to Data-Driven Attribution (digital marketing, customer analytics)
  29. The Definitive Guide to Content Curation (content-based marketing, SEO analytics)
  30. The Definitive Guide to Collecting and Storing Social Profile Data (social big data analytics)
  31. The Definitive Guide to Data-Driven API Testing (analytics automation, analytics-as-a-service)
  32. The Definitive Guide to the World’s Biggest Data Breaches (visual analytics, privacy analytics)

Follow Kirk Borne on Twitter @KirkDBorne

4_book_image-6fd6043b69f0bb051f45055c9481cccc

Drilling Through Data Silos with Apache Drill

Enterprise data collections are typically stored in silos belonging to different business divisions. Sometimes these silos belong to different projects within the same division. These silos may be further segmented by services/products and functions. Silos (which stifle data-sharing and innovation) are often identified as a primary impediment (both practically and culturally) to business progress and thus they may be the cause of numerous difficulties. For example, streamlining important business processes are rendered more challenging, ranging from compliance to data discovery. But, breaking down the silos may not be so easy to accomplish. In fact, it is often infeasible due to ownership issues, governance practices, and regulatory concerns.

Big Data silos create additional complications including data duplication (and associated increased costs), complicated data replication solutions, high data latency, and data quality concerns, not to mention being an enabler of the real problematic situation where your data repositories could hold different versions of the truth. The silos also put a limit on business intelligence (discovery and actionable insights). As big data best practices rise above the hype and noise, we now know that actionable value is more easily extracted when multiple data sets can be integrated and viewed holistically.

Data analysts naturally want to break down silos to combine data from multiple data sources. Unfortunately, this can create its own bottleneck: a complex integration labyrinth—which is costly to maintain, rarely performs well, and can’t be guaranteed to provide consistent results.

In response, many companies have deployed Apache Hadoop to address the problem of segregated data. Hadoop enables multiple types of data to be directly processed in place, and it fully supports data integration from multiple sources across different data storage technologies.

Organizations that use Hadoop are finding additional benefits with Apache Drill, which is the open source version of Google’s Dremel system…

(continue reading here https://www.mapr.com/blog/drive-innovation-breaking-down-data-silos-apache-drill)

Follow Kirk Borne on Twitter @KirkDBorne

These are a few of my favorite things… in Big Data and Data Science: A to Z

A while back, we made a list from A to Z of a few of our favorite things in big data and data science. We have made a lot of progress toward covering several of these topics. Here’s a handy list of the write-ups that I have completed so far:

AAssociation rule mining:  described in the article “Association Rule Mining – Not Your Typical Data Science Algorithm.”

C – Characterization:  described in the article “The Big C of Big Data: Top 8 Reasons that Characterization is ‘ROIght’ for Your Data.”

H – Hadoop (of course!):  described in the article “H is for Hadoop, along with a Huge Heap of Helpful Big Data Capabilities.” To learn more, check out the Executive’s Guide to Big Data and Apache Hadoop, available as a free download from MapR.

K – K-anything in data mining:  described in the article “The K’s of Data Mining – Great Things Come in Pairs.”

L – Local linear embedding (LLE):  is described in detail in the blog post series “When Big Data Goes Local, Small Data Gets Big – Part 1” and “Part 2

N – Novelty detection (also known as “Surprise Discovery”):  described in the articles “Outlier Detection Gets a Makeover – Surprise Discovery in Scientific Big Data” and “N is for Novelty Detection…” To learn more, check out the book Practical Machine Learning: A New Look at Anomaly Detection, available as a free download from MapR.

P – Profiling (specifically, data profiling):  described in the article “Data Profiling – Four Steps to Knowing Your Big Data.”

Q – Quantified and Tracked:  described in the article “Big Data is Everything, Quantified and Tracked: What this Means for You.”

R – Recommender engines:  described in two articles: “Design Patterns for Recommendation Systems – Everyone Wants a Pony” and “Personalization – It’s Not Just for Hamburgers Anymore.” To learn more, check out the book Practical Machine Learning: Innovations in Recommendation, available as a free download from MapR.

S – SVM (Support Vector Machines):  described in the article “The Importance of Location in Real Estate, Weather, and Machine Learning.”

Z – Zero bias, Zero variance:  described in the article “Statistical Truisms in the Age of Big Data.”

Top 10 Conversations That You Don’t Want to Have on Data Privacy Day

On January 28, the world observes Data Privacy Day. Here are the top 10 conversations that you do not want to have on that day. Let the countdown begin….

10.  CDO (Chief Data Officer) speaking to Data Privacy Day event manager who is trying to re-schedule the event for Father’s Day: “Don’t do that! It’s pronounced ‘Day-tuh’, not ‘Dadda’.”

9.  CDO speaking at the company’s Data Privacy Day event regarding an acronym that was used to list his job title in the event program guide: “I am the company’s Big Data ‘As A Service’ guru, not the company’s Big Data ‘As Software Service’ guru.”  (Hint: that’s BigData-aaS, not BigData-aSS)

8.  Data Scientist speaking to Data Privacy Day session chairperson: “Why are all of these cows on stage with me? I said I was planning to give a LASSO demonstration.”

​7.  Any person speaking to you: “Our organization has always done big data.”

6.  You speaking to any person: “Seriously? … The title of our Data Privacy Day Event is ‘Big Data is just Small Data, Only Bigger’.

5.  New cybersecurity administrator (fresh from college) sends this e-mail to company’s Data Scientists at 4:59pm: “The security holes in our data-sharing platform are now fixed. It will now automatically block all ports from accepting incoming data access requests between 5:00pm and 9:00am the next day.  Gotta go now.  Have a nice evening.  From your new BFF.”

4.  Data Scientist to new HR Department Analytics ​Specialist regarding the truckload of tree seedlings that she received as her end-of-year company bonus:  “I said in my employment application that I like Decision Trees, not Deciduous Trees.”

3.  Organizer for the huge Las Vegas Data Privacy Day Symposium speaking to the conference keynote speaker: “Oops, sorry.  I blew your $100,000 speaker’s honorarium at the poker tables in the Grand Casino.”

2.  Over-zealous cleaning crew speaking to Data Center Manager arriving for work in the morning after Data Privacy Day event that was held in the company’s shiny new Exascale Data Center: “We did a very thorough job cleaning your data center. And we won’t even charge you for the extra hours that we spent wiping the dirty data from all of those disk drives that you kept talking about yesterday.”

1.  Announcement to University staff regarding the Data Privacy Day event:  “Dan Ariely’s keynote talkBig Data is Like Teenage Sex‘ is being moved from room B002 in the Physics Department to the Campus Football Stadium due to overwhelming student interest.”

Follow Kirk Borne on Twitter @KirkDBorne

New Directions for Big Data and Analytics in 2015

The world of big data and analytics is remarkably vibrant and marked by incredible innovation, and there are advancements on every front that will continue into 2015. These include increased data science education opportunities and training programs, in-memory analytics, cloud-based everything-as-a-service, innovations in mobile (business intelligence and visual analytics), broader applications of social media (for data generation, consumption and exploration), graph (linked data) analytics, embedded machine learning and analytics in devices and processes, digital marketing automation (in retail, financial services and more), automated discovery in sensor-fed data streams (including the internet of everything), gamification, crowdsourcing, personalized everything (medicine, education, customer experience and more) and smart everything (highways, cities, power grid, farms, supply chain, manufacturing and more).

Within this world of wonder, where will we wander with big data and analytics in 2015? I predict two directions for the coming year…

(continue reading herehttp://www.ibmbigdatahub.com/blog/new-directions-big-data-and-analytics-2015)

Follow Kirk Borne on Twitter @KirkDBorne

The Power of Three: Big Data, Hadoop, and Finance Analytics

Big data is a universal phenomenon. Every business sector and aspect of society is being touched by the expanding flood of information from sensors, social networks, and streaming data sources. The financial sector is riding this wave as well. We examine here some of the features and benefits of Hadoop (and its family of tools and services) that enable large-scale data processing in finance (and consequently in nearly every other sector).

Three of the greatest benefits of big data are discovery, improved decision support, and greater return on innovation. In the world of finance, these also represent critical business functions….

(continue reading here:  https://www.mapr.com/blog/potent-trio-big-data-hadoop-and-finance-analytics)

Follow Kirk Borne on Twitter @KirkDBorne

When Big Data Goes Local, Small Data Gets Big

This two-part series focuses on the value of doing small data analyses on a big data collection.  In Part 1 of the series, we describe the applications and benefits of “small data” in general terms from several different perspectives.  In Part 2 of the series, we’ll spend some quality time with one specific algorithm (Local Linear Embedding) that enables local subsets of data (i.e., small data) to be used in developing a global understanding of the full big data collection.

We often hear that small data deserves at least as much attention in our analyses as big data.  While there may be as many interpretations of that statement as there are definitions of big data (and see more here), there are at least two situations where “small data” applications are worth considering…

(continue reading here https://www.mapr.com/blog/when-big-data-goes-local-small-data-gets-big-part-1)

Local Linear Embedding

Follow Kirk Borne on Twitter @KirkDBorne

Apervi’s Conflux Gives a Big Boost to a Confluence of Big Data Workflows

Data-driven workflows are the life and existence of big data professionals everywhere: data scientists, data analysts, and data engineers. We perform all types of data functions in these workflow processes: archive, discover, access, visualize, mine, manipulate, fuse, integrate, transform, feed models, learn models, validate models, deploy models, etc. It is a dizzying day’s work. We start manually in our workflow development, identifying what needs to happen at each stage of the process, what data are needed, when they are needed, where data needs to be staged, what are the inputs and outputs, and more.  If we are really good, we can improve our efficiency in performing these workflows manually, but not substantially. A better path to success is to employ a workflow platform that is scalable (to larger data), extensible (to more tasks), more efficient (shorter time-to-solution), more effective (better solutions), adaptable (to different user skill levels and to different business requirements), comprehensive (providing a wide scope of functionality), and automated (to break the time barrier of manual workflow activities).

(continue reading here http://www.bigdatanews.com/group/bdn-daily-press-releases/forum/topics/apervi-s-conflux-gives-a-big-boost-to-a-confluence-of-big-data-wo)

Apervi Conflux

 

Follow Kirk Borne on Twitter @KirkDBorne