Featured Blog Posts - AnalyticBridge2018-05-23T17:22:10Zhttps://www.analyticbridge.datasciencecentral.com/profiles/blog/feed?promoted=1&xn_auth=noThe Role of Predictive Analytics in Medical Diagnosistag:www.analyticbridge.datasciencecentral.com,2018-05-22:2004291:BlogPost:3838232018-05-22T08:30:00.000ZGoli Tajadodhttps://www.analyticbridge.datasciencecentral.com/profile/GoliTajadod
<p>Pre<span>dictive analytics uses current and historical data in order to determine the probability of a particular outcome. This is a particularly powerful approach when it is applied to medical diagnosis. In an effort to reduce misdiagnosis, historical data of former patient’s symptoms may be applied to the assessment of a new patient.</span></p>
<p></p>
<p><span>While doctors are the ultimate experts and decision-makers, using predictive analytics as a means of establishing precedent for…</span></p>
<p>Pre<span>dictive analytics uses current and historical data in order to determine the probability of a particular outcome. This is a particularly powerful approach when it is applied to medical diagnosis. In an effort to reduce misdiagnosis, historical data of former patient’s symptoms may be applied to the assessment of a new patient.</span></p>
<p></p>
<p><span>While doctors are the ultimate experts and decision-makers, using predictive analytics as a means of establishing precedent for particular patient ailments can be of significant benefit in lieu of a second or even a third or fourth opinion. Predictive analytics can help doctors to make even more informed and insightful diagnoses by using current and/or historical data. After all, the diagnosis phase is certainly the most important when it is put into the context of the patient’s overall journey from definition of initial symptoms to ultimate recovery. This is particularly important when one considers that misdiagnosis accounts for as much as one third of all medical errors, and diagnosis error can be three times more common than prescription error.*</span></p>
<p></p>
<p><span>The latest version of FlexRule includes a new sample project called Predictive Analytics. This describes how a patient’s history at a particular medical centre during a specific period of time could help doctors validate the diagnoses for new patients with a similar set of symptoms during the same time frame.</span></p>
<p></p>
<p><span>At the highest level, the FlexRule project involves three main actions. These are as follows:</span></p>
<p></p>
<ol>
<li><span>Read the patient’s historical data</span></li>
<li><span>Use the Naïve Bays (NB) Algorithm to train the model</span></li>
<li><span>Predict the diagnosis</span></li>
</ol>
<p><strong><span><a href="http://api.ning.com:80/files/67AHmvAIx7SmGU6VGHMDr7Bhvx7msAOB-Ffcj2uboBeOISkSg2jeD0ZTFJ0gPmmlFz67k*JYOwbU2h*1m-3Tbby25xCPaU9s/Goli1.png" target="_self"><img src="http://api.ning.com:80/files/67AHmvAIx7SmGU6VGHMDr7Bhvx7msAOB-Ffcj2uboBeOISkSg2jeD0ZTFJ0gPmmlFz67k*JYOwbU2h*1m-3Tbby25xCPaU9s/Goli1.png" width="458" class="align-left"/></a></span></strong></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<ol>
<li><strong><span>Reading the Patient’s Historical Data<br/> <br/></span></strong> During this first stage of the FlexRule project, the patient’s data is read from a Data Sheet (in this case a CSV file). This data sheet includes 20 patients, all of whom have different symptoms associated with their doctor’s final diagnoses and associated lab tests. For example, the patient on the first row experienced a sore throat and sneezing, whereas the second patient did not have a sore throat, but did have a stuffy nose, as well as symptoms of fatigue. Both patients were subsequently diagnosed with colds.</li>
</ol>
<p></p>
<p><a href="http://api.ning.com:80/files/67AHmvAIx7SdRTMwcWlKnmijILOXwgB2PX8GCa9mt-FF2KFVFj6Alp0SQDg*Ql8EYL1ssIjhCGrahkZzrpuny3RI3luIkjEe/Goli2.png" target="_self"><img src="http://api.ning.com:80/files/67AHmvAIx7SdRTMwcWlKnmijILOXwgB2PX8GCa9mt-FF2KFVFj6Alp0SQDg*Ql8EYL1ssIjhCGrahkZzrpuny3RI3luIkjEe/Goli2.png" width="508" class="align-center"/></a></p>
<p><span> </span></p>
<ol start="2">
<li><strong><span>Use the Naïve Bays (NB) Algorithm to train the model<br/> <br/></span></strong> The next step is to train the new predictive model or reload an existing model by using the Naïve Bayes (NB) algorithm. This is a scalable classification algorithm for the type of dataset used in the FlexRule project. The NB algorithm is often applied successfully to medical problems with large repositories of data and features. At this point, the FlexRule model is ready to predict a diagnosis for patients who display similar symptoms to those already assessed. </li>
</ol>
<p><span><br/></span></p>
<ol start="3">
<li><strong><span>Predict the Diagnosis<br/> <br/></span></strong> In the final step, the data relating to a new patient’s symptoms is passed to the predictive model. For example:<br/> <br/> {<br/> <br/> "Fatigue": null,<br/> <br/> "Stuffy Nose": "T",<br/> <br/> "Sneezing": "T",<br/> <br/> "Sore Throat": "T"<br/> <br/> }<br/> <br/> Using all of the available historical data allied with the new patient’s symptoms, the FlexRule predictive model now calculates the percentage probability of particular diagnoses as shown below:</li>
</ol>
<p></p>
<p><a href="http://api.ning.com:80/files/67AHmvAIx7TK3Wcf1rrBTU4sYwXChKfnQM5AvjpmZF6kPomtrCotpKvSJfbEsMegYI6h8ZdbE1yFSqwps3u5w5ulPkMgrXim/Goli3.png" target="_self"><img src="http://api.ning.com:80/files/67AHmvAIx7TK3Wcf1rrBTU4sYwXChKfnQM5AvjpmZF6kPomtrCotpKvSJfbEsMegYI6h8ZdbE1yFSqwps3u5w5ulPkMgrXim/Goli3.png" width="403" class="align-center"/></a><br/> <br/> In this example, it is very clear that our new patient will most likely be diagnosed as having a cold, as the probability is shown as 74.81%. The probability of our patient having the flu or allergies is 8.76% and 16.41% respectively.</p>
<p>While this is a very simple example, it shows how FlexRule and predictive modelling can help a doctor make better quality patient diagnosis.</p>
<p><span>No doubt most patients would also agree with this type of approach to medical diagnoses!</span></p>
<p><span>Read more <a href="http://www.flexrule.com" target="_blank" rel="noopener">here</a>. </span></p>
<p></p>
<p><em>* Ian Ayres; Super Crunchers, How anything can be predicted, page 97</em></p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/wpOfTwevAxU" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:383823http://api.ning.com:80/files/67AHmvAIx7SmGU6VGHMDr7Bhvx7msAOB-Ffcj2uboBeOISkSg2jeD0ZTFJ0gPmmlFz67k*JYOwbU2h*1m-3Tbby25xCPaU9s/Goli1.pngMachine Learning with Signal Processing Techniquestag:www.analyticbridge.datasciencecentral.com,2018-04-29:2004291:BlogPost:3825902018-04-29T15:00:00.000Zahmet taspinarhttps://www.analyticbridge.datasciencecentral.com/profile/ahmettaspinar
<p>Stochastic Signal Analysis is a field of science concerned with the processing, modification and analysis of (stochastic) signals.</p>
<p>Anyone with a background in Physics or Engineering knows to some degree about signal analysis techniques, what these technique are and how they can be used to analyze, model and classify signals.</p>
<p>Data Scientists coming from a different fields, like Computer Science or Statistics, might not be aware of the analytical power these techniques bring with…</p>
<p>Stochastic Signal Analysis is a field of science concerned with the processing, modification and analysis of (stochastic) signals.</p>
<p>Anyone with a background in Physics or Engineering knows to some degree about signal analysis techniques, what these technique are and how they can be used to analyze, model and classify signals.</p>
<p>Data Scientists coming from a different fields, like Computer Science or Statistics, might not be aware of the analytical power these techniques bring with them.</p>
<p>In this blog post, we will have a look at how we can use Stochastic Signal Analysis techniques, in combination with traditional Machine Learning Classifiers for accurate classification and modelling of time-series and signals.</p>
<p>At the end of the blog-post you should be able understand the various signal-processing techniques which can be used to retrieve features from signals and be able to <a href="http://ieeexplore.ieee.org/abstract/document/1306572/" target="_blank" rel="noopener">classify ECG signals</a><span> </span>(and even<span> </span><a href="http://ieeexplore.ieee.org/abstract/document/4427376/" target="_blank" rel="noopener">identify a person</a> by their ECG signal),<span> </span><a href="https://www.kaggle.com/c/seizure-prediction" target="_blank" rel="noopener">predict seizures</a><span> </span>from EEG signals,<span> </span><a href="https://www.spiedigitallibrary.org/conference-proceedings-of-spie/5808/0000/Comparative-analysis-of-feature-extraction-2D-FFT-and-wavelet-and/10.1117/12.597305.short" target="_blank" rel="noopener">classify and identify</a><span> </span>targets in radar signals,<span> </span><a href="https://link.springer.com/article/10.1007/s10916-005-5184-7" target="_blank" rel="noopener">identify patients with neuropathy or myopathyetc</a><span> </span>from EMG signals by using the FFT, etc etc.</p>
<p> <a href="http://api.ning.com:80/files/FeYSMijACsk1jOwDZw4uZFcnvuO5Tpsv3aiU0ON1dCCVIk0jEOdl7duqCAPVwDshTBbJAVUo3*NIPbF2lPYFfL3vss47nMou/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/FeYSMijACsk1jOwDZw4uZFcnvuO5Tpsv3aiU0ON1dCCVIk0jEOdl7duqCAPVwDshTBbJAVUo3*NIPbF2lPYFfL3vss47nMou/Capture.PNG" width="522" class="align-center"/></a></p>
<p>To read more, <a href="http://ataspinar.com/2018/04/04/machine-learning-with-signal-processing-techniques/" target="_blank" rel="noopener">click here</a>.</p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/AsAYDclMono" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:382590I Analyzed 10 MM digits of SQRT(2) - Look at My Findingstag:www.analyticbridge.datasciencecentral.com,2018-04-01:2004291:BlogPost:3817762018-04-01T04:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>This article is intended for practitioners who might not necessarily be statisticians or statistically-savvy. The mathematical level is kept as simple as possible, yet I present an original, simple approach to test for randomness, with an interesting application to illustrate the methodology. This material is not something usually discussed in textbooks or classrooms (even for statistical students), offering a fresh perspective, and out-of-the-box tools that are useful in many contexts, as…</p>
<p>This article is intended for practitioners who might not necessarily be statisticians or statistically-savvy. The mathematical level is kept as simple as possible, yet I present an original, simple approach to test for randomness, with an interesting application to illustrate the methodology. This material is not something usually discussed in textbooks or classrooms (even for statistical students), offering a fresh perspective, and out-of-the-box tools that are useful in many contexts, as an addition or alternative to traditional tests that are widely used. This article is written as a tutorial, but it also features an interesting research result in the last section. The example used in this tutorial shows how<span> intuiting can be wrong, and why you need data science.</span></p>
<p>The main question that we want to answer is: Are some events occurring randomly, or is there a mechanism making the events not occurring randomly? What is the gap distribution between two successive events of the same type? In a time-continuous setting (Poisson process) the distribution in question is modeled by the exponential distribution. In the discrete case investigated here, the discrete Poisson process turns out to be a Markov chain, and we are dealing with geometric, rather than exponential distributions. Let us illustrate this with an example.</p>
<p><strong>Example</strong></p>
<p>The digits of the square root of two (SQRT(2)), are believed to be distributed as if they were occurring randomly. Each of the 10 digits 0, 1, ... , 9 appears with a frequency of 10% based on observations, and at any position in the decimal expansion of SQRT(2), on average the next digit does not seem to depend on the value of the previous digit (in short, its value is unpredictable.) An event in this context is defined, for example, as a digit being equal to (say) 3. The next event is the first time when we find a subsequent digit also equal to 3. The<span> </span><em>gap</em><span> </span>(or time elapsed) between two occurrences of the same digit is the main metric that we are interested in, and it is denoted as<span> </span><em>G</em>. If the digits were distributed just like random numbers, the distribution of the gap<span> </span><em>G</em> between two occurrences of the same digit, would be geometric</p>
<p>Do you see any pattern in the digits below? <span style="text-decoration: underline;"><a href="https://www.datasciencecentral.com/profiles/blogs/stochastic-processes-new-tests-for-randomness-application-to-numb" target="_blank" rel="noopener"><strong>Read full article here</strong></a></span> to find the answer, and to learn more about a powerful statistical technique.</p>
<p><span><a href="http://api.ning.com:80/files/-GKNVYVm1pzNifdAosz5N7FDRWbbb0flbUCjHl*QvZ83lFwdI5-IAYr0q5Gk5-pgQopIEisLmP9vZIVsv6pYwQ0htdVPboXL/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/-GKNVYVm1pzNifdAosz5N7FDRWbbb0flbUCjHl*QvZ83lFwdI5-IAYr0q5Gk5-pgQopIEisLmP9vZIVsv6pYwQ0htdVPboXL/Capture.PNG" width="528" class="align-center"/></a></span></p>
<p><span style="font-size: 14pt;"><b>DSC Resources</b></span></p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter">Subscribe to our Newsletter</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/comprehensive-repository-of-data-science-and-ml-resources">Comprehensive Repository of Data Science and ML Resources</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between ML, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles">Selected Business Analytics, Data Science and ML articles</a></li>
<li><a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a><span> </span>|<span> </span><a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a><span> </span>|<span> </span><a href="http://classifieds.datasciencecentral.com">Classifieds</a><span> </span>|<span> </span><a href="http://www.analytictalent.com">Find a Job</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a><span> </span>|<span> </span><a href="http://www.datasciencecentral.com/forum/topic/new">Forum Questions</a></li>
</ul><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/x8U8H8RJGLk" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:381776What is an Analytics Translator and Why is the Role Important to Your Organization?tag:www.analyticbridge.datasciencecentral.com,2018-02-23:2004291:BlogPost:3812492018-02-23T09:30:00.000ZKartik Patelhttps://www.analyticbridge.datasciencecentral.com/profile/KartikPatel
<p><img src="http://api.ning.com:80/files/NEOjH3acnx6vbJlScQ7f2s389jvbWAV6qi0SznfnEJGSFCb2GeqH5RoAJQGglrCWiSxz2W6TnkCDeIGbXi4xz3lgJOvQQztu/whatisananalyticstranslatorandwhyistheroleimportanttoyourorganizationbanner.jpg" width="700"></img></p>
<p>Today, enterprises recognize the critical value of advanced analytics within the organization and they are implementing data democratization initiatives. As these initiatives evolve, new roles emerge in the organization. The newest of these analysis-related roles is the<span> </span><strong>'analytics translator'</strong>. As the enterprise considers the relevance of this new role within the business, it is important to understand the responsibilities of an Analytics…</p>
<p><img src="http://api.ning.com:80/files/NEOjH3acnx6vbJlScQ7f2s389jvbWAV6qi0SznfnEJGSFCb2GeqH5RoAJQGglrCWiSxz2W6TnkCDeIGbXi4xz3lgJOvQQztu/whatisananalyticstranslatorandwhyistheroleimportanttoyourorganizationbanner.jpg" width="700"/></p>
<p>Today, enterprises recognize the critical value of advanced analytics within the organization and they are implementing data democratization initiatives. As these initiatives evolve, new roles emerge in the organization. The newest of these analysis-related roles is the<span> </span><strong>'analytics translator'</strong>. As the enterprise considers the relevance of this new role within the business, it is important to understand the responsibilities of an Analytics Translator, and how this role might help the organization to achieve its goals. </p>
<p><strong><em>What is an Analytics Translator?</em></strong></p>
<p>The Analytics Translator is an important member of the new analytical team. As organizations encourage data democratization and implement self-serve business intelligence and advanced analytics, business users can leverage machine learning, self-serve data preparation, and predictive analytics for business users to gather, prepare an analyze data. The emerging role of Analytics Translator adds resources to a team that includes IT, data scientists, data architects and others.</p>
<p>Analytics Translators do not have to be analytical specialists or trained professionals. With the right tools, they can easily translate data and analysis without the skills of a highly trained data pro.</p>
<blockquote>Using their knowledge of the business and their area of expertise, translators can help the management team focus on targeted areas like production, distribution, pricing and even cross-functional initiatives.</blockquote>
<p>With self-serve, advanced analytics tools, translators can then identify patterns, trends and opportunities, and problems. This information is then handed off to data scientists and professionals to further clarify and produce crucial reports and data with which management teams can make strategic and operational decisions.</p>
<p><strong><em>Why is an Analytics Translator Important to Your Organization?</em></strong></p>
<p>IT resources and data professionals are typically in short supply within an organization and, if the enterprise wishes to increase staff, the cost of these highly skilled professionals can be prohibitive. In the average organization, these resources are usually stretched thin and time is wasted on projects that are:</p>
<ul>
<li>Too complex for business team members</li>
<li>Conceived or inappropriate for attention at the data scientist or IT level</li>
<li>Comprised of incomplete requirements</li>
<li>Required for day-to-day or immediate analysis or data sharing initiatives</li>
<li>Tactical or low-level operational in nature</li>
</ul>
<p>The time it takes for a data professional or IT professional to review the project and assign a priority, will take them away from more strategic or more critical tasks and, in the process, the business user may miss day-to-day deadlines or information that is critical to them. Perhaps, the data professional may need more information on requirements, which will further delay the project. There are many examples of unnecessary or inappropriate data analysis requests and many instances where a business user with access to analytical tools might be able to do the work themselves. But, there are even more examples of projects or analytical requirements that fall somewhere between the skills of a business user and the skills of a trained data scientist and just as many examples of poorly understood or poorly translated data analysis that sends a business user off in the wrong direction.</p>
<p>That is where the Analytics Translator comes in. Using her or his knowledge of the industry, the organization, the team and the analytics tools, the translator can play a crucial role in understanding requirements, preparing data and producing and explaining information in a way that is accurate and clear. As this role evolves within your organization, you will find that, by allowing the average business user to work with the Analytics Translator, that business user will become more knowledgeable and skilled in interpreting and understanding data.</p>
<p><strong><em>The Ideal Analytics Translator</em></strong></p>
<p>When identifying possible candidates to perform the Analytics Translator role, the organization should look for skills that can be nurtured and optimized as an asset.</p>
<ul>
<li>A power user of self-serve BI tools</li>
<li>Recognized as an expert in a functional, industry or organizational role</li>
<li>Comfortable with building and presenting reports and use cases</li>
<li>Works well with technical and management teams</li>
<li>Manages projects, milestones and dependencies with ease</li>
<li>Able to translate analysis and conclusions into actionable recommendations</li>
<li>Comfortable with metrics, measurements and prioritization</li>
<li>Acts as a role model for user and team member adoption of new processes and data-driven decisions</li>
</ul>
<p>If this role is recognized as important to the organization, most enterprises will structure a logical program to identify and train candidates to ensure uniform skills and performance.</p>
<p>By combining domain, organizational and industry skills with self-serve analytical tools, the Analytics Translator can help the enterprise to achieve low total cost of ownership (TCO) and rapid return on investment (ROI) for its business intelligence and advanced analytics initiatives and can encourage and nurture data democratization and optimal analytical business results within the organization.</p>
<blockquote><strong>Citizen Data Scientists/Citizen Analysts<span> </span></strong>play a crucial role in day-to-day analysis and decision-making, using self-serve business intelligence tools.<span> </span><strong>Analytics Translators </strong>bridge the gap between IT, data scientists and business users, and move initiatives forward by acting as a liaison and topic expert to help the organization focus on the right things to achieve its goals.</blockquote>
<p></p>
<p>As self-serve Advanced Analytics and data democratization becomes more common across industries and organizations, the role of the Analytics Translator will also become more important. As a power-user of BI tools and Self-Serve Analytics, the translator functions as a liaison between critical analytical and technical resources and the business user community and ensures that BI tools will be adopted and shared across the enterprise.</p>
<p>In our next article, we will consider the difference between the Analytics Translator and the Citizen Data Scientist or Citizen Analyst.</p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/pXGLlXil7Iw" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:381249What is Clickless Analysis? Can it Simplify Adoption of Augmented Analytics? (Part 1 of 3 articles)tag:www.analyticbridge.datasciencecentral.com,2018-01-25:2004291:BlogPost:3805402018-01-25T12:30:00.000ZKartik Patelhttps://www.analyticbridge.datasciencecentral.com/profile/KartikPatel
<p><img src="http://api.ning.com:80/files/oIvcR-vaikiOGuU9WbDJOIqNypmKGx*DZZdLcAUoT8aaK6TCg-*xIbc8XKs**ezrwtzBFeQ-URlh-GCcxVwVTOTzCfZU8ahm/whatisclicklessanalysiscanitsimplifyadoptionofaugmentedanalyticsbanner.jpg" width="700"></img></p>
<p>The concept of Clickless Analytics is one that will be happily embraced by business users and by the business enterprise. The reason is simple! Clickless Analytics allows users to find and analyze information without specialized skills, by using natural language.</p>
<p>In this, the first of a three-part series we discuss Clickless Analytics and how it can simplify user adoption of augmented analytics.</p>
<p>What is Clickless Analytics?</p>
<p>Clickless Analytics…</p>
<p><img src="http://api.ning.com:80/files/oIvcR-vaikiOGuU9WbDJOIqNypmKGx*DZZdLcAUoT8aaK6TCg-*xIbc8XKs**ezrwtzBFeQ-URlh-GCcxVwVTOTzCfZU8ahm/whatisclicklessanalysiscanitsimplifyadoptionofaugmentedanalyticsbanner.jpg" width="700"/></p>
<p>The concept of Clickless Analytics is one that will be happily embraced by business users and by the business enterprise. The reason is simple! Clickless Analytics allows users to find and analyze information without specialized skills, by using natural language.</p>
<p>In this, the first of a three-part series we discuss Clickless Analytics and how it can simplify user adoption of augmented analytics.</p>
<p>What is Clickless Analytics?</p>
<p>Clickless Analytics incorporates Natural Language Processing (NLP) and takes augmented analytics to the next level with machine learning and NLP in a self-serve environment that is easy enough for every business user. Business users can leverage sophisticated business intelligence tools to perform advanced data discovery by asking questions using natural language. The system will translate that search analytics language query into a query that the analytics platform can interpret, and return the most appropriate answer in an appropriate form such as visualization, tables, numbers or descriptions in simple human language. Clickless Analytics interprets natural language queries and presents results through smart visualization and contextual information delivered in natural language.</p>
<p><em><strong>Can Clickless Analytics Simplify Adoption of Augmented Analytics?</strong></em></p>
<p>Clickless analytics, NLP and search analytics provides true data democratization of advanced analytics. Clickless Analytics incorporates NLP within a suite of Augmented Analytics features, leveraging computational linguistics, data mining, and analytical algorithms to provide a self-serve, natural language approach to data analysis. Search Analytics and NLP filters through mountains of data to answer a question in the way a user can understand, thereby simplifying and speeding the decision process and ensuring clarity.</p>
<p>Clickless Analytics suggests relationships and offers insight to previously hidden information so that business users can 'discover' subtle, crucial business results, patterns, problems and opportunities. Clickless Analytics provides maximum results and business user access with minimum implementation time and minimal training.</p>
<p>Clickless Analytics and an NLP approach to augmented analytics utilizes a Google-type interface where business users can enter a question in human language, i.e., 'what is our best selling product in Arizona' or 'who is the best performing salesperson in this year as compared to previous year.' The ease-of-use assures user adoption and the clarity of analysis and reporting achieved by the enterprise results in an environment where the team, managers and executives can achieve rapid, accurate results without the assistance of IT or business analysts.</p>
<p>The evolution of search analytics and the application of NLP search within the confines of a business intelligence solution have allowed the average organization to leap forward with advanced data discovery and the incorporation of these crucial tools into a self-serve environment for user empowerment and accountability. Clickless Analytics and NLP help businesses to achieve rapid ROI and sustain low total cost of ownership (TCO) with meaningful tools that are easy to understand, and as familiar as a Google search. These tools require very little training, and provide interactive tools that 'speak the language' of the user.</p>
<p>Clickless Analytics, NLP and Search Analytics are a crucial component of business intelligence and Augmented Analytics, and are essential to business success and to building and sustaining a competitive advantage.</p>
<p><strong>Watch for Part II and Part III of this article series:</strong><span> </span>'What is Search Analytics and Can it Improve Self-Serve Data Discovery?' and 'What is Natural Language Processing & How Does it Benefit a Business?'</p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/JfNaZLlt8I0" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:380540Supervised learning in disguise: the truth about unsupervised learningtag:www.analyticbridge.datasciencecentral.com,2018-02-14:2004291:BlogPost:3807422018-02-14T20:00:00.000ZDanko Nikolichttps://www.analyticbridge.datasciencecentral.com/profile/DankoNikolic
<p>One of the first lessons you’ll receive in machine learning is that there are two broad categories: supervised and unsupervised learning. Supervised learning is usually explained as the one to which you provide the correct answers, training data, and the machine learns the patterns to apply to new data. Unsupervised learning is (apparently) where the machine figures out the correct answer on its own.</p>
<p>Supposedly, unsupervised learning can discover something new that has not been found…</p>
<p>One of the first lessons you’ll receive in machine learning is that there are two broad categories: supervised and unsupervised learning. Supervised learning is usually explained as the one to which you provide the correct answers, training data, and the machine learns the patterns to apply to new data. Unsupervised learning is (apparently) where the machine figures out the correct answer on its own.</p>
<p>Supposedly, unsupervised learning can discover something new that has not been found in the data before. Supervised learning cannot do that.</p>
<h2><span style="color: #ff6600;">The problem with definitions</span></h2>
<p>It’s true that there are two classes of machine learning algorithm, and each is applied to different types of problems, but is unsupervised learning really free of supervision?</p>
<p>In fact, this type of learning also involves a whole lot of supervision, but the supervision steps are hidden from the user. This is because the supervision is not explicitly presented in the data; you can only find it within the algorithm.</p>
<p>To understand this let us first consider the use of supervised learning. A prototypical method for supervised learning is regression. Here, the input and the output values – named X and Y respectively – are provided for the algorithm. The learning algorithm then assesses the model’s parameters such that it tries to predict the outputs (Y) for new inputs (X) as accurately as possible.</p>
<p>In other words, supervised learning finds a function: Y’ = f(X)</p>
<h2><span style="color: #ff6600;">Supervised learning success</span></h2>
<p>Supervised learning success is assessed by seeing how close Y’ is to Y, i.e. by computing error function.</p>
<p>This general principle of supervision in learning is the basic principle for logistic regression, support vector machines, decision trees, deep learning networks and many other techniques.</p>
<p>In contrast, unsupervised learning does not provide Y for the algorithm – only X is provided. Thus, for each given input we do not explicitly provide a correct output. The machine’s task is to “discover” Y on its own.</p>
<p>A common example is <a href="https://en.wikipedia.org/wiki/Cluster_analysis">cluster (or clustering) analysis</a>. Before a clustering analysis, there aren’t known clusters for the data points within the inputs, and yet the machine finds those clusters after the analysis. It’s almost as if the machine is creative – discovering something new in the data.</p>
<h2><span style="color: #ff6600;">Nothing new</span></h2>
<p>In fact, there is nothing new; the machine discovers only what it has been told to discover. Every unsupervised algorithm specifies what needs to be found in the data.</p>
<p>There must be criterion saying what success is. We don’t let algorithms do whatever they want, or ask machines to perform random analyses. There is always a goal to be accomplished, and that goal is carefully formulated as a constraint within the algorithms.</p>
<p>For example, in a clustering algorithm, you may require the distances between cluster centroids to be maximized, while the distances between data points belonging to the same cluster are minimized. Plus, for each data set there is an implicit Y, which for example may state to maximize the distance-between/distance-within ratio.</p>
<p>Therefore, the lack of supervision in these algorithms is nothing like the metaphorical “unsupervised child in a porcelain shop”, as this would not give us particularly useful machine learning. Instead, what we have is more akin to letting adults enter a porcelain shop without having to send a nanny too. The reason for our trust in adults is that they have already been supervised during childhood and have since (hopefully) internalized some of the rules.</p>
<p>Something similar happens with unsupervised machine learning algorithms; supervision has been internalized, as these methods come equipped with algorithms that informs what are good or bad model behaviours. Just as (most) adults have an internal voice telling them not to smash every item in the shop, unsupervised machine learning methods possess internal machinery that dictates what constitutes good behaviour.</p>
<h2><span style="color: #ff6600;">Supervised vs. unsupervised</span></h2>
<p>Fundamentally, the difference between supervised and unsupervised learning boils down to whether the computation of error utilizes an externally provided Y, or whether Y is internally computed from input data (X).</p>
<p>In both cases there is a form of supervision.</p>
<p>As all unsupervised learning is actually supervised, the main differentiator becomes the frequency at which intervention takes place. For example, do we intervene for each data point or just once, when the algorithm for computing Y out of X is designed?</p>
<p>Hence, within the so-called unsupervised methods, supervision is present, but hidden (it is disguised) because no special effort is required from the end user to supply supervision data. The algorithm seems to be magically supervised without an apparent supervisor. However, this does not mean that someone hasn’t gone through the pain of setting up the proper equations to implement an internal supervisor.</p>
<p>Consequently, unsupervised learning methods don’t truly discover anything new in any way that would overshadow the “discoveries” of supervised methods.</p>
<p></p>
<p>This blog entry is reposted from my original blog entry at <a href="http://www.teradata.com">www.teradata.com</a></p>
<p></p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/Cbv-amA7BsY" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:380742Easy Dashboards for Everyone Using Google Data Studiotag:www.analyticbridge.datasciencecentral.com,2018-01-11:2004291:BlogPost:3801002018-01-11T23:30:00.000ZLaura Ellishttps://www.analyticbridge.datasciencecentral.com/profile/LauraEllis
<div class="sqs-block html-block sqs-block-html" id="block-yui_3_17_2_4_1502412116261_16863"><div class="sqs-block-content"><p>No matter the job, most professionals do some level of analysis on their computer. There are always some data sets that live outside the walls. Or, some analyses that we know could be performed better in a not-easily-sharable tool such as excel, R, python, SPSS, SAS and so on.</p>
<p>So how do you share your personal analysis with others? Often times people export…</p>
</div>
</div>
<div class="sqs-block html-block sqs-block-html" id="block-yui_3_17_2_4_1502412116261_16863"><div class="sqs-block-content"><p>No matter the job, most professionals do some level of analysis on their computer. There are always some data sets that live outside the walls. Or, some analyses that we know could be performed better in a not-easily-sharable tool such as excel, R, python, SPSS, SAS and so on.</p>
<p>So how do you share your personal analysis with others? Often times people export the graphs and tables to add into a presentation file. One of the largest downfalls to this approach is that it can cause versioning and updating nightmares. </p>
<p>What if I told you that we could avoid all of this with dashboards? Some of you may say, "Yes, obviously, Laura. But I don't have a licensed BI tool or BI experts at my disposal! It's not a realistic scenario for me." Now in the past, I might've agreed with you. If you don't have a paid BI tool, it can be tricky. Free BI tool versions usually require the owner to host the software, or they limit the number of charts, viewers or users using the tool.</p>
<p>However, earlier this year, Google removed a number of restrictions to their free hosted dashboarding software called Google Data Studio. Because of this, I decided to give the software a test drive and see how accessible it is to the non-BI expert.</p>
<p>Below I will take you through a tutorial that I wrote which should allow anyone to create a Google Data Studio dashboard about US Home Prices. It should take about 1/2 an hour of your time. It really is that easy. So please, have a try and let me know how it goes!</p>
<p><a href="http://api.ning.com:80/files/nZezOgr77xYoJEVQo1afApWG0vwooxwerFy*Jorb*DbPcDEOpUHNa*6UyDnK*OlVmX6l7Zp7Eq1CFcU8DWyx3qeuQj5DfOQB/DashboardPreview.png" target="_self"><img src="http://api.ning.com:80/files/nZezOgr77xYoJEVQo1afApWG0vwooxwerFy*Jorb*DbPcDEOpUHNa*6UyDnK*OlVmX6l7Zp7Eq1CFcU8DWyx3qeuQj5DfOQB/DashboardPreview.png?width=750" width="750" class="align-full"/></a></p>
<div class="sqs-block html-block sqs-block-html" id="block-dd70af8bc85339bd772f"><div class="sqs-block-content"><p></p>
<p><strong>The Tutorial Description</strong></p>
<p></p>
<p>For this tutorial I wanted to use some sample data to make a basic one page dashboard. It will feature some common dashboard elements such as: text, images, summary metrics, summary tables and maps. To do so I searched out free data sets and found out that Zillow offers summary data collected through their real estate business. </p>
<blockquote><em>Side note: Thank you Zillow, I love when companies share their data! </em></blockquote>
<p>I downloaded a number of the data sets that I thought would be interesting to display and did a little data processing to make dashboard creation easier. From there I set out to make a dashboard without reading any instructions to see how usable it really is. I have to say, it was easy! There are some odd beta style behaviors that I outline below, but all in all it is a great solution. </p>
<p><strong>The Tutorial Steps</strong></p>
<p><strong>1. <span> </span></strong>Download the sample<span> </span><a target="_blank" href="https://github.com/lgellis/ZillowSampleData" rel="noopener">data set</a><span> </span>needed to create the sample. </p>
<p>Note: if you have trouble downloading the file from github, go to the main page and select "Clone or Download" and then "Download Zip" as per the picture below.</p>
<p></p>
<p><a href="http://api.ning.com:80/files/nZezOgr77xYJHDTMyaNTrCPmQKRbMZzDwUMaza*BbWUCiUdTGX84EtDX5F1cEUe3hf94*0Ph-IJ-zG0TyR5uvdmiKAtM59n5/ScreenShot20180111at5.25.40PM.png" target="_self"><img src="http://api.ning.com:80/files/nZezOgr77xYJHDTMyaNTrCPmQKRbMZzDwUMaza*BbWUCiUdTGX84EtDX5F1cEUe3hf94*0Ph-IJ-zG0TyR5uvdmiKAtM59n5/ScreenShot20180111at5.25.40PM.png?width=750" width="570" class="align-full" height="194"/></a></p>
<p>2. Sign up for Google Data Studio</p>
<p>3. Click "Start a New Report"</p>
<p></p>
<p><a href="http://api.ning.com:80/files/nZezOgr77xYsPtl-irOWgiEQ-HLw2V-ypjhaJdcJA9XRC-vCGrX2FYOQwJu7tDq-jfB0jWXTGeYbh-OlhQ5HgTPpQbgZSjNX/ScreenShot20180111at5.25.46PM.png" target="_self"><img src="http://api.ning.com:80/files/nZezOgr77xYsPtl-irOWgiEQ-HLw2V-ypjhaJdcJA9XRC-vCGrX2FYOQwJu7tDq-jfB0jWXTGeYbh-OlhQ5HgTPpQbgZSjNX/ScreenShot20180111at5.25.46PM.png?width=750" width="750" class="align-full"/></a></p>
</div>
</div>
<div class="sqs-block image-block sqs-block-image sqs-text-ready" id="block-yui_3_17_2_4_1503272149372_26958"><div class="sqs-block-content" id="yui_3_17_2_1_1515713017393_95"><div class="image-block-outer-wrapper layout-caption-below design-layout-inline" id="yui_3_17_2_1_1515713017393_94"><div class="intrinsic" id="yui_3_17_2_1_1515713017393_93"></div>
</div>
</div>
</div>
<p><span>4. In the new report, add the file "</span><a href="https://github.com/lgellis/ZillowSampleData/blob/master/Zillow_Summary_Data_2017-06.csv">Zillow_Summary_Data_2017-06.csv</a><span>" downloaded as part of the zip file from the data set in step 1.</span></p>
<p></p>
<p></p>
<p></p>
<p><span><a href="http://api.ning.com:80/files/nZezOgr77xawLGjBn8G01XUNGLOYJtUzyiCarh8p1OeHkocuvwLMVNRZcTNDhTChwtNjFgVkjCsc83noOcDPqabg0gWR2hOF/ScreenShot20180111at5.25.53PM.png" target="_self"><img src="http://api.ning.com:80/files/nZezOgr77xawLGjBn8G01XUNGLOYJtUzyiCarh8p1OeHkocuvwLMVNRZcTNDhTChwtNjFgVkjCsc83noOcDPqabg0gWR2hOF/ScreenShot20180111at5.25.53PM.png?width=750" width="750" class="align-full"/></a></span></p>
<p><span>5. Modify the columns of the data set to ensure that "State" is of type "Geo">"Region" with no aggregation and the remaining columns are type "Numeric" >"Number" with "Average" as the aggregation.</span></p>
<p><span><a href="http://api.ning.com:80/files/nZezOgr77xahubcO7paLTOPOybpSfdZHPzAsUzTak6UGg895YPoFk8gdM4UjQsiGuw9lsiIccbrG1zE8KJP*5TsytAAHPwWd/ScreenShot20180111at5.28.33PM.png" target="_self"><img src="http://api.ning.com:80/files/nZezOgr77xahubcO7paLTOPOybpSfdZHPzAsUzTak6UGg895YPoFk8gdM4UjQsiGuw9lsiIccbrG1zE8KJP*5TsytAAHPwWd/ScreenShot20180111at5.28.33PM.png?width=750" width="750" class="align-full"/></a></span></p>
<p>6. Click "Add to Report". This will make the data source accessible to your new report.</p>
<p>Now we are ready to start building the report piece by piece. To make it easier, I have broken up the dashboard content into 5 pieces that can be added. We will tackle these one by one.</p>
<p><a href="http://api.ning.com:80/files/nZezOgr77xZAjalAnIlB0qUZ3DfDvKro9E15DugpgEMvRgYPnfg9xbYt1kMlr-HhYQAW7wHpoE32HxfONfU6iY5nX8nAbe7M/ScreenShot20180111at5.28.45PM.png" target="_self"><img src="http://api.ning.com:80/files/nZezOgr77xZAjalAnIlB0qUZ3DfDvKro9E15DugpgEMvRgYPnfg9xbYt1kMlr-HhYQAW7wHpoE32HxfONfU6iY5nX8nAbe7M/ScreenShot20180111at5.28.45PM.png?width=750" width="750" class="align-full"/></a></p>
<p><span>To add each of the components above, you will need to use the Google Data Studio Toolbar on the top navigation. The image below highlights each of the toolbar items that we will be using.</span></p>
<p></p>
<p><span><a href="http://api.ning.com:80/files/nZezOgr77xY4EGSwOBahzJbYkCugJGK4tLw-4rz-zqNzGkAB07imajoSKRqAYHEAMKyUBNPu3vFvzJgZ1GqYvfsBsxrPvVnF/ScreenShot20180111at5.29.51PM.png" target="_self"><img src="http://api.ning.com:80/files/nZezOgr77xY4EGSwOBahzJbYkCugJGK4tLw-4rz-zqNzGkAB07imajoSKRqAYHEAMKyUBNPu3vFvzJgZ1GqYvfsBsxrPvVnF/ScreenShot20180111at5.29.51PM.png?width=750" width="750" class="align-full"/></a></span></p>
<p>7. "A. Text"- Easy street. Let's add some text to the dashboard. Start by clicking the "Text" button highlighted in the toolbar above. Next, take the cross-hair and drag it over the space you want the text to occupy. Enter your text: "US Home Prices”. In the “Text Properties” select the size and type. I’m using size 72 and type “Roboto Condensed".</p>
<p>8. "B. Image"- Easy street part 2. Now we are simply a pretty picture to the dashboard. Start by clicking the "Image" button highlighted in the toolbar above. Take the cross-hair and drag it over the space you want the image to occupy. Select the image "houseimage.jpg" that you downloaded from the GH repo.</p>
<p><a href="http://api.ning.com:80/files/nZezOgr77xb8D5CvolCwFCXncX1MVzPvJdliT3ibiXxyVQ*hxBABk0*TLgZkBTtvEZRvdLDO3n8CAovWAYKP78GWkk*GApDx/ScreenShot20180111at5.30.36PM.png" target="_self"><img src="http://api.ning.com:80/files/nZezOgr77xb8D5CvolCwFCXncX1MVzPvJdliT3ibiXxyVQ*hxBABk0*TLgZkBTtvEZRvdLDO3n8CAovWAYKP78GWkk*GApDx/ScreenShot20180111at5.30.36PM.png" width="295" class="align-full" height="336"/></a></p>
<p></p>
<p>9. "C. Scorecard Values"- Now we get into the real dashboarding exercises through metrics and calculations. Start by clicking the "Scorecard" button highlighted in the toolbar above. Take the cross-hair and drag it over the space you want the first scorecard value to occupy. In the “data” tab, Select the data set and appropriate metric. Start with the values in the image to the left. In the “style” tab select size 36 with the type "Roboto".</p>
<p>Repeat this for every metric in the "C. Scorecard Values" section.</p>
<p></p>
<p><a href="http://api.ning.com:80/files/nZezOgr77xYZ5NGJrs7PDAdvYgj2T6Qz92ElQPB3WFWS9oOMKcUTUEJiw*SZZe8P-lFxXOf3BPQd2rtIYl4y6zOObl-RewHZ/ScreenShot20180111at5.30.41PM.png" target="_self"><img src="http://api.ning.com:80/files/nZezOgr77xYZ5NGJrs7PDAdvYgj2T6Qz92ElQPB3WFWS9oOMKcUTUEJiw*SZZe8P-lFxXOf3BPQd2rtIYl4y6zOObl-RewHZ/ScreenShot20180111at5.30.41PM.png" width="296" class="align-full" height="357"/></a></p>
<p></p>
<p><span>10. "D. Map" - In this step we get more impressive, but not more difficult. We implement a map! Start by clicking the "Geo Map" button highlighted in the toolbar above. Take the cross-hair and drag it over the space you want the map to occupy. Select the data set and appropriate metric as per the values in the image to the left.</span></p>
<p></p>
<p><span><a href="http://api.ning.com:80/files/nZezOgr77xYGw5WAI25WLSu7brRJpXqqK7jjsV2lWZD4vp921oklYVILTrlmA4BNCxtK79fZGoFyKLHeS9YZz1ecmbutwPC-/ScreenShot20180111at5.30.47PM.png" target="_self"><img src="http://api.ning.com:80/files/nZezOgr77xYGw5WAI25WLSu7brRJpXqqK7jjsV2lWZD4vp921oklYVILTrlmA4BNCxtK79fZGoFyKLHeS9YZz1ecmbutwPC-/ScreenShot20180111at5.30.47PM.png" width="178" class="align-full" height="407"/></a></span></p>
<p><span>11. "E. List"- Now we are going to list out all values in the Geo Map above ordered by their metric "Average Home Value". Start by clicking the "Table" button highlighted in the toolbar above. Take the cross-hair and drag it over the space you want the list to occupy. Select the data set and appropriate metric as per the values in the image to the left.</span></p>
<p></p>
<p><span><a href="http://api.ning.com:80/files/nZezOgr77xbEeXpJtmDPvsKfIAtjaLSu-2bbY7*MFckzAAtBK2KTG6qldo8MwHW-D0LXbcOGMg9Cj5O69iwUWaZS*P75PAaR/ScreenShot20180111at5.34.24PM.png" target="_self"><img src="http://api.ning.com:80/files/nZezOgr77xbEeXpJtmDPvsKfIAtjaLSu-2bbY7*MFckzAAtBK2KTG6qldo8MwHW-D0LXbcOGMg9Cj5O69iwUWaZS*P75PAaR/ScreenShot20180111at5.34.24PM.png?width=750" width="750" class="align-full"/></a></span></p>
<p><span>12. Make the Report External and Share. Click the person + icon in the top right of your screen. Select "Anyone with the link can view". Copy the external URL and click done. Now take that external URL and send to all your friends and family with the subject "Prepare to be amazed".</span></p>
<p></p>
<p><span>13. Optional Embedding: Google Data Studio has now released the feature to allow for embedding the reports or dashboards into a web page. </span></p>
<p></p>
<p>And there you have it, your dashboard is created and you can share away!</p>
<p></p>
<p><strong>Some Criticisms</strong></p>
<p>As I'm sure was obvious from above, I'm impressed with their offering. But I do feel it is my duty to outline some oddities I came across. For example: when you set up your data source, you need to specify ahead of time for each column what type of summary you plan on doing with that value. If you want to use a chart to display averages, you cannot select this within the chart dynamically, it has to be at the data source. I find this odd and limiting. Additionally, the csv import has a 200 column limit and there are some formatting annoyances. </p>
<p></p>
<p><strong>Final Note</strong></p>
<p>I'm happy that I tried out Google Dash Studio. While it does not meet my current needs at the enterprise level, I am very impressed at it's applicability and accessibility to the personal user. I truly believe that anyone could make a dashboard with this tool. So give it a try, impress your colleagues and mobilize your analysis with Google Dash Studio!</p>
<p>Written by Laura Ellis</p>
<p></p>
<p>Original Blog Post <a href="https://www.littlemissdata.com/blog/dashboards-for-everyone" target="_blank" rel="noopener">Here</a></p>
<p></p>
</div>
</div><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/66yHM9TxtTo" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:380100http://api.ning.com:80/files/nZezOgr77xYoJEVQo1afApWG0vwooxwerFy*Jorb*DbPcDEOpUHNa*6UyDnK*OlVmX6l7Zp7Eq1CFcU8DWyx3qeuQj5DfOQB/DashboardPreview.pngBeautiful Number Theory Problem and Sandbox for Data Scientiststag:www.analyticbridge.datasciencecentral.com,2018-01-11:2004291:BlogPost:3801732018-01-11T01:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>The Waring conjecture - actually a problem associated with a number of conjectures, many now being solved - is one of the most fascinating mathematical problems. This article covers new aspects of this problem, with a generalization and new conjectures, some with a tentative solution, and a new framework to tackle the problem. Yet it is written in simple English and accessible to the layman.</p>
<p>I also review a number of famous related mathematical conjectures, including one with a $1…</p>
<p>The Waring conjecture - actually a problem associated with a number of conjectures, many now being solved - is one of the most fascinating mathematical problems. This article covers new aspects of this problem, with a generalization and new conjectures, some with a tentative solution, and a new framework to tackle the problem. Yet it is written in simple English and accessible to the layman.</p>
<p>I also review a number of famous related mathematical conjectures, including one with a $1 million award still waiting for a solution, as well as Goldbach's conjecture, yet unproved as of today. Many curious properties of the Floor function are also listed, and the emphasis is on machine learning and efficient computer-intensive algorithms to try to find surprising results, which then need to be formally proved or disproved.</p>
<p><a href="https://api.ning.com/files/0AXP65IrkpOhipn3sZEsaTEUi79zp1sGpmNTi*Q7XvgI25N4WWm1GKxS5rYKnCzC2aUFIYQApjV4Du7SLACqp4aY5H7vh74u/Capture.PNG" target="_self"><img src="https://api.ning.com/files/0AXP65IrkpOhipn3sZEsaTEUi79zp1sGpmNTi*Q7XvgI25N4WWm1GKxS5rYKnCzC2aUFIYQApjV4Du7SLACqp4aY5H7vh74u/Capture.PNG" class="align-center"/></a></p>
<p><strong>Content of this article</strong>:</p>
<p>1. General Framework</p>
<ul>
<li>Spectacular Result</li>
<li>New Generalization of Golbach's Conjecture</li>
<li>New Generalization of Fermat's Conjecture</li>
</ul>
<p>2. Generalized Waring Problem</p>
<ul>
<li>Definitions</li>
<li>Main Results</li>
<li>Open Problems</li>
<li>Fun Facts (Actually, Conjectures!)</li>
</ul>
<p>3. Algorithms and Source Code</p>
<ul>
<li>Case n = 2: Sums of Two Terms</li>
<li>Case n = 4: Sums of Four Terms</li>
</ul>
<p>4. Related Conjectures and Solved Problems</p>
<ul>
<li>The One Million Dollar Conjecture</li>
</ul>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/number-theory-nice-generalization-of-the-waring-conjecture" target="_blank" rel="noopener">Read the full article here</a>.</p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/6PZ4cUnpG48" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:3801736 Predictions about Data Science, Machine Learning, and AI for 2018tag:www.analyticbridge.datasciencecentral.com,2017-12-14:2004291:BlogPost:3772602017-12-14T22:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><em> Here are our 6 predictions for data science, machine learning, and AI for 2018. Some are fast track and potentially disruptive, some take the hype off over blown claims and set realistic expectations for the coming year.</em></p>
<p>It’s that time of year again when we do a look back in order to offer a look forward. What trends will speed up, what things will actually happen, and what things won’t in the coming year for data science, machine…</p>
<p><strong><em>Summary:</em></strong><em> Here are our 6 predictions for data science, machine learning, and AI for 2018. Some are fast track and potentially disruptive, some take the hype off over blown claims and set realistic expectations for the coming year.</em></p>
<p>It’s that time of year again when we do a look back in order to offer a look forward. What trends will speed up, what things will actually happen, and what things won’t in the coming year for data science, machine learning, and AI.</p>
<p>We’ve been watching and reporting on these trends all year and we scoured the web and some of our professional contacts to find out what others are thinking. There are only a handful of trends and technologies that look to disrupt or speed ahead. These are probably the most interesting in any forecast. But it also valuable to discuss trends we think are a tad overblown and won’t accelerate as fast as some others believe. So with a little of both, here’s what we concluded.</p>
<p><strong>Prediction 2:</strong> Data Science continues to develop specialties that mean the mythical ‘full stack’ data scientist will disappear.</p>
<p><strong>Prediction 3:</strong> Non-Data Scientists will perform a greater volume of fairly sophisticated analytics than data scientists.</p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/6-predictions-about-data-science-machine-learning-and-ai-for-2018" target="_blank" rel="noopener">Click here to read full article</a>. </p>
<p><a href="https://api.ning.com/files/ycVWu*1f9fx9eov3gOA2XJfU3BsbD2TgEDWI9UA6GPkxz8DzGqauZqp4a5l0r1aUTZLBTZCn9Q0e*VGefJycye*kUCvMaFwh/bender.png?width=300" target="_self"><img src="https://api.ning.com/files/ycVWu*1f9fx9eov3gOA2XJfU3BsbD2TgEDWI9UA6GPkxz8DzGqauZqp4a5l0r1aUTZLBTZCn9Q0e*VGefJycye*kUCvMaFwh/bender.png?width=300" class="align-center"/></a><b>DSC Resources</b></p>
<ul>
<li>Services:<span> </span><a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a><span> </span>|<span> </span><a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a><span> </span>|<span> </span><a href="http://classifieds.datasciencecentral.com">Classifieds</a><span> </span>|<span> </span><a href="http://www.analytictalent.com">Find a Job</a></li>
<li>Contributors:<span> </span><a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a><span> </span>|<span> </span><a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us:<span> </span><a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a><span> </span>|<span> </span><a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/MG_D0nRdmco" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:377260High Precision Computing in Python or Rtag:www.analyticbridge.datasciencecentral.com,2017-11-14:2004291:BlogPost:3739902017-11-14T02:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Here we discuss an application of HPC (not high performance computing, instead high precision computing, which is a special case of HPC) applied to dynamical systems such as the logistic map in chaos theory. defined as X(k) = 4 X(k) (1 - X(k-1)). </p>
<p>For all these systems, the loss of precision propagates exponentially, to the point that after 50 iterations, all generated values are completely wrong. Tons of articles have been written on this subject - none of them acknowledging the…</p>
<p>Here we discuss an application of HPC (not high performance computing, instead high precision computing, which is a special case of HPC) applied to dynamical systems such as the logistic map in chaos theory. defined as X(k) = 4 X(k) (1 - X(k-1)). </p>
<p>For all these systems, the loss of precision propagates exponentially, to the point that after 50 iterations, all generated values are completely wrong. Tons of articles have been written on this subject - none of them acknowledging the faulty numbers being used, as round-off errors propagate as fast as chaos. This is an an active research area with applications in population dynamics, physics, and engineering. It does not invalidate the published results, as most of them are theoretical in nature, and do not impact the limiting distribution as the faulty sequences behave as instances of processes that are re-seeded every 40 iterations or so due to errors, behaving the same way regardless of the seed. </p>
<p>The core of the discussion here is about how to write code that produces far more accurate numbers, whether in R, Python or other languages, using super precision. In short, which libraries should you use to handle such problems?</p>
<p>You can check out the context, Perl code, Python code, and an Excel spreadsheet that illustrates the issue, in this discussion. </p>
<p><a href="https://www.datasciencecentral.com/forum/topics/question-how-precision-computing-in-python" target="_blank">Click here to read the full article</a>. </p>
<p></p>
<p><a href="http://api.ning.com:80/files/qUnr-syCGlU7aRNPCmJnSXXDfprErYzcP8ruveC1oDWDr*l8WweWVPB-Hwojnphq2Nn6lDp3LFOJ6jCYnwATNNA3-Wd7RkIg/fract.PNG" target="_self"><img src="http://api.ning.com:80/files/qUnr-syCGlU7aRNPCmJnSXXDfprErYzcP8ruveC1oDWDr*l8WweWVPB-Hwojnphq2Nn6lDp3LFOJ6jCYnwATNNA3-Wd7RkIg/fract.PNG" width="350" class="align-center"/></a></p>
<p style="text-align: center;"><em>This broccoli is an example of the self-replicating processes that could benefit from HPC</em></p>
<p style="text-align: left;"><b>DSC Resources</b></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>
<p style="text-align: left;">Popular Articles</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-articles-about-core-data-science">What is Data Science? 24 Fundamental Articles Answering This Question</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python">Hitchhiker's Guide to Data Science, Machine Learning, R, Python</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
</ul><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/mG_T3fV0OxA" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:373990Linear Models Don’t have to Fit Exactly for P-Values To Be Accurate, Right, and Usefultag:www.analyticbridge.datasciencecentral.com,2017-11-03:2004291:BlogPost:3740452017-11-03T05:30:00.000ZChirag Shivalkerhttps://www.analyticbridge.datasciencecentral.com/profile/ChiragShivalker
<p>There is no need to get confused with multiple linear regression, generalized linear model or general linear methods. The general linear model or multivariate regression model is a statistical linear model and is written as <strong>Y = XB + U</strong>.</p>
<p><br></br> <img src="http://api.ning.com:80/files/IQ0ncUB6iv46gmInhSMUmqksefRP62Vf8KPE3lkuGhZiemAfzakEd9AmEseSNzn-xnrbA0vdegJVmimvfisz2q3pZqCIboFh/linearmodelsdonthavetofitexactlyforpvaluestobeaccuraterightanduseful.jpg?width=750" width="750"></img></p>
<p><br></br> Usually, a linear model includes a number of different statistical models such as ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-test. The GLM is a generalization of multiple…</p>
<p>There is no need to get confused with multiple linear regression, generalized linear model or general linear methods. The general linear model or multivariate regression model is a statistical linear model and is written as <strong>Y = XB + U</strong>.</p>
<p><br/> <img src="http://api.ning.com:80/files/IQ0ncUB6iv46gmInhSMUmqksefRP62Vf8KPE3lkuGhZiemAfzakEd9AmEseSNzn-xnrbA0vdegJVmimvfisz2q3pZqCIboFh/linearmodelsdonthavetofitexactlyforpvaluestobeaccuraterightanduseful.jpg?width=750" width="750"/></p>
<p><br/> Usually, a linear model includes a number of different statistical models such as ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-test. The GLM is a generalization of multiple linear regression models to the case of more than one dependent variable. So if Y, B, and U represent column vectors, the matrix equation above will portray a multiple linear regression.<br/> <br/> <span class="font-size-4">Which are the key assumptions made in a multiple linear regression analysis?</span><br/> <br/> Independent variables and outcome variables should have a linear relationship among them, and to find out whether there is a linear or curvilinear relationship; scatterplots can be leveraged.<br/></p>
<ul>
<li><strong>Multivariate Normality:</strong> Residuals are normally distributed, as is assumed in multiple regressions.</li>
<li><strong>No Multicollinearity:</strong> Independent variables are not correlated among, as is assumed in multiple regressions. To test these assumptions, Variance Inflation Factor – VIF is used.</li>
<li><strong>Homoscedasticity:</strong> Error terms are similar across the values for independent variables in the assumptions made. predicted values Vs standardized residuals are used to showcase if points are successfully and equally distributed across independent variables.</li>
</ul>
<p><br/> <a href="http://www.hitechbpo.com/market-research-and-data-analytics.php" target="_blank">Best data analytic solutions</a> are derived to automatically include assumption tests and plots while conducting regression. From nominal, ordinal, or interval/ratio level variables; multiple regression requires at least two independent variables. A rule of thumb for the sample size is that regression analysis requires at least 20 cases per independent variable in the analysis.<br/> <br/> <span class="font-size-4">Assumptions in your regression or ANOVA model</span><br/> <br/> We know, you know; how important are they because if they’re not met adequately, all the p-values will become inaccurate, wrong, & useless. But linear models don’t have to fit precisely for p-values to be accurate, right and useful; they are robust enough to departures from these assumptions. Statistics classes and several other coaching places have been imparting such knowledge or contradictory statements as you may say; to drive analysts crazy.<br/> <br/> <em>It is a debatable topic as to whether statisticians cooked this stuff up to torture researchers, pun intended; or do they do it to satisfy their egos?</em><br/> <br/> Well, the reality is they don’t. Learning how to push robust assumptions is not that hard a task, of course when done with professional training and guidance, backed with some practice. <em>Enlisted are few of the mistakes researchers usually make because of one, or both of the above claims.</em><br/> <br/> <strong><span class="font-size-3">1. P-value is the feel-good factor</span></strong><br/> <br/> Avoiding over-testing of assumptions is one of the ways out. Statistical tests can help in determining if assumptions made are met adequately or not. Having a p-value is the feel-good factor, isn’t it? It helps you avoid further complications, and one can do it by leveraging the golden rule of p<.05.<br/> <br/> In no case, the tests should ignore the robustness. Assuming that every distribution is non-normal and heteroskedastic; would be a mistake. Tools may prove helpful, but are developed to treat every data set as if it is a nail. The right thing to do is use the hammer when really required, and not hammer everything.<br/> <br/> <strong><span class="font-size-3">2. GLM is robust but not for all the assumptions</span></strong><br/> <br/> Here, assumptions are made that everything is robust and tests are avoided. It is a normal practice which succeeds most of the times. But there are instances, when it does not work. Yes, the robust GLM, for deviations from some of the assumptions, but are not robust all the way and not for all the assumptions. So check all of them without fail.<br/> <br/> <strong><span class="font-size-3">3. Test wrong assumptions</span></strong><br/> <br/> Testing wrong assumptions also is one of the mistakes that researchers do. They look at any two regression books and it will give them different sets of assumptions.<br/> <br/> <span class="font-size-4">Testing the related, but wrong thing</span><br/> <br/> Insights here might seem partial as several of these “assumptions” should be checked, but they are not model assumptions; instead are data challenges. At times the assumptions are taken to their logical conclusions, which adds up to it looking to be partial. The reference guide is trying to make it more logical for you, but in that attempt, it leads you to test the related but the wrong thing. It might work out most of the times, but not always.</p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/J1d7cbIi7l0" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:374045Information Retrieval Document Search Engine in Rtag:www.analyticbridge.datasciencecentral.com,2017-11-07:2004291:BlogPost:3739452017-11-07T13:30:00.000Zsuresh kumar Gorakalahttps://www.analyticbridge.datasciencecentral.com/profile/sureshkumarGorakala
<h3>Introduction:</h3>
<p><span>In this post, we learn about building a basic search engine or document retrieval system using Vector space model. This use case is widely used in information retrieval systems. Given a set of documents and search term(s)/query we need to retrieve relevant documents that are similar to the search query. </span></p>
<h3>Problem statement:</h3>
<p><span>The problem statement explained above is represented as in below image. …</span></p>
<h3>Introduction:</h3>
<p><span>In this post, we learn about building a basic search engine or document retrieval system using Vector space model. This use case is widely used in information retrieval systems. Given a set of documents and search term(s)/query we need to retrieve relevant documents that are similar to the search query. </span></p>
<h3>Problem statement:</h3>
<p><span>The problem statement explained above is represented as in below image. </span></p>
<div style="text-align: center;"><a href="http://3.bp.blogspot.com/-w96aVPMy198/WfgQTqkmI7I/AAAAAAAAGAQ/zd3iZfVh0_E5Yy2d_1WSs5N8RB8rPJRbwCK4BGAYYCw/s1600/information%2Bretrieval_1.PNG" target="_blank"><img src="https://3.bp.blogspot.com/-w96aVPMy198/WfgQTqkmI7I/AAAAAAAAGAQ/zd3iZfVh0_E5Yy2d_1WSs5N8RB8rPJRbwCK4BGAYYCw/s320/information%2Bretrieval_1.PNG?width=320" width="320" class="align-center"/></a><em>Document retrieval system</em></div>
<p></p>
<p>High level design of document search system is shown below :</p>
<div class="slate-resizable-image-embed slate-image-embed__resize-middle"><a href="https://media.licdn.com/mpr/mpr/AAIA_wDGAAAAAQAAAAAAAAt5AAAAJDViMmE0Y2NhLTg4ZmYtNDNiOC1hYWQxLTM2ZTkwOWM0NGU0Mg.png" target="_blank"><img src="https://media.licdn.com/mpr/mpr/AAIA_wDGAAAAAQAAAAAAAAt5AAAAJDViMmE0Y2NhLTg4ZmYtNDNiOC1hYWQxLTM2ZTkwOWM0NGU0Mg.png" class="align-center"/></a></div>
<p></p>
<p>The content of the post is as follows:</p>
<ul>
<li>Explaining various techniques used in Information retrieval such as vector space models, term document matrix, similarity score calculation</li>
<li>Data description </li>
<li>High level design of the document search system</li>
<li>Code implementation in R</li>
</ul>
<p>Please go thorough the complete blog at below location:</p>
<p><a href="http://www.dataperspective.info/2017/11/information-retrieval-document-search-using-vector-space-model-in-r.html">http://www.dataperspective.info/2017/11/information-retrieval-document-search-using-vector-space-model-in-r.html</a></p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/NFxcfRuS_sg" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:373945https://media.licdn.com/mpr/mpr/AAIA_wDGAAAAAQAAAAAAAAt5AAAAJDViMmE0Y2NhLTg4ZmYtNDNiOC1hYWQxLTM2ZTkwOWM0NGU0Mg.pngFascinating Time Series with Cool Applicationstag:www.analyticbridge.datasciencecentral.com,2017-11-07:2004291:BlogPost:3741632017-11-07T03:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Here we describe well-known chaotic sequences, including new generalizations, with application to random number generation, highly non-linear auto-regressive models for times series, simulation, random permutations, and the use of big numbers (libraries available in programming languages to work with numbers with hundreds of decimals) as standard computer precision almost always produces completely erroneous results after a few iterations -- a fact rarely if ever mentioned in the scientific…</p>
<p>Here we describe well-known chaotic sequences, including new generalizations, with application to random number generation, highly non-linear auto-regressive models for times series, simulation, random permutations, and the use of big numbers (libraries available in programming languages to work with numbers with hundreds of decimals) as standard computer precision almost always produces completely erroneous results after a few iterations -- a fact rarely if ever mentioned in the scientific literature, but illustrated here, together with a solution. It is possible that all scientists who published on chaotic processes, used faulty numbers because of this issue.</p>
<p>This article is accessible to non-experts, even though we solve a special stochastic equation for the first time, providing an unexpected exact solution, for a new chaotic process that generalizes the logistic map. We also describe a general framework for continuous random number generators, and investigate the interesting auto-correlation structure associated with some of these sequences. References are provided, as well as fast source code to process big numbers accurately, and even an elegant mathematical proof in the last section.</p>
<p><a href="https://api.ning.com/files/q03Iqwj-1ZoSi1l9qr9Uib38qslBZxgjww5CHJGfTiguEkEspF0ev5*PHK8F6jozOMo4*a0xybbVPIq6sfb5y2ywOHOAjRsP/Capture.PNG" target="_self"><img src="https://api.ning.com/files/q03Iqwj-1ZoSi1l9qr9Uib38qslBZxgjww5CHJGfTiguEkEspF0ev5*PHK8F6jozOMo4*a0xybbVPIq6sfb5y2ywOHOAjRsP/Capture.PNG" class="align-center"/></a>This article is also a useful read for participants in our <a href="https://www.analyticbridge.datasciencecentral.com/profiles/blogs/interesting-probability-problem-for-serious-geeks" target="_blank">upcoming competition</a> (to be announced soon) as it addresses a similar stochastic integral equation problem, also with exact solution, in the related context of self-correcting random walks - another kind of memory-less process. </p>
<p>The approach used here starts with traditional data science and simulations for exploratory analysis, with empirical results confirmed later by mathematical arguments in the last section. </p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/amazing-random-sequences-with-cool-applications" target="_blank">Read the full article here</a>. </p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/ND5Bj6jOOlA" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:374163Interesting Problem: Self-correcting Random Walkstag:www.analyticbridge.datasciencecentral.com,2017-10-04:2004291:BlogPost:3721092017-10-04T20:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><em>Section 3 was added on October 11. Section 4 was added on October 19. A $2,000 award is offered to solve any of the open questions, <a href="https://www.datasciencecentral.com/profiles/blogs/dsc-competition-for-data-scientists-and-quantitative-experts-quan" rel="noopener" target="_blank">click here for details</a>. </em></p>
<p>This is another off-the-beaten-path problem, one that you won't find in textbooks. You can solve it using data science methods (my approach) but the…</p>
<p><em>Section 3 was added on October 11. Section 4 was added on October 19. A $2,000 award is offered to solve any of the open questions, <a href="https://www.datasciencecentral.com/profiles/blogs/dsc-competition-for-data-scientists-and-quantitative-experts-quan" target="_blank" rel="noopener">click here for details</a>. </em></p>
<p>This is another off-the-beaten-path problem, one that you won't find in textbooks. You can solve it using data science methods (my approach) but the mathematician with some spare time could find an elegant solution. Share it with your colleagues to see how math-savvy they are, or with your students. I was able to make substantial progress in 1-2 hours of work using Excel alone, thought I haven't found a final solution yet (maybe you will.) My Excel spreadsheet with all computations is accessible from this article. You don't need a deep statistical background to quickly discover some fun and interesting results playing with this stuff. Computer scientists, software engineers, quants, BI and analytic professionals from beginners to veterans, will also be able to enjoy it!</p>
<p><a href="https://i.imgur.com/nvHjav6.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/nvHjav6.png?width=391" width="391" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source for picture: <a href="http://demonstrations.wolfram.com/ConstrainedRandomWalk/" target="_blank" rel="noopener">2-D constrained random walk</a> (snapshot - video available <a href="https://www.youtube.com/watch?v=W9jktqV3_Mc" target="_blank" rel="noopener">here</a>)</em></p>
<p><span class="font-size-4"><strong>1, The problem</strong></span></p>
<p>We are dealing with a stochastic process barely more complicated than a random walk. Random walks are also called <em>drunken walks</em>, as they represent the path of a drunken guy moving left and right seemingly randomly, and getting lost over time. Here the process is called <em>self-correcting random walk</em> or also <em>reflective random walk</em>, and is related to <em><a href="https://arxiv.org/abs/1303.3655" target="_blank" rel="noopener">controlled random walks</a></em>, and <em><a href="http://demonstrations.wolfram.com/ConstrainedRandomWalk/" target="_blank" rel="noopener">constrained random walks</a></em> (see also <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2648134" target="_blank" rel="noopener">here</a>)<em> </em>in the sense that the walker, less drunk than in a random walk, is able to correct any departure from a straight path, more and more over time, by either slightly over- or under-correcting at each step. One of the two model parameters (the positive parameter <em>a</em>) represents how drunk the walker is, with <em>a</em> = 0 being the worst. Unless <em>a</em> = 0, the amplitude of the corrections decreases over time to the point that eventually (after many steps) the walker walks almost straight and arrives at his destination. This model represents many physical processes, for instance the behavior of a stock market somewhat controlled by a government to avoid bubbles and implosions, or when it hits a symbolic threshold and has a hard time breaking through. It is defined as follows:</p>
<p>Let's start with <em>X</em>(1) = 0, and define <em>X</em>(<em>k</em>) recursively as follows, for <em>k</em> > 1:</p>
<p><a href="https://i.imgur.com/ozpssO4.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/ozpssO4.png?width=327" width="327" class="align-center"/></a></p>
<p><a href="https://i.imgur.com/0UkIUnK.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/0UkIUnK.png?width=324" width="324" class="align-center"/></a></p>
<p>and let's define <em>U</em>(<em>k</em>), <em>Z</em>(<em>k</em>), and <em>Z</em> as follows:</p>
<p><a href="https://i.imgur.com/ZtV2AJl.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/ZtV2AJl.png?width=120" width="120" class="align-center"/></a></p>
<p><a href="https://i.imgur.com/H1iOwc3.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/H1iOwc3.png?width=116" width="116" class="align-center"/></a></p>
<p><a href="https://i.imgur.com/jT3Y5xF.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/jT3Y5xF.png?width=113" width="113" class="align-center"/></a></p>
<p>where the <em>V</em>(<em>k</em>)'s are deviates from <em>independent</em> uniform variables on [0, 1], obtained for instance using the function RAND in Excel. So there are two <em>positive</em> parameters in this problem, <em>a</em> and <em>b</em>, and <em>U</em>(<em>k</em>) is always between 0 and 1. When <em>b</em> = 1, the <em>U</em>(<em>k</em>)'s are just standard uniform deviates, and if <em>b</em> = 0, then <em>U</em>(<em>k</em>) = 1. The case <em>a</em> = <em>b</em> = 0 is degenerate and should be ignored. The case <em>a</em> > 0 and <em>b</em> = 0 is of special interest, and it is a number theory problem in itself, <a href="http://www.datasciencecentral.com/profiles/blogs/new-representation-of-numbers-with-very-fast-converging-fractions" target="_blank" rel="noopener">related to this problem</a> when <em>a</em> = 1. Also, just like in random walks or Markov chains, the <em>X</em>(<em>k</em>)'s are not independent; they are indeed highly auto-correlated.</p>
<p>Prove that if <em>a</em> < 1, then <em>X</em>(<em>k</em>) converges to 0 as <em>k</em> increases. Under the same condition, prove that the limiting distribution <em>Z</em></p>
<ul>
<li>always exists, (Note: if <em>a</em> > 1, <em>X</em>(<em>k</em>) may not converge to zero, causing a drift and asymmetry)</li>
<li>always takes values between -1 and +1, with min(<em>Z</em>) = -1 and max(<em>Z</em>) = +1,</li>
<li>is symmetric, with mean and median equal to 0</li>
<li>and does not depend on <em>a,</em> but only on <em>b.</em></li>
</ul>
<p>For instance, for <em>b</em> =1, even <em>a</em> = 0 yields the same triangular distribution for <em>Z</em>, as any <em>a</em> > 0.</p>
<p>If <em>a</em> < 1 and <em>b</em> = 0, (the non-stochastic case) prove that </p>
<ul>
<li><em>Z</em> can only take 3 values: -1 with probability 0.25, +1 with probability 0.25, and 0 with probability 0.50</li>
<li>If <em>U</em>(<em>k</em>) and <em>U</em>(<em>k</em>+1) have the same sign,then <em>U</em>(<em>k</em>+2) is of opposite sign </li>
</ul>
<p>And here is a more challenging question: In general, what is the limiting distribution of <em>Z</em>? Also, what happens if you replace the <em>U</em>(<em>k</em>)'s with (say) Gaussian deviates? Or with <em>U</em>(<em>k</em>) = | sin (<em>k</em>*<em>k</em>) | which has a somewhat random behavior?</p>
<p><span class="font-size-4"><strong>2. Hints to solve this problem</strong></span></p>
<p>It is necessary to use a decent random number generator to perform simulations. Even with Excel, plotting the empirical distribution of <em>Z</em>(<em>k</em>) for large values of <em>k</em>, and matching the kurtosis, variance and empirical percentiles with those of known statistical distributions, one quickly notices that when <em>b</em> = 1 (and even if <em>a</em> = 0) the limiting distribution <em>Z</em> is well approximated by a symmetric triangular distribution on [-1, 1], and thus centered on 0, with a kurtosis of -3/5 and and a variance of 1/6. In short, this is the distribution of the difference of two uniform random variables on [0, 1]. In other words, it is the distribution of <em>U</em>(3) - <em>U</em>(2). Of course, this needs to be proved rigorously. Note that the limiting distribution <em>Z</em> can be estimated by computing the values <em>Z</em>(<em>n</em>+1), ..., <em>Z</em>(<em>n</em>+<em>m</em>) for large values of <em>n</em> and <em>m</em>, using just one instance of this simulated stochastic process.<em> </em></p>
<p>Does it generalize to other values of <em>b</em>? That is, does <em>Z</em> always have the distribution of <em>U</em>(3) - <em>U</em>(2)? Obviously not for the case <em>b</em> = 0. But it could be a function, combination and/or mixture of <em>U</em>(3), -<em>U</em>(2), and <em>U</em>(3) - <em>U</em>(2). This works both for <em>b</em> = 0 and <em>b</em> = 1.</p>
<p><a href="https://i.imgur.com/EXQSt5r.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/EXQSt5r.png?width=516" width="516" class="align-center"/></a></p>
<p style="text-align: center;"><em><strong>Figure 1</strong>: Mixture-like distribution of Z (estimated) when b = 0.01 and a = 0.8</em></p>
<p>Interestingly, for small values of <em>b</em>, the limiting distribution <em>Z</em> looks like a mixture of (barely overlapping) simple distributions. So it could be used as a statistical model in clustering problems, each component of the mixture representing a cluster. See Figure 1.</p>
<p><a href="https://i.imgur.com/5IgW1Ue.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/5IgW1Ue.png?width=513" width="513" class="align-center"/></a></p>
<p style="text-align: center;"><em>Figure 2: Triangular distribution of Z (estimated) when b = 1 and a = 0.8</em></p>
<p>The spreadsheet with all computations and model fitting <a href="http://api.ning.com:80/files/Dwi1glrkjL4qs53cOt86PQxvltEtxn1XmbA2FwwACaxvwM1H9YnK2J7MVSKaLvRpmYxd9JAWsHFqdckrys8YvoYCf0tJLS-c/ControlledRandomWalk.xlsx" target="_self">can be downloaded here</a>.. </p>
<p><span class="font-size-4"><strong>3. Deeper dive</strong></span></p>
<p>So far, my approach has been data science oriented: it looks more like a guesswork. Here I switch to mathematics, to try to derive the distribution of <em>Z</em>. Since it does not depend on the parameter <em>a</em>, let us assume here that <em>a</em> = 0. <span>Note that when </span><em>a</em><span> = 0, </span><em>X</em><span>(</span><em>k</em><span>) does not converge to zero; instead </span><em>X</em><span>(</span><em>k</em><span>) = Z(</span><em>k</em><span>) and both converge in distribution to </span><em>Z</em><span>. </span>It is obvious that <em>X</em>(<em>k</em>) is a mixture of distributions, namely <em>X</em>(<em>k</em>-1) + <em>U</em>(<em>k</em>) and <em>X</em>(<em>k</em>-1) - <em>U</em>(<em>k</em>). Since <em>X</em>(<em>k</em>-1) is in turn a mixture, <em>X</em>(<em>k</em>) is actually a mixture of mixtures, and so on, In short, it has the distribution of some nested mixtures.</p>
<p>As a starting point, it would be interesting to study the variance of <em>Z</em> (the expectation of Z is equal to 0.) The following formula is incredibly accurate for any value of <em>b</em> between 0 and 1, and even beyond. It is probably an exact formula, not an approximation. It was derived using the tentative density function obtained at the bottom of this section, for Z:</p>
<p><a href="https://i.imgur.com/LMSP0Py.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/LMSP0Py.png?width=174" width="174" class="align-center"/></a></p>
<p>It is possible to obtain a functional equation for the distribution <em>P</em>(<em>Z</em> < <em>z</em>)<em>,</em> using the equations that define <em>X</em>(<em>k</em>) in section 1, with <em>a</em> = 0, and letting <em>k</em> tends to infinity. It starts with </p>
<p><a href="https://i.imgur.com/W9rNj3W.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/W9rNj3W.png?width=460" width="460" class="align-center"/></a></p>
<p>Let's introduce <em>U</em> as a random variable with the same distribution as <em>U</em>(<em>k</em>) or <em>U</em>(2). As <em>k</em> tends to infinity, and separating the two cases <em>x</em> negative and <em>x</em> positive, we get</p>
<p><a href="https://i.imgur.com/JfMsTMQ.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/JfMsTMQ.png?width=519" width="519" class="align-center"/></a></p>
<p>Taking advantages of symmetries, this can be further simplified to </p>
<p><a href="https://i.imgur.com/fKjVw3Z.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/fKjVw3Z.png?width=394" width="394" class="align-center"/></a></p>
<p>where <em>F</em> represents the distribution function, <em>f</em> represents the density function, and <em>U</em> has the same distribution as <em>U</em>(2), that is</p>
<p><a href="https://i.imgur.com/0VQLhmv.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/0VQLhmv.png?width=216" width="216" class="align-center"/></a></p>
<p>Taking the derivative with respect to <em>z</em>, the functional equation becomes the following <a href="https://en.wikipedia.org/wiki/Fredholm_integral_equation" target="_blank" rel="noopener">Fredholm integral equation</a>, the unknown being <em>Z</em>'s density function:</p>
<p><a href="https://i.imgur.com/Js26YkK.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/Js26YkK.png?width=352" width="352" class="align-center"/></a></p>
<p>We have the following particular cases:</p>
<ul>
<li>When <em>b</em> tends to zero, the distribution of <em>Z</em> converges to a uniform law on [-1, 1] thus with a variance equal to 1/3. </li>
<li>When <em>b</em> = 1/2, <em>Z</em> has a parabolic distribution on [-1, +1], defined by P(<em>Z</em> < <em>z</em>) = (2 + 3<em>z</em> - <em>z</em>^3)/4. This needs to be proved, for instance by plugging this parabolic distribution in the functional equation, and checking that the functional equation is verified if <em>b</em> = 1/2. However, a constructive proof would be far more interesting.</li>
<li>When <em>b</em> = 1, <em>Z</em> has the triangular distribution discussed earlier. The density function for <em>Z</em>, defined as the derivative of P(<em>Z</em> < <em>z</em>) with respect to <em>z</em>, is equal to 1 - |<em>z</em>| when <em>b</em> = 1, and 3 (1 - <em>z</em>^2) / 4 when <em>b</em> = 1/2.</li>
</ul>
<p>So for <em>b</em> = 1, <em>b</em> = 1/2, or the limiting case <em>b</em> = 0, we have the following density for <em>Z</em>, defined on [-1, 1]:</p>
<p><a href="https://i.imgur.com/dTxllJB.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/dTxllJB.png?width=211" width="211" class="align-center"/></a></p>
<p>Is this formula valid for any <em>b</em> between 0 and 1? This is still an open question. The functional equation applies regardless of <em>U</em>'s distribution though, even if exponential or Gaussian. The complexity in the cases discussed here, arises from the fact that <em>U</em>'s density is not smooth enough, due to its bounded support domain [0, 1] (outside the support domain, the density is equal to 0.) A potential more generic version of the previous formula would be:</p>
<p><a href="https://i.imgur.com/8AoRsHl.png" target="_blank" rel="noopener"><img src="https://i.imgur.com/8AoRsHl.png?width=345" width="345" class="align-center"/></a></p>
<p>where <em>E</em> denotes the expectation. However, I haven't checked whether and under which conditions this formula is correct or not, except for the particular cases of <em>U</em> discussed here. One of the requirements is that the support domain for <em>U</em> is [0, 1]. If this formula is not exact in general, it might still be a good approximation in some cases. </p>
<p><span class="font-size-4"><strong>4. Potential Areas of Research</strong></span></p>
<p>Here are a few interesting topics for research:</p>
<ul>
<li>Develop a 2-D or 3-D version of this process, investigate potential applications in thermodynamics or statistical mechanics, for instance modeling movements of gas molecules in a cube as the temperature goes down (<em>a</em> >0) or is constant (<em>a</em> = 0), and comparison with other stochastic processes used in similar contexts..</li>
<li>Continuous version of the discrete reflective random walk investigated here, with <em>a</em> = 0, and increments <em>X</em>(<em>k</em>) - <em>X</em>(<em>k</em>-1) being infinitesimally small, following a Gaussian rather than uniform distribution. The limiting un-constrained case is known as a <em><a href="https://en.wikipedia.org/wiki/Wiener_process" target="_blank" rel="noopener">Wiener process</a></em> or <em>Brownian motion.</em> What happens if this process is also constrained to lie between -1 and +1 on the Y-axis? This would define a reflected Wiener process, <a href="https://en.wikipedia.org/wiki/Reflected_Brownian_motion" target="_blank" rel="noopener">see also here</a> for a similar process, and also <a href="https://en.wikipedia.org/wiki/Reflection_principle_(Wiener_process)" target="_blank" rel="noopener">here</a>.</li>
<li>Another direction is to consider the one-dimensional process as time series (which economists do) and to study the multivariate case, with multiple cross-correlated time series.</li>
<li>For the data scientist, it would be worth checking whether and when, based on cross-validation, my process provides better model fitting, leading to more accurate predictions and thus better stock trading ROI (than say a random walk, after removing any trend or drift) when applied to real stock market data publicly available.</li>
</ul>
<p>This is the kind of mathematics used by Wall Street quants and in operations research. Hopefully my presentation here is much less arcane than the traditional literature on the subject, and accessible to a much broader audience, even though it features the complex equations characterizing such a process (and even hinting to a mathematical proof that is not as difficult as it might seem at first glance, and supported by simulations). Note that my reflective random walk is not a true random walk in the classical sense of the term: A better term might be more appropriate. </p>
<p><span class="font-size-4"><strong>5. Solution for the (non-stochastic) case <em>b</em> = 0</strong></span></p>
<p><a href="https://www.linkedin.com/in/andrei-chtcheprov-5750731/" target="_blank" rel="noopener">Andrei Chtcheprov</a> submitted the following statements, with proof :</p>
<ul>
<li>If <em>a</em> < = 1, then the sequence {<em>X</em>(<em>k</em>)} converges to zero.</li>
<li>If <em>a</em> = 3, {<em>X</em>(k)} converges to <a href="https://en.wikipedia.org/wiki/Ap%C3%A9ry%27s_constant" target="_blank" rel="noopener">Zeta(3)</a> - 5/4 =~ -0.048.</li>
<li>If <em>a</em> = 4, {<em>X</em>(<em>k</em>)} converges to (Pi^4 / 90) - 9/8 =~ -0.043.</li>
</ul>
<p>You can read his proof <a href="http://api.ning.com:80/files/PfhqkPs7YPvl4V1MivGJKU3tNLp0vl95LdWaYCuNLSaKMo1u518lZfDwbb*rweKP2Ls7MSm353kpMU5KHJbnvqD9fsPfBKZl/proof.txt" target="_self">here</a>. Much more can also be explored regarding the case <em>b</em> = 0. For instance, when <em>a</em> = 1 and <em>b</em> = 0, the problem is similar <a href="http://www.datasciencecentral.com/profiles/blogs/new-representation-of-numbers-with-very-fast-converging-fractions" target="_blank" rel="noopener">to this one</a>, where we try to approximate the number 2 by converging sums of elementary positive fractions without ever crossing the boundary Y= 2, staying below at all times. Here, by contrast, we try to approximate 0, also by converging sums of the same elementary fractions, but allowing each term to be either positive or negative, thus crossing the boundary Y = 0 very regularly. The alternance of the signs for <em>X</em>(<em>k</em>), is a problem of interest: It shows strong patterns. </p>
<p><em><em>To include mathematical formulas in this article, I used <a href="http://www.hostmath.com/" target="_blank" rel="noopener">this app</a>. Those interested in winning the award by offering a theoretical solution should read <a href="https://www.datasciencecentral.com/profiles/blogs/amazing-random-sequences-with-cool-applications" target="_blank" rel="noopener">this article</a>, where I solved another stochastic integral equation of similar complexity (with mathematical proof), in a related context (chaotic systems.) </em></em></p>
<p><em>For related articles from the same author, <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">click here</a> or visit <a href="http://www.vincentgranville.com/" target="_blank" rel="noopener">www.VincentGranville.com</a>. Follow me <a href="https://www.linkedin.com/in/vincentg/" target="_blank" rel="noopener">on LinkedIn</a>.</em></p>
<p><span class="font-size-4"><b>DSC Resources</b></span></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>
<p><b>Popular Articles</b></p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-articles-about-core-data-science">What is Data Science? 24 Fundamental Articles Answering This Question</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python">Hitchhiker's Guide to Data Science, Machine Learning, R, Python</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
</ul><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/h9BSohZ8zaY" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:372109https://i.imgur.com/nvHjav6.pngBook on Computer Programmingtag:www.analyticbridge.datasciencecentral.com,2017-10-20:2004291:BlogPost:3720152017-10-20T02:00:00.000ZMark McIlroyhttps://www.analyticbridge.datasciencecentral.com/profile/MarkMcIlroy
<p>Data scientists use a range of tools in their work and some of these eventually require programming. This book, titled The Art and Craft of Computer Programming, is a guide to computer programming. It does not focus on a specific programming language, but instead contains the essential material from a first year Computer Science course. The book is available from Amazon.com.…</p>
<p></p>
<p>Data scientists use a range of tools in their work and some of these eventually require programming. This book, titled The Art and Craft of Computer Programming, is a guide to computer programming. It does not focus on a specific programming language, but instead contains the essential material from a first year Computer Science course. The book is available from Amazon.com.</p>
<p><a href="http://api.ning.com:80/files/tUZQu0G2R7cPuxOeM6pkCe1U1PLDfdvyPkbejs6cdUvfHuiVD5XnQWaSI8Wuz5DQxBNTES4yowNSFzVEbtwCLUNAngaCeTB7/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/tUZQu0G2R7cPuxOeM6pkCe1U1PLDfdvyPkbejs6cdUvfHuiVD5XnQWaSI8Wuz5DQxBNTES4yowNSFzVEbtwCLUNAngaCeTB7/Capture.PNG" width="554" class="align-center"/></a></p>
<p><strong>Contents</strong></p>
<p>1. Prelude 5</p>
<p>2. Program Structure 6</p>
<p>2.1. Procedural Languages 6<br/> 2.2. Declarative Languages 17<br/> 2.3. Other Languages 20</p>
<p>3. Topics from Computer Science 21</p>
<p>3.1. Execution Platforms 21<br/> 3.2. Code Execution Models 26<br/> 3.3. Data structures 29<br/> 3.4. Algorithms 44<br/> 3.5. Techniques 66<br/> 3.6. Code Models 85<br/> 3.7. Data Storage 100<br/> 3.8. Numeric Calculations 116<br/> 3.9. System Security 132<br/> 3.10. Speed & Efficiency 136</p>
<p>4. The Craft of Programming 157</p>
<p>4.1. Programming Languages 157<br/> 4.2. Development Environments 167<br/> 4.3. System Design 170<br/> 4.4. Software Component Models 180<br/> 4.5. System Interfaces 184<br/> 4.6. System Development 191 <br/> 4.3. System Design 170<br/> 4.4. Software Component Models 180<br/> 4.5. System Interfaces 184<br/> 4.6. System Development 191<br/> 4.7. System evolution 210<br/> 4.8. Code Design 217<br/> 4.9. Coding 233<br/> 4.10. Testing 262<br/> 4.11. Debugging 275<br/> 4.12. Documentation 285</p>
<p>5. Appendix A - Summary of operators 286</p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/vJBh6LXF8a8" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:372015What Kind of OLAP Do We Really Need?tag:www.analyticbridge.datasciencecentral.com,2018-02-01:2004291:BlogPost:3724382018-02-01T11:00:00.000ZJIANG Buxinghttps://www.analyticbridge.datasciencecentral.com/profile/HANLijun
<p><b>The narrow-sensed OLAP</b></p>
<p>OLAP is part and parcel of a BI application. As the name suggests, the word is an acronym for online analytical processing. Users, frontline employees, to be precise, are responsible for performing various types of data processing online. </p>
<p><b>But, the concept of OLAP tends to be used in a very narrow sense</b>. It has almost become an equivalence of multidimensional analysis. Based on a prebuilt data cubic, the analysis performs summarization…</p>
<p><b>The narrow-sensed OLAP</b></p>
<p>OLAP is part and parcel of a BI application. As the name suggests, the word is an acronym for online analytical processing. Users, frontline employees, to be precise, are responsible for performing various types of data processing online. </p>
<p><b>But, the concept of OLAP tends to be used in a very narrow sense</b>. It has almost become an equivalence of multidimensional analysis. Based on a prebuilt data cubic, the analysis performs summarization according to specified dimensions/levels and presents the aggregate values as a table or a diagram. It adopts drilldown, aggregation, rotation, and slicing to change the dimensions/levels and summarization range. The idea behind multidimensional analysis is this: extensive ground-based aggregate results are too broad to get a good insight into an issue; instead, data needs to be sliced into smaller parts and drilled down to more detailed and deeper levels for achieving a more valuable analytical purpose.</p>
<p><b>The broad-sensed OLAP</b></p>
<p>Is online analytical processing all about the multidimensional analysis?</p>
<p>There are some data analysis scenarios where a person who has a lot of experience in a field makes some predictions about their businesses. For example:</p>
<ul>
<li>An equity analyst predicts that stocks meeting certain conditions are most likely to rise;</li>
<li>A sales manager knows which types of sales representatives are better at dealing with difficult customers;</li>
<li>A tutor knows how the results of students who have very strong subjects and very weak subjects are like;</li>
</ul>
<p>These guesses provide basis for predictions. After operating for a certain time, a business system will generate a huge amount of data, which could verify these guesses. Verified guesses can be used as principles to guide future decisions. If the guesses are proved wrong, re-guesses will be made.</p>
<p>It is the guess verification that the OLAP should focus. The guess-and-verify work aims to find principles or facts that support a conclusion based on historical data. An OLAP tool helps to verify guesses via data manipulation.</p>
<p>Of course <b>guesses are made by experienced people in a certain field</b>, instead of the software. The online analysis is necessary because, most of the time, guesses are made on the spot based on some intermediate results. It is impossible and unnecessary to pre-design a complete end-to-end path, which means the pre-modelling is unfeasible. The provisionality of the action also makes the IT resources unavailable when trying to verify it.</p>
<p>To counter the issue technologically, frontline workers must be equipped with the capability of querying and computing data in a flexible and interactive way. In the previously mentioned scenarios, the possible computations are as follows:</p>
<ul>
<li>For a stock that has been rising for 3 days in a month, find the probability of continuous rising on the 4<sup>th</sup> day;</li>
<li>Find the customers whose last orders were half a year ago but who placed an order after their sales representatives were changed;</li>
<li>Get the rankings of the English scores of the students whose scores of both Chinese and Math are in top 10;</li>
<li>…</li>
</ul>
<p><b>Limitations of multidimensional analysis</b></p>
<p>Obviously these computations can be handled based on historical data. But is a multidimensional analysis method helpful?</p>
<p>I’m afraid not!</p>
<p>The multidimensional analysis has two drawbacks: one is that the data cubic should be pre-created, giving users no opportunity of remolding it provisionally and requiring a re-creation for each new analysis; the other one is that the analytic operations over a data cubic are limited, including only drilldown, aggregation, slicing and rotation, thus it is difficult to cope with complex multi-step computations. Though the popular agile BI products in recent years that are capable of performing multidimensional analysis have much better operation fluency and far more attractive interface than the early OLAP products have, their essential functionalities remain unchanged and no improvement is made about the inabilities.</p>
<p>Yet multidimensional analysis has values, like locating the exact source of the high cost. But it can’t get a principle that is crucial for predicting and guiding a future move based on data. In this sense, online analytical processing should be more than multidimensional analysis.</p>
<p><b>What kind of OLAP do we need?</b></p>
<p>What functionalities the OLAP software for verifying a speculation should have?</p>
<p>As mentioned previously, verifying a speculation is a process of data query and computation. <b>It is vital that the query and computation can be defined by frontline workers without the help of IT specialists</b>. In the current application context, an OLAP platform needs to have the following two functionalities:</p>
<p>1. Associated query</p>
<p>The first thing for performing an analysis is acquiring data. Many organizations have their own data warehouses for non-IT employees to access and perform queries. An important issue is that most of the OLAP software doesn’t provide convenient associated query functionality for the frontline employees. Instead, IT specialists need to first create a model to solve the associated query (which is similar to creating a data cubic for performing multidimensional analysis). Usually not all real-life demands can be handled with this single model, and IT rescue is still needed. This makes online analytical processing not online any more.</p>
<p>2. Interactive computation</p>
<p>After data is collected, computation begins. The distinguishing characteristic of the speculation-verifying computation is that, instead of a ready-made program, the next move is determined based on the result of the previous move. The process is highly interactive, which is similar to the computation with a calculator. Furthermore, it is the structured data in batches, instead of numbers, that needs to be processed. The OLAP tool thus becomes a <b>data calculator</b>. Excel is interactive to some degree, making it the most popular desktop BI tool. But Excel doesn’t give sufficient support for dealing with multi-level data and regular operations, thus unable to handle the speculation-verifying computation mentioned in the previous scenarios.</p>
<p>In later articles, we’ll analyze the current popular computing techniques to locate problems of handling the two types of computation, and suggest solutions to them. </p>
<p><a href="https://www.linkedin.com/pulse/mr-jiangs-datatalk-room-what-kind-olap-do-we-really-han-lijun/" target="_blank" rel="noopener">https://www.linkedin.com/pulse/mr-jiangs-datatalk-room-what-kind-olap-do-we-really-han-lijun/</a></p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/pjpx1cOiYfQ" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:3724389 Off-the-beaten-path Statistical Science Topics with Interesting Applicationstag:www.analyticbridge.datasciencecentral.com,2017-10-02:2004291:BlogPost:3719192017-10-02T19:43:45.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><span>You will find here nine interesting topics that you won't learn in college classes. Most have interesting applications in business and elsewhere. They are not especially difficult, and I explain them in simple English. Yet they are not part of the traditional statistical curriculum, and even many data scientists with a PhD degree have not heard about some of these concepts.…</span></p>
<p></p>
<p><span>You will find here nine interesting topics that you won't learn in college classes. Most have interesting applications in business and elsewhere. They are not especially difficult, and I explain them in simple English. Yet they are not part of the traditional statistical curriculum, and even many data scientists with a PhD degree have not heard about some of these concepts.</span></p>
<p><span><a href="http://api.ning.com/files/hIzuEPIKkM6cOZ73lNYnehvJ6jmXmMAGlRiI8MXMIo9pF*QRTE5wAxpO559lTW-Ynz-xxulWwguEJmCyBrwPZJosP2n8gF4N/Capture.PNG" target="_self"><img src="http://api.ning.com/files/hIzuEPIKkM6cOZ73lNYnehvJ6jmXmMAGlRiI8MXMIo9pF*QRTE5wAxpO559lTW-Ynz-xxulWwguEJmCyBrwPZJosP2n8gF4N/Capture.PNG" class="align-center"/></a></span></p>
<p><span>The topics discussed in this article include:</span></p>
<ul>
<li>Random walks in one, two and three dimensions - With Video</li>
<li>Estimation of the convex hull of a set of points - Application to clustering and oil industry</li>
<li>Constrained linear regression on unusual domains - Application to food industry</li>
<li>Robust and scale-invariant variances</li>
<li>Distribution of arrival times of extreme events - Application to flood predictions</li>
<li>The Tweedie distributions - Numerous applications</li>
<li>The arithmetic-geometric mean - Fast computations of decimals of Pi</li>
<li>Weighted version of the K-NN clustering algorithm</li>
<li>Multivariate exponential distribution and storm modeling</li>
</ul>
<p><a href="http://www.datasciencecentral.com/profiles/blogs/9-off-th-beaten-path-statistical-science-topics" target="_blank">Click here to read the article</a>.</p>
<p></p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/TV8hEMzWp8s" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:371919Audience is The Future Business Model, Data Analytics Can Improves Ittag:www.analyticbridge.datasciencecentral.com,2017-09-26:2004291:BlogPost:3710142017-09-26T18:30:00.000ZChirag Shivalkerhttps://www.analyticbridge.datasciencecentral.com/profile/ChiragShivalker
<p><img src="http://api.ning.com:80/files/Zk032S8hxsW5EhV6AH5bPFP6D1ncYHrrK2-OrQXYHYOzX*zwYnMcn5Zf7NMGDoe51DEu2ivPqJ16NW0QJGaNTJ4*efe6a9ch/audienceisthefuturebusinessmodeldataanalyticsimprovesit3.jpg" width="750"></img></p>
<p></p>
<p>Companies and enterprises are facing a daily grind, while they are also required to see to it that their customers are happy & satisfied, operations is efficient and employees are satisfied; and all this makes running the business – a real challenge. “<em>Audience is the new business model</em>”, and if any organization is struggling miserably to communicate with customers or their audience, there certainly is a negative impact of it across the business plan,…</p>
<p><img src="http://api.ning.com:80/files/Zk032S8hxsW5EhV6AH5bPFP6D1ncYHrrK2-OrQXYHYOzX*zwYnMcn5Zf7NMGDoe51DEu2ivPqJ16NW0QJGaNTJ4*efe6a9ch/audienceisthefuturebusinessmodeldataanalyticsimprovesit3.jpg" width="750"/></p>
<p></p>
<p>Companies and enterprises are facing a daily grind, while they are also required to see to it that their customers are happy & satisfied, operations is efficient and employees are satisfied; and all this makes running the business – a real challenge. “<em>Audience is the new business model</em>”, and if any organization is struggling miserably to communicate with customers or their audience, there certainly is a negative impact of it across the business plan, starting from finances to product development.<br/></p>
<p>Majority of the organizations, in some or other manner, have turned to data analytics to identify and understand their audience, focusing on optimizing customer communication, revitalizing their business models, and make profitable growth. Even Fortune 500 and Fortune 1000 companies have realized that data analytics draws a clear line from marketing campaigns to profits. Here are three ways, tried and tested, on capitalizing data analytics to improve overall business plan.<br/></p>
<p><span class="font-size-5">1. Understanding target markets before expansion</span></p>
<p><br/> Some of them have grown smart enough & started viewing their businesses models as a constant work in progress. Exploratory data analytics has helped them stay relevant. Leveraging data analytics offerings, equips them with real-time information in and around consumer activity. This also helps them with potential target markets which might have stayed unnoticed in past years. <em>Factors that they consider for reaching out new demographics:</em><br/></p>
<ul>
<li>Identify what aspects of products/services are best received to focus on reaching out to potential customers.</li>
<li>Assess and strategies how potential customers can be served better as compared to direct competition.</li>
<li>Visualizing insights derived through data analytics & conclude with pattern of age, gender, location etc., to fine tune offers in and around these factors.</li>
</ul>
<p><br/> Understanding target markets before implementing expansion plans, is imperative. Using data to glean insights and information to identify strengths and weaknesses of the current business model backed with user activity; ultimately enhances capabilities to engage potential customers in newer landscapes.</p>
<p><br/> <span class="font-size-5">2. Understanding consumer behavior</span></p>
<p><br/> Businesses were operating with limited abilities when it came to reaching out to customers and a lot of decisions were made on guesswork. This included the rudimentary marketing and advertising tactics of attracting customers. And then predictive data analytics walked in to the picture.<br/></p>
<p>Understanding audience or customer behavior, the critical need and element of preparing a profitable business plan, is addressed adequately. Online activity of any consumer, which gives out data that can be analyzed on real time basis, is used like an opportunity to directly target them. <em>This data also helps businesses in organizing targeted advertisements to encourage more sales and better manage finances in the following ways:</em></p>
<p></p>
<ul>
<li>Tracking down when a customer had made purchases, with help of sales forecasting, helps in telling most probably when they will make the next purchase again. Based on this, businesses can organize targeted advertisements to enhance sales.</li>
<li>Consumer purchasing patterns, as highlighted by predictive analytics, will help in strategically allocating funds according to times of prosperity or stagnation.</li>
<li>Easy identification of trends and fluctuations in the market and across the industry, facilitated by historical data analytics, helps in minimizing risks and ascertain lucrative investment opportunities.</li>
</ul>
<p></p>
<p>The fine blend of sales forecasting and predictive analytics creates a sustainable financial model. Understanding consumer behavior delivers valuable insights about how purchases were made and what all can be done to improve overall sales activity.<br/></p>
<p><span class="font-size-5">3. Understanding the power of word of mouth</span></p>
<p><br/> Tremendous rise of social networks has forced businesses to recognize and understand the power of word of mouth. Customer feedbacks and forums have proved that even the biggest of corporations across the globe are subject to customer reviews. Businesses today are modeling their business plans based on customer expectations by leveraging consumer data analytics.</p>
<p></p>
<ul>
<li>To build trust and reach out to more loyal a customer base, businesses can target consumers with personalized messages to prove that they are attentive.</li>
<li>To engage consumers in more meaningful communication, they can leverage data to get in touch with consumers based on seasonal events, holidays and significant events.</li>
<li>Being open to criticism is one of the most effective ways businesses have opted to improve products/services and even the business model, and make the customer feel that they area attentive; encouraging their customers to become returning customers.</li>
</ul>
<p><br/> With help of data analytics, it becomes much easier to track customer needs and adjust the business model accordingly. Ecommerce / retail, real estate, consulting & professional services, transportation & logistics, BFSI or healthcare; data analytics has empowered professionals across industries to glean insights about consumer needs, useful to reinforce confidence in their products/services, impacting everything from consumer awareness to sales & profit.</p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/d2VOPs580OM" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:371014Why Analytics Projects Fail – And It’s Not The Analytics!tag:www.analyticbridge.datasciencecentral.com,2017-09-26:2004291:BlogPost:3712992017-09-26T21:00:00.000ZEd Crowleyhttps://www.analyticbridge.datasciencecentral.com/profile/EdCrowley
<p>Being in a highly technical, complex field it is easy to sometimes lose the ‘human aspect’ of the solutions we are developing. We focus on apply edge computing concepts, or whether a seasonality model works better for our predictive accuracy than some other approach. Don't get me wrong, these are all important activities. However, in working with many firms in developing, deploying and supporting advanced analytics solutions, particularly in the domain of the Industrial IoT space, it’s often…</p>
<p>Being in a highly technical, complex field it is easy to sometimes lose the ‘human aspect’ of the solutions we are developing. We focus on apply edge computing concepts, or whether a seasonality model works better for our predictive accuracy than some other approach. Don't get me wrong, these are all important activities. However, in working with many firms in developing, deploying and supporting advanced analytics solutions, particularly in the domain of the Industrial IoT space, it’s often the people side that fails – not the technology.</p>
<p>How many times have you developed an amazing predictive analytics solution that your team is excited about, but it fails to get corporate funding? Or even worse, you develop a proof of concept, but the business unit responsible for deploying the solution never fully adopts or accepts the solution. What gives? The solution has been proven to work. It shows significant results. Why the hesitancy.</p>
<p>The reality is that most advanced analytics solutions have some impact on existing business processes. Whether it is replacing staff with an automated process, or gaining the trust in management that the decision making algorithms work – the bottom line is that it’s all about making sure the organization, the people, not only accept, but embrace the solution.</p>
<p>Here are some suggestions for helping to make this happen:</p>
<ol>
<li><b>KISS-Keep It Simple Stupid.</b> Yeah, it’s cool to be the ‘wizard in the room’, but the reality is, most people are pretty intimidated by the statistics and buzz words associated with advanced analytics. Keep it simple, they are already going to be impressed by your understanding of the subject area. Use as many examples as possible and try to avoid buzzwords or highly technical jargon.</li>
<li><b>Make It Relevant.</b> Clearly articulate the reason why this has value. For the executive team, a clear use case with financial detail explain the value of the solution is critical. For the management team implementing this, before you ever pitch ‘the solution’, make sure you have spent time to really understand how this would impact them, what their needs are, and explaining how this will help them in their job or how the improvement in company performance will benefit them.</li>
<li><b>Get Buy In Early.</b> Give everyone involved in implementing, using, or making decisions about the solution a chance to buy in by having a say on how it is designed, and how it will operate. If you wait until you have ‘completed’ the design, then you have to overcome resistance to change and inertia. No matter how good the solution is, folks typically don’t want to change. However, if you let them help you design the solution, they will buy-in to it as part of the process.</li>
</ol>
<p>Clearly, there are a lot of things that can be done to help drive acceptance of your solution. So my strong advice is – don’t think of just designing the solution and developing as the ‘project’. Half of the project is getting buy-in to the solution from your organization. Be prepared for this, plan for it, and embrace it!</p>
<p>Ed Crowley is CEO of Virtulytix – the Industrial IoT solutions development firm that makes the Industrial IoT smart!</p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/o-IDLV_iZNA" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:371299Can you solve these mathematical / statistical problems?tag:www.analyticbridge.datasciencecentral.com,2017-09-22:2004291:BlogPost:3712872017-09-22T03:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>I recently posted an article featuring a non traditional approach to find large prime numbers. The research section of this article offers interesting challenges, both for data scientists interested in mathematics, and for mathematicians interested in data science and big data. My approach is data, pattern recognition, and machine learning heavy. Here is the introduction:</p>
<p>Large prime numbers have been a topic of considerable research, for its own mathematical beauty, as well as to…</p>
<p>I recently posted an article featuring a non traditional approach to find large prime numbers. The research section of this article offers interesting challenges, both for data scientists interested in mathematics, and for mathematicians interested in data science and big data. My approach is data, pattern recognition, and machine learning heavy. Here is the introduction:</p>
<p>Large prime numbers have been a topic of considerable research, for its own mathematical beauty, as well as to develop more powerful cryptographic applications and random number generators. In this article, we show how big data, statistical science (more specifically, pattern recognition) and the use of new efficient, distributed algorithms, could lead to an original research path to discover large primes. Here we also discuss new mathematical conjectures related to our methodology.</p>
<p>Much of the focus so far has been on discovering raw large primes: Any time a new one, bigger than all predecessors, is found, it gets a lot of attention even beyond the mathematical community. Here we explore a different path: finding numbers (usually not primes) that have a very large prime factor. In short, we are looking for special integer-valued functions f(n) such that f(n) has a prime factor bigger than n, hopefully much bigger than n, for most values of n.</p>
<p><a href="http://api.ning.com:80/files/2ZTD6jtt0ki2RHB4NCP5vnQap1rzTO*xnaeSraW2vMSqpizMibvH7WNSzUe*dECAUDydUKL*TkB9yqMq-E3kaCTP-FAzYX0q/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/2ZTD6jtt0ki2RHB4NCP5vnQap1rzTO*xnaeSraW2vMSqpizMibvH7WNSzUe*dECAUDydUKL*TkB9yqMq-E3kaCTP-FAzYX0q/Capture.PNG" width="595" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source for picture: <a href="http://infosthetics.com/archives/2012/07/on_the_pattern_of_primes.html" target="_blank">click here</a></em></p>
<p>The distribution of the largest prime factor has been studied extensively. If we choose a function that grows fast enough, one would expect that the largest prime factor of f(n) will always be larger than n. However, this would lead to intractable factoring issues to find the large primes in question. So in practice, we are interested in functions f(n) that do not grow too fast. The problem is that many, if not most very large integers, are friable : their largest prime factor is a relatively small prime. I like to call them porous numbers. So the challenge is to find a function f(n) that is not growing too fast, and that somehow produces very few friable numbers as n becomes extremely large. <a href="http://www.datasciencecentral.com/profiles/blogs/data-science-method-to-discover-large-prime-numbers" target="_blank">Read the full article here</a>.</p>
<p><em>For another interesting challenge, read the section "Potential Areas of Research" in my article <a href="http://www.analyticbridge.datasciencecentral.com/profiles/blogs/mysterious-sequences-that-look-random-with-surprising-properties" target="_blank">How to detect if numbers are random or not</a>. For other articles featuring difficult mathematical problems, <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank">click here</a>. For a statistical problem with several potential applications (clustering, data reduction) <a href="http://www.datasciencecentral.com/profiles/blogs/nice-generalization-of-the-k-nn-clustering-algorithm" target="_blank">click here</a> and read the last section. More challenges can be found <a href="http://www.datasciencecentral.com/group/resources/forum/topics/best-kept-secret-about-data-science-competitions" target="_blank">here</a>.</em> . </p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/O-pfWGvG-t8" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:371287Building Convolutional Neural Networks with Tensorflowtag:www.analyticbridge.datasciencecentral.com,2017-09-07:2004291:BlogPost:3694982017-09-07T13:30:00.000Zahmet taspinarhttps://www.analyticbridge.datasciencecentral.com/profile/ahmettaspinar
<p>In the past year I have also worked with Deep Learning techniques, and I would like to share with you how to make and train a Convolutional Neural Network from scratch, using tensorflow. Later on we can use this knowledge as a building block to make interesting Deep Learning applications.</p>
<p>The pictures here are from the full article. Source code is also provided.…</p>
<p></p>
<p>In the past year I have also worked with Deep Learning techniques, and I would like to share with you how to make and train a Convolutional Neural Network from scratch, using tensorflow. Later on we can use this knowledge as a building block to make interesting Deep Learning applications.</p>
<p>The pictures here are from the full article. Source code is also provided.</p>
<p><a href="http://api.ning.com:80/files/OtMcIKbpmwRWBDj9GQF-YGP9oRvaOCVcm7V69c-mMoJgT7BYOfP10tAyA-7CVNEeeT65kmN0MZazjB08c3rhMHL6-tv-mM5i/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/OtMcIKbpmwRWBDj9GQF-YGP9oRvaOCVcm7V69c-mMoJgT7BYOfP10tAyA-7CVNEeeT65kmN0MZazjB08c3rhMHL6-tv-mM5i/Capture.PNG" width="654" class="align-center"/></a></p>
<p>Before you continue, make sure you understand how a convolutional neural network works. For example,</p>
<ul>
<li>What is a convolutional layer, and what is the filter of this convolutional layer?</li>
<li>What is an activation layer (ReLu layer (most widely used), sigmoid activation or tanh)?</li>
<li>What is a pooling layer (max pooling / average pooling), dropout?</li>
<li>How does Stochastic Gradient Descent work?</li>
</ul>
<p><strong>The contents of this blog-post is as follows</strong>:</p>
<p>1. Tensorflow basics:</p>
<ul>
<li>Constants and Variables</li>
<li>Tensorflow Graphs and Sessions</li>
<li>Placeholders and feed_dicts</li>
</ul>
<p>2. Neural Networks in Tensorflow</p>
<ul>
<li>Introduction</li>
<li>Loading in the data</li>
<li>Creating a (simple) 1-layer Neural Network:</li>
<li>The many faces of Tensorflow</li>
<li>Creating the LeNet5 CNN</li>
<li>How the parameters affect the outputsize of an layer</li>
<li>Adjusting the LeNet5 architecture</li>
<li>Impact of Learning Rate and Optimizer</li>
</ul>
<p>3. Deep Neural Networks in Tensorflow</p>
<ul>
<li>AlexNet</li>
<li>VGG Net-16</li>
<li>AlexNet Performance</li>
</ul>
<p>4. Final words</p>
<p><a href="http://api.ning.com:80/files/OtMcIKbpmwQGE4JomGtJxEJJGt6KnYUy00r0-QEkm5ijh4CUew9kI4K1GfAaWEg373iECKBQR7Q91MCgpVvvCfEN7lohLKe*/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/OtMcIKbpmwQGE4JomGtJxEJJGt6KnYUy00r0-QEkm5ijh4CUew9kI4K1GfAaWEg373iECKBQR7Q91MCgpVvvCfEN7lohLKe*/Capture.PNG" width="506" class="align-center"/></a></p>
<p><em>To read this blog, <a href="http://ataspinar.com/2017/08/15/building-convolutional-neural-networks-with-tensorflow/" target="_blank">click here</a>. <span>The code is also available in my </span><a href="https://github.com/taspinar/sidl" target="_blank">GitHub repository</a><span>, so feel free to use it on your own dataset(s).</span></em></p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/9e87hfMDZyQ" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:369498A Day in the life of an Analysttag:www.analyticbridge.datasciencecentral.com,2017-09-03:2004291:BlogPost:3706622017-09-03T20:00:00.000ZIvy Pro Schoolhttps://www.analyticbridge.datasciencecentral.com/profile/IvyProSchool
<p><a href="http://ivyproschool.com/blog/2016/11/29/a-day-in-the-life-of-an-analyst/" target="_blank"><img class="align-full" src="http://api.ning.com:80/files/n6oHIXSTHuV5RPPEztZ2VArorCdmCs1luq*GOApIsTbZs*-znhBpU-iDgKScC-BCZrwTFYKloOAcMgZea06SjHwLj3erdtbT/Infographic_dayinthelifeofanAnalyst.png?width=750" width="750"></img></a></p>
<p><strong><u>A typical day in the life of an Analyst</u></strong></p>
<p>An Analyst works on varied projects with multiple deliverables and varied duties depending on the business objectives.</p>
<p>However there are some tasks that can be easily classified as “common everyday duties” in a “typical work day of a business analyst”</p>
<p><strong>Clarification and…</strong></p>
<p><a href="http://ivyproschool.com/blog/2016/11/29/a-day-in-the-life-of-an-analyst/" target="_blank"><img src="http://api.ning.com:80/files/n6oHIXSTHuV5RPPEztZ2VArorCdmCs1luq*GOApIsTbZs*-znhBpU-iDgKScC-BCZrwTFYKloOAcMgZea06SjHwLj3erdtbT/Infographic_dayinthelifeofanAnalyst.png?width=750" width="750" class="align-full"/></a></p>
<p><strong><u>A typical day in the life of an Analyst</u></strong></p>
<p>An Analyst works on varied projects with multiple deliverables and varied duties depending on the business objectives.</p>
<p>However there are some tasks that can be easily classified as “common everyday duties” in a “typical work day of a business analyst”</p>
<p><strong>Clarification and investigation of business goals and problems</strong></p>
<p>Analysts like to ask questions. This helps them identify the right question against which they would need to conduct analysis. The process of investigating involves conducting interviews, reading and observing people at work.</p>
<p><strong>Information analysis</strong></p>
<p>The analysis phase is the phase during which the Business Analyst explains the elements in detail, affirming clearly and unambiguously what the business needs to do in order solve its issue. During this stage the BA will also interact with the development team and, if appropriate, an architect, to design the layout and define accurately what the solution should look like.</p>
<p> </p>
<p><strong>Meeting stakeholders</strong></p>
<p>Good Business Analysts contribute countless hours actively communicating. More than only speaking, this means hearing and recognising verbal and non-verbal information, building an open conversation, verifying you’ve understood what you heard, and communicating what you learn to those who will create the actual solution</p>
<p> </p>
<p><strong>Document research</strong></p>
<p>Post Investigation, all the facts that have been collected by Analysts need to be documented for their future use. They use Data visualization tools like Excel graphs, spreadsheets etc.</p>
<p><strong>Evaluate options</strong></p>
<p>Before drawing solutions for the given problems, Analysts like to work on various possible solutions contemplating the best one according to business requirements. In order to draw conclusions Analysts use analytical tools like SAS, R and Python for analyzing and predicting the best outcome.</p>
<p><strong>Take action</strong></p>
<p>Once the optimum solution is derived, an Analyst has to again communicate with the client to understand any missed objectives and to guide them on how to implement the recommended processes.</p>
<p> </p>
<p>A business Analyst shows the light towards viable solutions to complex business problems. It’s no surprise that Business Analysts are in such great demand.</p>
<p><a rel="nofollow" href="http://ivyproschool.com/" target="_blank"> </a></p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/nq6ucs-pOUo" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:370662Overpromising and Underperforming: Understanding and Evaluating Self-service BI Toolstag:www.analyticbridge.datasciencecentral.com,2017-08-31:2004291:BlogPost:3705892017-08-31T06:00:00.000ZJIANG Buxinghttps://www.analyticbridge.datasciencecentral.com/profile/HANLijun
<p>From the OLAP concept in earlier years to the agile BI over the last few years, BI vendors never stop advertising the self-service capability, claiming that business users will be able to perform analytics by themselves. Since there are strong self-service needs among users, the two really hit it off and it is very likely that a quick deal is made. The question is - does a BI product’s self-service functionality enable a truly flexible data analytics by business users?</p>
<p>There isn’t a…</p>
<p>From the OLAP concept in earlier years to the agile BI over the last few years, BI vendors never stop advertising the self-service capability, claiming that business users will be able to perform analytics by themselves. Since there are strong self-service needs among users, the two really hit it off and it is very likely that a quick deal is made. The question is - does a BI product’s self-service functionality enable a truly flexible data analytics by business users?</p>
<p>There isn’t a standard definition of “data analytics” in the industry. So no one can say for sure whether the claim is objective or exaggerated. But for users who have little BI experience, the fact is that most of their self-service needs can’t be met with the so-called self-service technology. According to industry experiences, the best record is about 30% solved problems. Most BI products lag far behind the number, lingering around 10%.</p>
<p>We’ll look at the phenomenon from three aspects.</p>
<p><b> </b></p>
<p><b>Multidimensional analysis</b></p>
<p>Multidimensional analysis performs interactive operations over a pre-created data set (or a data cube). Today most BI products provide this type of analytic capability. Though a new generation of BI products has improved much on interface design and operational smoothness, their ability of implementing computations hasn’t essentially improved.</p>
<p>The key aspect of multidimensional analysis is model creation, which is the pre-preparation of data sets. If the data to be analyzed is all held in a single data set, and if the operations to be performed are within those provided by a BI product (including rotation, drilldown, slicing, and so on), the analysis is well within the product’s capability. But for most real-life scenarios, analytic needs are beyond these pre-installed functionalities, like adding a provisional data item or perform a join with another data set, leading to a re-creation of the model. The problem is that the model creation requires technical professionals, sending the tool non-self-service. </p>
<p>Multidimensional analysis can meet only 10% of the self-service needs, which reflects the average self-service ability of today’s BI products.</p>
<p><b> </b></p>
<p><b>Associative query</b></p>
<p>Some BI products provide associative query capability to make up for the limitations of multidimensional analysis. The strategy is to create a new data set by joining multiple data sets before performing the multidimensional analysis, or to implement certain joins between multiple data cubes during the multidimensional analysis. This means business users are to some degree allowed to create models.</p>
<p>It isn’t easy to implement an associative query well. Relational databases give a too simple definition of the JOIN operation, making the association between data sets too complicated to understand for many business users. The issue can be partly addressed through a well-designed product interface, and a good BI product enables business users to appropriately handle non-circular joins. But to solve the issue, we need to change the data organization scheme on the database level. The reality is that nearly no BI products re-define the database model, thus the improvement of associative query ability is limited. We’ll discuss the related technologies in later articles.</p>
<p>Here’s a typical example for testing the associative query ability of a BI product: finding the male employees under the female managers. The simple query involves multiple self-joins, but most BI products are incapable of handling it (without first creating a model).</p>
<p>BI products’ associative query capability can meet 20%-30% of the self-service needs, though the specific number depends on the different capabilities provided by different products.</p>
<p> </p>
<p><b>Procedure computing</b></p>
<p>About 70% or more self-service demands involve multi-step procedural computations, which is completely beyond the design target of a BI product and even can be considered beyond data analytics, but is a hot user problem. Users hope that frontline employees can get data as flexible as possible within their authority.</p>
<p>A simple solution is exporting data with the BI product, and then handling it by the frontline workers with desktop tools like Excel. Unfortunately, Excel is not good at handling multilevel joins (the issue will be discussed later), as well as dealing with a large amount of data, making it unsuitable in many computing scenarios.</p>
<p>Before more advanced interactive computing technology appears, technical specialists are responsible for tackling those problems. In this context, instead of pursuing the self-service procedure computing, BI products should focus on facilitating business users’ access to technical resources and the development process for developers.</p>
<p>There are two things that we can do. One is establishing an algorithm library where the algorithms of handled scenarios are stored. Business users would call up an algorithm and change parameters for use in a same type of computing scenario. They can also find an algorithm in the library for the technical specialists’ reference in handling a new scenario, reducing the chance of difference between the business users and the development team in understanding a computing scenario, which is a major factor for transaction delay. The other is providing efficient and manageable programming technology that facilitates coding and modification and that supports storing an algorithm in the library and its reuse. Corresponding technologies are rare in the industry. SQL has good manageability, but SQL code in handling procedural computations is too tedious. The stored procedure needs recompilation, which is inconvenient for reuse. Java code needs recompilation, too, and is nearly unmanageable. Other scripting languages are integration-unfriendly and thus difficult to store and manage in a database for reuse.</p>
<p> </p>
<p>At present, BI products are barely able to meet the most common self-service needs. Usually the BI vendors are talking about multidimensional analysis while the users are thinking of problems that need to be done with procedure computing. The misunderstanding invites high expectations as well as big disappointments. In view of this, it’s critical that users have a good understanding of their self-service needs: Is multidimensional analysis sufficient for dealing with the problems? How many associative queries will be needed? Will the frontline employees have a lot of problems that require procedure computing? Having these questions answered is necessary for setting a reasonable expectation of a BI product and for knowing what the BI product can do, thus avoiding being misled by the flowery interface and smooth operation and making a wrong purchase decision.</p>
<p></p>
<p><a href="http://www.linkedin.com/pulse/overpromising-underperforming-understanding-evaluating-han-lijun" target="_blank">http://www.linkedin.com/pulse/overpromising-underperforming-understanding-evaluating-han-lijun</a></p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/Fz2zsOfxQDI" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:3705898 Great Articles, Tutorials, and Infographicstag:www.analyticbridge.datasciencecentral.com,2017-08-22:2004291:BlogPost:3698382017-08-22T18:39:35.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Posted on DSC today and yesterday</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/new-book-data-science-mindset-methodologies-and-misconceptions">New Book: Data Science: Mindset, Methodologies, and Misconceptions</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/robust-attacks-on-machine-learning-models">Robust Attacks on Machine Learning Models</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/about-quick-r">Quick Guide to R…</a></li>
</ul>
<p>Posted on DSC today and yesterday</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/new-book-data-science-mindset-methodologies-and-misconceptions">New Book: Data Science: Mindset, Methodologies, and Misconceptions</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/robust-attacks-on-machine-learning-models">Robust Attacks on Machine Learning Models</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/about-quick-r">Quick Guide to R and Statistical Programming</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/book-r-for-data-science">Book: R for Data Science</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/predicting-the-next-eclipse">Predicting the next Eclipse</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/under-the-hood-with-reinforcement-learning-understanding-basic-rl">Understanding Basic Reinforcement Learning Models</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/evolution-of-machine-learning-infographics">Evolution of Machine Learning </a>- Infographics</li>
<li><a href="http://www.bigdatanews.datasciencecentral.com/profiles/blogs/how-marketers-use-data-science-to-increase-reach">How Marketers Use Data Science to Increase Reach </a>- Infographic</li>
</ul>
<p><span>Enjoy the reading!</span></p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/VCUQOaYI3DM" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:369838Curious Mathematical Object: Hyperlogarithmstag:www.analyticbridge.datasciencecentral.com,2017-08-16:2004291:BlogPost:3696132017-08-16T18:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Logarithms turn a product of numbers into a sum of numbers: log(xy) = log(x) + log(y). Hyperlogarithms generalize the concept as follows: Hlog(XY) = Hlog(X) + Hlog(y), where X and Y are any kind of objects, and the product and sum are replaced by operators in some arbitrary space. …</p>
<p><a href="http://api.ning.com:80/files/*K0CPt1rhrQPLLGS8*ARoNUsaedloCkcp2WiyS-5HI7Xxf3xmSL5DVRW*noyClNPAuQJCPhh8EMhqW27*tNv*4Lv8nrmI9yA/Capture.PNG" target="_self"><img class="align-center" src="http://api.ning.com:80/files/*K0CPt1rhrQPLLGS8*ARoNUsaedloCkcp2WiyS-5HI7Xxf3xmSL5DVRW*noyClNPAuQJCPhh8EMhqW27*tNv*4Lv8nrmI9yA/Capture.PNG" width="252"></img></a></p>
<p>Logarithms turn a product of numbers into a sum of numbers: log(xy) = log(x) + log(y). Hyperlogarithms generalize the concept as follows: Hlog(XY) = Hlog(X) + Hlog(y), where X and Y are any kind of objects, and the product and sum are replaced by operators in some arbitrary space. </p>
<p><a href="http://api.ning.com:80/files/*K0CPt1rhrQPLLGS8*ARoNUsaedloCkcp2WiyS-5HI7Xxf3xmSL5DVRW*noyClNPAuQJCPhh8EMhqW27*tNv*4Lv8nrmI9yA/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/*K0CPt1rhrQPLLGS8*ARoNUsaedloCkcp2WiyS-5HI7Xxf3xmSL5DVRW*noyClNPAuQJCPhh8EMhqW27*tNv*4Lv8nrmI9yA/Capture.PNG" width="252" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source for picture:<a href="https://en.wikipedia.org/wiki/Super-logarithm" target="_blank" rel="noopener">click here</a></em></p>
<p>Here we focus exclusively on operations on sets: XY becomes the intersection of the sets X and Y, and X + Y the union of X and Y. The question is: which functions satisfy Hlog(XY) = Hlog(X) + Hlog(y). We assume here that the argument for Hlog is a set X, and the returned value Hlog(X) = Y is another set Y from the same set of sets. Let E = {X, Y, ... } be the sets of all potential arguments for Hlog. E must satisfy the following conditions</p>
<ul>
<li>The intersection of two sets of E is also a set of E</li>
<li>The union of two sets of E is also a set of E</li>
</ul>
<p>The following additional condition can be added:</p>
<ul>
<li>E does not contain the empty set (thus any intersection of a finite number of sets in E, is non empty)</li>
</ul>
<p>Let's denote as U the union of all sets of E. It is easy to prove that Hlog(U) is the empty set, denoted as O. Also Hlog(O) does not exist, just like log(0) does not exist. It is also easy to prove that Hlog(XYZ) = Hlog(X) + Hlog(Y) + Hlog(Z), and this generalizes to any (finite) number of sets in E.</p>
<p>Two functions Hlog satisfy Hlog(XY) = Hlog(X) + Hlog(Y) :</p>
<ul>
<li>Hlog(X) is equal to a constant set if X is different from U, and Hlog(U) = O is the empty set.</li>
<li>Hlog(X) = U - X is the complimentary set of X (that is, U - X consists of all the elements that are in U but not in X)</li>
</ul>
<p>The question is: Besides these two functions (the first one is a degenerate solution), are there any other functions Hlog satisfying the same property? How do you proceed to solve such as weird functional equation in an unusual space? One could try to investigate this problem by first analyzing the case where E contains only 3 or 4 sets: In this case, look at all potential functions defined on E (there is only a finite number of them), and see which ones satisfy the equation Hlog(XY) = Hlog(X) + Hlog(Y). </p>
<p>Also, using object orientated programming, how would you implement a generic function Hlog that works both for real numbers, for sets, or in any other context?</p>
<p>To read more of my math-related articles, <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">click here</a>. </p>
<p><span class="font-size-4"><b>DSC Resources</b></span></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>
<p>Popular Articles</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-articles-about-core-data-science">What is Data Science? 24 Fundamental Articles Answering This Question</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python">Hitchhiker's Guide to Data Science, Machine Learning, R, Python</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
</ul><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/gHE01ARXBJQ" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:369613Fighting eCommerce fraud with graph technologytag:www.analyticbridge.datasciencecentral.com,2017-08-09:2004291:BlogPost:3693712017-08-09T15:30:00.000ZElise Devauxhttps://www.analyticbridge.datasciencecentral.com/profile/EliseDevaux
<p><span><br></br> ECommerce fraud is growing quickly, creating new challenges in terms of prevention and detection. As merchants gather more and more information about customers and their behaviors, the key element in the fight against fraud is now to draw on the connections within the data collected to uncover fraudulent behaviors. In this post we explain why and how graph technologies are crucial in the detection of eCommerce fraud.…</span></p>
<p></p>
<h2 style="text-align: center;"></h2>
<p><span><br/> ECommerce fraud is growing quickly, creating new challenges in terms of prevention and detection. As merchants gather more and more information about customers and their behaviors, the key element in the fight against fraud is now to draw on the connections within the data collected to uncover fraudulent behaviors. In this post we explain why and how graph technologies are crucial in the detection of eCommerce fraud.</span></p>
<p></p>
<h2 style="text-align: center;"><span class="font-size-5">eCommerce: a fertile ground for fraud</span></h2>
<p></p>
<p><span>In the past years the eCommerce market has continually expanded, reaching up to 1.9$ trillions in terms of transaction value in 2016. Ecommerce sales are still growing rapidly and are </span><a href="https://www.emarketer.com/Article/Worldwide-Retail-Ecommerce-Sales-Will-Reach-1915-Trillion-This-Year/1014369">forecast to reach $4 trillion</a><span> by the end of 2020, notably with retailers pushing into new international markets.<br/> <br/></span></p>
<p><span>In the meantime, eCommerce fraud has become a multi-billion dollar industry. A </span><a href="http://www.experian.com/assets/decision-analytics/white-papers/juniper-research-online-payment-fraud-wp-2016.pdf">study conducted by Juniper</a><span> estimated the average cost for merchants between 0.3% and 3% of revenues, depending on the vertical and the region. B</span><span>elow are some examples of today’s most common fraud schemes.</span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/ecommerce_fraud_types_linkurious.png" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/ecommerce_fraud_types_linkurious.png?width=1230" width="1230" class="align-center"/></a></p>
<p><span>The adoption of new technologies, payment methods and data processing systems has benefited to fraudsters, opening new doors to bypass existing security measures and cover their tracks. Professional fraudsters are organizing </span><span><a href="https://www.rte.ie/news/technology/2017/0621/884459-online-fraud/">themselves into networks</a>. They</span><span> exchange knowledge and technology across the globe and devise new schemes to stay one step ahead of the latest anti-fraud technology.<br/> <br/></span></p>
<h2 style="text-align: center;"><span><span class="font-size-5">A layered approach to fight fraud raising technical challenges</span><br/> <br/></span></h2>
<p>Faced with increasing flows of money and evolving fraud schemes, but also with new technology disrupting traditional security measures, anti-fraud teams in eCommerce companies need to adapt.<br/></p>
<p><span>The traditional “silver bullet” approach of relying on one or two anti-fraud strategies is no longer enough. Best in class organizations combine multiple complementary approaches to maximize the accuracy of fraud detection and avoid false positives that negatively impact reputations. With numerous fraud prevention solutions available – from device authentication to proxy piercing or address verification service – the layered approach shows better results in detecting fraud attempts.<br/> <br/></span></p>
<p><span>In their </span><a href="https://www.gartner.com/doc/3472117/market-guide-online-fraud-detection">Market Guide for Online Fraud Detection</a><span>, Gartner’s fraud analysts outlined five critical layers to tackle today’s threats: end-point, navigation, channel and cross-channel centric layers and an additional entity link analysis layer.</span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/Gartner_layers_fraud.png" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/Gartner_layers_fraud.png?width=1858" width="1858" class="align-center"/></a></p>
<p style="text-align: center;"><i>Gartner’s conceptual model of a layered approach for fraud detection<br/> <br/></i></p>
<p><span>In order to carry out connected analysis, eCommerce vendors need technologies able to work with cross-channel data and perform relationship analysis at scale. However, most of today’s anti-fraud solutions (whether it’s homemade or provided by a vendor) still relies on relational databases, designed to store data in a tabular format. Detecting connections between entities typically requires to join tables using foreign keys which becomes computationally intractable after a few hops.<br/> <br/></span></p>
<p><span>As a result, eCommerce anti-fraud teams are still limited by product and channel silos that provide little or no cross-channel view of a subject’s behavior. Existing structures are too rigid to allow the easy adoption of new rules or data, making it hard to keep pace with new products and schemes.<br/> <br/></span></p>
<h2 style="text-align: center;"><span>Graph technology: the answer for connected analysis<br/> <br/></span></h2>
<p><span>To be able to perform connected analysis and reinforce their fraud detection system, many eCommerce merchants are choosing to leverage graph technology. This approach relies on a graph data model where all the data is stored as a graph. The entities are stored as nodes, connected to each other by edges. Popular graph databases vendors include <a href="https://www.datastax.com/">DataStax</a>, <a href="https://neo4j.com/">Neo4j</a>, <a href="http://titan.thinkaurelius.com/">Titan</a>.<br/> <br/></span></p>
<p><span>Graph technology allows to gather and connect customer, transaction, behavior or third party data into a unique data model. This is essential to discover fraud attempts that are often hidden beyond layers of deceit. For instance, instead of examining credit card transactions over a lapse of time, analysts can query the graph data to investigate how it’s connected to other entities such as IP addresses, customers, devices.<br/> <br/></span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/linkurious_fraud_1.png" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/linkurious_fraud_1.png?width=2135" width="2135" class="align-center"/></a></p>
<p style="text-align: center;"><i>Visualization in Linkurious of cross-channel data stored as a graph in Neo4j<br/> <br/></i></p>
<p><span>The graph approach makes it easier and faster to query connections within the data. Anti-fraud teams can run queries traversing datasets of millions of records to unveil suspicious connections. This is critical to detect networks and suspicious patterns in real-time. But in order to speed up the analysis process, and therefore the response time, eCommerce merchants need intuitive accesses to this graph data.</span></p>
<p></p>
<h2 style="text-align: center;"><span>Detecting eCommerce fraudulent activity<br/> <br/></span></h2>
<p><span>To illustrate what fraud detection analysis with Linkurious Enterprise looks like, we created a small dataset with dummy eCommerce data and loaded it into a graph database. In the following sections, we explain how anti-fraud teams can leverage Linkurious Enterprise to detect and investigate fraud attempts.<br/> <br/></span></p>
<h3 style="text-align: left;"><span><span class="font-size-4">Setting up alerts to monitor fraud attempts</span><br/> <br/></span></h3>
<p><span>As new fraud schemes emerge, the ability to create detection rules on the fly is critical. Graph traversal languages, such as <a href="https://neo4j.com/developer/cypher-query-language/">Cypher </a>or <a href="https://github.com/tinkerpop/gremlin/wiki">Gremlin </a>are simple yet complete enough to let analysts imagine new queries that will flag fraudulent behaviors. Linkurious Enterprise offers an alert dashboard to generate and monitor different alerts and assess the flagged cases.<br/> <br/></span></p>
<p><span>For instance, we want to set up an alert query that returns any transactions that have a connection with at-risk-countries. The list of these countries is integrated in the graph model and appears as a property on our country nodes. The data is processed in near-real time and you get an immediate response of whether or not there is such patterns in your data.<br/> <br/></span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/alerts_linkurious.png" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/alerts_linkurious.png?width=400" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><span class="font-size-2"><i>Setting up alerts in Linkurious using Cypher query language<br/> <br/></i></span></p>
<p><span>The above alert will flag transactions where the IP or delivery addresses is located in one of the countries on my “at-risk-countries” list. For every case reported, a team can visually investigate the data.<br/> <br/></span></p>
<h3 style="text-align: left;"><span><span class="font-size-4">Visually investigating graph data</span><br/> <br/></span></h3>
<p><span>Whether it’s to investigate specific alert cases or learn more about a particular entity, analysts can easily search and visualize their data in real time in the Linkurious Enterprise interface. For instance, we can generate a visualization compiling a subset of transactions and their related connections with a simple query. Below is the visualization of our subset of transactions (red nodes), the associated credit cards (blue nodes) and customer accounts (green nodes). We see that nodes are connected together, depicting the ownership relationships between credit cards and customers.<br/> <br/></span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/linkurious_visualization_2.jpg" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/linkurious_visualization_2.jpg?width=3194" width="3194" class="align-center"/></a></p>
<p style="text-align: center;"><i>Example of visualization of graph data nodes (customer, credit card and transactions) and their connections<br/> <br/></i></p>
<p><span>By using different graph layouts, we can easily reveal structural patterns in the data to identify differences in a glimpse. In the example below, we switched from a force-directed layout to a hierarchical layout in order to better understand the different cases we have.<br/> <br/></span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/Linkurious_visualization_3.jpg" rel="prettyPhoto[gallery]" title="" class="fancybox image"><img class="aligncenter wp-image-5737 size-full" src="https://linkurio.us/wp-content/uploads/2017/08/Linkurious_visualization_3.jpg" alt="Transactions visualization in Linkurious" width="3184" height="1504"/></a></p>
<p style="text-align: center;"><i>Visualization with a hierarchical layout a of subset of transactions and their related nodes<br/> <br/></i></p>
<p><span>We immediately notice a suspicious pattern in the graph. Customers (green nodes) are typically connected to a single credit card (blue node) which is connected to one or several transactions (red nodes). But there’s one case where two green nodes (two customers), are connected to the same credit card which is suspicious.<br/> <br/></span></p>
<p><span>It is easy to drill down on a suspicious case with the Linkurious Enterprise interface. Analysts simply expand the nodes around suspicious patterns to reveal other connections within the data and assess the situation. In our example, we expanded the nodes around our two customers and the transactions to unveil a sub-graph with additional information (customer IP addresses, contact information, addresses, goods bought and shipping addresses).<br/> <br/></span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/Linkurious_visualization_4.png" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/Linkurious_visualization_4.png?width=1916" width="1916" class="align-center"/></a></p>
<p style="text-align: center;"><i>Investigation of the neighboring nodes of our suspicious customers<br/> <br/></i></p>
<p><span>With this visualization, we understand that actually three users are involved. Two of them, from different countries, ordered goods and shipped them to a third client, which could indicate a reshipping fraud.<br/> <br/></span></p>
<p><span>Graph technology offers an additional layer of protection for eCommerce companies. It enhances discrete analysis methods by providing connected analysis capabilities over a single source of truth of customer data. Anti-fraud teams have an intuitive tool to detect and investigate fraud attempts and fraud rings that would otherwise stay undetected.</span></p>
<p></p>
<p><a href="https://linkurio.us/" target="_blank">Learn more</a></p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/sNupOybjmfU" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:369371https://linkurio.us/wp-content/uploads/2017/08/ecommerce_fraud_types_linkurious.pngType I and Type II Errors in One Picturetag:www.analyticbridge.datasciencecentral.com,2017-08-10:2004291:BlogPost:3695862017-08-10T23:17:32.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>This picture speaks more than words. It explains the concept or false positive and false negative, that is, what is referred to by statisticians as Type I and Type II errors.</p>
<p></p>
<p><a href="http://api.ning.com:80/files/4aD3G6lWhVnWOLeavIi6rI00scQEtKRCp8cBo5rnAIbckXiSk7jOXyuhyG4JyI-C76Hp*lymuaqTdW5D1RKdy1laH7x3S2Ph/fp.PNG" target="_self"><img class="align-center" src="http://api.ning.com:80/files/4aD3G6lWhVnWOLeavIi6rI00scQEtKRCp8cBo5rnAIbckXiSk7jOXyuhyG4JyI-C76Hp*lymuaqTdW5D1RKdy1laH7x3S2Ph/fp.PNG" width="472"></img></a></p>
<p>Other great pictures summarizing data science and statistical concepts, can be found…</p>
<p>This picture speaks more than words. It explains the concept or false positive and false negative, that is, what is referred to by statisticians as Type I and Type II errors.</p>
<p></p>
<p><a href="http://api.ning.com:80/files/4aD3G6lWhVnWOLeavIi6rI00scQEtKRCp8cBo5rnAIbckXiSk7jOXyuhyG4JyI-C76Hp*lymuaqTdW5D1RKdy1laH7x3S2Ph/fp.PNG" target="_self"><img src="http://api.ning.com:80/files/4aD3G6lWhVnWOLeavIi6rI00scQEtKRCp8cBo5rnAIbckXiSk7jOXyuhyG4JyI-C76Hp*lymuaqTdW5D1RKdy1laH7x3S2Ph/fp.PNG" width="472" class="align-center"/></a></p>
<p>Other great pictures summarizing data science and statistical concepts, can be found <a href="http://www.datasciencecentral.com/profiles/blogs/four-great-pictures-illustrating-machine-learning-concepts" target="_blank">here</a> and also <a href="http://www.datasciencecentral.com/profiles/blogs/17-amazing-infographics-and-other-visual-tutorials" target="_blank">here</a>. </p>
<p><b>DSC Resources</b></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>
<p>Popular Articles</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-articles-about-core-data-science">What is Data Science? 24 Fundamental Articles Answering This Question</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python">Hitchhiker's Guide to Data Science, Machine Learning, R, Python</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
</ul><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/JhuWH7NGErE" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:369586Capturing Low-Probability, High-Impact Events 'Black Swans' in Economic and Financial Modelstag:www.analyticbridge.datasciencecentral.com,2017-07-31:2004291:BlogPost:3691962017-07-31T14:30:00.000ZJamilu Auwalu Adamuhttps://www.analyticbridge.datasciencecentral.com/profile/JamiluAuwaluAdamu
<p style="text-align: center;"><b>Capturing Low-Probability, High-Impact Events 'Black Swans' in Economic and Financial Models</b> <br></br> <b>Jamilu Auwalu Adamu , <i>Lecturer, Nigeria</i></b></p>
<p align="center"><b>Incorporation of Fat - Tailed Effects of the Underlying Assets Probability Distribution using Advanced Stressed Methods.</b></p>
<p><br></br> Capturing the effects of Low-Probability, High-Impact "Black Swans" in the existing stochastic and deterministic models is tremendously…</p>
<p style="text-align: center;"><b>Capturing Low-Probability, High-Impact Events 'Black Swans' in Economic and Financial Models</b> <br/> <b>Jamilu Auwalu Adamu , <i>Lecturer, Nigeria</i></b></p>
<p align="center"><b>Incorporation of Fat - Tailed Effects of the Underlying Assets Probability Distribution using Advanced Stressed Methods.</b></p>
<p><br/> Capturing the effects of Low-Probability, High-Impact "Black Swans" in the existing stochastic and deterministic models is tremendously Important. On this page, I would like to share with the members an open access, peer reviewed published research findings of my PhD thesis on how to capture the effects of Low-Probability, High-Impact Events in our existing economic and financial models.<br/> I shall begin with the incorporation of fat-tailed effects of the underlying assets probability distribution in the popular LOGIT and PROBIT MODELS.<br/> <br/> INTRODUCTION<br/> <br/> The Global financial markets have experienced series of financial and economic crises right from the inception and from generation to generation. Banks, Companies and the world economy experienced catastrophic deterioration and serious corporate failures by systemic risk effect.<br/> Big Banks and Companies like Continental Illinois, City Federal Savings and Loan, Bank of New England Boston, Lehman Brothers, General Motors, and Worldcom have all failed and declared bankrupt in the history of global financial markets. That is why, many scholars in the past and recent past have attempted to comes up with the models that can precisely calculate the Probability of Default of a Bank or Company over a given time period. Probability of Default for a given Company or Bank captures the probability that the Company or Bank will default within a certain period.<br/> The most popular models used by financial institutions to calculate probabilities of default are LOGIT (1980) and PROBIT (1981). Despite the fact that Logit and Probit gives good approximations but seems not to capture chaotic markets behaviour to some extend. Accurately determination of probability of default play very important role in the entire world economy. Probability of Default is the major component when determining (i) Capital Requirements under Basel II (Now Basel III) (ii) Expected Loss (iii) Risk Weighted Asset.<br/> Also, the probability of default (PD) is a crucial parameter in risk management, and can be used for the requests of loans, rating estimation, pricing of credit derivatives and many others key financial fields. The false estimation of PD leads to unreasonable rating; incorrect pricing of financial instruments and thereby it was one of the causes of the recent global financial crisis. <br/> The aim of this research work is to come up with new advanced stressed probability models that can capture chaotic markets behaviour or in the other way round to work under financial markets distress to some extend.<br/> <br/> LITERATURE REVIEW<br/> <br/> Initially, corporate distresses were assessed based on some qualitative information, which were very subjective. In particular, four references were mostly used, namely: (i) the capacity of the manager in charge of the project or company, (ii) the fact that the manager had an important financial involvement in the company as a financial guarantee, (iii) the project and the industry in itself, and (iv) the fact that firm possessed assets or collateral to back-up in case of a bad situation. <br/> <br/> John M. Moody (1909) was the first to published credit rating grades for publicly traded bonds. In 1941, David Durand applied discriminant analysis proposed by Fisher (1936) to classify prospective borrowers. Attempts have been made in 1950s to merge automated credit decision making with statistical techniques so as to enhance credit decision making. Lack of sophisticated computing tools, the models possessed limitations. Myers and Forgy (1963), compared discrimination analysis with regression in credit scoring application.<br/> <br/> Altman (1960), introduced variables in a multivariate discriminant analysis and obtained a function depending on some financial ratios. Beaver in 1966 introduced an univariate approach of discriminant analysis to assess the individual relationships between predictive variables, and subsequent failure events. In 1968, Altman expanded the work of Beaver (1966) to allow one to assess the relationship between failure and a set of financial characteristics. Martin (1977), presented a logistic regression model to predict probabilities of failure of banks using data obtained from Federal Reserve System. Ohlson (1980), used Logit to predict bankruptcy. Zmijewski (1984) used probit to estimate probability of default and predict bankruptcy.<br/> <br/> In 1985, West used factor analysis and logit estimation to assign a probability of a bank being a problematic. In 2001, Shumway introduced dynamic logit or hazard model to predict bankruptcy. Chava & Jarrow (2004), Hillegeist, Keating, Cram, & Lundstedt (2004), and Beaver, McNichols & Rhie (2005) uses Shumway’s approach. In 2004, Jones & Hensher introduced a mixed logit model for financial distress prediction and argued that it offers significant improvements compare to binary logit and multinomial logit models. Campbell, Hilscher, & Szilagyi (2008), introduced a dynamic logit model to predict corporate bankruptcies and failures at short and long horizons using accounting and market variables.<br/> <br/> In 2011, Altman, Fargler, & Kalotay used accounting – based measures, firm characteristics and industry level expectations of distress conditions to estimate the likehood of default inferred from equity prices. Li, Lee, Zhou, & Sun (2011) introduced a combined random subspace approach (RSB) with binary logit models to generate a so called RSB-L model that takes into account different decision agent’s opinions as a matter to enhance results.<br/> Sun & Li (2011) tested the feasibility and effectiveness of dynamic modelling for financial distress prediction (FDP) based on the Fisher discriminant analysis model.<br/> <br/> Stefan Van der Ploeg (2011) stated that since the seminal work of Martin (1977), the Logit and Probit Models has become one of the most commonly applied parametric failure prediction models in the academic literature as well as the banking regulation and supervision. Jamilu (2015) introduced new methods entitled “Jameel’s Advanced Stressed Methods uses Jameel’s Criterion” to Stress Economic and Financial Stochastic Models, initially using Logit and Probit Models. <br/> <br/> JAMEEL'S ADVANCED STRESSED METHODS<br/> <br/> Before incorporating Fat - Tailed Effects in our Existing LOGIT and PROBIT Models. We have to discuss about how to obtain BEST FITTED FAT -TAILED PROBABILITY DISTRIBUTIONS OF THE UNDERLYING STOCKS RETURNS OF THE COMPANIES UNDER CONSIDERATION. <br/> <br/> However, the major aim of this Post of the Research Findings is to consider Eleven (11) out of fifty (50) WORLD’S BIGGEST PUBLIC Companies by FORBES as of 2015 Ranking regardless of the platform in which they are listed and run the test of Goodness of fit on them vis – a – vis time series from 2014 – 2009. These include:<br/> <br/> (1) China Construction Bank Corporation (CICHY) from 2014 – 2009<br/> (2) Bank of China Limited (3988.K) from 2014 – 2009 <br/> (3) Berkshire Hathaway Inc. (BRK – A) from 2014 – 2009<br/> (4) Toyota Motor Corporation (TM) from 2014 – 2009<br/> (5) Volkswagen Group AG (VLKAY) from 2014 – 2009<br/> (6) Bank of America Corporation (BAC) from 2014 – 2009<br/> (7) Nestle India Limited (NESTLEIND.NS) from 2014 – 2009<br/> (8) International Business Machines Corporation (IBM) from 2014 – 2009<br/> (9) Goldman Sachs Group Securities (GJS) from 2014 – 2009<br/> (10) Google Inc. (GOOG) from 7/2/2015 to 5/18/2012<br/> (11) Facebook Inc. (FB) from 7/2/2015 to 5/18/2012<br/> <br/> This is regardless of the questions of WHAT would happen to the Stock Returns probability distributions IF we considers:<br/> (a) Daily, Weekly, Quarterly or Annually Stock Returns<br/> (b) Companies that are more than Five (5)<br/> (c) Companies that are listed on different stock exchanges not only ONE STOCK EXCHANGE platform<br/> (d) Simultaneously Long – term and Short – term time series of the Stock Returns<br/> (e) Recently past public companies like FACEBOOK with initial public offering on the 18th May, 2012 and began selling stock to the public and trading on the NASDAQ the same date and the GOOGLE on the March 9 2006.<br/> <br/> To achieve these, the author developed what is called JAMEEL'S CRITERION using ESSAYFITS SOFTWARE as follows:<br/> <br/> JAMEEL'S CRITERION:<br/> <br/> (i) We accept if the Average of the ranks of Kolmogorov Smirnor, Anderson Darling and Chi-squared is less than or equal to Three (3)<br/> (ii) We must choose the Probability Distribution follows by the data ITSELF regardless of its Rankings<br/> (iii) If there is tie, we include both Probability Distributions in the selection<br/> (iv) At least Two (2) probability distributions must be included in the selection <br/> (v) We select the most occur probability distribution as the qualify candidate in each case of test of goodness of fit of the stock returns as follows<br/> <br/> In the next post, we shall continue from Jameel's Criterion.<br/> <br/> Thank you.v</p><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/Ql9KfvmNQdA" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:369196Introducing User Behavioral Analysis in the Risk Processtag:www.analyticbridge.datasciencecentral.com,2017-07-31:2004291:BlogPost:3690282017-07-31T17:30:00.000ZAndrew Maranehttps://www.analyticbridge.datasciencecentral.com/profile/AndrewMarane
<div>Many years ago when I was entering the intelligence community, I attended a class in Virginia where the instructor opened the session with a test that I will never forget and that I have applied to almost every analytic task in my career. At the beginning of the class we were shown a ten-minute video of grand central station at rush hour with tens of thousands of people and were asked if we could find a single pickpocket in the crowd by the end of the video. At the end of ten minutes no…</div>
<div>Many years ago when I was entering the intelligence community, I attended a class in Virginia where the instructor opened the session with a test that I will never forget and that I have applied to almost every analytic task in my career. At the beginning of the class we were shown a ten-minute video of grand central station at rush hour with tens of thousands of people and were asked if we could find a single pickpocket in the crowd by the end of the video. At the end of ten minutes no one in the class was able to identify the individual.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>The purpose of the class was to stress the importance of looking for behavior at as indication that something is wrong, looking for a person or thing which is not doing or behaving the same way everything else is. By understanding what is normal in any given place or situation, identifying threats becomes much easier because outliers stand out. If you want to find a pickpocket in a crowd of people you don’t attempt to look at every person, you understand what the crowd is doing and look for the person who is not doing what the crowd is.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>By the end of the class, the video was shown again and every person was able to find the pickpocket within the first three minutes. Everyone understood how this principal could be applied to everything from counter-terrorism to personal protection and as I realized after entering the field, fraud. The person who is committing fraud does not behave the same way legitimate people do online. Even their attempts to look “normal” make them stand out, because their goal is completely different from everyone else on your site.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>Recently I conducted an experiment to see what percentage of users who had committed organized fraud transactions could have been identified by their behavior before they ever made a transaction. By looking at these individual’s interactions with the platform from account inception through the transaction process, I examined if this group did anything that a legitimate user would not and if that action was exclusive of individuals committing fraud. I used a year’s worth of fraud events, hundreds of thousands of fraud transactions and began dissecting their activity down to the click. At the end, 93% of individuals who had engaged in organized fraud could have been detected before they ever made a transaction based on their behavior with the bulk identifiable at account creation. Of that 93%, over half could have been identified by the first eight things they did on the site based on how the entered, their settings, their network and flow of interaction from entry to attempted activation. Even worse, and it was something that I hadn’t planned on, 12% of rejected transaction for organized fraud were false positives and misclassified. If behavioral analysis had been utilized, it would have detected that these were legitimate users and applied a different decision.</div>
<div>.</div>
<div>The impulse of every company engaged in online commerce is develop rules and models aimed at identifying fraud in the transaction flow. We throw hundreds of rules that look at thousands of data points when a person is buying something, writing a post or review or conducting a financial transaction. Over the years, these rules continue to grow and weave their way through the commerce or submission flow in our operation and they funnel every single person and every single transaction through them. We continue to scale as transaction volume grows, adding resources at bandwidth to the rules platform as the people and the rules themselves continue to multiply.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>Ultimate our fraud platforms struggle with latency and conflict because we are aiming every big gun we have at every individual who walks through our door and because we keep adding guns they eventually start pointing at themselves creating conflicts within the fraud infrastructure and friction to the user, regardless of who that user is because none of these rules simply look at the risk of the individual. In the end, more transactions are pushed to manual review which adds cost or worse, falsely rejected due to a lack of intelligence about the user themselves.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>Fraud rules, fraud filters and fraud models are essential and play an important part of any fraud prevention platform but they a reactionary not strategic. They do not know the difference between an individual who entered your site three months ago shopping for a TV, returned to your site 20 times in the following weeks looking at the price, reviews, specs and pictures of TV’s you are selling and then finally makes a purchasing decision on the best and most expensive TV you have and the guy who appeared out of the blue, set up an account, went directly to the most expensive TV you have, threw it in the cart and hit the purchasing flow. The rules are going to treat both of these transactions exactly the same because they are “new” users buying a high-risk item.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>By starting the fraud identification process by analyzing the behavior of users, particularly new users who enter your site or establish an account we become strategic an solve any number of issues in our fraud scoring process. Much in the same way we classify transactions in risk, our behavior model is looking for attributes and actions that run contrary to what legitimate users do on the site and has to begin with an understanding of what that is (I recommend you cozy up with your user experience team, they have piles of information on that very thing). Once there is a good understanding of normal user activity and behavior, we start training models that look for the opposite, and since you know what good behavior is, finding the abnormal behavior becomes much easier (see paragraph one). This can be visualized but scatter plotting behavioral attributes across millions of activities and users.</div>
<div class="separator"><span style="color: #ffffff;">.</span></div>
<div>Our behavioral model is going to look at attributes and actions and begin scoring them so we can tell the difference between a risky new user and a good new user (new users on first transactions are the hardest to classify and the highest risk in the transaction flow). It will look for things such as the users network, did they enter on a hosted service, proxy, botnet or in some way are they trying to disguise where they are and who they are. Do they have java, flash or cookies disabled? Do they enter the site and go directly to sign-up flow or account creation? What is the delta between the amount of time it takes them to complete the sign-up flow, is it 20 seconds when ever legitimate user takes 2 minutes? What type of operating system, browser and resolution (user agent attributes) are they on, is it unique? What language localization are they using and does it match where the other attributes and user entered data say they are from? What are the click patterns, are they too efficient in getting from entry to transaction for a normal user?</div>
<div><span style="color: #ffffff;">.</span></div>
<div>This could go on for some time and there are literally thousands of things to be gleaned from user and page logs that have enormous value in fraud detection that are hardly ever used anywhere. The behavioral model should be looking at everything you can feed into it from entry stopping at transaction, this model doesn’t care about transactions its job is to make a decision on what to do with this user before they make it there.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>A good example of strategic intervention over tactical or reactive intervention is the comparison of airline security in the United States to Israel. In the US, when you go to the airport that starts the screening process and everyone is lumped together and fed through the same security process to identify risk. Everyone goes through the same security line, the same x-ray, the same scanner and even random selection does not take into account anything about you, its just a random number. A five year old is basically looked at the same way a military age individual is when it comes to the overall process, which is why you always see the funny and yet disturbing videos of five year olds getting pat down, because their number was up.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>In Israel, the airlines security knows as much as they can about you before you even show up at the airport. By the time you arrive they know about you and your network, why your there, where you’re going, why you’re going and anything else they can possibly dig up about you. They have already risked scored you from the minute you bought the ticket and decided what kind of screening you are going to have at the airport and if you are still high risk, which security agent is going to be sitting in the seat behind you ready to blow your head off if you jump to quick. Strategic vs tactical risk management or in our case proactive vs reactive fraud detection.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>Once the behavior model is built and beginning to score and surface risk, it can begin solving fraud and platform issues. The first process would to begin cohorting users into different risk groups to determine what fraud rules and models apply based on the behavior. A new user who exhibited normal or predicted user behavior before entering the transaction flow would not be subject to the same filters that a user who showed highly suspicious behavior.</div>
<div><div class="separator"><span style="color: #ffffff;">.</span></div>
By cohorting users, we can begin more accurately “vetting” users in the transaction flow and also redirect traffic to our fraud platform to improve scalability and friction. An established user with a good behavior profile and predictable buying pattern could bypass all but exclusion rules freeing up bandwidth for users who exhibit high risk behavior that would run the gauntlet of our fraud platform.<br/>
<div class="separator"><span style="color: #ffffff;">.</span></div>
</div>
<div>Likewise, a user who exhibits known fraudulent behavior doesn’t need to be routed through the fraud platform at time of transaction, they can bypass the models to rejection. We can tune our cohort groups to optimize manual review of those users in specific risk groups who have a higher likelihood of transaction completion. By implementing behavior modeling, deep diving and analyzing the data from the users interactions prior to the transaction flow, we get a better understanding of the user’s intent on our site, we gain efficiency and bandwidth in our transactional fraud process and a greater accuracy when making risk based decisions.</div><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/8OKXnHibzUc" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:369028How to Detect if Numbers are Random or Nottag:www.analyticbridge.datasciencecentral.com,2017-07-10:2004291:BlogPost:3665472017-07-10T06:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>In this article, you will learn some modern techniques to detect whether a sequence appears as random or not, whether it satisfies the central limit theorem (CLT) or not -- and what the limiting distribution is if CLT does not apply -- as well as some tricks to detect abnormalities. Detecting lack of randomness is also referred to as signal versus noise detection, or pattern recognition.</p>
<p>It leads to the exploration of time series with massive, large-scale (long term) auto-correlation…</p>
<p>In this article, you will learn some modern techniques to detect whether a sequence appears as random or not, whether it satisfies the central limit theorem (CLT) or not -- and what the limiting distribution is if CLT does not apply -- as well as some tricks to detect abnormalities. Detecting lack of randomness is also referred to as signal versus noise detection, or pattern recognition.</p>
<p>It leads to the exploration of time series with massive, large-scale (long term) auto-correlation structure, as well as model-free, data-driven statistical testing. No statistical knowledge is required: we will discuss deep results that can be expressed in simple English. Most of the testing involved here uses big data (more than a billion computations) and data science, to the point that we reached the accuracy limits of our machines. So there is even a tiny piece of numerical analysis in this article.</p>
<p>Potential applications include testing randomness, Monte Carlo simulations for statistical testing, encryption, blurring, and <a href="http://www.datasciencecentral.com/profiles/blogs/interesting-data-science-application-steganography" target="_blank" rel="noopener">steganography</a> (encoding secret messages into images) using pseudo-random numbers. A number of open questions are discussed here, offering the professional post-graduate statistician new research topics both in theoretical statistics and advanced number theory. The level here is state-of-the-art, but we avoid jargon and some technicalities to allow newbies and non-statisticians to understand and enjoy most of the content. An Excel spreadsheet, attached to this document, summarizes our computations and will help you further understand the methodology used here.</p>
<p>Interestingly, I started to research this topic by trying to apply the notorious <a href="http://www.datasciencecentral.com/profiles/blogs/the-fundamental-statistics-theorem-revisited" target="_blank" rel="noopener">central limit theorem</a> (CLT) to non-random (static) variables -- that is, to fixed sequences of numbers that look chaotic enough to simulate randomness. Ironically, it turned out to be far more complicated than using CLT for regular random variables. So I start here by describing what the initial CLT problem was, before moving into other directions such as testing randomness, and the distribution of the largest gap in seemingly random sequences. As we will see, these problems are connected. </p>
<p><span class="font-size-3"><strong>1. Central Limit Theorem for Non-Random Variables</strong></span></p>
<p>Here we are interested in sequences generated by a periodic function f(<em>x</em>) that has an irrational period <em>T</em>, that is f(x+<em>T</em>) = f(x). Examples include f(<em>x</em>) = sin <em>x</em> with <em>T</em> = 2<span style="font-family: 'symbol', geneva;">p</span>, or f(x) = {<span style="font-family: 'symbol', geneva;">a</span>x} where <span style="font-family: 'symbol', geneva;">a</span> > 0 is an irrational number, { } represents the <a href="https://en.wikipedia.org/wiki/Fractional_part" target="_blank" rel="noopener">fractional part</a> and <em>T</em> = 1/<span style="font-family: 'symbol', geneva;">a</span>. The <em>k</em>-th element in the infinite sequence (starting with <em>k</em> = 1) is f(<em>k)</em>. The central limit theorem can be stated as follows:</p>
<p>Under certain conditions to be investigated -- mostly the fact that the sequence seems to represent or simulate numbers generated by a well-behaved stochastic process -- we would have:</p>
<p><a href="http://api.ning.com:80/files/lFg-2*UqyS7c3VI1rJZ-56pQ-GyU5xXXkHlDEUSKvuwlpdSIRwDUfjr3xOgRJrcmHLIZsg5POMw3AEUtxkYMs0yg5m2eTCSz/Capture1.PNG" target="_self"><img src="http://api.ning.com:80/files/lFg-2*UqyS7c3VI1rJZ-56pQ-GyU5xXXkHlDEUSKvuwlpdSIRwDUfjr3xOgRJrcmHLIZsg5POMw3AEUtxkYMs0yg5m2eTCSz/Capture1.PNG?width=703" width="703" class="align-center"/></a></p>
<p>In short, <em>U</em>(<em>n</em>) tends to a normal distribution of mean 0 and variance 1 as <em>n</em> tends to infinity, which means that as both <em>n</em> and <em>m</em> tends to infinity, the values <em>U</em>(<em>n</em>+1), <em>U</em>(<em>n</em>+2) ... <em>U</em>(<em>n</em>+<em>m</em>) have a distribution that converges to the standard bell curve.</p>
<p>From now on, we are dealing exclusively with sequences that are <a href="https://en.wikipedia.org/wiki/Equidistributed_sequence" target="_blank" rel="noopener">equidistributed</a> over [0, 1], thus <span style="font-family: 'symbol', geneva;">m</span> = 1/2 and <span style="font-family: 'symbol', geneva;">s</span> = SQRT(1/12). In particular, we investigate f(<em>x</em>) = {<span style="font-family: 'symbol', geneva;">a</span><em>x</em>} where <span style="font-family: 'symbol', geneva;">a</span> > 0 is an irrational number and { } the fractional part. While this function produces a sequence of numbers that seems fairly random, there are major differences with truly random numbers, to the point that CLT is no longer valid. The main difference is the fact that these numbers, while somewhat random and chaotic, are much more evenly spread than random numbers.True random numbers tend to create some clustering as well as empty spaces. Another difference is that these sequences produce highly auto-correlated numbers.</p>
<p>As a result, we propose a more general version of CLT, redefining <em>U</em>(<em>n</em>) by adding two parameters <em>a</em> and <em>b</em>: </p>
<p><a href="http://api.ning.com:80/files/lFg-2*UqyS56ZvvpU4DRshu2WGmCT-Qy7KhoDkR33qA-NoS0R3KKs8pyNdgUK7tmwvNckoY0upDSVU2kpnieorvFjugeKavG/Capture2b.PNG" target="_self"><img src="http://api.ning.com:80/files/lFg-2*UqyS56ZvvpU4DRshu2WGmCT-Qy7KhoDkR33qA-NoS0R3KKs8pyNdgUK7tmwvNckoY0upDSVU2kpnieorvFjugeKavG/Capture2b.PNG" width="298" class="align-center"/></a></p>
<p>This more general version of CLT can handle cases like our sequences. Note that the classic CLT corresponds to <em>a</em> = 1/2 and <em>b</em> =0. In our case, we suspect that <em>a</em> = 1 and <em>b</em> is between 0 and -1. This is discussed in the next section. </p>
<p>Note that if instead of f(<em>k</em>), the <em>k</em>-th element of the sequence is replaced by f(<em>k^2</em>) then the numbers generated behave more like random numbers: they are less evenly distributed and less auto-correlated, and thus the CLT might apply. We haven't tested it yet. </p>
<p>You will also find an application of CLT to non-random variables, as well as to correlated variables, i<a href="http://www.datasciencecentral.com/profiles/blogs/the-fundamental-statistics-theorem-revisited" target="_blank" rel="noopener">n my previous article on this topic</a>.</p>
<p><span class="font-size-4"><strong>2. Testing Randomness: Max Gap, Auto-Correlations and More</strong></span></p>
<p>Let's call the sequence f(1), f(2), f(3) ... f(<em>n</em>) generated by our function f(<em>x</em>) an <span style="font-family: 'symbol', geneva;">a</span>-sequence. Here we compare properties of <span style="font-family: 'symbol', geneva;">a</span>-sequences with those of random numbers on [0, 1] and we highlight the striking differences. Both sequences, when <em>n</em> tends to infinity, have a mean value converging to 1/2, a variance converging to 1/12 (just like any uniform distribution on [0, 1]), and they both look quite random at first glance. But the similarities almost stop here. </p>
<p><strong>Maximum gap</strong></p>
<p>The maximum gap among <em>n</em> points scattered between 0 and 1 is another way to test for randomness. If the points were truly randomly distributed, the expected value for the length of the maximum gap (also called longest segment) is known and is equal to</p>
<p><a href="http://api.ning.com:80/files/KtY2yaJFKB6Voi*GralKMEEv*VZ4x3sYQ1rrR46fXestPxL*UBN-2KCB0r7MTCxVzmrWO3LBNkUB8DmmuV8TDBm370qFI7nv/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/KtY2yaJFKB6Voi*GralKMEEv*VZ4x3sYQ1rrR46fXestPxL*UBN-2KCB0r7MTCxVzmrWO3LBNkUB8DmmuV8TDBm370qFI7nv/Capture.PNG" width="150" class="align-center"/></a></p>
<p>See <a href="https://math.stackexchange.com/questions/13959/if-a-1-meter-rope-is-cut-at-two-uniformly-randomly-chosen-points-what-is-the-av" target="_blank" rel="noopener">this article</a> for details, or the book <a href="https://www.amazon.com/Order-Statistics-Herbert-David/dp/0471389269" target="_blank" rel="noopener">Order Statistics</a> published by Wiley, page 135. The max gap values have been computed in the spreadsheet (see section below to download the spreadsheet) both for random numbers and for <span style="font-family: 'symbol', geneva;">a</span>-sequences. It is pretty clear from the Excel spreadsheet computations that the average maximum gaps have the following expected values, as <em>n</em> becomes very large:</p>
<ul>
<li>Maximum gap for random numbers: log(<em>n</em>) / <em>n</em> as expected from the above theoretical formula</li>
<li>Maximum gap for <span style="font-family: 'symbol', geneva;">a</span>-sequences: <em>c</em> / <em>n</em> (<em>c</em> is a constant close to 1.5; the result needs to be formally proved)</li>
</ul>
<p>So <span style="font-family: 'symbol', geneva;">a</span>-sequences have points that are far more evenly distributed than random numbers, by an order of magnitude, not just by a constant factor! This is true for the 8 <span style="font-family: 'symbol', geneva;">a</span>-sequences (8 different values of <span style="font-family: 'symbol', geneva;">a</span>) investigated in the spreadsheet, corresponding to 8 "nice" irrational numbers (more on this in the research section below, about what a "nice" irrational number might be in this context.) </p>
<p><strong>Auto-correlations</strong></p>
<p>Unlike random numbers, values of f(<em>k</em>) exhibit strong, large-scale auto-correlations: f(<em>k</em>) is strongly correlated with f(<em>k</em>+<em>p</em>) for some values of <em>p</em> as large as 100. The successive lag-<em>p</em> auto-correlations do not seem to decay with increasing values of <em>p.</em> To the contrary, it seems that the maximum lag-<em>p</em> auto-correlation (in absolute value) seems to be increasing with <em>p</em>, and possibly reaching very close to 1 eventually. This is in stark contrast with random numbers: random numbers do not show auto-correlations significantly different from zero, and this is confirmed in the spreadsheet. Also, the vast majority of time series have auto-correlations that quickly decay to 0. This surprising lack of decay could be the subject of some interesting number theoretic research. These auto-correlations are computed and illustrated in the Excel spreadsheet (see section below) and are worth checking out. </p>
<p><strong>Convergence of <em>U</em>(<em>n</em>) to a non-degenerate distribution</strong></p>
<p>Figures 2 and 3 in the next section (extracts from our spreadsheet) illustrate why the classic central limit theorem (that is, <em>a</em> = 1/2, <em>b</em> =0 for the <em>U</em>(<em>n</em>) formula) does not apply to <span style="font-family: 'symbol', geneva;">a</span>-sequences, and why <em>a</em> = 1 and <em>b</em> = <em>0</em> might be the correct parameters to use instead. However, with the data gathered so far, we can't tell whether <em>a</em> = 1 and <em>b</em> = 0 is correct, or whether <em>a</em> = 1 and <em>b</em> = -1 is correct: both exhibit similar asymptotic behavior, and the data collected is not accurate enough to make a final decision on this. The answer could come from theoretical considerations rather than from big data analysis. Note that the correct parameters should produce a somewhat horizontal band for <em>U</em>(<em>n</em>) in figure 2, with values mostly concentrated between -2 and +2 due to normalization of <em>U</em>(<em>n</em>) by design. And <em>a</em> = 1, <em>b</em> = 0, as well as <em>a</em> = 1, <em>b</em> = -1, both do just that, while it is clear that <em>a</em> = 1/2 and <em>b</em> = 0 (classic CTL) fails as illustrated in figure 3. You can play with parameters <em>a</em> and <em>b</em> in the spreadsheet, and see how it changes figure 2 or 3, interactively. </p>
<p>One issue is that we computed <em>U</em>(<em>n</em>) for <em>n</em> up to 100,000,000 using a formula that is ill-conditioned: multiplying a large quantity <em>n</em> by a value close to zero (for large <em>n</em>) to compute <em>U</em>(<em>n</em>), when the precision available is probably less than 12 digits. This might explain the large, unexpected oscillations found in figure 2. Note that oscillations are expected (after all, <em>U</em>(<em>n</em>) is supposed to converge to a statistical distribution, possibly the bell curve, even though we are dealing with non-random sequences) but such large-scale, smooth oscillations, are suspicious. </p>
<p><strong><span class="font-size-4">3. Excel Spreadsheet with Computations</span></strong></p>
<p><a href="http://datashaping.com/tctl.xlsx" target="_blank" rel="noopener">Click here</a> to download the spreadsheet. The spreadsheet has 3 tabs: One for <span style="font-family: 'symbol', geneva;">a</span>-sequences, one for random numbers -- each providing auto-correlation, max gap, and some computations related to estimating <em>a</em> and <em>b</em> for <em>U</em>(<em>n</em>) -- and a tab summarizing <em>n</em> = 100,000,000 values of <em>U</em>(<em>n</em>) for <span style="font-family: 'symbol', geneva;">a</span>-sequences, as shown in figures 2 and 3. That tab, based on data computed using a Perl script, also features moving maxima and moving minima, a concept similar to moving averages, to better identify the correct parameters <em>a</em> and <em>b</em> to use in <em>U</em>(<em>n</em>). </p>
<p>Confidence intervals (CI) can be empirically derived to test a number of assumptions, as illustrated in figure 1: in this example, based on 8 measurements, it is clear that maximum gap CI's for <span style="font-family: 'symbol', geneva;">a</span>-sequences are very different from those for random numbers, meaning that <span style="font-family: 'symbol', geneva;">a</span>-sequences do not behave like random numbers.</p>
<p><a href="http://api.ning.com:80/files/KtY2yaJFKB6KsHYx84NGKv2T24dHWq4PXxXtg7d**3gwNuzqLMLQUR2jrHDa3mMAiyU80buVj7F1Z39Of4ro7UpF1I2UN83c/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/KtY2yaJFKB6KsHYx84NGKv2T24dHWq4PXxXtg7d**3gwNuzqLMLQUR2jrHDa3mMAiyU80buVj7F1Z39Of4ro7UpF1I2UN83c/Capture.PNG" width="515" class="align-center"/></a></p>
<p style="text-align: center;"><em><strong>Figure 1</strong>: max gap times n (n = 10,000), for 8 <span style="font-family: 'symbol', geneva;">a</span>-sequences (top) and 8 sequences of random numbers (bottom)</em></p>
<p><a href="http://api.ning.com:80/files/KtY2yaJFKB6nBIDr4NLvNg3f*p-xVEt9yMsro5RPphWM*x34ZiC5dcJ1JViH6ZyVQjrTqqO*g2hn80uBmyYTEcLykRWkI3NQ/CaptureA.PNG" target="_self"><img src="http://api.ning.com:80/files/KtY2yaJFKB6nBIDr4NLvNg3f*p-xVEt9yMsro5RPphWM*x34ZiC5dcJ1JViH6ZyVQjrTqqO*g2hn80uBmyYTEcLykRWkI3NQ/CaptureA.PNG?width=503" width="503" class="align-center"/></a></p>
<p style="text-align: center;"><em><strong>Figure 2</strong>: U(n) with a = 1, b = 0 (top) and U(n) moving max / min (bottom) for <span style="font-family: 'symbol', geneva;">a</span>-sequences</em></p>
<p><a href="http://api.ning.com:80/files/KtY2yaJFKB5il*-FhIyyJS7ibpjMnMP-64KqHhPHEfWrgQlJKkEHnkRG5CJeeLCt88F-b3AAOpwl1lDF0J3wvTkYjcKc3764/Captureb.PNG" target="_self"><img src="http://api.ning.com:80/files/KtY2yaJFKB5il*-FhIyyJS7ibpjMnMP-64KqHhPHEfWrgQlJKkEHnkRG5CJeeLCt88F-b3AAOpwl1lDF0J3wvTkYjcKc3764/Captureb.PNG?width=504" width="504" class="align-center"/></a></p>
<p style="text-align: center;"><em><strong>Figure 3</strong>: U(n) with a = 0.5, b = 0 (top) and U(n) moving max / min (bottom) for <span style="font-family: 'symbol', geneva;">a-</span>sequences</em></p>
<p><span class="font-size-4"><strong>4. Potential Research Areas</strong></span></p>
<p>Here we mention some interesting areas for future research. By sequence, we mean <span style="font-family: 'symbol', geneva;">a</span>-sequence as defined in section 2, unless otherwise specified. </p>
<ul>
<li>Using f(<em>k^c</em>) as the <em>k</em>-th element of the sequence, instead of f(<em>k</em>). Which values of <em>c</em> > 0 lead to equidistribution over [0, 1], as well as yielding the classic version of CLT with <em>a</em> = 1/2 and <em>b</em> = 0 for <em>U</em>(<em>n</em>)? Also what happens if f(<em>k</em>) = {<span style="font-family: 'symbol', geneva;">a </span>p(<em>k</em>)} where p(<em>k</em>) is the <em>k</em>-th prime number and { } represents the fractional part? This sequence was proved to be equidistributed on [0, 1] (this by itself.is a famous result of analytic number theory, published by Vinogradov in 1948) and has a behavior much more similar to random numbers, so maybe the classic CLT applies to this sequence? Nobody knows. </li>
</ul>
<ul>
<li>What is the asymptotic distribution of the moments and distribution of the maximum gap among the <em>n</em> first terms of the sequence, both for random numbers on [0, 1] and for the sequences investigated in this article? Does it depend on the parameter <span style="font-family: 'symbol', geneva;">a</span>? Same question for minimum gap and other metrics used to test randomness, such as point concentration, defined for instance in the article <a href="http://www.sciencedirect.com/science/article/pii/S0022314X99924204" target="_blank" rel="noopener">On Uniformly Distributed Dilates of Finite Integer Sequences</a>?</li>
</ul>
<ul>
<li>Does <em>U</em>(<em>n</em>) depend on <span style="font-family: 'symbol', geneva;">a</span>? What are the best choices for <span style="font-family: 'symbol', geneva;">a</span>, to get as much randomness as possible? In a similar context, sqrt(2)-1 and (sqrt(5)-1)/2 are found to be good candidates: see <a href="https://en.wikipedia.org/wiki/Low-discrepancy_sequence" target="_blank" rel="noopener">this Wikipedia article</a> (read the section on additive recurrence.) Also, what are the values of the coefficients <em>a</em> and <em>b</em> in <em>U</em>(<em>n</em>), for <span style="font-family: 'symbol', geneva;">a</span>-sequences? It seems that <em>a</em> must be equal to 1 to guarantee convergence to a non-degenerate distribution. Is the limiting distribution for <em>U</em>(<em>n</em>) also normal for <span style="font-family: 'symbol', geneva;">a</span>-sequences, when using the correct <em>a</em> and <em>b</em>?</li>
</ul>
<ul>
<li>What happens if <span style="font-family: 'symbol', geneva;">a</span> is very close to a simple rational number, for instance if the first 500 digits of <span style="font-family: 'symbol', geneva;">a</span> are identical to those of 3/2?</li>
</ul>
<p><strong>Generalization to higher dimensions</strong></p>
<p>So far we worked in dimension 1, the support domain being the interval [0, 1]. In dimension 2, f(<em>x</em>) = {<span style="font-family: 'symbol', geneva;">a</span><em>x</em>} becomes f(<em>x</em>, <em>y</em>) = ({<span style="font-family: 'symbol', geneva;">a</span><em>x</em>}, {<span style="font-family: 'symbol', geneva;">b</span><em>y</em>}) with <span style="font-family: 'symbol', geneva;">a</span>, <span style="font-family: 'symbol', geneva;">b</span>, and <span style="font-family: 'symbol', geneva;">a</span>/<span style="font-family: 'symbol', geneva;">b</span> irrational; f(<em>k</em>) becomes f(<em>k</em>,<em>k</em>). Just like the interval [0, 1] can be replaced by a circle to avoid boundary effects when deriving theoretical results, the square [0, 1] x [0, 1] can be replaced by the surface of the torus. The maximum gap becomes the maximum circle (on the torus) with no point inside it. The range statistic (maximum minus minimum) becomes the area of the convex hull of the <em>n</em> points. For a famous result regarding the asymptotic behavior of the area of the convex hull of a set of <em>n</em> points, <a href="http://www.datasciencecentral.com/profiles/blogs/the-fundamental-statistics-theorem-revisited" target="_blank" rel="noopener">read this article</a> and check out the sub-section entitled "Other interesting stuff related to the Central Limit Theorem." Note that as the dimension increases, boundary effects become more important. </p>
<p><a href="http://api.ning.com:80/files/5yCEGsYwHQr1dk1G9i2cH1fchLFoDRoH4ONMOJ-6vFsDU1vxyjxKyeEior6HtwcVBGmeNsvaMaUnCpUO6Ogz2CBD72UOYKTy/biv.PNG" target="_self"><img src="http://api.ning.com:80/files/5yCEGsYwHQr1dk1G9i2cH1fchLFoDRoH4ONMOJ-6vFsDU1vxyjxKyeEior6HtwcVBGmeNsvaMaUnCpUO6Ogz2CBD72UOYKTy/biv.PNG" width="467" class="align-center"/></a></p>
<p style="text-align: center;"><em><strong>Figure 4</strong>: bi-variate example with c = 1/2, <span style="font-family: 'symbol', geneva;">a</span> = SQRT(31), <span style="font-family: 'symbol', geneva;">b</span> = SQRT(17) and n = 1000 points</em></p>
<p>Figure 4 shows an unsual example in two dimensions, with strong departure from randomness, at least when looking at the first 1,000 points. Usually, the point pattern looks much more random, albeit not perfectly random, as in Figure 5</p>
<p style="text-align: center;"><a href="http://api.ning.com:80/files/QzmRjEe-2paSdrJZvK-BHRQxwI*tUYGdGk28ST-vzHjO4HlXin1L4YcJzzRuwYVkmXWnmXnR003FmDBZ2QUYnKl3NryeeMnF/vvv.PNG" target="_self"><img src="http://api.ning.com:80/files/QzmRjEe-2paSdrJZvK-BHRQxwI*tUYGdGk28ST-vzHjO4HlXin1L4YcJzzRuwYVkmXWnmXnR003FmDBZ2QUYnKl3NryeeMnF/vvv.PNG" width="467" class="align-center"/></a>. <em><strong>Figure 5</strong>: bi-variate example with c = 1/2, <span style="font-family: 'symbol', geneva;">a</span> = SQRT(13), <span style="font-family: 'symbol', geneva;">b</span> = SQRT(26) and n = 1000 points</em></p>
<p style="text-align: left;">Computations are found <a href="http://api.ning.com:80/files/QzmRjEe-2pZ3iaGKOuGtjwCfeTA7dpgyYjRCGBMY8WSco0FnNqXlGyGv67job4C6PfHwvKQsMeE7Znnok-7y-OddV0W5Vww7/bivariate.xlsx" target="_self">in this spreadsheet</a>. Note that we've mostly discussed the case <em>c</em> = 1 in our article. The case <em>c</em> = 1/2 creates interesting patterns, and the case <em>c</em> = 2 produces more random patterns. The case <em>c</em> = 1 creates very regular patterns (points evenly spread, just like in one dimension.)</p>
<p><strong>Related articles</strong></p>
<ul>
<li><a href="http://www.analyticbridge.datasciencecentral.com/profiles/blogs/10-interesting-reads-for-math-geeks" target="_blank" rel="noopener">12 Interesting Reads for Math Geeks</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">My Best Articles</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/the-fundamental-statistics-theorem-revisited" target="_blank" rel="noopener">The Fundamental Statistics Theorem Revisited</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/are-the-digits-of-pi-truly-random" target="_blank" rel="noopener">Are the Digits of Pi Truly Random? - Must Read for Math and Data Geeks</a></li>
</ul>
<div><p><span class="font-size-4"><b>DSC Resources</b></span></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>
<p>Popular Articles</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-articles-about-core-data-science">What is Data Science? 24 Fundamental Articles Answering This Question</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python">Hitchhiker's Guide to Data Science, Machine Learning, R, Python</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
</ul>
</div><img src="http://feeds.feedburner.com/~r/FeaturedBlogPosts-Analyticbridge/~4/x59kxti8Wq4" height="1" width="1" alt=""/>http://www.analyticbridge.datasciencecentral.com/xn/detail/2004291:BlogPost:366547