2015年8月25日星期二

fb 抓小偷变种题

面试官说是道新题  finding ali baba
就是ali baba是个怪兽,他可能在[0,1, ..., numCaves-1]出现,他每隔一天就要换
一个地方,但他每次只能移左右一格。
然后给你一个strategy[]数组,数组里面是猜测每天ali baba的位置。如果里面有一个
猜中了,这个strategy就是正确的。
问题就是让你写一个判断函数 alibaba(int numCaves, int[] strategy)来判别这个
strategy数组是不是稳赢的策略。写完后问时间复杂度,然后问了下大概怎么样可以优
化~~~

我感觉这题是DP,但不知道转移方程怎么写,不知走过路过的大神能给点idea咩?

  boolean canCatchTheft(int n, int strategy[]) {
        int k = strategy.length;
        // nextDay[i] means theft can survive in spot i or not on this day
        boolean nextDay[] = new boolean[n]; 
        boolean canSurvive, pre, a, b;
        // init the first day
        Arrays.fill(nextDay, true); nextDay[strategy[0]] = false;
        for (int i = 1; i < k; ++i) { 
            canSurvive = false; pre = false;
            for (int j = 0; j < n; ++j) { 
                a = j == 0 ? false : pre;
                b = j == n - 1 ? false : nextDay[j + 1];
                pre = nextDay[j]; // store current day for the next round
                nextDay[j] = ((a || b) && strategy[i] != j) ? true : false;
                if(nextDay[j] == true) canSurvive = true
            }
            if (!canSurvive) return true;
        }
        return false;

    }

2015年8月22日星期六

有space tolerance以及对时间有限制的hashtable实现

关键地方是需要删除时间t-T的entry point
if (map.entrySet().size() >= T) { long lastBucket = ((long) nums[i - T] - Integer.MIN_VALUE) / ((long) t + 1); map.remove(lastBucket); } map.put(bucket, remappedNum);
做一道leetcode题想到的扩展

2015年8月3日星期一

Facebook: Building Timeline: Scaling up to hold your life story

Building Timeline: Scaling up to hold your life story

Timeline isn’t just a bold new look for Facebook­—it’s also the product of a remarkably ambitious engineering effort. While our earlier profile pages surfaced a few days or weeks of activity, from the onset we knew that with Timeline we had to think in terms of years and even decades. At a high level we needed to scan, aggregate, and rank posts, shares, photos and check-ins to surface the most significant events over years of Facebook activity.

The schedule for Timeline was very aggressive. When we sat down to build the system, one of our key priorities was eliminating technical risk by keeping the system as simple as possible and relying on internally-proven technologies. After a few discussions we decided to build on four of our core technologies: MySQL/InnoDB  for storage and replication, Multifeed (the technology that powers News Feed) for ranking, Thrift for communications, and memcached for caching. We chose well-understood technologies so we could better predict capacity needs and rely on our existing monitoring and operational tool kits.

Denormalizing the data
Before we began Timeline, our existing data was highly normalized, which required many round trips to the databases. Because of this, we relied on caching to keep everything fast. When data wasn’t found in cache, it was unlikely to be clustered together on disk, which led to lots of potentially slow, random disk IO. To support our ranking model for Timeline, we would have had to keep the entire data set in cache, including low-value data that wasn’t displayed.

A massive denormalization process was required to ensure all the data necessary for ranking was available in a small number of IO-efficient database requests

Once denormalized, each row in the database contained both information about the action and enough ranking metadata so it could be selected or discarded without additional data fetches. Data is now sorted by (user, time) on disk and InnoDB does a great job of streaming data from disk with a primary key range query.

Some of our specific challenges of the denormalization process were:

1. Dozens of legacy data formats that evolved over years. Peter Ondruška, a Facebook summer intern, defined a custom language to concisely express our data format conversion rules and wrote a compiler to turn this into runnable PHP. Three “data archeologists” wrote the conversion rules.
2. Non-recent activity data had been moved to slow network storage. We hacked a read-only build of MySQL and deployed hundreds of servers to exert maximum IO pressure and copy this data out in weeks instead of months.
3. Massive join queries that did tons of random IO. We consolidated join tables into a tier of flash-only databases. Traditionally PHP can perform database queries on only one server at a time, so we wrote a parallelizing query proxy that allowed us to query the entire join tier in parallel.
4. Future-proofing the data schema. We adopted a data model that’s compatible with Multifeed. It’s more flexible and provides more semantic context around data with the added benefit of allowing more code reuse.

Timeline aggregator 
We built the Timeline aggregator on top of the database. It started its life as a modified version of the Multifeed Aggregator that powers News Feed, but now it runs locally on each database box, allowing us to max out the disks without sending any data over the network that won’t be displayed on the page.

The aggregator provides a set of story generators that handle everything from geographically clustering nearby check-ins to ranking status updates. These generators are implemented in C++ and can run all these analyses in a few milliseconds, much faster than PHP could. Much of the generator logic is decomposed into a sequence of simple operations that can be reused to write new generators with minimal effort.

Caching is an important part of any Facebook project. One of the nice properties of Timeline is that the results of big queries, such as ranking all your activity in 2010, are small and can be cached for a long period without cache invalidations. A query result cache is of huge benefit and memcached is an excellent solution.

Recent Activity changes frequently so a query cache is frequently invalidated, but regenerating the summary of Recent Activity is quite fast. Here a row cache helps further boost query performance. We rely on the InnoDB buffer pool in RAM and our ownFlashcache kernel driver to expand the OS cache onto a flash device.


Developing in parallel
Timeline started as a Hackathon project in late 2010 with two full-time engineers, an engineering intern, and a designer building a working demo in a single night. The full team ramped up in early 2011, and the development team was split into design, front-end engineering, infrastructure engineering, and data migrations. By doing staged and layered prototyping, we achieved an amazing amount of development parallelism and rarely was any part of the team blocked by another. Early on in the project we were simultaneously:

1. Designing UI prototypes with our pre-existing but non-scalable backend,
2. Building production frontend code on a simulation of the scalable backend,
3. Building the scalable backend using samples of de-normalized data from a prototype of denormalization migration,
4. Building the framework to run the full-scale denormalization process,
5. Collecting and copying the data necessary for the denormalization,
6. Performing simulated load testing to validate our capacity planning estimates.

In retrospect, that’s pretty crazy. We had to move a lot of mountains to go from the initial infrastructure review meeting to successfully turning on the 100% backend load test in just six months. Done another way, this project could have taken twice as long - and that’s being generous.

Timeline has been a wonderful opportunity to work closely with the product team. Our constant collaboration was critical to ensuring the infrastructure we built supported their product goals while simultaneously guiding the product toward features that we could implement efficiently.

As millions of users enable Timeline, it is wonderfully exciting to see all the positive feedback and even more exciting to see our performance graphs look just like our simulations.

Ryan Mack, an infrastructure engineer, looks forward to rediscovering this blog post on his timeline a decade from now.

2015年7月27日星期一

f design question 总结

同主题阅读:f design question 总结

分类: 面经 29679人阅读 评论(1) 收藏 举报

1. 入门级的news feed
http://www.quora.com/What-are-best-practices-for-building-somet
http://www.infoq.com/presentations/Scale-at-Facebook
http://www.infoq.com/presentations/Facebook-Software-Stack
一般的followup question是估算需要多少server
另外这个帖子有讨论
http://www.mitbbs.ca/article_t/JobHunting/32463885.html
这篇文章稍微提到要怎么approach这种题,可以稍微看看
http://book.douban.com/reading/23757677/


2. facebook chat,这个也算是挺常问的
http://www.erlang-factory.com/upload/presentations/31/EugeneLet
https://www.facebook.com/note.php?note_id=14218138919
http://www.cnblogs.com/piaoger/archive/2012/08/19/2646530.html
http://essay.utwente.nl/59204/1/scriptie_J_Schipers.pdf

3. typeahead search/search suggestion,这个也常见
https://www.facebook.com/video/video.php?v=432864835468
问题在这个帖子里被讨论到,基本上每个问题,在视频里都有回答
http://www.mitbbs.com/article_t/JobHunting/32438927.html


4. Facebook Messaging System(有提到inbox search, which has been asked before)
messaging system就是一个把所有chat/sms/email之类的都结合起来的一个系统
http://www.infoq.com/presentations/HBase-at-Facebook
http://sites.computer.org/debull/A12june/facebook.pdf
http://www.slideshare.net/brizzzdotcom/facebook-messages-hbase/
https://www.youtube.com/watch?v=UaGINWPK068


5. 任给一个手机的位置信号(经纬度),需要返回附近5mile 的POI
这个这里有讨论,这题貌似nyc很爱考...
http://www.mitbbs.ca/article0/JobHunting/32476139_0.html


6. Implement second/minute/hour/day counters
这题真不觉得是system design,但万一问道,还是要有准备,貌似在总部面试会被问
道....
这个帖子有讨论
http://www.mitbbs.com/article_t/JobHunting/32458451.html


7. facebook photo storage,这个不太会被问起,但是知道也不错
https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
https://www.facebook.com/note.php?note_id=76191543919


8. facebook timeline,这个也不太是个考题,看看就行了
https://www.facebook.com/note.php?note_id=10150468255628920
http://highscalability.com/blog/2012/1/23/facebook-timeline-bro


除了这些,准备一下这些题目
implement memcache
http://www.adayinthelifeof.nl/2011/02/06/memcache-internals/

implement tinyurl(以及distribute across multiple servers)
http://stackoverflow.com/questions/742013/how-to-code-a-url-sho

determine trending topics(twitter)
http://www.americanscientist.org/issues/pub/the-britney-spears-
http://www.michael-noll.com/blog/2013/01/18/implementing-real-t

copy one file to multiple servers
http://vimeo.com/11280885

稍微知道一下dynamo key value store,以及google的gfs和big table


另外推荐一些网站
http://highscalability.com/blog/category/facebook
这个high scalability上有很多讲system design的东西,不光是facebook的,没空的
话,就光看你要面试的那家就好了..
facebook engineering blog
http://www.quora.com/Facebook-Engineering/What-is-Facebooks-arc
http://stackoverflow.com/questions/3533948/facebook-architectur

其他家的
http://www.quora.com/What-are-the-top-startup-engineering-blogs


==================================================================
在说说怎么准备这样的面试
首先如果你连availability/scalability/consistency/partition之类的都不是太有概
念的话,我建议先去wikipedia或者找一个某个大学讲这门课的网站稍微看一下,别一
点都不知道
这个链接也不错
http://www.aosabook.org/en/distsys.html

如果你这些基本的东西都还知道,那么我觉得你就和大部分毫无实际经验的人差不多一
个水平...
能做的就是一点一点去准备,如果你还有充足的时间的话,建议从你面试的那家公司的
engineering blog看起,把人家用的technology stack/product都搞清楚,然后在把能
找到的面试题都做一遍呗....我们做coding题说白了不也是题海战术...而且你如果坚
持看下去,真的会看出心得,你会发现很多地方都有相同之处,看多了就也能照葫芦画
瓢了...

再有就是面试的时候应该怎么去approach这种题,我说说我的做法
1. product spec/usage scenario 和面试者confirm这个东西到底是做什么的
可以先列出来几个major functionality,然后有时间的话,再补充一些不重要的
把你想的都写下来

2. define some major components
就是画几个圈圈框框的,每个发表一番您的高见....然后讲他们之间怎么interact

以上是question specific的东西,
这个讲完了,我们可以讲一些每道题都是用的,比如说
怎么scale/怎么partition/怎么实现consistency,这些东西,可以套用到任何题上



当然了,我们遇到的题和解题的方法可能都有些出入,不见得每道题有一个路数下来,
最重要的是,讲题的时候要有条理,画图要清楚,保持和面试官的交流,随时问一下人
家的意见。

我能想到的就这么多,欢迎大家交流,希望大家都能找到理想的工作.