Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
tomrippity
Starting Member
37 Posts |
Posted - 2012-03-28 : 11:04:55
|
I have a page that has a filter system on it. To create the filter, I have to select distinct values from the table that the data is pulled from, for each of the filter types.As an example, if there were 8 fields that you could filter on, I have 8 select boxes built from 8 different "Select Distinct X From Y"I believe this is the best way to build those filters, I may be wrong. If this is the best way, my problem is that each of those queries takes about .3 - .5 seconds to run and return to my page, which ends up making the load time a bit long for the entire site. There are about 11 million records right now in said table, and it will grow to about 100 million in a few months, and maybe 3 or 4 hundred million in a year or so, and then it will cease to grow. That means that .5 could turn into a full second. Is there a way I can speed up these queries? Because of the simplicity of the query, I don't believe there are any indexes I can create to speed them up that much. I have used query analyzer and execution plans to look for suggestions, but received none.The hardware isn't amazing, but its only a year old.I'm not sure what information would be useful to help solve this problem, but I can provide anything needed.Thanks in advance. |
|
Vinnie881
Master Smack Fu Yak Hacker
1231 Posts |
Posted - 2012-03-28 : 11:39:50
|
Can you please post a sample query to illustrate. Also, even though I think it is probably pretty self-explanitory once I see the query, try to include some wanted/expected results to help us assist.Thanks Success is 10% Intelligence, 70% Determination, and 22% Stupidity.\_/ _/ _/\_/ _/\_/ _/ _/- 881 |
 |
|
tomrippity
Starting Member
37 Posts |
Posted - 2012-03-28 : 11:47:59
|
quote: Originally posted by Vinnie881 Can you please post a sample query to illustrate. Also, even though I think it is probably pretty self-explanitory once I see the query, try to include some wanted/expected results to help us assist.
The sample I gave was pretty accurate, but below is the exact query I use to fill out one of the Select boxes for the filter.SELECT distinct school_nameFROM Student_Test_ItemWHERE COUNTY_DISTRICT = 'xxxxxx'This returns anywhere from 30 - 50 rows out of around 11.5 million and takes about .3 seconds to run. The result, obviously, would be a set of school names in a given district. |
 |
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2012-03-28 : 12:17:41
|
i think issue is your system design. Why do you've to go Student_Test_Item table to get school_names? Ideally, you should be maintaining master tables for reference (dimension) attributes like these . The Student_Test_Item should just have a reference key (a fk id column) which will just specify what school student belongs to. This fk will be the primary key id value generated in the master table for school. The school master table will not have huge amount of data at any point. It will just have distinct list of schools and hence select query from school table will be quick and fast.Suggest you to read on the concept of normalisationWith the current design, you have to fetch data from transaction table (Student_Test_Item ) for any of master reference details which will surely hurt page performance------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/ |
 |
|
tomrippity
Starting Member
37 Posts |
Posted - 2012-03-28 : 14:33:51
|
quote: Originally posted by visakh16 i think issue is your system design. Why do you've to go Student_Test_Item table to get school_names? Ideally, you should be maintaining master tables for reference (dimension) attributes like these . The Student_Test_Item should just have a reference key (a fk id column) which will just specify what school student belongs to. This fk will be the primary key id value generated in the master table for school. The school master table will not have huge amount of data at any point. It will just have distinct list of schools and hence select query from school table will be quick and fast.Suggest you to read on the concept of normalisationWith the current design, you have to fetch data from transaction table (Student_Test_Item ) for any of master reference details which will surely hurt page performance
I considered this point. The data is not generated by our system. We are given the data by the state and it comes in the format that I query. I could create a table with just school names that references this table by foreign key, however, I would then also have to join to this table because the filter could include additional layers that add to a where clause on other fields. I wasn't sure if it would then become more complicated and take longer still. I thought it best to ask first before building a bunch of tables. In this case I am searching school names, but there are several other fields I have to consider as well. If you think it would be best to create a table that I join two tables instead of selecting distincts, I can definitely go that route.Thank you for your help. |
 |
|
tomrippity
Starting Member
37 Posts |
Posted - 2012-03-28 : 14:40:04
|
Also, keep in mind that the master reference table could not simply be School_ID, School_name, District_ID...It would have to include each school multiple times, for grade, year, content area, etc...So while it would dramatically reduce the number of records, by the time I have them all done for the entire state, I would probably still be looking at around 8 or 9 million records and that number will grow every year when I add another year of data being sent from the state. |
 |
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2012-03-28 : 14:55:49
|
quote: Originally posted by tomrippity Also, keep in mind that the master reference table could not simply be School_ID, School_name, District_ID...It would have to include each school multiple times, for grade, year, content area, etc...So while it would dramatically reduce the number of records, by the time I have them all done for the entire state, I would probably still be looking at around 8 or 9 million records and that number will grow every year when I add another year of data being sent from the state.
why should each school be repeated multiple times?Any additional attributes like grade year, content area etc which has no direct relation to school should not be in schools table. Its something related to student's test record and should reside in your current transaction table itself ie Student_Test_Item. All other master entries like student,school,course, etc have to be represnted by master tables and should just have reference key in your transaction table. As suggested before, please read about normalisation principles------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/ |
 |
|
Vinnie881
Master Smack Fu Yak Hacker
1231 Posts |
Posted - 2012-03-28 : 15:17:29
|
Agree with VisakhTable 1 (SCHOOL)TABLE 2 (Years)table 3 (Student_Tests)table 4 ...Just because the data received comes in one file, does not mean you should not parse it out properly to ensure adequate performance. Success is 10% Intelligence, 70% Determination, and 22% Stupidity.\_/ _/ _/\_/ _/\_/ _/ _/- 881 |
 |
|
tomrippity
Starting Member
37 Posts |
Posted - 2012-03-28 : 15:17:41
|
visakh16:I understand normalization, and I think you would understand my question more if you saw the scope of the application and data together, or my understanding of what should and should not be normalized is wrong.Years, Content Area, Grade Level are all repeated many many times, just like a school name would be, even for an individual student. Any one of those fields would have the same data repeated no less than a million times, where a school name might appear at most, 50 or 60 thousand times. |
 |
|
Vinnie881
Master Smack Fu Yak Hacker
1231 Posts |
Posted - 2012-03-28 : 15:22:34
|
If you will not change your data structure, then to answer your original question:All you can do is proper index it possibly partition it, and that is your optimal performance. You have no joins or anything that would impact performance besides the data you are querying. Success is 10% Intelligence, 70% Determination, and 22% Stupidity.\_/ _/ _/\_/ _/\_/ _/ _/- 881 |
 |
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2012-03-28 : 15:23:41
|
quote: Originally posted by tomrippity visakh16:I understand normalization, and I think you would understand my question more if you saw the scope of the application and data together, or my understanding of what should and should not be normalized is wrong.Years, Content Area, Grade Level are all repeated many many times, just like a school name would be, even for an individual student. Any one of those fields would have the same data repeated no less than a million times, where a school name might appear at most, 50 or 60 thousand times.
thats ok. if even thats the case why should you worry?In suggested approach, you'll just have a single entry in School,Student etc for each school,student information and they'll have an associated id value which uniquely identifies them.In your main table ie student_test_item all records that belongs to the particular school /student will have same id repeated against them. This will avoid need of duplicating school name,student name etc for each of those records. And wherever you want to show them you dont need to pull a distinct list out of million records------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/ |
 |
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2012-03-28 : 15:25:51
|
how are you getting the data feeds currently? is it coming from file/ application?------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/ |
 |
|
tomrippity
Starting Member
37 Posts |
Posted - 2012-03-28 : 15:46:29
|
quote: Originally posted by visakh16
quote: Originally posted by tomrippity visakh16:I understand normalization, and I think you would understand my question more if you saw the scope of the application and data together, or my understanding of what should and should not be normalized is wrong.Years, Content Area, Grade Level are all repeated many many times, just like a school name would be, even for an individual student. Any one of those fields would have the same data repeated no less than a million times, where a school name might appear at most, 50 or 60 thousand times.
thats ok. if even thats the case why should you worry?In suggested approach, you'll just have a single entry in School,Student etc for each school,student information and they'll have an associated id value which uniquely identifies them.In your main table ie student_test_item all records that belongs to the particular school /student will have same id repeated against them. This will avoid need of duplicating school name,student name etc for each of those records. And wherever you want to show them you dont need to pull a distinct list out of million records------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/
Ok, I will work based on that information. I appreciate the input.Also, I am wondering if there is any reason why I couldn't use the school name as the foreign key here, as opposed to creating a column with an ID and then having to update the student_test_item table every time we import data from the state for each district. That process occurs a few hundred times each semester. Instead, couldn't I just keep a master table of School_Name, School_Code, District_Number and reference based on those pieces of information? If the performance hit were minimal, I believe there would be far less chance of problems occurring when we get new data.Thanks again. |
 |
|
tomrippity
Starting Member
37 Posts |
Posted - 2012-03-28 : 15:47:55
|
quote: Originally posted by visakh16 how are you getting the data feeds currently? is it coming from file/ application?
The files are given to us in flat text files, broken up by districts. There are about 530 districts, and each one would have a minimum of a million records and as many as 10. |
 |
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2012-03-28 : 15:50:14
|
quote: Originally posted by tomrippity
quote: Originally posted by visakh16 how are you getting the data feeds currently? is it coming from file/ application?
The files are given to us in flat text files, broken up by districts. There are about 530 districts, and each one would have a minimum of a million records and as many as 10.
then your ETL process should take care of populating the master tables with reference data from files------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/ |
 |
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2012-03-28 : 15:53:59
|
quote: Originally posted by tomrippityOk, I will work based on that information. I appreciate the input.Also, I am wondering if there is any reason why I couldn't use the school name as the foreign key here, as opposed to creating a column with an ID and then having to update the student_test_item table every time we import data from the state for each district. That process occurs a few hundred times each semester. Instead, couldn't I just keep a master table of School_Name, School_Code, District_Number and reference based on those pieces of information? If the performance hit were minimal, I believe there would be far less chance of problems occurring when we get new data.Thanks again.
the main reason is to avoid update anomalies. If at all you decide to change name of school at a later point, you just need to do it in single record which is in master table rather than seraching for all millions of rows within student_test_item having this school. The id will still be the same as its same school but it just changed its name so there wont be any change required in student_test_item. changing id value in student_test_item means changing associated school itself which will not happen for retrospective records unless source feed was wrong.------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/ |
 |
|
tomrippity
Starting Member
37 Posts |
Posted - 2012-03-28 : 16:04:14
|
quote: Originally posted by visakh16
quote: Originally posted by tomrippityOk, I will work based on that information. I appreciate the input.Also, I am wondering if there is any reason why I couldn't use the school name as the foreign key here, as opposed to creating a column with an ID and then having to update the student_test_item table every time we import data from the state for each district. That process occurs a few hundred times each semester. Instead, couldn't I just keep a master table of School_Name, School_Code, District_Number and reference based on those pieces of information? If the performance hit were minimal, I believe there would be far less chance of problems occurring when we get new data.Thanks again.
the main reason is to avoid update anomalies. If at all you decide to change name of school at a later point, you just need to do it in single record which is in master table rather than seraching for all millions of rows within student_test_item having this school. The id will still be the same as its same school but it just changed its name so there wont be any change required in student_test_item. changing id value in student_test_item means changing associated school itself which will not happen for retrospective records unless source feed was wrong.------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/
In this case, when we get new data, they give us ALL data for that school at the same time. We completely replace data each time its given to us because the state does retroactively change information for student testing. So if the school name did change, the state would go back and change all of the old records and then send us new files which we would then use to replace all of our existing data. So if I understand you correctly, and I am not concerned with name changes causing conflict, it would be ok to do that? |
 |
|
tomrippity
Starting Member
37 Posts |
Posted - 2012-03-28 : 16:05:54
|
quote: Originally posted by visakh16then your ETL process should take care of populating the master tables with reference data from files
Ok, I believe I understand where you are coming from. I will look at trying to create some processes that normalize the data for us in an efficient manner. |
 |
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2012-03-28 : 16:11:08
|
quote: Originally posted by tomrippity
quote: Originally posted by visakh16
quote: Originally posted by tomrippityOk, I will work based on that information. I appreciate the input.Also, I am wondering if there is any reason why I couldn't use the school name as the foreign key here, as opposed to creating a column with an ID and then having to update the student_test_item table every time we import data from the state for each district. That process occurs a few hundred times each semester. Instead, couldn't I just keep a master table of School_Name, School_Code, District_Number and reference based on those pieces of information? If the performance hit were minimal, I believe there would be far less chance of problems occurring when we get new data.Thanks again.
the main reason is to avoid update anomalies. If at all you decide to change name of school at a later point, you just need to do it in single record which is in master table rather than seraching for all millions of rows within student_test_item having this school. The id will still be the same as its same school but it just changed its name so there wont be any change required in student_test_item. changing id value in student_test_item means changing associated school itself which will not happen for retrospective records unless source feed was wrong.------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/
In this case, when we get new data, they give us ALL data for that school at the same time. We completely replace data each time its given to us because the state does retroactively change information for student testing. So if the school name did change, the state would go back and change all of the old records and then send us new files which we would then use to replace all of our existing data. So if I understand you correctly, and I am not concerned with name changes causing conflict, it would be ok to do that?
ideally if school information is unchanged there's no reason to remove and again add it back. it will cause a new id value to be generated for it and then you've to cascade change to your student_test_item which would be big impact. so what you need to do is identify deltas (new data/modified data) out of file and reflect them in appropriate master tables------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/ |
 |
|
|
|
|
|
|